=Paper= {{Paper |id=Vol-1507/DX_2015_proceedings |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1507/dx15_proceedings.pdf |volume=Vol-1507 }} ==None== https://ceur-ws.org/Vol-1507/dx15_proceedings.pdf
   26th International Workshop
    on Principles of Diagnosis




                               2015
                                  Paris, France
              August 31 - September 3, 2015




Editors: Yannick Pencolé, Louise Travé-Massuyès, Philippe Dague




                         LABORATOIRE DE RECHERCHE EN
                         I N F O R M AT I Q U E
Proceedings of the 26th International Workshop
      on Principles of Diagnosis (DX-15)


                     August 31-September 3, 2015

                               Paris, France




      Yannick Pencolé, Louise Travé-Massuyès, Philippe Dague, Editors
Foreword
The International Workshop on Principles of Diagnosis is an annual event that started in 1989,
originating in the Artificial Intelligence community. Its focus is on theories, principles and compu-
tational techniques for diagnosis, monitoring, testing, reconfiguration and repair of complex systems
and applications of these techniques to real world problems.

    This year, DX-15 received 41 submissions (39 full papers and 2 tool papers) from 15 countries,
from 5 continents. Each paper was thoroughly peer reviewed by three reviewers. We accepted 17
regular papers (selection rate 43.6%), 18 posters and 2 benchmark/tool papers. We wish to thank all
the authors of submitted papers, the program committee members for the time and effort spent, the
invited speakers for their participation.

   As the DX-15 workshop is co-located with the IFAC International Symposium SAFEPROCESS
2015, its organization would not have been possible without the full support of the SAFEPROCESS
organization team and especially Vincent Cocquempot who did a tremendous coordination job be-
tween the two events. Also special thanks to our local contact Nazih Mechbal at École Nationale
Supérieure d’Arts et Métiers (ENSAM), ParisTech, where DX-15 and SAFEPROCESS take place.
Thanks also to the local organization team at LAAS-CNRS and at the CNRS administrative depart-
ment of Toulouse (DR14) for their full technical and administrative support.

    We also wish to thank our sponsors: Centre National de la Recherche Scientifique (CNRS), Uni-
versité de Toulouse), Laboratoire de Recherche en Informatique, Université Paris-Sud, École Nationale
Supérieure d’Arts et Métiers (ENSAM), Institut National des Sciences Appliquées de Toulouse (INSA-
Toulouse), Université Pierre et Marie Curie (UMPC), and ACTIA.


Yannick Pencolé, Louise Travé-Massuyès, Philippe Dague.
August 2015




          Word cloud generated from the titles of the DX-15 papers by http://www.wordle.net.



                                                   i
ii
Workshop Organization

Program Co-Chairs
Yannick Pencolé          LAAS-CNRS, Univ. Fédérale Toulouse, France
Louise Travé-Massuyès   LAAS-CNRS, Univ. Fédérale Toulouse, France
Philippe Dague            LRI, Université Paris-Sud, France




International Program Committee
Rui Abreu                 PARC, USA
Jose Aguilar              Universidad de los Andes, Venezuela
Carlos Alonso             Universidad de Valladolid, Spain
Gautam Biswas             Vanderbilt University, USA
Anibal Bregon             Universidad de Valladolid, Spain
Luca Console              Università di Torino, Italy
Matthew Daigle            NASA Ames Research Center, USA
Johan de Kleer            PARC, USA
Michael Hofbaur           Joanneum Research, Austria
Alexander Feldman         PARC, USA
Gerhard Friedrich         Klagenfurt University, Austria
Alban Grastien            NICTA, Australia
Claudia Isaza             University of Antioquia, Medellı́n, Colombia
Meir Kalech               Ben-Gurion University of the Negev, Israel
Mattias Krysander         Linköping University, Sweden
Anastassia
                          Bonn-Rhein-Sieg University of Applied Science, Germany
Küstenmacher
Ingo Pill                 TU Graz, Austria
Gregory Provan            University College Cork, Ireland
Xavier Pucel              ONERA CERT, France
Martin Sachenbacher       LION Smart GmbH, Germany
Ramon Sarrate             Universitat Politècnica de Catalunya, Spain
Neal Snooke               Aberystwyth University, UK
Gerald Steinbauer         TU Graz, Austria
Peter Struss              TU München, Germany
Anna Sztyber              The Institute of Automatic and Robotics, Warsow University of Technology
Gianluca Torta            Università di Torino, Italy
Franz Wotawa              TU Graz, Austria
Marina Zanella            Università degli Studi di Brescia, Italy




                                               iii
Additional Reviewers
Moussa Maiga              LAAS-CNRS, Univ. Fédérale Toulouse, France
Nathalie Barbosa Roa      LAAS-CNRS, Univ. Fédérale Toulouse, France
Euriell Le Corronc        LAAS-CNRS, Université Paul Sabatier, Univ. Fédérale Toulouse, France
Élodie Chanthery         LAAS-CNRS, INSA Toulouse, Univ. Fédérale Toulouse, France
Indranil Roychoudhury     SGT Inc, NASA Ames Research Center, USA




Workshop Organizing Committee
Yannick Pencolé          LAAS-CNRS, Univ. Fédérale Toulouse, France
Louise Travé-Massuyès   LAAS-CNRS, Univ. Fédérale Toulouse, France
Vincent Cocquempot        CRISTAL, Université Lille 1, France
Audine Subias             LAAS-CNRS, INSA Toulouse, Univ. Fédérale Toulouse, France
Nazih Mechbal             ENSAM Paris, France




Technical and Administrative Support
Christèle Mouclier       LAAS-CNRS, Toulouse, France
Régine Duran             LAAS-CNRS, Toulouse, France
Dominique Daurat          LAAS-CNRS, Toulouse, France
Fabienne Baduel           LAAS-CNRS, Toulouse, France
Bruno Birac               LAAS-CNRS, Toulouse, France
Stéphanie Saluden        Délégation Régionale CNRS, Toulouse, France
Régine Barthes           Délégation Régionale CNRS, Toulouse, France




                                               iv
                             Table of contents

                                     Regular papers


A Divide-And-Conquer-Method for Computing Multiple Conflicts for Diagnosis
by Shchekotykhin Kostyantyn, Jannach Dietmar, Schmitz Thomas                                          3


A Robust Alternative to Correlation Networks for Identifying Faulty Systems
by Traxler Patrick, Grill Tanja, Gomez Pablo                                                          11


Applied multi-layer clustering to the diagnosis of complex agro-systems
by Roux Elisa, Travé-Massuyès Louise, Le Lann Marie-Véronique                                      19


A Bayesian Framework for Fault diagnosis of Hybrid Linear Systems
by Zhou Gan, Biswas Gautam, Feng Wenquan, Zhao Hongbo, Guan Xiumei                                    27


ADS2 : Anytime Distributed Supervision of Distributed Systems that Face Unreliable or Costly Com-
munication
by Herpson Cédric, El Fallah Seghrouchni Amal, Corruble Vincent                                      35

Data Driven Modeling for System-Level Condition Monitoring on Wind Power Plants
by Eickmeyer Jens, Li Peng, Givehchi Omid, Pethig Florian, Niggemann Oliver                           43


Using Incremental SAT for Testing Diagnosability of Distributed DES
by Ibrahim Hassan, Dague Philippe, Simon Laurent                                                      51


Improving Fault Isolation and Identification for Hybrid Systems with Hybrid Possible Conflicts
by Bregon Anibal, Alonso-Gonzalez Carlos, Pulido Belarmino                                            59


State estimation and fault detection using box particle filtering with stochastic measurements
by Blesa Joaquim, Le Gall Françoise, Jauberthie Carine, Travé-Massuyès Louise                      67


Minimal Structurally Overdetermined Sets Selection for Distributed Fault Detection
by Khorasgani Hamed, Biswas Gautam, Jung Daniel                                                       75


Condition-based Monitoring and Prognosis in an Error-Bounded Framework
by Travé-Massuyès Louise, Pons Renaud, Ribot Pauline, Pencolé Yannick, Jauberthie Carine           83


Configuration as Diagnosis: Generating Configurations with Conflict-Directed A* - An Application to
Training Plan Generation -
by Grigoleit Florian, Struss Peter                                                                    91

Decentralised fault diagnosis of large-scale systems: Application to water transport networks
by Puig Vicenç, Ocampo-Martinez Carlos                                                               99




                                                 v
Self-Healing as a Combination of Consistency Checks and Conformant Planning Problems
by Grastien Alban                                                                                    105


Implementing Troubleshooting with Batch Repair
by Stern Roni, Kalech Meir, Shinitzky Hilla                                                          113


Formulating Event-Based Critical Observations in Diagnostic Problems
by Christopher Cody, Grastien Alban                                                                  119


A Framework For Assessing Diagnostics Model Fidelity
by Provan Gregory, Feldman Alexander                                                                 127




                                          Posters

A General Process Model: Application to Unanticipated Fault Diagnosis
by Wang Jiongqi, He Zhangming, Zhou Haiyin, Li Shuxing                                               137


A SCADA Expansion for Leak Detection in a Pipeline
by Carrera Rolando, Verde Cristina, Cayetano Raúl                                                   145


Automatic Model Generation to Diagnose Autonomous Systems
by Santos Simón Jorge, Mühlbacher Clemens, Steinbauer Gerald                                       153


Methodology and Application of Meta-Diagnosis on Avionics Test Benches
by Cossé Ronan, Berdjag Denis, Piechowiak Sylvain, Duvivier David, Gaurel Christian                 159


SAT-Based Abductive Diagnosis
by Koitz Roxane, Wotawa Franz                                                                        167


Fault Tolerant Control for a 4-Wheel Skid Steering Mobile Robot
by Fourlas George, Karras George, Kyriakopoulos Kostas                                               177


Data-Driven Monitoring of Cyber-Physical Systems Leveraging on Big Data and the Internet-of-Things
for Diagnosis and Control
by Niggemann Oliver, Biswas Gautam, Kinnebrew John, Khorasgani Hamed, Volgmann Sören, Bunte         185
Andreas

Diagnosing Advanced Persistent Threats: A Position Paper
by Abreu Rui, Bobrow Daniel, Eldardiry Hoda, Feldman Alexander, Hanley John, Honda Tomonori,         193
De Kleer Johan, Perez Alexandre, Archer Dave, Burke David

A Structural Model Decomposition Framework for Hybrid Systems Diagnosis
by Daigle Matthew, Bregon Anibal, Roychoudhury Indranil                                              201


Device Health Estimation by Combining Contextual Control Information with Sensor Data
by Honda Tomonori, Liao Linxia, Eldardiry Hoda, Saha Bhaskar, Abreu Rui, Pavel Radu, Iverson         209
Jonathan



                                               vi
On the Learning of Timing Behavior for Anomaly Detection in Cyber-Physical Production Systems
by Maier Alexander, Niggemann Oliver, Eickmeyer Jens                                                  217


The Case for a Hybrid Approach to Diagnosis: A Railway Switch
by Matei Ion, Ganguli Anurag, Honda Tomonori, De Kleer Johan                                          225


Design of PD observer-based fault estimator using a descriptor approach
by Krokavec Dusan, Filasova Anna, Liscinsky Pavol, Serbak Vladimir                                    235


Chronicle based alarm management in startup and shutdown stages
by Vasquez John William, Travé-Massuyès Louise, Subias Audine, Jimenez Fernando, Agudelo Carlos     241


Data-Augmented Software Diagnosis
by Mishali Amir, Stern Roni, Kalech Meir                                                              247


Faults isolation and identification of Heat-exchanger/ Reactor with parameter uncertainties
by Zhang Mei, Dahhou Boutaı̈eb, Cabassud Michel, Li Ze-Tao                                            253


LPV subspace identification for robust fault detection using a set-membership approach: Application
to the wind turbine benchmark
by Chouiref Houda, Boussaid Boumedyen, Abdelkrim Mohamed Naceur, Puig Vicenç, Aubrun                 261
Christophe

Processing measure uncertainty into fuzzy classifier
by Monrousseau Thomas, Travé-Massuyès Louise, Le Lann Marie-Véronique                              269




                                  Tools/Benchmarks

Random generator of k-diagnosable discrete event systems
by Pencolé Yannick                                                                                   277


HyDiag: extended diagnosis and prognosis for hybrid systems
by Chanthery Elodie, Pencolé Yannick, Ribot Pauline, Travé-Massuyès Louise                         281




                                                vii
viii
Proceedings of the 26th International Workshop on Principles of Diagnosis




             Regular papers




                                   1
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                   2
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




 A Divide-And-Conquer-Method for Computing Multiple Conflicts for Diagnosis

              Kostyantyn Shchekotykhin1 and Dietmar Jannach2 and Thomas Schmitz2
                             1
                               Alpen-Adria University Klagenfurt, Austria
                               e-mail: kostyantyn.shchekotykhin@aau.at
                                       2
                                         TU Dortmund, Germany
                            e-mail: {firstname.lastname}@tu-dortmund.de


                         Abstract                                    approaches, since they require only a very limited reasoning
                                                                     functionality like consistency or entailment checking with-
     In classical hitting set algorithms for Model-                  out knowing the internals of the reasoning algorithm. Such
     Based Diagnosis (MBD) that use on-demand con-                   methods can benefit from the newest improvements in rea-
     flict generation, a single conflict is computed                 soning algorithms, such as incremental solving, heuristics,
     whenever needed during tree construction. Since                 learning strategies, etc., without any modifications.
     such a strategy leads to a full “restart” of the                   A non-intrusive conflict detection algorithm which has
     conflict-generation algorithm on each call, we                  shown to be very efficient in different application scenar-
     propose a divide-and-conquer algorithm called                   ios is Junker’s Q UICK X PLAIN [10] (QXP for short) which
     M ERGE X PLAIN which efficiently searches for                   was designed to find a single minimal conflict based on a
     multiple conflicts during a single call.                        divide-and-conquer strategy. The algorithm was originally
     The design of the algorithm aims at scenarios in                developed in the context of constraint problems, but since
     which the goal is to find a few leading diagnoses               its method is independent of the underlying reasoner, it was
     and the algorithm can – due to its non-intrusive                used in several of the hardware and software diagnosis ap-
     design – be used in combination with various un-                proaches mentioned above.
     derlying reasoners (theorem provers). An em-                       In many classical hitting set based approaches, conflicts
     pirical evaluation on different sets of benchmark               are computed individually with QXP during HS-tree con-
     problems shows that our proposed algorithm can                  struction when they are required, as in many domains not
     lead to significant reductions of the required diag-            all conflicts are known in advance [11]. This, however, has
     nosis times when compared to a “one-conflict-at-                the effect that QXP has to be “restarted” with a slightly dif-
     a-time” strategy.                                               ferent configuration whenever a new conflict is needed.
                                                                        In this paper, we propose M ERGE X PLAIN (MXP for
                                                                     short), a divide-and-conquer algorithm which searches for
1 Introduction                                                       multiple conflicts during a single decomposition run. Our
In Model-Based Diagnosis (MBD), the concept of conflicts             method is built upon QXP and is therefore also non-
describes parts of a system which – given a set of observa-          intrusive. The basic idea behind MXP is that (a) the early
tions – cannot all work correctly. Besides MBD, the calcu-           identification of multiple conflicts can speed up the overall
lation of minimal conflicts is a central task in a number of         diagnosis process, e.g., due to better conflict “reuses” [2],
other AI approaches [1]. Reiter [2] showed that the minimal          and that (b) we can identify additional conflicts faster when
hitting sets of conflicts correspond to diagnoses, where a di-       we decompose the original components into smaller subsets
agnosis is a possible explanation why a system’s observed            with the divide-and-conquer strategy of MXP.
behavior differs from its expected behavior. He used this               The paper is organized as follows. After a problem char-
property for the computation of diagnoses in the breadth-            acterization in Section 2, we present the details of MXP in
first hitting set tree (HS-tree) diagnosis algorithm.                Section 3 and discuss the properties of the algorithm. Sec-
   Over time, the principle of this MBD approach was used            tion 4 presents the results of an extensive empirical evalua-
for a number of different diagnosis problems such as elec-           tion using various diagnosis benchmark problems. Previous
tronic circuits, hardware descriptions in VHDL, program              work is finally discussed in Section 5.
specifications, ontologies, and knowledge-based systems [3;
4; 5; 6; 7]. A reason for the broad utilization of hitting set       2 Preliminaries
approaches is that its principle does not depend on the un-
derlying knowledge representation and reasoning technique,           2.1 The Diagnosis Problem
because only a general Theorem Prover (TP) – a component
                                                                     We use the definitions of [2] to characterize a system, diag-
that returns conflicts – is needed.
                                                                     noses, and conflicts.
   The implementation of a TP can be done in different
ways. First, the conflict detection can be implemented as            Definition 1 (System). A system is a pair (SD, C OMPS )
a reasoning task, e.g., by modifying a consistency check-            where SD is a system description (a set of logical sentences)
ing algorithm [8; 9]. Second, “non-intrusive” conflict de-           and C OMPS represents the system’s components (a finite set
tection techniques can be used with a variety of reasoning           of constants).




                                                                 3
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


   A diagnosis problem arises when a set of logical sen-               Algorithm 1: Q UICK X PLAIN(B, C)
tences O BS, called observations, is inconsistent with the
normal behavior of the system (SD, C OMPS ). The correct                Input: B: background theory, C: the set of possibly
behavior is represented in SD with an “abnormal” predicate                       faulty constraints
AB /1. That is, for any component ci ∈ C OMPS the literal
                                                                        Output: A minimal conflict CS ⊆ C
                                                                      1 if isConsistent(B ∪ C) then return ‘no conflict’;
¬AB(ci ) represents the assumption that the component ci
                                                                      2 else if C = ∅ then return ∅;
behaves correctly.
                                                                      3 return GET C ONFLICT(B, B, C)
Definition 2 (Diagnosis). Given a diagnosis problem (SD,
C OMPS , O BS ), a diagnosis is a minimal set ∆ ⊆ C OMPS such             function GET C ONFLICT (B, D, C)
that SD ∪ O BS ∪ {AB(c)|c ∈ ∆} ∪ {¬AB(c)|c ∈ C OMPS\∆}                4      if D 6= ∅ ∧ ¬ isConsistent(B) then return ∅;
is consistent.                                                        5      if |C| = 1 then return C;
                                                                      6      Split C into disjoint, non-empty sets C1 and C2
   A diagnosis therefore corresponds to a minimal subset of
                                                                      7      D2 ← GET C ONFLICT (B ∪ C1 , C1 , C2 )
the system components which, if assumed to be faulty (and
                                                                      8      D1 ← GET C ONFLICT (B ∪ D2 , D2 , C1 )
thus behave abnormally) explain the system’s behavior, i.e.,
                                                                      9      return D1 ∪ D2
are consistent with the observations.
   Two general classes of MBD algorithms exist. One relies
on direct problem encodings and the aim is often to find one
diagnosis quickly, see [12; 13; 14]. The other class relies on        Theorem 1 ([10]). Let B be a background theory, i.e., a
the computation of conflicts and their hitting sets (see next         set of constraints considered as correct, and C be a set of
section). Such diagnosis algorithms are often used when the           possibly faulty constraints. Then, Q UICK X PLAIN always
goal is to find multiple or all minimal diagnoses. In the con-        terminates. If B ∪ C is consistent it returns ‘no conflict’.
text of our work, techniques of the second class can imme-            Otherwise, it returns a minimal conflict CS .
diately profit when the conflict generation process is done
more efficiently.                                                     2.4 Using QXP During HS-Tree Construction
2.2 Diagnoses as Hitting Sets                                         Assume that MBD is applied to find an error in the defini-
Finding all minimal diagnoses corresponds to finding all              tion of a CSP. The CSP comprises the set of possibly faulty
minimal hitting sets (HS) of all existing conflicts [2].              constraints C. These are the elements of C OMPS. The sys-
Definition 3 (Conflict). A conflict CS for (SD, C OMPS ,              tem description SD corresponds to the semantics of the con-
O BS ) is a set {c1 , . . . , ck } ⊆ C OMPS such that SD ∪ O BS       straints in C. Finally, the observations O BS are encoded as
∪{¬AB(ci ) | ci ∈ CS } is inconsistent.                               unary constraints and are added to the background theory
                                                                      B. During the HS-tree construction, QXP is called when-
   Assuming that all components of a conflict work correctly          ever a new node is created and no conflict reuse is possi-
therefore contradicts the observations. A conflict CS is min-         ble. As a result, QXP can either return one minimal conflict,
imal, if no proper subset of CS is also a conflict.                   which can be used to label the new node, or return ’no con-
   To find the set of all minimal diagnoses for a given prob-         flict’, which would mean that a diagnosis is found at the tree
lem, [2] proposed a breadth-first HS-tree algorithm with tree         node. Note that QXP can be used with other algorithms,
pruning and conflict reuse. A correction to this algorithm            e.g., preference-based search [19] or boolean search [20], in
was proposed by Greiner et al. which uses a directed acyclic          the same way as with the HS-tree algorithm.
graph (DAG) instead of the tree to correctly deal with non-
minimal conflicts [15]. Our work, however, does not de-
pend on this correction as QXP as well as our proposed                3     M ERGE X PLAIN (MXP): Algorithm
MXP method always return minimal conflicts. Apart from                      Details
this, a number of algorithmic variations were suggested
                                                                      3.1 General Considerations
in the literature which, for example, use problem-specific
heuristics [16], a greedy search algorithm, or apply paral-           The pseudo-code of MXP, which unlike QXP can return
lelization techniques [17], see also [18] for an overview.            multiple conflicts at a time, is given in Algorithm 2. MXP,
                                                                      like QXP, is generally applicable to a variety of problem do-
2.3   Q UICK X PLAIN (QXP)                                            mains. The mapping to the terminology used in MBD (SD,
QXP was developed in the context of inconsistent constraint           C OMPS, O BS) is straightforward as discussed in the previous
satisfaction problems (CSPs) and the computation of expla-            section. In the following, we will use the notation and sym-
nations. E.g., in case of an overconstrained CSP, the prob-           bols from [10], e.g., C or B, and constraints as a knowledge
lem consists in determining a minimal set of constraints              representation formalism.
which causes the CSP to become unsolvable for the given                  Note that there are applications of MBD in which the
inputs. A simplified version of QXP [10] is shown in Al-              function isConsistent has to be “overwritten” to take the
gorithm 1. The rough idea of QXP is to apply a recursive              specifics of the underlying knowledge representation and
procedure which relaxes the input set of faulty constraints           reasoning system into account. The ontology debugging
C by partitioning it into two sets C1 and C2 (line 6). If C1          approach presented in [7] for example extends isConsis-
is a conflict the algorithm continues partitioning C1 in the          tent with the verification of entailments of a logical theory.
next recursive call. Otherwise, i.e., if the last partitioning        MXP can be used in such scenarios after the corresponding
has split all conflicts in C, the algorithm extracts a conflict       adaptation of the implementation of isConsistent.
from the sets C1 and C2 . This way, QXP finally identifies               Furthermore, MXP can be easily extended for cases in
single constraints which are inconsistent with the remaining          which the MBD approach has to support the specification
consistent set of constraints and the background theory.              of (multiple) test cases, i.e., sets of formulas that must be




                                                                  4
                             Proceedings of the 26th International Workshop on Principles of Diagnosis


consistent or inconsistent with the system description, e.g.,                 Algorithm 2: M ERGE X PLAIN(B, C)
[21; 22].
                                                                               Input: B: background theory, C: the set of possibly
3.2 Algorithm Rationale                                                               faulty constraints
                                                                               Output: Γ, a set of minimal conflicts
MXP (Algorithm 2) accepts two sets of constraints as in-
                                                                             1 if ¬isConsistent(B) then return ‘no solution’;
puts, B as the assumed-to-be-correct set of background con-
                                                                             2 if isConsistent(B ∪ C) then return ∅;
straints and C, the possibly faulty components/constraints.
                                                                             3 h_, Γi ← FIND C ONFLICTS(B, C)
   In case C∪B is inconsistent, MXP returns a set of minimal
                                                                             4 return Γ;
conflicts Γ by calling the recursive function FIND C ONFLICTS
in line 3. This function again accepts B and C as an input and                 function FIND C ONFLICTS (B, C) returns tuple hC 0 , Γi
returns a tuple hC 0 , Γi, where Γ is a set of minimal conflicts             5    if isConsistent(B ∪ C) then return hC, ∅i;
and C 0 ⊂ C is a set of constraints that does not contain any                6    if |C| = 1 then return h∅, {C}i;
conflicts, i.e., B ∪ C 0 is consistent.                                      7    Split C into disjoint, non-empty sets C1 and C2
   The logic of FIND C ONFLICTS is similar to QXP in that we                 8    hC10 , Γ1 i ← FIND C ONFLICTS(B, C1 )
decompose the problem into two parts in each recursive call                  9    hC20 , Γ2 i ← FIND C ONFLICTS(B, C2 )
(lines 7–9). Differently from QXP, however, we look for                     10    Γ ← Γ1 ∪ Γ2 ;
conflicts in both splits C1 and C2 independently and then                   11    while ¬isConsistent(C10 ∪ C20 ∪ B) do
combine the conflicts that are eventually found in the two                  12          X ← GET C ONFLICT(B ∪ C20 , C20 , C10 )
halves (line 10)1 . If there is, e.g., a conflict in the first part         13          CS ← X ∪ GET C ONFLICT(B ∪ X, X, C20 )
and one in the second, FIND C ONFLICTS will find them inde-                 14          C10 ← C10 \ {α} where α ∈ X
pendently from each other. Of course, there might also be                   15          Γ ← Γ ∪ {CS }
conflicts in C whose elements are spread across both C1 and                 16    return hC10 ∪ C20 , Γi
C2 , that is, the set C10 ∪ C20 ∪ B is inconsistent. This situation
is addressed in lines 11–15. The computation of a minimal
conflict is done by two calls to GET C ONFLICT (Algorithm 1).
In the first call this function returns a minimal set X ⊆ C10               C1 = {c1 ,c2 ,c3 } and C2 = {c4 ,c5 } and provides them as in-
such that X ∪C20 ∪B is a conflict (line 12). In line 13, we then            put to the recursive calls (lines 8 and 9). In the next level
look for a subset of C20 , say Y , such that Y ∪ X corresponds              of the recursion – marked with 2 in Figure 1 – the input is
to a minimal conflict CS . The latter is added to Γ (line 15).              found to be inconsistent (line 5) and again partitioned into
In order to restore the consistency of C10 ∪ C20 ∪ B we have to             two sets (line 7). In the subsequent calls, 3 and 4 , the two
remove at least one element α ∈ CS from either C10 or C20 .                 input sets are found to be consistent (line 5) and, therefore,
Therefore, in line 14 the algorithm removes α ∈ X ⊆ CS                      the set {c1 , c2 , c3 } has to be analyzed using GET C ONFLICT
from C10 .                                                                  (lines 12 and 13) defined in Algorithm 1. GET C ONFLICT
   Note that MXP allows us to use different split functions                 returns the conflict {c1 ,c3 }, which is added to Γ. Finally,
in line 7. In our default implementation we use a function                  FIND C ONFLICTS removes c1 from the set C1 and returns the
                                                                                                                          0
that splits the set of constraints C into two equal parts, i.e.,            tuple h{c2 ,c3 }, {{c1 ,c3 }}i to 1 .
split(n) = n/2, where |C| = n. In the worst case this split                    Next, the “right-hand” part of the initial input, the set
function results in a perfect binary tree with n leaves. Con-               C2 = {c4 ,c5 }, is provided as input to FIND C ONFLICTS 5 .
sequently, the total number of nodes is 2n − 1, which cor-                  Since C2 is inconsistent, it is partitioned into two sets
respond to 2(2n − 1) consistency checks (lines 5 and 11).                   C1 = {c4 } and C2 = {c5 }. The first recursive call 6 re-
Other split functions might result in a similar number of                   turns h{c4 }, ∅i since the input is consistent. The second
consistency checks in the worst case as well, since in any
                                                                            call 7 , in contrast, finds that the input comprises only
case MXP has to traverse a binary tree with n leaves. For
                                                                            one constraint that is inconsistent with the background the-
instance, the function split(n) = n − 1 results in a tree with
                                                                            ory B. Therefore, it returns h∅,{{c5 }}i in line 6. Since
one branch of the depth n − 1 and n leaves, that is, 2n − 1
nodes to traverse. However, while the number of nodes to                    C10 ∪ C20 = {c4 } ∪ ∅ is consistent with B, FIND C ONFLICTS 5
explore might be comparable, the important point is that the                returns h{c4 }, {{c5 }}i to 1 .
computational costs for the individual consistency checks                      Finally, in 1 the set of constraints C10 ∪ C20 = {c2 ,c3 } ∪
can be different depending on the splitting strategy. Un-                   {c4 } is found to be inconsistent with B (line 11) and GET-
der the reasonable assumption that consistency checking of                  C ONFLICT is called. The method returns the conflict {c2 ,c4 }
smaller sets of constraints requires less time, the function                and c2 is removed from C10 . The resulting set {c3 ,c4 } is con-
split(n) = n/2 allows MXP to split the set of constraints                   sistent and MXP returns Γ = {{c1 ,c3 }, {c5 }, {c2 , c4 }}.
faster, thus, improving the overall runtime.
                                                                            3.4 Properties of M ERGE X PLAIN
3.3 Example                                                                 Theorem 2. Given a background theory B and a set of con-
Consider a CSP consisting of six constraints {c0 , ..., c5 }.               straints C, Algorithm 2 always terminates and returns
The constraint c0 is considered correct, i.e., B = {c0 }. Let                  • ‘no solution’, if B is inconsistent,
{{c0 , c1 , c3 }, {c0 , c5 }, {c2 , c4 }} be the set of minimal con-
flicts. Algorithm 2 proceeds as follows (Figure 1).                            • ∅, if B ∪ C is consistent, and
   Since the input CSP (B ∪ C) is not consistent, the al-                      • a set of minimal conflicts Γ, otherwise.
gorithm enters the recursion. In the first step, FIND C ON -
FLICTS partitions the input set (line 7) into the two subsets               Proof. In the first case, given an inconsistent background
                                                                            theory B, the algorithm terminates in line 1 and returns ‘no
   1
       The calls in line 8 and 9 can in fact be executed in parallel.       solution’. In the second case, if the set B ∪ C is consistent,




                                                                        5
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                 C1 = {c1 , c2 , c3 } C2 = {c4 , c5 }
                                                 h{c2 , c3 } , {{c1 , c3 }}i
                                             1 : h{c4 } , {{c5 }}i
                                                 Γ = {{c1 , c3 } , {c5 }} ∪ {{c2 , c4 }}
                                                 C = {c3 , c4 }
                                              y
                                                                                             %
                       C1 = {c1 , c2 } C2 = {c3 }
                                                                                          C1 = {c4 } C2 = {c5 }
                       h{c1 , c2 } , ∅i
                                                                                          h{c4 } , ∅i
                   2 : h{c3 } , ∅i                                                    5 :
                                                                                          h∅, {{c5 }}i
                       Γ = ∅ ∪ {{c1 , c3 }}
                                                                                          isConsistent X
                       C = {c2 , c3 }

                                                                                                                
                                                                                       
                                                                                                               B ∪ C = {c0 , c5 }
           B ∪ C = {c0 , c1 , c2 }         B ∪ C = {c0 , c3 }                 B ∪ C = {c0 , c4 }
       3 :                             4 :                              6 :                                7 : isConsistent 
           isConsistent X                  isConsistent X                     isConsistent X
                                                                                                               |C| = 1

 Figure 1: M ERGE X PLAIN recursion tree. Each node shows values of selected variables in the FIND C ONFLICTS function.


then no subset of C is a conflict. MXP terminates and re-               strategies MXP might not return enough minimal conflicts
turns ∅.                                                                for the HS-tree algorithm to compute at least one diagnosis.
   Finally, if the set B ∪ C is inconsistent, the algorithm en-         For instance, let {{c1 , c2 } , {c1 , c3 } , {c2 , c4 }} be the set of
ters the recursion in line 3. The function FIND C ONFLICTS              all minimal conflicts. If MXP returns Γ = {{c1 , c2 }}, which
in each call partitions the input set C into two sets C1 and            is one of the possible valid outputs, then the HS-tree algo-
C2 . The partitioning continues until either the found set              rithm fails to find a diagnosis as {c1 , c2 } must be hit twice.
of constraints C is consistent or a singleton conflict is de-           In this case, the HS-tree algorithm must call MXP multiple
tected. Therefore, every recursion branch ends after at most            times or another algorithm for diagnosis computation must
log |C|−1 calls. Consequently, FIND C ONFLICTS terminates if            be used, e.g., [23].
the conflict detection loop in lines 11–15 always terminates.           Corollary 2. Algorithm 2 is sound, i.e., every set CS ∈ Γ
   We consider two situations. If the set C10 ∪ C20 is consistent       is a minimal conflict, and complete, i.e., given a diagnosis
with B, the loop terminates. Otherwise, in each iteration at            problem for which at least one minimal conflict exists, Algo-
least one conflict in the set C10 ∪ C20 is resolved. This fact          rithm 2 returns Γ 6= ∅.
follows from Theorem 1 according to which the function
GET C ONFLICT in Algorithm 1 always returns a minimal con-
                                                                           The soundness of the algorithm follows from Theorem 1,
flict if the input parameter C is inconsistent with B. Since            since the conflict computation of MXP uses the GET C ON -
                                                                        FLICT function of QXP. The completeness is shown as fol-
the number of conflicts is finite and in each iteration one of
the conflicts in C10 ∪ C20 is resolved in line 14, the loop will        lows: Let B be a background theory and C a set of faulty
terminate after a finite number of iterations. Consequently,            constraints, i.e., B ∪ C is inconsistent. Assume MXP returns
Algorithm 2 terminates and returns a set of minimal con-                Γ = ∅, i.e., no minimal conflicts are found. However, this is
flicts Γ.                                                               impossible, since the loop in line 11 will never end. Con-
                                                                        sequently, Algorithm 2 will not terminate which contradicts
Corollary 1. Given a consistent background theory B and a               our assumption. Hence, it holds that MXP is complete.
set of inconsistent constraints C, Algorithm 2 always returns
a set ofSminimal conflicts Γ such that there exists a diagnosis         4 Evaluation
∆i ⊆ CS i ∈Γ CS i .                                                     We have evaluated the efficiency of computing multiple con-
   The proof follows from the fact that – similar to the HS-            flicts at once with MXP using a number of different diagno-
tree algorithm – a conflict is resolved by removing one of its          sis benchmark problems. As a baseline for the comparison,
elements from the set of constraints C1 in line 14. The loop            we use QXP as a Theorem Prover, which returns exactly
in line 11 guarantees that every conflict CS i ∈ C10 ∪ C20 is           one minimal conflict at a time. Furthermore, we made mea-
hit. Consequently, FIND C ONFLICTS hits every conflict in the           surements with a variant of MXP called PMXP in which
input set C and the set of constraints {α1 , . . . , αn } removed       the lines 8 and 9 are executed in parallel in two threads on a
in every call of line 14 is a superset or equal to a diagnosis of       multi-core computer.
the problem. The construction of at least one diagnosis from
the found conflicts Γ can be done by the HS-tree algorithm.             4.1 Benchmark Problems
   MXP can in principle use several strategies for the res-             We made experiments with different benchmark problems.
olution of conflicts in line 14. The strategy used in MXP               First, we used the five first systems of the DX Competition
by default is conservative and allows us to find several con-           (DXC) 2011 Synthetic Track. For each system, 20 scenarios
flicts at once. Two additional elimination strategies can be            are specified in which artificial faults were injected. In addi-
used in line 14: (1) C10 ← C10 \ X or (2) C10 ← C10 \ CS and            tion, we made experiments with a number of CSP problems
C20 ← C20 \ CS . These more aggressive strategies result in             from the CSP solver competition 2008 and several CSP en-
a smaller number of conflicts returned by MXP in each call              codings of real-world spreadsheets. The injection of faults
but each call returns the results faster. However, for the latter       was done in the same way as in [17].




                                                                    6
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


   In addition to these benchmark problems, we developed                 System #C #V #F         #D       #D |D| #Cf |Cf|
a diagnosis problem generator, which can be configured                   74182 21 28 4 - 5 30 - 300       139 4.66 4.9 3.3
to generate (randomized) diagnosis problems with varying                 74L85 35 44 1 - 3 1 - 215        66.4 3.13 5.9 8.3
characteristics, e.g., with respect to the number of conflicts,          74283 38 45 2 - 4 180 - 4,991 1,232.7 4.42 78.8 16.1
their size, or their position in the system description SD.              74181 67 79 3 - 6 10 - 3,828 877.8 4.53 7.8 10.6
                                                                         c432   162 196 2 - 5 1 - 6,944 1,069.3 3.38 15.0 19.8
4.2 Measurement Method
We implemented all algorithms in a Java-based MBD                        Table 1: Characteristics of selected DXC benchmarks. #C:
framework, which uses Choco as an underlying constraint                  number of constraints, #V: number of variables, #F: num-
solver, see [17]. The experiments were conducted on a lap-               ber of injected faults, #D: range of the number of diagnoses,
top computer (Intel i7, 8GB RAM). As a performance indi-                 #D: average number of the diagnoses, |D|: average diag-
cator we use the time needed (“wall clock”) for computing                nosis size, #Cf: average number of conflicts, |Cf|: average
one or more diagnoses. The reported running time num-                    conflict size.
bers are averages of 100 runs of each problem setting that
were done to avoid random effects. We furthermore ran-                      System          QXP-5       MXP-5        QXP-1       MXP-1
domly shuffled the ordering of the constraints in each run to                                 [ms]     Improv.         [ms]     Improv.
avoid effects that might be caused by a certain positioning                 74182             17.0        19%          17.0        19%
of the conflicts in SD. For the evaluation of MXP we used                   74L85             20.9        15%          16.1        19%
the most aggressive elimination strategy (2) as described in                74283             61.2        29%          53.8        32%
Section 3.4.                                                                74181            691.8        45%         637.0        47%
   Since MXP can return more than one conflict at a time, it                c432             707.5        25%         503.9        37%
is expected to be particularly useful when the problem is to
find a set of n first (leading) diagnoses, e.g., in the context of       Table 2: Performance gains for DXC benchmarks when
applying MBD to software debugging [5; 7]. We therefore                  searching for the first n diagnoses of minimal cardinality.
report the results for the tasks “find-one-diagnosis” (as an
extreme case) and “find-n-diagnoses”.
                                                                         Constraint Problems / Spreadsheets The characteristics
   The task of finding a single diagnosis is comparably
                                                                         for the next set of benchmark problems (six CSP compe-
simple and “direct encodings” or algorithms like I NVERSE -
                                                                         tition instances, five CSP-encoded real-world spreadsheets
Q UICK X PLAIN [23] are typically more efficient for this
                                                                         with injected faults [17]) are shown in Table 3.
task than the HS-tree algorithm. For instance, I NVERSE -
Q UICK X PLAIN requires only O(|∆| log(|C|/|∆|)) calls to TP.
If TP can check the consistency in polynomial time, then                 Scenario            #C #V #F #D |D| #Cf |Cf|
one diagnosis can also be computed efficiently. The prob-                c8                  523 239 8   4 6.25 7   1.6
lem of finding more than one diagnosis is very different and             costasArray-13       87 88 2 >5 3.6 >565 45.6
computationally challenging, because deciding whether an                 domino-100-100      100 100 3 81    2   2   15
additional diagnosis exists is NP-complete [24]. In such set-            graceful–K3-P2       60 15 4 >117 2.94 >12 29.2
tings the application of methods that are highly efficient for           mknap-1-5            7 39 1     2   1   1   2
finding one diagnosis is not always advantageous. For in-                queens-8             28 8 15 9 10.9 15 2.8
stance, the evaluation presented in [14] demonstrates this               hospital payment     38 75 4 40     4   4   3
fact for direct encodings. Therefore a comparison of our al-             profit calculation   28 140 5 42 4.25 11    9
gorithm with approaches for the “find-one-diagnosis” prob-               course planning     457 583 2 3024 2    2 55.5
lem is beyond the scope of our work, as we are interested                preservation model 701 803 1 22     1   1   22
in problem settings in which the HS-tree algorithm is fa-                revenue calculation 93 154 4 1452 3     3 15.7
vorable and no assumptions about the underlying reasoner
should be made. When the task is to find all diagnoses, the                      Table 3: Characteristics of selected CSP settings.
performance of MXP is similar to that of QXP as all exist-
ing conflicts have to be determined.                                        The results for determining the five first minimal diag-
                                                                         noses are shown in Table 42 . Again, performance improve-
4.3 Results                                                              ments of up to 54% can be observed. The obtained im-
DXC Benchmark Problems Table 1 shows the charac-                         provements vary quite strongly across the different problem
teristics of the analyzed and CSP-encoded DXC benchmark                  instances: the higher the complexity of the underlying prob-
problems. Since we consider multiple scenarios per system,               lem, the stronger are the improvements achieved with our
the number of faults and the corresponding diagnoses can                 new method. Only in the two cases in which only one single
vary strongly across the experiment runs.                                conflict exists (see Table 3), the performance can slightly de-
   Table 2 shows the observed performance gains when us-                 grade as MXP performs an additional check if further con-
ing MXP instead of QXP in terms of absolute numbers (ms)                 flicts among the remaining constraints exist.
and the relative improvement. For the problem of finding the
first 5 diagnoses (QXP-5/MXP-5), the observed improve-                   Systematically Generated MBD Problems To be able to
ments range from 15% up to 45%. For the extreme case of                  systematically analyze which factors potentially influence
finding one single diagnosis, even slightly stronger improve-            the obtained performance improvements, we developed an
ments can be observed. The improvements when searching                   MBD problem generator in which we could vary (i) the
for, e.g., the first 10 diagnoses are similar for cases in which
                                                                            2
significantly more than 10 diagnoses actually exist.                            The results for finding one diagnosis follow the same trend.




                                                                     7
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


   Scenario                      QXP            MXP                       #Cp #Cf |Cf|          Cf Pos.      QXP    MXP PMXP
                                  [ms]        [ms] Impr.                                                     [ms]   Impr. Impr.
   c8                              615         376  39%                    50      5     2     Random         351    27%   30%
   costasArray-13            1,379,842     629,366  54%                    50      5     2      Left          161     6%   10%
   domino-100-100                  417         389   7%                    50      5     2      Right         481    69%   70%
   graceful–K3-P2                 1611        1123  30%                    50      5     2      LaR           293    51%   57%
   mknap-1-5                        32          36 -11%                    50      5     2     Neighb.        261    54%   58%
   queens-8                        281         245  13%                   100      5     2     Random         417    33%   35%
   hospital payment              1,717       1,360  21%                   100      5     2      Left          181    14%   17%
   profit calculation               86          76  12%                   100      5     2      Right         622    75%   76%
   course planning               2,045       1,544  25%                   100      5     2      LaR           351    58%   63%
   preservation model              371         391  -5%                   100      5     2     Neighb.        314    62%   65%
   revenue calculation             109          87  21%                    50     15     4     Random        2,300   22%   20%
                                                                           50     15     4      Left          452    -8%   -4%
Table 4: Results for CSP benchmarks and spreadsheets
                                                                           50     15     4      Right        1,850   72%   73%
when searching for 5 diagnoses.
                                                                           50     15     4      LaR          3,596   22%   18%
                                                                           50     15     4     Neighb.      166,335 43%    43%
overall number of C OMPS, (ii) the number of conflicts and
their average size (and as a consequence the number of diag-             Table 5: Results when varying the problem characteristics.
noses), and (iii) the position of the conflicts in the database.
We considered the last aspect because the performance of
                                                                         in the left part of SD, some improvements or light deterio-
QXP and MXP can largely depend on this aspect3 . If,
                                                                         rations can be observed for MXP. The latter two situations
e.g., there is only one conflict and the conflict is represented
                                                                         (all conflicts are clustered in one half) are actually quite im-
by the two “left-most” elements in SD, QXP’s divide-and-
                                                                         probable but help us better understand which factors influ-
conquer strategy will be able to rule out most other elements
                                                                         ence the performance.
very fast.
   We evaluated the following configurations regarding the                  (2) When comparing the results of the first two blocks
position of the conflicts (see Table 5): (a) Random: The                 in the table, it can be seen that the improvements achieved
elements of each conflict are randomly distributed across                with MXP are stronger when there are more components in
                                                                         SD and more time is needed for performing the individual
SD; (b) Left/Right: All elements of the conflict appear in
exactly one half of SD; (c) LaR (Left and Right): Conflicts              consistency checks. This is in line with the results of the
are both in the left and right half, but not spanning both               other experiments.
halves; (d) Neighb.: Conflicts appear randomly across SD,                   (3) Parallelization can help to obtain modest additional
but only involve “neighboring” elements.                                 improvements. The strongest improvements are observed
   One specific rationale of evaluating these constellations             for the LaR configuration, which is intuitive as PMXP by
individually is that conflicts in some application domains               design explores the left and right halves independently in
(e.g., when debugging knowledge bases) might represent                   parallel. Note that in the experiments with the DXC and the
“local” inconsistencies in SD.                                           CSP benchmark problems, in most cases we could not ob-
   Since the conflicts are known in advance in this exper-               serve runtime improvements through parallelization. This is
iment, no CSP solver is needed to determine the consis-                  caused by two facts. First, the consistency checking times
tency of a given set of constraints. Because zero compu-                 are often on average below 1 ms, which means that the rel-
tation times are unrealistic, we added simulated consistency             ative overhead of starting a new thread can be comparably
checking times in each call to the TP. The value of the sim-             high. Second, the used CSP solver causes some additional
ulated time quadratically increases with the number of con-              overheads and thread synchronization when used in multiple
straints to be checked and is capped in the experiments at 10            threads in parallel.
milliseconds. We made additional tests with different con-
sistency checking times to evaluate to which extent the im-              5 Related Work
provements obtained with MXP depend on the complexity
                                                                         In [10], Junker informally sketches a possible extension of
of an individual consistency check for the underlying prob-
                                                                         QXP to be able to compute multiple “preferred explana-
lem. However, these tests did not lead to any significant
                                                                         tions” in the context of Preference-Based Search (PBS). The
differences.
                                                                         general goal of Junker’s approach is partially similar to our
   Table 5 shows some of the results of this simulation. In
                                                                         work and the proposed extended version of QXP could in
this evaluation, we also include the results of the parallelized
                                                                         theory be used during the HS-tree construction as well.
PMXP variant. The following observations can be made.
                                                                            Technically, Junker proposes to set a choice point when-
   (1) The performance of QXP strongly depends on the po-
                                                                         ever a constraint ci is found to be consistent with a partial re-
sition of the conflicts. In the probably most realistic Random
                                                                         laxation during search and thereby look for (a) branches that
case, MXP helps to reduce the computation times around
                                                                         lead to conflicts not containing ci and (b) branches leading
20-30%. In the constellations that are “unfortunate” for
                                                                         to conflicts in which the removal of ci leads to a solution.
QXP, the speedups achieved with MXP can be as high as
75%. When QXP is “lucky” and all conflicts are clustered                    Unfortunately, it is not fully clear from the informal
                                                                         sketch in [10] where the mentioned choice point should
   3
     We assume a splitting strategy in which the elements are sim-       be set. If applied in line 5 of Algorithm 1, conflicts are
ply split in half in the middle with no particular ordering of the       only found in the left-most inconsistent partition. The
elements.                                                                method would then return only a small subset of all conflicts




                                                                     8
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


M ERGE X PLAIN would return. If the split is done for every                techniques such as M ARCO [29] aim at the enumeration of
ci consistent with a partial relaxation during PBS, the result-            conflicts. In general, many of these algorithms use a similar
ing diagnosis algorithm corresponds to the binary HS-tree                  divide-and-conquer principle as we do with MXP. How-
method [25], which according to the experiments in [11] is                 ever, such algorithms – including the ones listed above –
not generally favorable over HS-Tree algorithms, in partic-                often modify the underlying knowledge base by adding re-
ular when we are searching for a limited set of diagnoses.                 laxation variables to clauses of a given unsatisfiable formula
   From the algorithm design, note that QXP applies a con-                 and then use a SAT solver to find the relaxations. This strat-
structive conflict computation procedure prior to partition-               egy roughly corresponds to the direct diagnoses approaches
ing, whereas MXP does the partitioning first – thereby re-                 discussed above. MXP, in contrast, acts completely inde-
moving multiple constraints at a time – and then uses a                    pendently of the underlying knowledge representation lan-
divide-and-conquer conflict detection approach. Finally, our               guage. Moreover, the problem-independent decomposition
method can, depending on the configuration, make a guaran-                 approach used by MXP is a novel feature which – to the
tee about the existence of a diagnosis given the returned con-             best of our knowledge – is not present in the existing con-
flicts without the need of computing all existing conflicts.               flict detection techniques from the MaxSAT field. Specifi-
                                                                           cally, it allows our algorithm to find multiple conflicts more
   In general, our work is related to a variety of (com-
                                                                           efficiently because it searches for them within independent
plete) approaches from the MBD literature which aim to
                                                                           small subsets of the original knowledge base. In addition,
find diagnoses more efficiently than with Reiter’s original
                                                                           MXP can find conflicts in knowledge bases formulated in
method. Existing works for example try to speed up the
                                                                           very expressive knowledge representation languages, such
process by exploiting existing hierarchical, tree-like or dis-
                                                                           as description logics, which cannot be efficiently translated
tributed structural properties of the underlying problem [16;
                                                                           to SAT, see also [23].
26], through parallelization [17], or by solving the dual
problem [27; 28; 29]. A main difference to these works
is that we make no assumption about the underlying prob-                   6 Conclusions
lem structure and leave the general HS-tree procedure un-                  We have proposed and evaluated a novel, general-pur-
changed. Instead, our aim is to avoid a full restart of the                pose and non-intrusive conflict detection strategy called
conflict search process when constructing a new node by                    M ERGE X PLAIN, which is capable of detecting multiple
looking for potentially existing additional conflicts in each              conflicts in a single call. An evaluation on various bench-
call, and to thereby speedup the overall process.                          mark problems revealed that M ERGE X PLAIN can help to
   Beside complete methods, a number of approximate di-                    significantly reduce the required computation times when
agnosis approaches have been proposed in the last years,                   applied in a Model-Based Diagnosis setting in which the
which for example use stochastic and heuristic search [30;                 goal is to find a defined number of diagnoses and in
31]. The relation of our work to these approaches is limited               which no assumption about the underlying reasoning engine
as we are focusing on application scenarios where the goal                 should be made.
is to find a few first diagnoses more quickly but at the same                 One additional property of M ERGE X PLAIN is that the
time maintain the completeness property. Finally, for some                 union of the elements of the returned conflict sets is guaran-
domains, “direct” and SAT-based, e.g., [32], or CSP-based,                 teed to be a superset of one diagnosis of the original prob-
e.g., [33], encodings, have shown to be very efficient to find             lem. Recent methods like the one proposed in [23] can
one or a few diagnoses in recent years. For instance, [33]                 therefore be applied to find one minimal diagnosis quickly.
suggests an encoding scheme that first translates a given di-
agnosis problem (SD, C OMPS , O BS ) into a CSP. Then a spe-
cific diagnosis algorithm is applied that searches for conflict
                                                                           Acknowledgements
sets with increasing cardinality, i.e., 1, 2, . . . , |C OMPS |. The       This work was supported by the Carinthian Science Fund
same method is then used to search for diagnoses in the set                (KWF) contract KWF-3520/26767/38701, the Austrian Sci-
of all found conflict sets. In order to speed up the compu-                ence Fund (FWF), and the German Research Foundation
tations the author suggests a kind of hierarchical approach                (DFG) under contract numbers I 2144 N-15 and JA 2095/4-
that helps the user spot the relevant components. Generally,               1 (Project “Debugging of Spreadsheet Programs”).
most of the “direct” methods require the use of additional
techniques like hierarchical diagnosis or iterative deepening              References
that constrain the cardinality of computed diagnoses while
computing minimal diagnoses.                                               [1] Ulrich Junker. QUICKXPLAIN: Conflict Detection
   The concept of conflicts plays a central role in different                  for Arbitrary Constraint Propagation Algorithms. In
other reasoning contexts than Model-Based Diagnosis, e.g.,                     IJCAI ’01 Workshop on Modelling and Solving prob-
explanations or dynamic backtracking. Specifically, in re-                     lems with constraints (CONS-1), 2001.
cent years a number of approaches were proposed in the                     [2] Raymond Reiter. A Theory of Diagnosis from First
context of the maximum satisfiability problem (MaxSAT),                        Principles. Artificial Intelligence, 32(1):57–95, 1987.
see [34] for a recent survey. In these domains the con-
                                                                           [3] Gerhard Friedrich, Markus Stumptner, and Franz
flicts are referred to as unsatisfiable cores or Minimally Un-
satisfiable Subsets (MUSes); Minimal Correction Subsets                        Wotawa. Model-Based Diagnosis of Hardware De-
(MSCes) on the other hand correspond to the concept of                         signs. Artificial Intelligence, 111(1-2):3–39, 1999.
diagnoses in this paper. In [35] or [36], for example, dif-                [4] Cristinel Mateis, Markus Stumptner, Dominik
ferent algorithms were recently proposed to find one so-                       Wieland, and Franz Wotawa. Model-Based Debug-
lution to the MaxSAT problem, which corresponds to the                         ging of Java Programs. In Proceedings AADEBUG
problem of finding one minimal/preferred diagnosis. Other                      ’00 Workshop, 2000.




                                                                       9
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


[5] Dietmar Jannach and Thomas Schmitz. Model-based                  [21] Alexander Felfernig, Gerhard Friedrich, Dietmar Jan-
     diagnosis of spreadsheet programs: a constraint-based                nach, and Markus Stumptner. Consistency-based di-
     debugging approach. Automated Software Engineer-                     agnosis of configuration knowledge bases. Artificial
     ing, 2014.                                                           Intelligence, 152(2):213–234, 2004.
[6] Jules White, David Benavides, Douglas C. Schmidt,                [22] Gerhard Friedrich and Kostyantyn Shchekotykhin. A
     Pablo Trinidad, Brian Dougherty, and Antonio Ruiz                    General Diagnosis Method for Ontologies. In Pro-
     Cortés. Automated Diagnosis of Feature Model                         ceedings ISWC ’05, pages 232–246, 2005.
     Configurations. Journal of Systems and Software,                [23] Kostyantyn Shchekotykhin, Gerhard Friedrich, Patrick
     83(7):1094–1107, 2010.                                               Rodler, and Philipp Fleiss. Sequential diagnosis of
[7] Kostyantyn Shchekotykhin, Gerhard Friedrich,                          high cardinality faults in knowledge-bases by direct
     Philipp Fleiss, and Patrick Rodler.          Interactive             diagnosis generation. In Proceedings ECAI ’14, pages
     ontology debugging: Two query strategies for effi-                   813–818, 2014.
     cient fault localization. Journal of Web Semantics,             [24] Thomas Eiter and Georg Gottlob. The Complexity of
     12-13:88–103, 2012.
                                                                          Logic-Based Abduction. Journal of the ACM (JACM),
[8] Franz Baader and Rafael Penaloza. Axiom Pinpointing                   42(1):1–49, 1995.
     in General Tableaux. Journal of Logic and Computa-
                                                                     [25] Li Lin and Yunfei Jiang. The computation of hitting
     tion, 20(1):5–34, 2008.
                                                                          sets: Review and new algorithms. Information Pro-
[9] Johan de Kleer. A Comparison of ATMS and CSP                          cessing Letters, 86(4):177–184, May 2003.
     Techniques. In Proceedings IJCAI ’89, pages 290–
     296, 1989.                                                      [26] F Wotawa and I Pill. On classification and modeling
                                                                          issues in distributed model-based diagnosis. AI Com-
[10] Ulrich Junker. QUICKXPLAIN: Preferred Explana-
                                                                          munications, 26(1):133–143, 2013.
     tions and Relaxations for Over-Constrained Problems.
     In Proceedings AAAI ’04, pages 167–172, 2004.                   [27] Ken Satoh and Takeaki Uno. Enumerating Minimally
                                                                          Revised Specifications Using Dualization. In JSAI ’05
[11] Ingo Pill, Thomas Quaritsch, and Franz Wotawa. From
                                                                          Workshop, pages 182–189, 2005.
     Conflicts to Diagnoses: An Empirical Evaluation of
     Minimal Hitting Set Algorithms. In Proceedings DX               [28] Roni Stern, Meir Kalech, Alexander Feldman, and
     ’11 Workshop, pages 203–211, 2011.                                   Gregory Provan. Exploring the Duality in Conflict-
[12] Alexander Feldman, Gregory Provan, Johan de Kleer,                   Directed Model-Based Diagnosis. In Proceedings
                                                                          AAAI ’12, pages 828–834, 2012.
     Stephan Robert, and Arjan van Gemund. Solv-
     ing Model-Based Diagnosis Problems with Max-SAT                 [29] Mark H. Liffiton, Alessandro Previti, Ammar Malik,
     Solvers and Vice Versa. In Proceedings DX ’10 Work-                  and Joao Marques-Silva. Fast, Flexible MUS Enumer-
     shop, pages 185–192, 2010.                                           ation. Constraints, pages 1–28, 2015.
[13] Amit Metodi, Roni Stern, Meir Kalech, and Michael               [30] Lin Li and Jiang Yunfei. Computing Minimal Hitting
     Codish. A Novel SAT-Based Approach to Model                          Sets with Genetic Algorithm. In Proceedings DX ’02
     Based Diagnosis. Journal of Artificial Intelligence Re-              Workshop, pages 1–4, 2002.
     search, 51:377–411, 2014.                                       [31] A Feldman, G Provan, and A van Gemund. Approx-
[14] Iulia Nica, Ingo Pill, Thomas Quaritsch, and Franz                   imate Model-Based Diagnosis Using Greedy Stochas-
     Wotawa. The Route to Success – A Performance Com-                    tic Search. Journal of Artifcial Intelligence Research,
     parison of Diagnosis Algorithms. In Proceedings IJ-                  38:371–413, 2010.
     CAI ’13, pages 1039–1045, 2013.                                 [32] Amit Metodi, Roni Stern, Meir Kalech, and Michael
[15] R Greiner, B A Smith, and R W Wilkerson. A Correc-                   Codish.      Compiling Model-Based Diagnosis to
     tion to the Algorithm in Reiter’s Theory of Diagnosis.               Boolean Satisfaction. In Proceedings AAAI ’12, pages
     Artificial Intelligence, 41(1):79–88, 1989.                          793–799, 2012.
[16] Markus Stumptner and Franz Wotawa. Diagnos-                     [33] Yannick Pencolé. DITO: a CSP-based diagnostic en-
     ing tree-structured systems. Artificial Intelligence,                gine. In Proceedings ECAI ’14, pages 699–704, 2014.
     127(1):1–29, 2001.
                                                                     [34] Antonio Morgado, Federico Heras, Mark Liffiton,
[17] Dietmar Jannach, Thomas Schmitz, and Kostyantyn                      Jordi Planes, and Joao Marques-Silva. Iterative and
     Shchekotykhin. Parallelized Hitting Set Computation                  core-guided MaxSAT solving: A survey and assess-
     for Model-Based Diagnosis. In Proceedings AAAI ’15,                  ment. Constraints, 18(4):478–534, 2013.
     pages 1503–1510, 2015.
                                                                     [35] Jessica Davies and Fahiem Bacchus. Postponing opti-
[18] Johan de Kleer. Hitting set algorithms for model-based               mization to speed up MAXSAT solving. In Proceed-
     diagnosis. In Proceedings DX ’11 Workshop, pages                     ings CP ’13, pages 247–262, 2013.
     100–105, 2011.
                                                                     [36] Alexey Ignatiev, Antonio Morgado, Vasco Man-
[19] Ulrich Junker. Preference-Based Search and Multi-
                                                                          quinho, Ines Lynce, and Joao Marques-Silva. Progres-
     Criteria Optimization. Annals of Operations Research,                sion in Maximum Satisfiability. In Proceedings ECAI
     130:75–115, 2004.                                                    ’14, pages 453–458, 2014.
[20] Ingo Pill and Thomas Quaritsch. Optimizations for
     the Boolean Approach to Computing Minimal Hitting
     Sets. In Proceedings ECAI ’12, pages 648–653, 2012.




                                                                10
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




   A Robust Alternative to Correlation Networks for Identifying Faulty Systems
                          Patrick Traxler 1 and Pablo Gómez2 and Tanja Grill1
                              1
                                Software Competence Center Hagenberg, Austria
                                        e-mail:patrick.traxler@scch.at
                                              tanja.grill@scch.at
          2
            Institute of Applied Knowledge Processing, Johannes Kepler University, Linz, Austria
                                       e-mail: pablo.gomez@faw.jku.at

                         Abstract                                                1
                                                                                 2
     We study the situation in which many systems                                3
                                                                                                             1       2      5
     relate to each other. We show how to robustly                               4
     learn relations between systems to conduct fault                            5
     detection and identification (FDI), i.e. the goal is                        6
     to identify the faulty systems. Towards this, we                                1   2   3   4   5   6   3       4      6
     present a robust alternative to the sample correla-                             (a) Fitness matrix       (b) Digraph
     tion matrix and show how to randomly search in
     it for a structure appropriate for FDI. Our method               Figure 1: Learning relations between 6 systems. We draw
     applies to situations in which many systems can                  an edge between two systems if there is a strong linear re-
     be faulty simultaneously and thus our method re-                 lation between them. First, we compute the fitness matrix,
     quires an appropriate degree of redundancy. We                   1(a), our robust alternative to the sample correlation matrix.
     present experimental results with data arising in                Darker colors mean a stronger linear relation. Going from
     photovoltaics and supporting theoretical results.                Fig. 1(a) to 1(b) is a discretization step via thresholding. The
                                                                      digraph is the input for conducting FDI.
1 Introduction
The increasing number of technical systems connected to
                                                                      ilar concepts. The concept that fits our needs are correlation
the Internet raises new challenges and possibilities in di-
                                                                      networks. A correlation network is some structure in the
agnosis. Large amount of data needs to be processed and
                                                                      correlation matrix, e.g. a minimum spanning tree or a clus-
analyzed. Faults need to be detected and identified. Sys-
                                                                      tering. In our application we have n variables which rep-
tems exist in different configurations, e.g. two systems of
                                                                      resent the produced energy per photovoltaic system. Given
the same type that have different sets of sensors. Knowl-
                                                                      that a single system correlates strongly with enough other
edge about the system design is often incomplete. Data is
                                                                      systems, we use this information for FDI via applying a me-
often unavailable due to unreliable data connections. Be-
                                                                      dian.
sides these and other difficulties, the large amount of data
also opens new possibilities for diagnosis based on machine              We can also think of correlation networks as a method
learning.                                                             for knowledge discovery. It has been applied in areas such
   The idea of our approach is to conduct fault detection and         as biology [18; 10] and finance [12] to analyze gene co-
identification (FDI) by comparing data of similar systems.            expression and financial markets. In our situation, the first
We assume to have data of machines, devices, systems of a             step is to learn linear relations between systems. For learn-
similar type and want to know if some system is faulty and if         ing we need historical data. A sample result of this step is
so, to identify the faulty systems. This situation may deviate        depicted in Fig. 1. In Fig. 1(a) the fitness matrix, our robust
from classic diagnosis problems in that we just have limited          alternative to the correlation matrix, is shown. It represents
information (e.g. sensor or control information) of system            the degree of linearity between any two systems. For FDI,
internals. Moreover, we may have incomplete knowledge                 the second step of our method, we work with the result as
about the system design. This makes manual system mod-                depicted in 1(b) and current data. In the example, we derive
eling hard or even impossible. The problem is then to com-            for every of the six systems an estimation m̂i of its current
pare the limited information of the working systems (per-             value yi from its neighbors current values, e.g. for system
haps only input-output information) to identify faulty sys-           1 we get an estimate from the current values of the systems
tems.                                                                 2, 3, 4 and for system 5 from system 6. Finally, we test for a
   In this work we tackle one concrete problem of this kind.          fault by checking if |m̂i − yi | is large.
It is motivated by photovoltaics. We describe it in more de-             The major difficulty we try to tackle with this approach
tail below. The problem that arises in our and other appli-           is the presence of many faults. Faults influence both the
cations is that not every two systems can be compared. We             learning problem and the FDI problem. Robustness is an
thus need to learn relations between systems.                         essential property of our algorithms. Our result can be seen
   There are different approaches to learn structure, e.g.            as a robust structure learning algorithm for the purpose of
learning Bayesian networks, Markov random fields, or sim-             FDI. Robustness is a preferable property of many learning




                                                                 11
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


and estimation algorithms. However, the underlying opti-              1.2 Related Work
mization problems unlike their non-robust variants are often          Correlation networks have applications in biology and fi-
NP-hard. This is for example the case for computing robust            nance. See e.g. [12; 18; 10] and the references therein. In
and non-robust estimators for linear regression, e.g. Least           biology [18; 10], they are applied to study gene interactions.
Median of Squares versus Ordinary Least Squares [16]. We              The correlation matrix is the basis for clustering genes and
avoid NP-hardness by a careful modeling of our problem.               the identification of biologically significant clusters. In [18;
In particular, our algorithms are computationally efficient.          10], a scale-free network is derived via the concept of topo-
Under some conditions, FDI can be done in (almost) linear             logical overlap. Scale-free networks tend to have few nodes
time in the number of systems n.                                      (genes) with many neighbors, so called hubs.
   To summarize our contributions, we introduce a novel al-              Correlation networks are primarily used for knowledge
ternative to the sample correlation matrix and present a first        discovery. In particular, concepts such as clusters, hubs, and
use of it to discover structure appropriate for general FDI           spanning trees are interpreted in the context of biology and
and in particular for identifying faulty photovoltaic systems         finance. In our work, we introduce a robust alternative to
(PV). Our method works in the presence of many faults. Our            correlation networks.
algorithms are computationally efficient. Our method incor-              Other structural approaches, i.e. approaches based on
porates a couple of techniques from machine learning and              graphical models, are based on Bayesian networks, Markov
statistics: (Repeated) Theil-Sen estimation for robust sim-           random fields and similar concepts. Gaussian Markov ran-
ple linear regression. Trimming to obtain a robust fitness            dom fields are loosely related to correlation networks. Their
measure. Randomized subset selection for improved run-                structure is described by the precision matrix, the inverse
ning time. And a median mechanism to conduct FDI.                     covariance matrix (ch. 17.3, [9].)
   In Sec. 2 we discuss our method. In Sec. 3 we present                 Another structural approach is FDI in sensor networks [7;
experimental and theoretical results.                                 4; 19; 20]. The current approach [7; 4; 19] mainly deals with
                                                                      wireless sensor networks. The algorithms usually use the
1.1 Motivating Application: Identifying Faulty                        median for FDI such as we do. The difference is that FDI
    Photovoltaic Systems                                              in wireless sensor networks uses a geometric model similar
Faults influence the performance of photovoltaic systems              to interpolation methods. It requires the geographic location
(PV). PV systems produce less energy than possible if faults          of the sensors. It is assumed that two sensors close to each
occur. We can distinguish between two kinds of faults.                other have a similar value. This cannot be assumed in gen-
Faults caused by an exogenous event such as shading, (melt-           eral. To overcome these problems of manual modeling, we
ing) snow, and tree leafs covering solar modules. And faults          apply machine learning techniques.
caused by endogenous events such as module defects and                   Models for PV systems are compared in [14]. All these
degradation, defects at the power inverter, and string dis-           models require the plane-of-array irradiance. Fault de-
connections.                                                          tection of PV systems is the topic of e.g. [3; 8; 5; 2;
   We are going to detect faults by estimating the drop in            17]. Firth et al. [8] consider faults if the PV system gen-
produced energy. Most of the common faults result in such             erates no energy. Another type of fault occurs if the pan-
a drop. The particular problem is given by the sensor setup.          els are covered by snow, tree leaves, or something else.
We just assume to know the produced energy and possible               In this case, we can observe a drop in energy. It is con-
but not necessarily the area (e.g. the zip code) where the PV         sidered e.g. in [5]. The fraction of panel area covered
system is located.                                                    is a crucial parameter. All these approaches [3; 8; 5; 2;
   We apply our method to PV system data. Difficulties in             17] require at least the knowledge of the plane-of-array irra-
the application are different system types and deployments            diance, i.e. it requires an irradiance sensor installed. We do
of systems. For example, different number of strings and              make this assumption.
modules per string and differing orientation (north, west,               The median is common in fault detection and identifica-
south, east) of the modules; see Fig. 2. Moreover, the lack of        tion. One reason for this circumstance is its optimal break-
information due to the lack of sensors and incomplete data            down point [16]. We also make use of (repeated) Theil-
due to unreliable data connections. Faults occur frequently,          Sen algorithms [6; 15] for learning. An ingredient of our
in particular exogenous faults during winter.                         fault identification algorithm is the algorithm for median
   The novelty of our work in the context of photovoltaics            selection [1] and an algorithm for generating uniform sub-
is that it works in an extremely restrictive (only power mea-         sets efficiently (see e.g. the Fisher-Yates or Knuth shuffle
surement) sensor setting. To the best of our knowledge, we            in [13].) In our algorithm analysis we derive bounds for a
are the first to consider this restrictive sensor setting. We         partial Cauchy-Vandermonde identity (pg. 20 in [11]).
only need to know the produced energy of a PV system.
There is also the implicit assumption, which is tested by             2 Method
the learning algorithm, that the systems are not too far from
each other so that we can observe them in similar working             2.1 Data Model for Incomplete Data
(environmental) conditions. Distances of a couple of kilo-            We have data from n systems and one data stream per sys-
meters are possible. Systems which are very close to each             tem. A data stream for system i ∈ {1, . . . , n} is given by a
other and have the same orientation such as systems in a              set Ni ⊆ {1, . . . , N } of available data and values xi,t ∈ R
solar power plant yield the best results. Other approaches            with t ∈ Ni . We can think of the parameter t as discrete
assume the presence of a plane-of-array-irradiance sensor             time. With Ni , we explicitly model data availability. Incom-
which are mostly deployed for solar power plants. Irradi-             plete data is a common problem in our situation. Causes in
ance estimations via satellite imaging are usually not accu-          practice are unreliable data connections or unreliable sen-
rate enough.                                                          sors. We call D := {(xi,t )t∈Ni : i ∈ {1, . . . , n}} a data




                                                                 12
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


set. Sets of historical and current data are the inputs to our               wide availability of efficient implementations of near-linear
algorithms.                                                                  time algorithms. There is also a variant of T TS , called the
                                                                             repeated Theil-Sen estimator, which has a breakdown point
2.2 Fitness Matrix – Definition and Robustness                               of 0.5, but less efficient implementations. The concrete def-
The fitness matrix is intended as a robust replacement for                   inition of T TS can be found e.g. in [6]. It is however not
the sample correlation matrix. The sample correlation co-                    important for our application, only its robustness property
efficient such as the sample covariance is well known to be                  and the existence of efficient implementations are.
sensitive to faults (outliers) [16]. As an example, we gener-                   To define the breakdown point of a fitness matrix, let f
ated the data for Fig. 1 with faults. The non-robust sample                  be a real-valued function defined on any finite data set. We
correlation matrix would have yield a digraph without edges                  define the fitness matrix as
instead of the digraph in Fig. 1(b).
   A fault can be an arbitrary corruption of a single data item                                         Fji := f (Zi,j )
xi,t . That is, xi,t = x̃i,t + ∆, ∆ 6= 0, where ∆ is the fault.              and its breakdown point as
We think of x̃i,t as the actual or true but unobserved value.
   We do not make any assumptions on faults themselves                                            ε∗ (F ) := min ε∗ (f, Zi,j ).
                                                                                                              i,j
but only on their number. This is at core of the definition of
the breakdown point. This statistical concept is defined for                 Next, we provide the fitness matrix we are going to use. It
a particular estimation or learning problem. In our case for                 has the property that Fji is close to zero if xi and xj are
simple linear regression.                                                    strongly linearly related and it has a high breakdown point.
   Linear regression is closely related to the correlation co-                  Let yt := xi,t and ŷt := xj,t · θ̂2 + θ̂1 , t ∈ Ni ∩ Nj , for
efficient. For simple linear regression – a regression model
                                                                             the Theil-Sen estimate θ̂ of Zij . Let rt := ŷt − yt be the
with one independent and one dependent variable – the
                                                                             residuals. And let i1 , . . . , ik ∈ Ni ∩ Nj with k := |Ni ∩ Nj |
correlation coefficient can be seen as a fitness measure of
                                                                             be such that |ri1 | ≤ · · · ≤ |rik |. We define
the line which fits the data best w.r.t. vertical squared dis-
tances. See e.g. [16]. However, the corresponding estimator,                                                                      √
                                                                                                                               bk/ 2c
                                                                                                               1                X
namely `2 -regression a.k.a. ordinary least squares, is known                         f   TS
                                                                                               (Zij ) := P    √            ·            |rit |.   (1)
to be sensitive to outliers [16]. On the other hand, there are                                             bk/ 2c
                                                                                                                  |yit |        t=1
                                                                                                           t=1
estimators for simple linear regression which are robust to
a large number of faults, i.e. they have a large breakdown                     We define F TS w.r.t. f TS , i.e.
point.                                                                                                 TS
   The idea underlying the fitness matrix is thus to replace                                          Fji := f TS (Zi,j ).
the correlation coefficient (and `2 -regression) by a robust                                                           √
notion of fitness based on robust linear regression. In the re-              Note that the sum goes from 1 up to bk/ 2c. This trim-
mainder of this section we recall the definition of the break-               ming together with the high breakdown point of Theil-Sen
down point following [16], pg. 9, and we are going to for-                   directly implies the following result.
malize the notion of fitness matrices.                                       Theorem 1. It holds that ε∗ (F TS ) ≥ 1 − √12 .
   We define the breakdown only for simple linear regres-
                                                                                Finally, we compare the sample correlation matrix and the
sion. We fix two systems i, j ∈ {1, . . . , n} and define
                                                                             fitness matrix. Let C denote the sample correlation matrix
Z := Zi,j := {(xi,t , xj,t ) : t ∈ Ni ∩ Nj }. Let T be a
                                                                             and define Cji0
                                                                                              = 1 − |Cji |. Both matrices have the property
regression estimator, i.e. T (Zi,j ) = θ̂ ∈ R2 is the intercept              that if some entry is close to 0 then xi and xj have a strong
and slope for the data set Zi,j . For Z, we define Z 0 as Z                  linear relation. It is guaranteed that Cji
                                                                                                                     0
                                                                                                                        is at most 1. A value
with m data points arbitrarily corrupted. Define
                                                                             close to 1 means a weak linear relation. For Fji    TS
                                                                                                                                    , it is not
           bias(m; T, Z) := sup kT (Z) − T (Z 0 )k.
                                Z0                                           guaranteed that Fji  TS
                                                                                                      ≤ 1, but experimental results suggest
If bias(m; T, Z) is infinite, then m faults (outliers) have an               that it is usually the case. We also note that both matrices
arbitrarily large effect on the estimate T (Z 0 ). The (finite               obey a weak form of the triangle inequality since if xi and
sample) breakdown point of T and Z is defined as                             xj correlate strongly and xj and xk correlate strongly, then
                                                                           also xi and xk correlate.
                              m
        ∗
       ε (T, Z) := min           : bias(m; T, Z) = ∞ .                          There are two important benefits of fitness matrices over
                             |Z|                                             correlation matrices. They are robust and are also defined
   To explain this notion, we consider four typical examples.                for incomplete data. On the negative side, the fitness matrix
The breakdown point ε∗ (T`2 , Z) is 1/n for `2 -regression.                  is not positive semi-definite, in particular not symmetric.
This holds for any Z. The situation is different for `1 -
regression in that ε∗ (T`1 , Z) = 1/n for some Z.                            2.3 Structure in Fitness Matrices – Algorithm
   In this work we are going to use the Theil-Sen estimator1                     LEARN and IDENTIFY
T a.k.a. median slope selection. The reason is its break-
  TS
                                                                             We want to identify faulty systems. In a first step, we learn
down point of at least 1 − √12 ≥ 0.292 (see e.g. [6]) and the                a structure appropriate for FDI; see algorithm LEARN. We
    1                                                                        obtain it via thresholding the fitness matrix. Most of the
      There is a subtle issue here we have to deal with. Regres-             correlation networks, i.e. structures arising from the sample
sion problems are optimization problems. The solution to the con-
crete optimization problem does not need to be unique. In our
                                                                             correlation matrix, are obtained in this way [12; 18; 10].
situation, intercept and slope are unique for `2 -regression but not         We denote the threshold by θ ≥ 0 and the threshold fitness
for `1 -regression. The estimator T`1 is however unique for a (de-           matrix as                 
terministic) algorithm solving the optimization problem. We thus                                         Fji if Fji ≤ θ
                                                                                             Fji;θ :=                      .
think of T`1 as the output of a particular (deterministic) algorithm.                                    0     if Fji > θ




                                                                        13
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


Algorithm 1 Algorithm LEARN with input D (data set) and                 particular, we have the following result.
parameter θ (fitness threshold). Output is a digraph G with             Theorem 2. Let D be a data set with n systems and let
edge labels (intercept, slope) representing the threshold fit-          m := maxi |Ni |. The running time of LEARN is O(n2 · m ·
ness matrix.                                                            log(m)). The running time of IDENTIFY is O(k · n).
  Let G = (V, E) be a digraph with V = {1, . . . , n} and
  E = {}.                                                               Proof. LEARN. There are O(n2 ) pairs of systems. The
  for all i ∈ V and j ∈ V \ {i} do                                      Theil-Sen estimator can be computed in time O(m log(m))
      Learn (Theil-Sen) the intercept aj,i and slope bj,i be-           [6]. The computation of f TS , Eq. 1, is done via sorting and
  tween xi (dependent variable) and xj (independent vari-               thus takes time O(m log(m)).
  able).                                                                   IDENTIFY. Assume |Ni− | ≥ k. We uniformly at ran-
  end for                                                               dom choose a k-element subset out of Ni− and compute
  for all i ∈ V and j ∈ V \ {i} do                                      the median. For random selection we can use for example
      Compute the trimmed fitness f = f TS (Eq. 1) of Zi,j .            the Fisher-Yates (or Knuth) shuffle [13] which runs in time
  end for                                                               O(k) and for median selection the algorithm in [1] which
  if f ≤ θ then                                                         also runs in time O(k). The second case, 1 ≤ |Ni− | ≤ k−1,
      Add to G the directed edge from j to i with edge la-              is analogous. This shows that the overall running time of
  bels (aj,i , bj,i ).                                                  IDENTIFY is O(kn).
  end if
                                                                           In Sec. 3.2, we provide some sufficient conditions that
                                                                        IDENTIFY works correctly even if k = O(log(n)). This
The input to algorithm LEARN is a data set D as described               is a strong running time improvement from O(n2 ) to O(n ·
in Sec. 2.1. It outputs a digraph G = (V, E), i.e. the (pos-            log(n)).
sible sparse) threshold fitness matrix FθTS . Additionally, in-
tercept and slope of the simple linear regressions are added            3 Results
as edge labels.
                                                                        3.1 Experimental
                                                                        In this section we are going to discuss how to apply our
Algorithm 2 Algorithm IDENTIFY with input G (digraph
                                                                        method, Sec. 2, to photovoltaic data. In particular, it re-
with edge labels), current data yi for the i-th system, and
                                                                        mains to discuss how the use-case fits to the model. More
parameters k and s (deviation). It outputs the set of all faulty
                                                                        precisely, why there is strong correlation between PV sys-
systems H.
                                                                        tems. Finally, we present experimental results to verify the
  Set H = {}.                                                           estimation and fault identification quality of our algorithms.
  for all i ∈ V = {1, . . . , n} do
                                                                        Use-Case Photovoltaics
      Let Ni− := {j ∈ V : (j, i) ∈ E}.
                                                                        A simple system model for PV systems is as follows:
      if |Ni− | = 0 then
           Continue with the next (system) i.                                                    Pi = ci · Ii .
      end if
      if |Ni− | ≥ k then                                                Here, Pi is the power, Ii the plane-of-array (POA) irradiance
           Select uniformly at random a k-element                       of the i-th system, and ci a constant of the system which can
           subset S from Ni− .                                          be interpreted as the efficiency of converting solar energy
      else                                                              into electrical energy. More complex physical models in-
           Set S := Ni− .                                               clude system variables such as the module temperature [14;
      end if                                                            17]. Our considerations translate to the more complex mod-
      Compute Mi := {ŷj = bj,i · yj + aj,i : j ∈ S}.                   els as long as they are time-independent. We also note that
      Compute the median m̂i of Mi .                                    these models are more accurate, but only slightly, since the
      Add i to H if |m̂i − yi | > s                                     POA-irradiance has the most critical influence on the pro-
  end for                                                               duced energy.
  Output H.                                                                 We get from the above considerations that Pi = c0ij · Pj
                                                                        given that Ii = Ij . In our situation we cannot test the condi-
                                                                        tion Ii = Ij since we do not know the POA-irradiance, but
   In the second step, we identify the faulty systems; see              Ii ≈ Ij holds if the system operate under similar weather
algorithm IDENTIFY. Its input is the result of algorithm                conditions and have a similar orientation. The former holds
LEARN. Algorithm IDENTIFY constructs a random di-                       if the systems are close to each other. To reduce the effect of
graph of in-degree at most k for FDI. It works as follows. In-          different orientations, see Fig. 2, we consider the following
dependently for every system, we choose uniformly at ran-               model: Pi∆ = uij · Pj∆ + vij . The variable Pi∆ is the power
dom at most k of its neighbors in the digraph G and compute             within a time interval ∆, usually one hour. The variables
the median m̂i of estimated values derived from the selected            uij and vij are the unknowns.
neighbors values. We compare the median m̂i to the current                  In more general words, let Yi be the output of the i-th
system value and decide whether it has a fault or not via the           system and let Xi describe the system input and system in-
deviation parameter s.                                                  ternals. Our model assumption is that for a reasonable num-
   We discuss the threshold parameter θ and the deviation               ber of system pairs (i, j), the system outputs Yi are Yj are
parameter s in Sec. 3.1. They essentially depend on the vari-           linearly related given that Xi ≈ Xj . By the above consid-
ance in the data set D. Parameter k in algorithm IDENTIFY               erations, it is plausible that PV systems fulfill these require-
has the purpose of improving running time efficiency. In                ments.




                                                                   14
                                        Proceedings of the 26th International Workshop on Principles of Diagnosis




                   4000




                                                                                               1000
                                                                                                      POA−irradiance
                   3000




                                                                                               800
                                                                                                                              Estimate
Power [Watt]




                                                                                Power [Watt]

                                                                                               600
                   2000




                                                                                               400
                   1000




                                                                                               200
                                                                                                                                    Real


                                                12:00                                                                  12:00
                                                    13:00                                                                   13:00
                   0




                                                                                               0
                                                  Time                                                                 Time




Figure 2: Four power curves of a sunny day in August, data                      Figure 3: A faulty system. The real power curve of observed
set DK . Two PV systems have their maximum power peak                           values shows a fault from roughly 11:00 to 14:30. The es-
before and the other two after 13:00. They have different                       timated values are considerable higher during this period.
orientations, i.e. they produce more energy in the morning                      The PV system has a plane-of-array irradiance sensor in-
or evening.                                                                     stalled. A cross check with its power curve reveals that the
                                                                                fault was detected correctly.
   We next describe our experimental setup to verify it by
real data.                                                                      The entry x is an average over all systems and 7 days. The
                                                                                first day is noted in column Start.
Experimental Setup                                                                 Algorithm LEARN is executed once for every week and
To demonstrate our method, we use two data sets DA and                          with θ = 0.8 and roughly three months of historical data,
DK . DA arises from 13 systems from a solar park located in                     e.g. for the months January, February, and March to get es-
Arizona2 . The PV systems there are geographically close.                       timates for the days April, 1. to April, 7. Algorithm IDEN-
We use data for one year. DK arises from 40 systems spread                      TIFY is executed with s = 0.25 · |m̂i | and k = 11 for both
across a typical municipality located in Austria, i.e. the sys-                 data sets. The choice of parameters θ and s depend on the
tems can be up to some kilometers apart. Their orientation                      variance of the input data and were chosen manually, so to
can differ significantly. Some systems may be orientated to                     get a reasonable number of good estimates. Similar for k.
the west, others to the east. We have data for almost a year.                   The difficulty in choosing the parameters is that increasing θ
   A system is faulty if it produces considerable less energy                   will usually reduce the number of neighbors. For a reason-
than estimated; see Fig. 3. This definition is motivated by                     able number of good estimates we need both: A strong lin-
the fact that most faults imply a drop in energy. The dif-                      ear relation of a system to its neighbors and enough neigh-
ficulty in setting up an experiment is that we do not know                      bors. The parameters were chosen accordingly. For param-
if a PV system is faulty in advance, i.e. we do not have la-                    eter k, we derive a theoretical result in Sec. 3.2 which says
beled data. We thus design our experiment as follows: We                        that k = O(log(n)) is a good choice for n the number of
verify the accuracy of the energy estimation, namely the rel-                   systems.
ative deviation |m̂i − yi |/|m̂i | for every system i and over
the period of a week, m̂i and yi as in algorithm IDENTIFY.                      Experimental Results
   This relative deviation is noted in column Hour of Table                     The false positive rate (FPR), the false negative rate (FNR),
1 for the time period 12:00 to 13:00. In column Day of                          and the estimation accuracy are the most interesting num-
Table 1 we note the same but for a whole day, i.e. m̂i is the                   bers for us. As remarked above, we do not have labeled
estimated energy (power) for the whole day calculated from                      data. The faults as recorded in Table 1 are faults as detected
the hourly estimates and yi the actual energy for the whole                     by our algorithm.
day. For the whole day we consider the time period from                            We make a worst case assumption, namely that all de-
9:00 to 16:00.                                                                  tected faults are false positives. This yields a FPR of at most
   The number |m̂i − yi |/|m̂i | can be read as some relative                   0% to 5% per 7 day period (rows in the table.) To get an un-
deviation, i.e. the estimation is 100 · x% away from the truth                  derstanding of FNR, we simulated faults by subtracting 33%
value where x is some entry in the column Hour and Day.                         percent of energy. The FNR in this case is at most 10% per
                                                                                7 day period. In the rows Sum and Sum−33% in Table 1 we
               2
                   http://uapv.physics.arizona.edu/                             summed up the faults to get the FPR and FNR for the whole




                                                                           15
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


                       (a) Results for DA .                              3.2 Theoretical
    Start           Hour     Faults           Day     Faults
    April, 1.       0.058    2/77             0.037   1/77               We argued in Sec. 3.1 that algorithm LEARN yields good
    May, 1.         0.040    2/77             0.014   0/77               estimates for the systems current value. For an estimate to
                                                                         be good, the neighboring system j in G of system i needs
    June, 1.        0.019    0/91             0.021   0/91
                                                                         to work correctly. Moreover, the regression estimates, the
    July, 1.        0.068    7/88             0.071   7/88
                                                                         intercept and slope, need to be accurate enough. In this sec-
    Aug., 1.        0.362    12/91            0.250   7/91               tion, we provide a supporting theoretical result which says
    Sept., 1.       0.096    4/65             0.034   1/78               that, if enough estimates are good, algorithm IDENTIFY
    Oct., 1.        0.019    0/84             0.016   0/84               correctly identifies all faulty systems.
    Nov., 1.        0.039    0/72             0.025   0/72                  The input to IDENTIFY is a digraph G = (V, E) with
    Dec., 1.        0.135    10/84            0.130   7/84               edge labels. Let yi be the current value of system i. Let
    Sum                      37/729                   23/742             yi = ỹi + ∆i . We think of ỹi as the true value. We say that
    Sum−33%                  673/729                  682/742            system i is correct if ∆i = 0 and faulty otherwise.
                       (b) Results for DK .                                 The input to IDENTIFY has to satisfy two conditions, Eq.
                                                                         2 and 3, to work correctly. These conditions state that there
  Start           Hour      Faults            Day     Faults
                                                                         are more good than bad estimates. We formulate them be-
  June, 1.        0.056     7/269             0.068   11/273
                                                                         low.
  June, 15.       0.055     7/238             0.097   17/258
  July, 1.        0.077     7/267             0.068   9/280              Theorem 3. Let 0 < p < 1 and s > 0. Let H := {i ∈
  July, 15.       0.025     0/267             0.044   6/280              {1, . . . , n} : |∆i | > 2s}. Assume that the input digraph G
  Aug., 1.        0.037     2/279             0.030   3/280              satisfies Eq. 2 and 3. Then, algorithm IDENTIFY outputs
  Aug., 15.       0.031     1/280             0.032   0/280              H with probability at least 1 − p.
  Sept., 1.       0.040     0/280             0.033   0/280                 Let ŷj be the estimates as computed in IDENTIFY. Fix
  Sept., 15.      0.092     20/280            0.056   0/280              a system i and let j ∈ Ni− . We say that ŷj is s-good for
  Sum                       42/2160                   46/2211            system i if |ỹi − ŷj | ≤ s. Let Ai := {j ∈ Ni− : |ỹi − ŷj | ≤
  Sum−33%                   1960/2154                 2033/2207          s} be the s-good estimates for system i. Condition 2 is as
                                                                         follows: For every system i with 1 ≤ |Ni− | ≤ k − 1 it holds
Table 1: The values in column Hour and Day contain the                   that
relative deviation |m̂i − yi |/|m̂i |, m̂i and yi as in algorithm                                          |N − |
IDENTIFY. They are averages over all systems and the pe-                                           |Ai | > i ,                          (2)
                                                                                                              2
riod of a week. Column Start contains the start date of the
7 day period. The two columns labeled Faults contain the                 i.e. there are more good than bad estimates. For the case that
number of (possible false detected) faults relative to the to-           |Ni− | ≥ k we assume
tal number of analyzed hours and days, respectively. The                                                        
rows Sum contain the summed up number of faults, once for                                                   1
                                                                                          |Ai | > 1 −              · |Ni− |,            (3)
the actual data sets and then with a simulated fault of −33%                                              cn,p,k
less energy.
                                                                         with cn,p,k := ( np · 18k )2/(k−1) . Setting k = Ω(log( np ))
                                                                         makes cn,p,k larger than some constant independent of n
data sets.                                                               and p. This is the most reasonable setting as it implies that
                                                                         a constant fraction of estimates can be bad and IDENTIFY
   The interpretation of these results is as follows. Setting
                                                                         still identifies the faulty systems correctly. We remark that
the parameter s to 0.25 · |m̂i | means that we define a fault
                                                                         the asymptotic analysis which yields cn,k,p is not optimal.
as a 25% relative deviation of the observed produced energy
from its true value. Setting s to this value, yields the above           In particular, it seems that the factor 18k is not optimal and
mentioned FPR. Simulating a 33% drop in energy, which                    may be improved to a factor as small as 2k/2 . For practical
corresponds naturally to the faults we want to detect, yields            applications, the following heuristic seems reasonable: For
the above FNR.                                                           n systems and a failure probability p of IDENTIFY, set k to
                                                                         10 · log( np ).
   For the data set DA we have knowledge about the POA-
irradiance. We can thus cross-check with the irradiance to
check if faulty systems were identified correctly; see Fig.              3.3 Proof of Theorem 3
3. This manual inspection suggests that the FPR is much                  We apply the following lemma with A = Gi and M = Ni− .
smaller than 5%, close to 1%. Furthermore, increasing the                It directly gives us the probability that IDENTIFY correctly
drop implies a decreasing FNR, i.e. stronger energy drops                identifies the faulty systems since the median works cor-
are easier to identify.                                                  rectly if |S ∩Ai | > |S ∩(Ni− \Ai )|, where S is the (random)
   Depending on the application, these rates may be consid-              set chosen in IDENTIFY.
ered appropriate or not. In some applications, we may want
                                                                         Lemma 1. Let M be a finite set and A ⊆ M . Let k ≥ 2
to detect faults which yield a drop in energy of less −25%.
                                                                         be an integer. Let S ⊆ M be a k-element subset selected
This worsens the FPR and FNR. On the other side, if we
                                                                         uniformly at random. Then
want to improve the FPR and FNR, we may have to specify
a fault as a drop in energy of −50%. In other words, our                                                                              bk/2c
                                                                                                                             |M \ A|
parameter setting is one out of many reasonable parameter                Pr(|S ∩ A| > |S ∩ (M \ A)|) ≥ 1 − 18        k
                                                                                                                                                .
settings.                                                                S                                                     |M |




                                                                    16
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


Proof. Let M := {1, . . . , m}, F := M \ A, and r := |F |.                   Proof of Theorem 3. We show that the success probability
First, we are going to bound the number of k-element sub-                    of IDENTIFY is at least 1 − p. Let p0 := np . We show that
sets S ⊆ M for which |S ∩ G| ≤ k 0 with k 0 = bk/2c. The                     for every i ∈ V , G = (V, E), the success probability of a
exact number of these sets is                                                single iteration in the loop of IDENTIFY is at least 1 − p0 .
                         k0                                              This implies the above claim since (1 − p0 )n ≥ 1 − p0 n by
                       X     m−r        r
                                                           (4)               e.g. the Binomial Theorem.
                       i=0
                                i      k−i                                       Fix some i ∈ V , i.e. we consider one iteration in the loop
                                                                            of IDENTIFY. We apply Lemma 1. Let us assume that |S ∩
since there are |A|    i    ways to choose an i-element subset               Ai | > |S ∩ (Ni− \ Ai )|, S the random k-element subset as
                      
                 |F |
from A and k−i ways to choose from F for the remain-                         in IDENTIFY and Ai the good estimates as defined above.
ing k − i elements.                                                          Since |S∩Ai | > |S∩(Ni− \Ai )|, it follows that |m̂i −ỹi | ≤ s
     Note that |S ∩ A| > |S ∩ F | iff |S ∩ A| > b|S|/2c =                    for the median m̂i as computed in IDENTIFY and yi =
k 0 . Moreover, we can assume that r = |F | ≥ 1 since the                    ỹi + ∆i .
claim holds for r = 0. To provide a lower bound for the                          Assume ∆i = 0, i.e. system i works correctly. Then,
probability of this event we show an upper bound on the                      |m̂i − yi | = |m̂i − ỹi | ≤ s. Thus, i is not output.
complementary event, i.e. |S ∩ A| ≤ k 0 . First, we derive an                    Assume ∆i 6= 0, i.e. system i is faulty. Here, |m̂i − yi | =
upper bound for Eq. 4 using                                                  |m̂i − ỹi − ∆i |. It follows from |m̂i − ỹi | ≤ s and |∆i | > 2s
                   k                    k                              that |m̂i − yi | > s. Thus, i is output.
                      m           m       me                                     Finally, we want that the probability of failure for a
                             ≤       ≤                     (5)
                      k           k       k                                  single step is at most np . By Lemma 1, 18k αbk/2c ≤
                                                                                                               |Ni− \Ai |
for e = 2.714 . . . and 1 ≤ k ≤ m. (See e.g. pg. 12 in [11].)                18k α(k−1)/2 ≤ np with α :=         |Ni− |
                                                                                                                          . With c = cn,p,k :=
Since this inequality holds just for k ≥ 1 we rewrite Eq. 4                                            −                −
                                                                               n     k 2/(k−1)
                                                                             ( p · 18 )        , c · |Ni \ Ai | ≤ |Ni | and thus (1 − 1/c) ·
as
                 X     k0                                                  −
                                                                             |Ni | ≤ |Ai |.
                 r            m−r        r
                     +                        .          (6)
                 k             i       k−i
                
                        i=1                                                  4 Conclusions and Open Problems
It holds that kr ≤ ( re
                      k ) and for the second term in Eq. 6
                          k
                                                                             We presented a method for learning structure to identify
k0                         k0                i          k−i          faulty systems. The basic method of correlation networks
X      m−r         r           X      (m − r)e            re
                           ≤                                                 has found many applications in biology and finance. In our
i=1
        i         k−i          i=1
                                         i               k−i                 application, the presence of many faults required the design
                                                                             and analysis of robust algorithms. We provided an experi-
                 Xk 
                  0       i      i      k                               mental analysis of our algorithms to verify their estimation
                      m−r      k−i       1
       = (re)k                                                               and fault identification quality. We also provided a support-
                       r        i       k−i
                 i=1                                                         ing theoretical result which allowed us to considerable im-
Next, we prove the upper bound on the probability p that                     prove the running time of algorithm IDENTIFY.
|S ∩ A| ≤ k 0 . We select uniformly at random a k-element                       Improving the running time of LEARN remains as an
                                      −1
subset of M . Its probability is m        . We multiply Eq. 6                open problem. It is not directly clear that it is necessary
                                    k
        
      m −1                                                                   to compare every two systems. The reason is that if systems
with k       and get two parts p1 + p2 ≥ p. For the first part               (i, j) and (j, k) correlate strongly, then also (i, k) correlate,
                       −1
p1 ≤ ( m ) since m
       re k
                     k      ≤ (k/m)k . For the second part p2 ,              but not necessarily strongly. Thus, it may not be necessary
we use r ≤ r , ((k−i)/i)i ≤ 2k , and (k/(k−i))k ≤ 2k .
        m−r      m                                                           to solve a simple linear regression problem for every system
The latter since i ≤ k 0 . We get an upper for the second part:              pair.
          k X                                                                 In other applications it may be useful to solve a general
                    k0         i         i       k
            re             m−r        k−i         k                          linear regression problem instead of a simple linear regres-
   p2 ≤                                                  ≤                   sion, e.g. if our model depends on more than one variable
            m               r           i        k−i
                   i=1                                                       per system. The corresponding correlation networks are
                                                k X
                                                    k0        i            based on the partial correlation coefficient [12]. Since ro-
                                           12r             m
                                                                    .        bust estimators for general linear regression are based on re-
                                            m      i=1
                                                           r                 gression problems which are NP-hard, it remains as an open
                                                               0             problem to find a robust alternative to partial correlation net-
An upper bound for the geometric sum is k 0 (m/r)k . In                      works that can be computed efficiently.
total                                                                           Finally, to put our method and results into a broader con-
                     k             k  k
                      re          12r       m                                text, we approached the problem of FDI via learning graph-
      p ≤ p1 + p2 ≤        + k0                 .                            ical models. It seems to be a challenge to learn classical
                      m            m         r
                                                                             component-models of technical systems to conduct diagno-
Substituting k−1
              2 for k and further simplification yields
                     0
                                                                             sis. In this work we were able to close the gap between
                     (k−1)/2            (k−1)/2                          (structure) learning on the one side and FDI on the other
        (k + 2)(12)k r                      r                                side for a concrete problem setting.
  p≤                              ≤ 18k                 .
              2        m                   m
The latter since ((k + 2)/2)1/k ≤ 1.5 for k ≥ 3. We have                     References
thus a lower bound for the probability 1 − p and the claim                   [1] Manuel Blum, Robert W. Floyd, Vaughan Pratt,
follows.                                                                         Ronald L. Rivest, and Robert E. Tarjan. Linear time




                                                                        17
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


     bounds for median computations. In Proc. of the 4th             [17] P. Traxler. Fault detection of large amounts of photo-
     Annual ACM Symposium on Theory of Computing,                         voltaic systems. In Online Proc. of the ECML PKDD
     pages 119–124, 1972.                                                 2013 Workshop on Data Analytics for Renewable En-
[2] H. Braun, S. T. Buddha, V. Krishnan, A. Spanias,                      ergy Integration (DARE’13), 2013.
    C. Tepedelenlioglu, T. Yeider, and T. Takehara. Signal           [18] Bin Zhang and Steve Horvath. A general frame-
    processing for fault detection in photovoltaic arrays.                work for weighted gene co-expression network analy-
    In 37th IEEE International Conference on Acous-                       sis. Statistical Applications in Genetics and Molecular
    tics, Speech and Signal Processing, pages 1681–1684,                  Biology, 4(17), 2005.
    2012.                                                            [19] Chongming Zhang, Jiuchun Ren, Chuanshan Gao,
[3] K. H. Chao, S. H. Ho, and M. H. Wang. Modeling                        Zhonglin Yan, and Li Li. Sensor fault detection in
    and fault diagnosis of a photovoltaic system. Electric                wireless sensor networks. In Proc. of the IET Interna-
    Power Systems Research, 78(1):97–105, 2008.                           tional Communication Conference on Wireless Mobile
                                                                          and Computing, pages 66–69, 2009.
[4] Jinran Chen, Shubha Kher, and Arun Somani. Dis-
    tributed fault detection of wireless sensor networks. In         [20] Yang Zhang, N. Meratnia, and P. Havinga. Outlier de-
    Proc. of the 2006 Workshop on Dependability Issues                    tection techniques for wireless sensor networks: a sur-
    in Wireless Ad Hoc Networks and Sensor Networks,                      vey. Communications Surveys and Tutorials, IEEE,
    pages 65–72, 2006.                                                    12(2):159–170, 2010.
[5] A. Chouder and S. Silvestre. Fault detection and
    automatic supervision methodology for PV systems.
    Energy Conversion and Management, 51:1929–1937,
    2010.
[6] R. Cole, J.S. Salowe, W.L. Steiger, and E. Szemeredi.
    An optimal-time algorithm for slope selection. SIAM
    Journal on Computing, 18(4):792–810, 1989.
[7] M. Ding, Dechang Chen, Kai Xing, and Xiuzhen
    Cheng. Localized fault-tolerant event boundary detec-
    tion in sensor networks. In Proc. of the 24th Annual
    Joint Conference of the IEEE Computer and Commu-
    nications Societies, volume 2, pages 902–913, 2005.
[8] S.K. Firth, K.J. Lomas, and S.J. Rees. A simple model
    of PV system performance and its use in fault detec-
    tion. Solar Energy, 84:624–635, 2010.
[9] Trevor Hastie, Robert Tibshirani, and Jerome Fried-
    man. The elements of statistical learning. Springer,
    2008.
[10] Steve Horvath. Weighted network analysis: applica-
     tions in genomics and systems biology. Springer Sci-
     ence & Business Media, 2011.
[11] S. Jukna. Extremal combinatorics: with applications
     in computer science. Springer, 2nd edition, 2011.
[12] Dror Y. Kenett, Michele Tumminello, Asaf Madi, Gi-
     tit Gur-Gershgoren, Rosario N. Mantegna, and Eshel
     Ben-Jacob. Dominating clasp of the financial sector re-
     vealed by partial correlation analysis of the stock mar-
     ket. PLoS ONE, 5(12):e15032, 12 2010.
[13] Donald E. Knuth. The art of computer programming:
     seminumerical algorithms, volume 2. Addison-Wesley
     Longman Publishing Co., Inc., 3rd edition, 1997.
[14] B. Marion. Comparison of predictive models for PV
     module performance. In 33rd IEEE Photovoltaic Spe-
     cialist Conference, pages 1–6, 2008.
[15] J. Matousek, D. M. Mount, and N. S. Netanyahu. Ef-
     ficient randomized algorithms for the repeated median
     line estimator. Algorithmica, 20(2):136–150, 1998.
[16] Peter J Rousseeuw and Annick M Leroy. Robust re-
     gression and outlier detection, volume 589. John Wi-
     ley & Sons, 2005.




                                                                18
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




         Applied multi-layer clustering to the diagnosis of complex agro-systems

                 Elisa Roux1, Louise Travé-Massuyès1 and Marie-Véronique Le Lann1,2
                   1
                     CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
                         emails: lisa.roux@laas.fr, louise@laas.fr, mvlelann@laas.fr
                        2
                          Univ de Toulouse, INSA, LAAS, F-31400 Toulouse, France




                         Abstract                                      qualitative valued data, which can be nominal or ordinal,
                                                                       mixed with quantitative and interval data. Many situations
In many fields, such as medical, environmental, a lot of
                                                                       leading to well-conditioned algorithms for quantitative
data are produced every day. In many cases, the task of
                                                                       valued information become very complex whenever there
machine learning is to analyze these data composed of
                                                                       are several data given in qualitative form. In a non-
very heterogeneous types of features. We developed in
                                                                       exhaustive list, we can mention, rule based deduction,
previous work a classification method based on fuzzy
                                                                       classification, clustering, dimensionality reduction… Dur-
logic, capable of processing three types of features (data):
                                                                       ing the last decades, few research works have been di-
qualitative, quantitative, and more recently intervals. We
                                                                       rected to defy the issue of representing multiplicity for data
propose to add a new one: the object type which is a mean-
                                                                       analysis purposes [3, 11]. However, no standard principle
ingful combination of other features yielding the possibil-
                                                                       has been proposed in the literature to handle in a unified
ity of developing hierarchical classifications. This is illus-
                                                                       way heterogeneous data. Indeed, a lot of proposed tech-
trated by a real-life case study taken from the agriculture
                                                                       niques process separately quantitative and qualitative data.
area 1.
                                                                       In data reduction tasks for example, they are either based
                                                                       on distance measures for the former type [12] and on in-
1   Introduction                                                       formation or consistency measures for the later one.
Nowadays, large scale datasets are produced in various                 Whereas in classification and clustering tasks, eventually
different fields such as social networks, medical, process             only a Hamming distance is used to handle qualitative data
operation, agricultural/environmental,... Many studies                 [4,11,14]. Other approaches are originally designed to
relate to data mining with the intention of analyzing and if           process only quantitative data and therefore arbitrary trans-
possible extracting knowledge from these data. The data                formations of qualitative data into a quantitative space are
classification has to provide a relevant and well-fitted               proposed without taking into account their nature in the
representation of reality. In this context, the issue of repre-        original space [12,15,16]. For example, the variable shape
senting of data is crucial since the formalisms must be                can take values in a discrete unordered set {round, square,
generic yet well suited to every new problem. For machine              triangle}. These values are transformed respectively to
learning, the concern is to be able to detect adequate pat-            quantitative values 1, 2, and 3. However, we can also
terns from heterogeneous, large, and sometimes uncertain               choose to transform them to 3, 2 and 1. Another inverse
datasets. In diagnosis, the necessity to quickly recognize a           practice is to enhance the qualitative aspect and discretize
problem to provide a sure solution to solve it appears to be           the quantitative value domain into several intervals, then
essential. One of the main challenges is the necessity to              objects in the same interval are labeled by the same quali-
process heterogeneous data (qualitative, quantitative...) and          tative value [17,18]. Obviously, both approaches introduce
sometimes to merge data obtained in different contexts.                distortion and end up with information loss with respect to
We developed a classification method based on fuzzy logic              the original data. Moreover, none of the previously pro-
[1] capable of processing heterogeneous data types and                 posed approaches combines in a fully adequate way, the
noisy data. The LAMDA (Learning Algorithm for Multi-                   processing of symbolic intervals simultaneously with
variate Data Analysis) method is a classification method,              quantitative and qualitative data. Although extensive stud-
capable to process three types of data: qualitative, quantita-         ies were performed to process this type of data in the Sym-
tive, and intervals [2]. We addressed one of the main diffi-           bolic Data Analysis framework [19], they were focused
culties encountered in data analysis tasks: the diversity of           generally on the clustering tasks [8, 10] and no unified
information types. Such information types are given by                 principle was given to handle simultaneously the three
                                                                       types of data for different analysis purposes. In [2], a new
                                                                       general principle, was introduced as “Simultaneous Map-
1                                                                      ping for Single Processing (SMSP)”, enabling the reason-
 This work was supported by the FUI/FEDER project MAISEO
involving the companies VIVADOUR, CACG, GEOSYS, ME-                    ing in a unified way about heterogeneous data for several
TEO FRANCE, PIONEER and laboratories CESBIO, LAAS-                     data analysis purposes. The fact that SMSP together with
CNRS.




                                                                  19
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


LAMDA can process simultaneously these three types of                  2.1 Calculation of MAD for quantitative features
data without pre-processing is one of its principal ad-                The quantitative type allows the representation of numeri-
vantages compared to other classical machine learning                  cal values, assuming that the including space is known as a
methods such as SVM (Support Vector Machine [20]), K-                  defined interval. For this type of descriptor, membership
NN [21]. Decision trees are very powerful tools for classi-            functions can be used, such as the Gaussian membership
fication and diagnosis [22] but their sequential approach is           function so that the membership function for the xth sample
still not advisable to process multidimensional data since,            descriptor to the kth class is:
by their very nature, they cannot be processed as efficient-
ly as totally independent information [23]. A complete                                        − ( xi − ρ ki ) 2

                                                                               µ i (x ) = exp 2σ i
                                                                                                       2
description of the LAMDA method and comparison with                                                                          (1)
                                                                                  k    i
other classification techniques on various well known data
sets can be found in [24, 25, 26]. Its other main character-
istic is the fuzzy formalism which enables an element to               or the binomial membership function:
belong to several classes simultaneously. It is also possible
to perform clustering (i.e. with no a priori knowledge of
the number and the class prototypes).
                                                                                µ ki (xi ) = ρ ki
                                                                                                    xi
                                                                                                         (1 − ρki )(1− x )
                                                                                                                       i
                                                                                                                                     (2)
   Besides the three existing types, we propose to add an-             where:
other type: the class type which can be processed simulta-             ρ ki ∈ [0, 1] is the mean of the ith feature based on the
neously with the three former ones: quantitative, qualita-
tive, intervals thanks to the “SMSP”. In this configuration            samples belonging to the class Ck, xi ∈ [0, 1] is the normal-
the class feature represents a meaningful aggregation of               ized xth feature and σi the standard deviation of the ith fea-
other features. This aggregation can be defined by a class             ture value based on the samples belonging to the class Ck.
determined by a previous classification, or the result of an
abstraction. This new type gives the possibility to develop
                                                                       2.2 Calculation of MAD for qualitative features
hierarchical classifications or to fuse different classifica-          In case of qualitative feature, the possible values of the ith
tions. It allows an easier representation of many various              feature forms a set of modalities such as Di= Q1i , Qi Qm
                                                                                                                                 i
and complex types of data, like multi-dimensional data,
while being realistic and conserving their constraints. In a           with m the total number of modalities. The qualitative type
first part, the LAMDA method is briefly explained. The                 permits to express by words the different modalities of a
second part is devoted to the new type of data introduced:             criterion.
the object type. Finally, this new method is exemplified               The frequency of a modality Qli of the ith feature for the
through an agronomical project.
                                                                       class Ck is the quantity of samples belonging to Ck whose
2   The LAMDA method                                                   modality for their ith feature is Qli [1].So each modality
                                                                                                                   i
The LAMDA method is an example of fuzzy logic based                    Qli ∈ Di has an associated frequency. Let θ kj be the fre-
classification methods [9]. The classification method takes
as input a sample x made up of N features. The first step is           quency of a modality Q ij for the class Ck. The membership
to compute for each feature of x, an adequacy degree to
each class Ck , k = 1..K where K is the total number of                function concerning the ith feature is:
classes. This is obtained by the use of a fuzzy adequacy
                                                                                             ( )q * (q ki 2 )q *2* (q kmi )q
                                                                                                          i           i        i
                                                                                m ki (xi ) = q ki 1
function providing K vectors of Marginal Adequacy De-                                                     1           2        m
                                                                                                                                     (3)
gree vectors (MAD). This degree estimates the closeness
of every single sample feature to the prototype correspond-
ing to its class. At this point, all the features are in a com-        where qli =1 if xi = Qli and qli = 0 otherwise, for l=1, ..m.
mon space. Then the second step is to aggregate all these
marginal adequacy degrees into one global adequacy de-
gree (GAD) by means of a fuzzy aggregation function.                   2.3 Calculation of MAD for interval features
Thus the K MAD vectors become K GADs. Fuzzy logic[1]                   Finally, to take in account the potential uncertainties or
is here used to express MADs and GADs, since the mem-                  noises in data, we can use the interval representation [2].
bership degree of a sample to a given class is not binary              The membership function for the interval type descriptors
but takes a value in [0,1]. Classes can be known a priori,             is regarded as being the similarity S ( xi, ρ ki ) between the
commonly determined by an expert and the learning pro-
cess is therefore supervised, or classes can created during            symbolic interval value for the ith feature xi and the interval
the learning itself (unsupervised mode or clustering).                 [ ρ ki − , ρ ki + ] which represents the value of the ith feature for
Three types of features can be processed by the LAMDA
                                                                       the class Ck, so that:
method: quantitative, qualitative and intervals for the
MAD calculation [2]. The membership functions µ(x) used
by LAMDA are based on the generalization of a probabil-                         µ ki (xi ) = S ( xi, ρ ki )                          (4)
istic rule defined on 0, 1 to the [0,1]-space.




                                                                  20
                                    Proceedings of the 26th International Workshop on Principles of Diagnosis


Let ω be defined as the scalar cardinal of a fuzzy set in a                            This optimization problem has an analytical solution de-
discrete universe as ϖ [X ] = ∑x ∈V µ x (x i ) .                                     termined by the classical Lagrangian method. Details of
                                                                                     the method can be found in [9].
In case of a crisp interval, it becomes:
ϖ [X ] = upperBound(X)- lowerBound(X).                                               3   The new object type
Given two intervals A=[a-, a+] and B=[b-, b+], the distance                            In order to allow the combination of various data types
is defined as:                                                                       into one single global object and therefore to support mul-

                           [( {               } {             })] (5)
                                                                                     ti-dimensional features, we develop a novel data type.
      δ [A, B ] = max 0, max a − , b − − min a + , b +                               Each feature of an object descriptor can be described by a
                                                                                     measured value and an extrinsic object-related weight. A
and the definition of the similarity measure between two                             sample GAD calculus formula is then the weighted mean
crisp intervals:                                                                     of all MADs:
                                                                                                GADk = ∑  MADk .w
                                                                                                      j            ji ~ 
                                                                                                                       fi  for j=1…J     (9)
                                                                                                                         
                   1  ϖ [I1 ∩ I 2 ]       δ [I , I ] 
S ( I1 , I 2 ) =                     + 1 − 1 2                   (6)                            ji
                   2  ϖ [I1 ∪ I 2 ]        ϖ [V ]                                 where MADk = MAD of the jth sample for the ith feature
                                                                                                    ~ ∈ [0,1] = Normalized value of weight
                                                                                     to class k and w   fi
The similarity combines the Jaccard's similarity measure                             w f i of the ith feature determined by the MEMBAS meth-
which computes the similarity when the intervals overlapp,
and a second term which allows taking into account the                               od, and J is the total number of samples which have been
case where the intervals are not straddled.                                          classified.

2.4 Calculation of feature weights
It is possible to determine the relevance of a feature to
optimize the separation between classes. The MEMBAS
method [8, 9] is a feature weighting method based on a
membership margin. A distinguishable property of this
method is its capability to process problems characterized
by mixed-type data (quantitative, qualitative and interval).
It lies on the maximization of the margins between two
closest classes for each sample. It can be expressed as:

                                   ∑ N w µ i ( x ( j ) ) 
     J                       J      i =1 fi c i                
Max ∑ j =1 β j (w f ) = 1/N ∑ j =1                              (7)
                                                          ( j )
wf                                 − ∑ iN=1 w fi µ ~i ( xi )
                                                   c           
Subject to the following constraints: || w f ||22 = 1 , w f ≥ 0 .
                                                                                                   Figure 1: LAMDA architecture
The first constraint is the normalized bound for the modu-                           The main advantage of using this new object-oriented data
lus of wf so that the maximization ends up with non-                                 type is to capture the distinct features of a same object as a
infinite values, whereas the second guarantees the                                   whole. An object of layer i-1 is regarded as one single
nonnegative property of the obtained weight vector. Then                             feature for the layer i then can be processed as all other
can be simplified as:                                                                descriptors. The weights of the descriptors composing the
                                                                                     objects are determined using MEMBAS once the cluster-
                    Max (w f ) s
                                T                                                    ing is finished for the layer i-1. An object is regarded as
                     wf                                                   (8)        being a combination of features, each of which is associat-
                                        2
                    Subject to || w f ||2 = 1, w f ≥ 0                               ed to its weight. In other words, an object regarded as a
                                                                                     single entity in reality can be processed as a complex unit.
where:                                 {
                    s = 1 / N ∑ Jj =1 U jc − U j~c       }             and
                                                                                     For instance, the weather can be considered as a global
                                                                                     concept but also as detailed data (rain, temperature, etc…).
U jc =  µ 1c  xi ,, µ cN  xi  , µ ki  xi( j )  is the
                   ( j)             ( j)                                             All of its features are parts of a same object and are strong-
                                                                             ly connected together. That realistic consideration implies
membership function of class c (c corresponds to the                                 several distinct clustering layers. The layer i concerns the
                                                                                     classification of a sample set called A and the i-1 one in-
“right” class for sample x(j), c~ the closest class evaluated
                                                                                     volves some of their constituent units. Obviously, a second
                             ( j)
at the given value xi               of the ith feature of pattern x(j). s is         layer of classification is consistent only in case at least one
computed with respect to all samples contained in the data                           of the sample features is a complex entity. Therefore, for
base excluding x(j) (“leave-one-out margin”).                                        each sample of the set, an object feature becomes itself a
                                                                                     whole sample in the layer i-1 and is compared to the others




                                                                                21
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


to constitute a new sample set called B. Then a classifica-              where N is the total number of individuals in the data base
tion of the B samples is processed. Once the classification              and K the total number of classes.
of the B samples has been done, its results are used to                  Dis represents the dispersion given by:
compute the classification of A. If the samples of the A set                                      J
have C complex features, the second classification level                                         ∑ δ kj ⋅ exp(δ kj )
                                                                                    K            j =1
implies C distinct sample sets B1, B2, … BC thus C distinct                  Dis = ∑ 1 −                                             (11)
classifications.
                                                                                   k =1    N ⋅ GADMk ⋅ exp(GADMk )
  The MEMBAS algorithm [8, 9] can then calculate the
                                                                         with: δ kj = GADMk − GADk ∀j , j ∈ [1, J ]
                                                                                                           j
weights of every feature for the classes definition. It is                                                                           (12)
applied on the B samples so that its involved features be-
                                                                                GADMk = max GADk 
                                                                                                 j
                                                                         and                                                         (13)
come the weighted components of a meaningful object.                                             
The complex features of an A sample is then a balanced
                                                                            * is the minimum distance between two classes. This
                                                                          Dmin
combination of attributes.
                                                                         distance is computed by using the distance d*(A,B) be-
                                                                         tween two fuzzy sets A and B [8] defined by:

                                                                                                              J         j       j
                                                                                                              ∑ min(GAD A , GADB )   (14)
                                                                                            M [A ∩ B ]      j =1
                                                                         d * ( A, B ) = 1 −            = 1−
                                                                                            M [A ∪ B ]         J          j    j
                                                                                                              ∑ max(GAD A , GADB )
                                                                                                             j =1
                                                                         The highest value of CV corresponds to a better partition.

                                                                         4     Application to an agronomical project
                                                                         The agronomical project aims at developing a diagnosis
                                                                         system for an optimized water management system and an
                                                                         efficient distinctive guidance for corn farmers in order to
                                                                         decrease the use of phytosanitary products and the water
        Figure 2: Principle of hierarchical classification               consumption for irrigation. The project involves two as-
                                                                         pects. The first one aims at complementing the benefits of
  As explained in the Figure 2, the sample Sample1 is de-                adopting and implementing the cultural profile techniques
scribed by X features, including the object-type feature                 [28, 29]. In this context, we perform a classification of
Desc1,1 . Desc1,1 is described by Desc1,α , Desc1,β, etc.                plots based on various agronomic and SAFRAN meteoro-
  To get their respective importance Wα , Wβ etc in Desc1,1              logical data [30], so that each plot should mostly belong to
description, a previous classification is performed regard-              one particular class whose features are known. Thanks to
ing Desc1,1 as a sample (Sample1,p), so that each weight                 the provided information stemmed from the classification
can be calculated using the MEMBAS algorithm [8, 9].                     results, advice can be offered to the corn farmers concern-
Once the respective weights of each feature are known,                   ing the corn variety they should sow and the schedule they
objects are automatically instantiated to be involved in the             should follow for an optimized yield. This study includes
main classification. Desc1,1 is then described in line with              two steps which are described in figure 3. The first one
the obtained weights Wα , Wβ and the known values                        concerns the clustering of a training set of 50 plots, using
V1,α , V1,β.                                                             the unsupervised LAMDA classification.


2.5 Evaluation of a classification quality
  The comparison of two classifications can be performed
by measuring their respective compactness and their sepa-
ration. Better the classes are compact and separated easier
will be the recognition process.

A method to measure the quality of a partition has been
proposed by [10]. This index measures the quality partition
in terms of classes compactness and separation. This parti-
tion index is the Clusters Validity Index (CV, Eq.(10))
which depends only on the GADs (membership degree of
an individual to a class) and not explicitly on data values.

                                 Dis *
                         CV =       ⋅ Dmin . K               (10)
                                 N




                                                                    22
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                        to divide the area in three sub-areas. The results of cluster-
                                                                        ing (B) and the meteorological supervised classification
                                                                        (B’) have been first performed with every sample of the set
                                                                        and the distribution of the weights between the meteoro-
                                                                        logical features has been determined.
                                                                          The result of this classification is consistent and so, we
                                                                        can use the obtained classes and weights of the meteoro-
                                                                        logical features (obtained with MEMBAS) as object-
                                                                        features in classification (A). To analyze the benefit of
                                                                        using hierarchical classification, a clustering (A') has been
                                                                        performed by using the twenty-one meteorological features
                                                                        separately and the agronomical features (twenty-seven
                                                                        features taken indistinctly). We can notice that the proto-
                                                                        types of the classes are highly dependent on the meteoro-
                                                                        logical classes for clustering (A) while clustering (A') is
                                                                        mainly influenced by the ground type.




           Figure 3: Learning System functioning

  The data used for this classification are six distinctive
agronomical descriptors, describing the plots' features and
that are highly involved in their capacity for yield and
water retention, and twenty-one weather features, defining
the meteorological class in which the plot is situated. The
second part of the project will be repeated annually to
update and improve the clustering performed previously by
adding new information returned by the farmers after har-
vest. In the following, only the first part is presented.
  Firstly a previous meteorological clustering (A) is re-
quired to realize a realistic plot classification since the                 Figure 4: Meteorological sub-areas obtained with classifica-
yield of seedling is highly related to the meteorological                                           tion (B)
conditions. The weather is then regarded as a complex
entity so that it is only one of a plot features. It is based on          To enlighten this, we chose arbitrarily two very close
the historical meteorological data of the geographical posi-            classes containing the similar plots in both clustering. Each
tion corresponding to the studied plot. Those descriptors               class prototype is described by the mean value of its mar-
refer to the temperature, the quantity of rainfall, and the             ginal degree memberships (MAD). We represent in Figure
evapotranspiration which occurred during three crucial                  5 these prototype parameters for meteorological features
periods of the year. Each feature is described in several               only for both cases (A with diamond and A' with square)
distinctive ways. For instance, one period temperature is               with in abscises, the marginal membership degree for class
evaluated according three types of information. This mete-              1 and in ordinate the same marginal membership degree
orological clustering is an unsupervised classification                 for class 2. For a better quantification of the benefits that
based on weather data covering every single days of the                 the use of the object representation brings, the CV is sys-
determined periods during the fifty last years for all the              tematically calculated in order to determine the better
geolocalized points belonging to the area studied in this               partition quality. The results are very encouraging since
project (South-West of France). In the event that the plot is           CV = 0.69 when the meteorological data are regarded as a
part of the training set (studied area), the weather type of            whole object and 0.2 when they are treated separately. The
its area is known and the plot classification can be done               object type representation enables to multiply by more
directly. Otherwise, the weather type is obtained thanks to             than 3 this index and therefore the compactness of the
a supervised classification mode (B') delivering the most               obtained partition.
appropriate context. In any cases, the weather type is an
object-feature. This hierarchical treatment permits to re-
gard each meteorological type as a whole and let the
weather contexts follow their natural evolution inde-
pendently of agronomical variations. Moreover, consider-
ing the meteorological features as a single global object
permits taking into account the environmental constraints
and getting a realistic model. As we can observe in the
Figure 4, the meteorological clustering (B) has permitted




                                                                   23
                                      Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                              4   Conclusion
                                                                              This modular architecture allows more flexibility and a
                                                                              more precise treatment of data. As we can notice with the
                                                                              previous agronomical classification, the object approach
                                                                              makes each module able to be managed independently of
                                                                              the others so that they can evolve autonomously, depend-
                                                                              ing on their own specific features and contexts. The object
                                                                              representation permits to preserve multi-dimensionality
                                                                              and makes fusion of datasets easier. A better overview is
                                                                              offered since we can percept the variations of each module
                                                                              distinctively and the evolution of their influences.
                                                                                 As a perspective, an agent-oriented architecture, based
                                                                              on the multi-agents theory [31] will be developed so that
                                                                              each sample could be considered independently of the
                                                                              others. They would be so able to create classes acting
                                                                              simultaneously and comparing themselves to the others, so
                                                                              that the classes definition won’t depend on the samples
                                                                              order in the file anymore but will directly result from the
                                                                              samples set definition. This orientation will assure that the
                                                                              classification result of our method is unique and stable for
                                                                              a given samples set. We aim at developing some methods
                                                                              to allow a semantic data processing also.

         Figure 5: Meteorological prototypes for two close classes in         References
                            case (A) and (A')
                                                                              [1] D. Dubois and H. Prade editor. The three semantics of
                                                                              fuzzy sets, Fuzzy sets and systems, vol. 90,N° 2, pp141-
  The second aspect of our implication in the project deals
                                                                              150,Elsevier, London, 1997.
with the water utilization of various clusters of farmers
                                                                              [2] L. Hedjazi, J. Aguilar-Martin, M.V. Le Lann and T.
with the aim of forecasting the needs of each cluster and
                                                                              Kempowsky, Towards a unified principle for reasoning
adjusting the repartition. From this perspective, we realize
                                                                              about heterogeneous data: a fuzzy logic framework, Inter-
an unsupervised classification of a training data-set of
                                                                              national Journal of Uncertainty, Fuzziness and
2900 samples described by seven features: distance to the
                                                                              Knowledge-Based Systems, vol. 20,N°2,pp. 281-302,
closest waterway, orientation, altitude… Orientation con-
                                                                              World Scientific, 2012
cerns cardinal points and we assume that it is not expressi-                  [3] R.S. Michalski and R.E. Stepp, automated construction
ble with different modalities since continuity cannot be                      of classifications: Conceptual clustering versus numerical
represented by qualitative descriptors. It cannot be a num-                   taxonomy, IEEE Trans. Pattern Anal. Machine Intell., vol.
ber nor an interval because of the cyclic form to be kept.                    PAMI-5, no. 4 (1980), pp. 396-410.
Thus we choose to regard a cluster orientation as an object                   [4] D.W. Aha, Tolerating noisy, irrelevant and novel at-
composed of two descriptors that correspond to the coor-                      tributes in instance based learning algorithms, Int. Man-
dinates of its cardinal point in a trigonometric circle base.                 Machine Studies 36 (1992), pp. 267-287.
The orientation of each cluster can take eight different                      [5] S. Cost, S. Salzberg, A weighted nearest neighbor
values: N, NE, E, SE, S, SW, W, and NW, which bring us                        algorithm for learning with symbolic features, Machine
to consider eight different combinations. In accordance                       learning (10) (1993), pp.57-78.
with the trigonometrical circle, these eight combinations                     [6] T. Mohri, T. Hidehiko, An optimal Weighting Criterion
                                               √2 √2      √2 √2               of case indexing for both numeric and symbolic attributes,
are respectively: (0,1), ( 2 , 2 ), (1,0), (- 2 , 2 ), (0,-1),
                                                                              in D.W. Aha (Ed.), Case-based Reasoning: papers from the
     √2        √2            √2      √2                                       1994 workshop. Menlo Park, CA:AIII Press, pp. 123-127.
(-        ,-        ), (1,0), ( ,-        ).
     2         2              2      2
                                                                              [7] C. Giraud-Carrier, M. Tony, An Efficient Metric for
  Once our results are validated by an expert, the classifi-                  heterogeneous Inductive Learning Applications in the
cation is experimented twice: firstly treating each de-                       Attribute-Value Language, Intelligent systems (1995) 341–
scriptor separately and secondly involving the object type.                   350.
Such as meteorological data in the first example, the CV is                   [8] K.C. Gowda, E. Diday, Symbolic clustering using a
calculated in order to determine the better partition quality.                new similarity measure, IEEE Trans. SMC 22(2) (1992)
  In this case, which implies 2900 samples, CV= 0.08                          368–378.
                                                                              [9] Q.H Hu, Z.X. Xie, D.R. Yu, Hybrid attribute reduction
when abscissa and ordinate are separated, and CV= 0.13
                                                                              based on a novel fuzzy-rough model and information
when using an orientation object. As in the first example,
                                                                              granulation, Pattern Recognition 40 (2007) 3509–3521.
these results show a qualitative gain for the partition when                  [10] F.A.T. De Carvalho, R.M.C.R. De Souza, Unsuper-
the object type is used to express the semantically connect-                  vised Pattern Recognition Models for Mixed Feature-Type
ed data.                                                                      Symbolic Data, Pattern Recognition Letters 31 (2010)
                                                                              430–443.




                                                                         24
                       Proceedings of the 26th International Workshop on Principles of Diagnosis


[11] I. Kononenko, Estimating Attributes: Analysis and              Analysis of near-surface atmospheric variables: Validation
Extensions of Relief, Proc. European Conf. Mach. Learn-             of the SAFRAN analysis over France. Journal of applied
ing ECML (1994), pp. 171-182.                                       meteorology and climatology, 47(1), 92-107.
[12] K. Kira, L. Rendell, A practical approach to feature           [31] Ferber, J. (1999). Multi-agent systems: an introduc-
selection. In proced. 9th Int’l Workshop on Machine Learn-          tion to distributed artificial intelligence (Vol. 1). Reading:
ing (1992), pp. 249-256.                                            Addison-Wesley.
[13] M. Dash, H. Liu, Consistency-based search in feature
selection, Artif. Intell. 151 (2003) 155–176.
[14] D.W. Aha, Incremental, instance-based learning of
independent and graded concept descriptions, in Proced.
Of the 6th int’l Mach. Learning Workshop. (1989) 387–
391.
[15] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, V.
Vapnik, Feature Selection for SVMs, Advances in Neural
Information Processing Systems (2001), pp. 668-674.
[16] T. Cover, P. Hart, Nearest neighbor pattern classifica-
tion, IEEE Trans. Inf. Theory 13 (1967), pp.21-27.
[17] H. Liu, F. Hussian, C.L. TAM, M. Dash, Discretiza-
tion: an enabling technique, J. Data Mining and
Knowledge Discovery 6(4) (2002) 393–423.
[18] M.A. Hall, Correlation-based Feature Selection for
Discrete and Numeric Class Machine Learning, Int. Conf.
Mach. Learning ICML (2000), pp. 359-366.
[19] H.H. Bock, Diday E. Analysis of Symbolic Data,
Exploratory methods for extracting statistical information
from complex data. (Springer, Berlin Heidelberg,2000).
[20] V. Vapnik, The Nature of Statistical Learning Theory
Data Mining and Knowledge Discovery ,pp. 1-47, 6,
Springer Verlag, 1995
[21] T. Cover and P. Hart, Nearest neighbor pattern classi-
fication Information Theory, IEEE Transactions
on,13,1,pp. 21-27, 1967
[22] Michie D., Spiegelhalter D.J., Taylor C.C., Machine
Learning, Neural and Statistical Classification, Ellis Hor-
wood series in Artificial Intelligence, february, 1994
[23] Rakotomalala R., Decision Trees, review MODU-
LAD, 33, 2005.
[24] J. C. Aguado and J. Aguilar-Martin, A mixed qualita-
tive-quantitative self-learning classification technique
applied to diagnosis, The Thirteenth International Work-
shop on Qualitative Reasoning, (QR'99)pp. 124-128, 1999.
[25] L. Hedjazi, J. Aguilar- Martin, M.V. Le Lann, Simi-
larity-margin based feature selection for symbolic interval
data, Pattern Recognition Letters, Vol.32, N°4, pp. 578-
585, 2012
[26] L. Hedjazi, J. Aguilar-Martin, M.V. Le Lann, and
Tatiana Kempowsky-Hamon, Membership-Margin based
Feature Selection for Mixed Type and High-dimensional
Data: Theory and Applications, Information Sciences,
accepted to be published, 2015.
[27] C. V. Isaza , H. O. Sarmiento, , T. Kempowsky-Hamon ,
M.V. Le Lann , Situation prediction based on fuzzy cluster-
ing for industrial complex processes, Information Sciences,
Volume 279, 20 September 2014, pp. 785-804, 2014
[28] Henin S., Gras R., Monnier G., 1969, Le profil cultu-
ral (2e edition), Masson Ed. Paris.
[29] Gautronneau Y., Gigleux C., 2002, Towards holistic
approaches for soil diagnosis in organic orchards, Pro-
ceedings of the 14th IFOAM Organic World Congress,
Victoria, p 34.
[30] Quintana-Seguí, P., Le Moigne, P., Durand, Y., Mar-
tin, E., Habets, F., Baillon, M., ... & Morel, S. (2008).




                                                               25
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                   26
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




          A Bayesian Framework for Fault diagnosis of Hybrid Linear Systems

           Gan Zhou1 Gautam Biswas2 Wenquan Feng1 Hongbo Zhao1 and Xiumei Guan1
         1
           School of Electronic and Information Engineering, Beihang University, Beijing, China
                         email: zhouganterry@hotmail.com; buaafwq@buaa.edu.cn;
                                 bhzhb@buaa.edu.cn; guanxm@buaa.edu.cn
            2
              Institute for Software Integrated Systems, Vanderbilt University, Nashville, USA
                                   email: gautam.biswas@vanderbilt.edu

                         Abstract                                     system. Some faults may be parametric, and they directly
                                                                      affect the continuous behavior, others are discrete, thus they
    Fault diagnosis is crucial for guaranteeing safe,                 directly affect the mode of system operation. Both types of
    reliable and efficient operation of modern engi-                  faults also have indirect effects on the other type of beha-
    neering systems. These systems are typically hy-                  vior. Moreover, faults can have different time-varying pro-
    brid. They combine continuous plant dynamics                      files, such as abrupt faults, intermittent faults and incipient
    described by continuous-state variables and dis-
                                                                      faults [2]. In addition, faults may occur in the plant, the
    crete switching behavior between several operat-
                                                                      actuators and the sensors. The diagnosis of multiple fault
    ing modes. This paper presents an integrated ap-
    proach for online tracking and diagnosis of hybrid                types in the same framework is challenging, because some
    linear systems. The diagnosis framework com-                      faults may produce similar effects in the particular mea-
    bines multiple modules that realize the hybrid                    surements. Therefore, the diagnosis approach should pro-
    observer, fault detection, isolation and identifica-              vide more discriminatory power.
    tion functionalities. More specifically, a Dynamic                   Previous model-based diagnosis approaches of hybrid
    Bayesian Network (DBN)-based particle filtering                   systems were developed separately for parametric faults or
    (PF) method is employed in the hybrid observer to                 discrete faults. For example, [1], [3] combined system
    track nominal system behavior. The diagnostic                     monitoring with an integrated approach: qualitative and
    module combines a qualitative fault isolation me-                 quantitative fault isolation to generate, refine, and identify
    thod using hybrid TRANSCEND, and a quantita-                      parametric faults. [4]-[5] are typical discrete fault diagnosis
    tive estimation method that again employs a                       approaches, which modeled the discrete faults as fault
    DBN-based PF approach to isolate and identify                     modes, and relied on estimating the system behavior for
    abrupt and incipient parametric faults, discrete                  diagnosis. In recent years, some integrated approaches have
    faults and sensor faults in a computationally effi-               been proposed for diagnosis of parametric and discrete
    cient manner. Finally, simulation and experimental                faults together. [6] introduced a global ARRs
    studies performed on a hybrid two-tank system                     (GARRs)-based mode diagnoser to track discrete system
    demonstrate the effectiveness of this approach.                   modes, and combined it with a quantitative approach to
                                                                      diagnose discrete and abrupt or incipient parametric faults
1 Introduction                                                        within a common framework. The approach presented in [7]
The increasing complexity of modern industrial systems                monitored system behavior using a timed Petri-Net model
motivates the need for online health monitoring and diag-             and mode estimation techniques, and isolated the faults by
nosis to ensure their safe, reliable, and efficient operation.        means of a decision tree approach. Unfortunately, this me-
These systems are typical hybrid involving the interplay              thod was application-specific, and was not generalized.
between discrete switching behavior and continuous plant                 Our goal in this paper is to propose an integrated mod-
dynamics. More specifically, the system configuration                 el-based approach to diagnose single and persistent inci-
changes consist of known controlled mode transitions                  pient or abrupt parametric faults, discrete faults and sensor
generated from external supervisory controller and auto-              faults in hybrid linear systems. This extends our earlier
nomous mode transitions triggered by internal variables               work [8] from continuous systems to hybrid systems. A PF
crossing boundary values. The continuous dynamic beha-                technique using switched DBN is adopted for tracking
vior is modeled by continuous-state variables that are a              nominal hybrid system behavior. When a non-zero residual
function of the particular discrete mode of operation. As a           value is detected using a statistical hypothesis testing me-
result, tasks like online monitoring and diagnosis have to            thod, this fault detection scheme triggers the fault isolation
seamlessly integrate continuous behaviors interspersed with           and identification modules. We combine a fast qualitative
discrete transitions that often require model switching to            fault isolation (Qual-FI) scheme using the hybrid TRAN-
accommodate the discrete transitions [1].                             SCEND approach [1] with quantitative fault isolation and
   For complex hybrid systems, faults will typically affect           identification (Quant-FII) scheme based on a PF-based
the continuous behavior and the discrete dynamics of the              parameter estimation technique to support the diagnosis of
                                                                      multiple faults types in hybrid linear systems. The




                                                                 27
                               Proceedings of the 26th International Workshop on Principles of Diagnosis


Quant-FII scheme derives a switched faulty DBN model for               of zero flow and zero effort, respectively. The dynamic
each fault hypothesis that remains when the switch from                behavior of switched junctions is implemented by a finite
Qual-FI to Quant-FII is initiated. In addition, Quant-FII is           state machine control specification (CSPEC). A CSPEC
also designed to estimate possible parameter values [8].               defines finite number of states, and captures controlled and
   The rest of this paper is organized as follows. Section 2           autonomous changes.
briefly presents the different models employed in our di-                 The hybrid two-tank system, shown in Figure 1, is the
agnosis approach and some basic definition of the different            running example we employ in this paper. This system
types of faults. A hybrid two-tank system is used as a run-            consists of two tanks connected by a pipe, a source of flow
ning example to explain the hybrid bond graph modeling                 into the first tank, and drain pipes at the bottom of each tank.
method and the derivation of temporal causal graph and                 Three valves valve1, valve2 and valve3 can be turned on
DBN from hybrid bond graph models. Section 3 gives a                   and off by commands generated from the supervisory con-
brief overview of our diagnosis architecture, and then                 troller. When the liquid level in tanks 1 ( h1 ) and/or 2 ( h2 )
presents our online tracking and fault detection, qualitative          reaches the height at which pipe R12 is placed ( h ), a flow is
fault isolation and quantitative fault isolation and identifi-
cation schemes in some detail. Section 4 discusses the re-             initiated through pipe R12 . The autonomous mode changes
sults of the application of our algorithm to the hybrid                associated with this pipe are triggered when the liquid level
two-tank system. Finally, the discussion and conclusions of            in tank1 and/or tank 2 goes above or below the height of the
this paper are presented in the last section.                          pipe R12 . We assume five sensors: M 1 and M 2 measure the
                                                                       outflow from tank 1 and tank 2, respectively. M 3 measures
2 Theoretical Background                                               the flow through the autonomous pipe R12 , and M 4 and
In this section, we formalize the basic definitions, concepts           M 5 measure the liquid pressure in tank 1 and tank 2, re-
and notation of the modeling approach that goes in con-
junction with our diagnosis architecture.                              spectively.

2.1 Hybrid Bond Graphs                                                                                           CSPEC4                                   CSPEC5
Bond graphs (BGs) are a domain-independent topologi-                                                              C : C1             Df : M 3             C : C2
cal-modeling language that captures energy-based interac-                   CSPEC1               De : M 4
tions among the processes that make up a physical system                                                     3          4                    9                   12

[9]. The nodes in bond graphs represent components of                                  1                2                    8                   11                   16
                                                                           Sf                    1                  0                  R12                   0              De : M 5
dynamic systems including energy storage elements (ca-
pacities, C and inertias, I), energy dissipation elements                                                               5                  10                    13
(resistors, R), energy sources (effort source, Se and flow
                                                                                                        6                                                             14
source, Sf) and energy transformation elements (gyrators,                                  Df : M 1                 1                 R : R12                1              Df : M 2
GY and transformers, TF). Bonds, drawn as half arrows,
                                                                                                                        7            CSPEC2                      15
represent the energy exchange paths between the bond
graph elements. Two junctions (1 and 0), also modeled as                                                          R : R1                                  R : R2
nodes, represent the equivalent of series and parallel to-                                                                              CSPEC3
pologies respectively.
               Valve1                                                     Autonomous pipe R12
                                                                                 1                    LS f                   f(x)                     RS f                  1


                   F1

                                                                                Left                     1                       1                    1                    Right




                                     R12                                                                                    R : R12
                        C1                           C2


                                                                                       Figure 2 Hybrid bond graph of the plant
                                                                          Figure 2 illustrates the HBG model for the plant in Figure
              Valve2                       Valve3
                               R1                           R2
                                                                       1 (The HBG model for autonomous pipe R12 is shown
                       Tank1                        Tank2
                                                                       separately at the bottom part of Figure 2). The tanks and
                                                                       pipes are modeled as fluid capacitances C and resistances R,
      Figure 1 Schematics of hybrid two-tank system                    respectively. Measurement points occur at junctions. They
                                                                       are denoted by elements with symbols De for effort variable
  Hybrid bond graphs (HBGs) extend BGs by introducing
                                                                       measurements and Df for flow variable measurements.
switched junctions to enable discrete changes in the system
                                                                       Moreover, the two-tank system has five switched junctions:
configuration [10]. The switched junctions may be dy-
                                                                       the CSPEC1, CSPEC2 and CSPEC3 describe the control
namically switched on and off as system behavior evolves.
                                                                       logic for the three valves. CSPEC4 and CSPEC5 together
When a switched junction is on, it behaves as a normal
                                                                       capture the autonomous mode transitions of the connecting
junction. When off, the 1 and 0 junctions behave as sources




                                                                  28
                             Proceedings of the 26th International Workshop on Principles of Diagnosis


pipe between the two tanks. Figure 3 (a) shows the CSPEC                       between the variables [12]. The system variables consists of
for a valve controlled by the switching signal sw. Figure 3                    four different set of variables  X t , Zt ,Ut , Yt  , which de-
(b) shows CSPEC4 that describes the state of the left tank.
                                                                               notes the continuous state variables, other hidden variables,
When the liquid height in tank1 is below that of the auto-
                                                                               input variables and measured variables for dynamic system,
nomous pipe R12 , that state is OFF. If the liquid level ex-                   respectively. The relations between these variables can be
ceeds the height of the pipe, this CSPEC transitions to the                    generated as equations in the state space formalism. The
ON state. Similarly, CSPEC5 denotes the state of the right                     across-time links between the successive times slice t and
tank, and the mode of the autonomous pipe depends on the                       t+1 are derived as transition equations between the state
combination of these two CSPECs. Table 1 shows the dis-                        variables in the system. Since the TCG describes the causal
crete mode for pipe R12 and the corresponding state of                         constraints between system variables, the DBN can be
CSPEC4 and CSPEC5 in detail. The corresponding bond                            easily constructed from TCG. More details of this process
graph configurations are described in [15].                                    are presented in Lerner, et al. [13].
                                                                                                        t                 t+1

                sw                                h1  h                                                    f1     f1
                                                                                               f6                               f6
      S1 : ON         S2 : OFF           S1 : ON            S2 : OFF

                                                                                                            e4      e4
                 sw                                h1  h
                                                                                               f9                               f9
                (a)                                (b)

Figure 3 (a) Controlled transition; (b) Autonomous transi-                                                  e12    e12
                     tion for CSPEC4
                                                                                              f14                               f14
Table 1 Four different possible configurations for auto-
nomous pipe R12                                                                                     Figure 4 Nominal DBN
    Mode        Constraint Function      CSPEC4             CSPEC5                When all the valves are ON and the liquid level in tank1
     1            h1  h  h2  h          ON                OFF               and tank2 are above the height of the autonomous pipe R12 ,
       2          h1  h  h2  h           OFF               ON               the nominal DBN model for hybrid two-tank system is
                                                                               shown in Figure 4. This DBN model derived from the TCG
       3          h1  h  h2  h           OFF              OFF
                                                                               as the following random variables: the continuous state
       4          h1  h  h2  h            ON               ON               variables X  e4 , e12  presents the pressures at the bottom
                                                                               of each tank, input variables U   f1 denotes the input
   The temporal causal graph (TCG) is a signal flow dia-
gram that captures the causal and temporal relations be-                       flow into tank 1, and measured variables Y   f6 , f9 , f14 
tween system variables, and can also be systematically                         indicates the outflow from tank1, the flow through the au-
derived from a BG [11]. In our work, we can efficiently                        tonomous pipe R12 and the outflow from tank 2.
reason about the qualitative behavior of each continuous
                                                                                                    t                     t+1
mode of hybrid system behavior using the TCG when a
fault is detected. Formally, a TCG is defined as follows [2]:                                               f1      f1
   Definition 1 (Temporal Causal Graph): A TCG is a di-
rected graph that can be denoted by a tuple .
                                                                                              f9            e4      e4           f9
V  E  F  S  M is a set of vertices involving effort
variables E, flow variables F, discrete fault event S and
measurement M in hybrid bond graph model. L is a label set                                                  e12
                                                                                             f14                    e12          f14
{1, 1, , p, p 1 , N , Z , p  dt , p 1  dt} . The propagation type
of first seven labels is instantaneous, and the last two are
temporal. D  V  L  V is a set of edges.                                                                  R1      R1

   For lack of space, the TCG for hybrid two-tank system is
not shown in this paper, but the algorithms for deriving                        Figure 5 Single DBN model for both abrupt and incipient
TCGs directly from bond graph model can be found in [2].                                           parametric fault
It should be noted that for each mode of operation, the TCG
may need to be re-derived to capture the changes in the BG                        Since the discrete faults only influence the system mode,
model configuration when mode transitions occur.                               but not parameter variables, the DBN fault model corres-
                                                                               ponding to discrete fault will be constructed from the TCG
2.2 Dynamic Bayesian Networks                                                  in the particular discrete mode. For parametric faults, the
Assuming that the system is Markovian and time-invariant,                      DBN fault model is generated on the basis of nominal DBN
we can model the system as a two-slice temporal Bayes net                      model by augmenting a new random variable for each fault
that illustrates not only the relations between system va-                     candidate. Figure 5 shows DBN model with parametric
riables at any time slice t, but also the across-time relations                faults represented explicitly for the hybrid two-tank system.




                                                                          29
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


The abrupt fault R1 a and incipient fault R1 i are                         3 Diagnosis Approach of Hybrid Linear
represented in the same model. When the fault occurs, fault                    Systems
parameter R1 becomes the additional state variable that
                                                                             Our integrated diagnosis approach for hybrid linear systems
need to be tracked.                                                          (See Figure 6) combines the Hybrid TRANSCEND ap-
                                                                             proach [2] with switched DBN-based PF scheme [14] to-
2.3 Modeling Faults                                                          gether, which diagnoses abrupt or incipient parametric
In this paper, we focus on the diagnosis of persistent single                faults, discrete faults and sensor faults in a common
faults. We consider incipient or abrupt parametric faults and                framework. It includes three main parts: system monitoring,
discrete faults occurring in hybrid linear systems, as well as               qualitative fault isolation (QFI) and quantitative fault iso-
sensor faults. The precise definition for these faults can be                lation and identification (QFII). These three steps are
given as follow.                                                             summarized below.
   Definition 2 (Incipient parametric fault): An incipient                      Initially, a nominal DBN is constructed from the current
fault profile is defined by a gradual drift in the corres-                   TCG model. A hybrid observer uses a PF-based nominal
ponding component parameter value p(t) from the fault                        DBN model to track the system behavior in individual
occurrence time t f . The incipient fault parameter pi (t )                  modes of operation. At the same time, a finite automata
can be described by:                                                         method in hybrid bond graph scheme implements the
                                                                             CSPECs, executes controlled and autonomous mode
             
              p(t )                                     t  tf
   pi (t )                                                      (1)        changes, and determines the system model for hybrid ob-
             
              p (t )  d (t )  p (t )   i
                                            p (t  t f )  t  t f            server.
                                                                                The fault detection continually monitors the statistically
  where d (t )   ip (t  t f ) is a linear function with a con-
                                                                             significant deviations between the observation y(t) and
stant slope  ip that added to the nominal parameter value                   estimation yˆ (t ) generated by hybrid observer. Once a fault
from the time point of fault occurrence. Our approach to                     is determined, QFI is triggered to generate the initial fault
isolation and identification of incipient fault parameters is                hypothesis, and refine them as additional deviations are
to calculate this constant slope  ip [8].                                   observed. When remaining fault hypothesis set satisfies
                                                                             particular condition, the QFII scheme is invoked to run in
   Definition 3 (Abrupt parametric fault): An abrupt para-
                                                                             parallel with QFI. The goal of this scheme is to refine the
metric fault is characterized by step changes in nominal
                                                                             fault hypothesis further and estimate the value of the fault
component parameter value p(t) from the fault occurrence
                                                                             parameter. The following subsections describe these steps
time t f . The abrupt fault parameter p a (t ) is given by:
                                                                             in more detail.
                 
                  p(t )                               t  tf
      p a (t )                                                (2)          3.1 Online Tracking and Fault Detection
                  p(t )  b(t )  p(t )   p  p(t ) t  t f
                                             a
                                                                            Since the hybrid system is piecewise continuous, discrete
   where b(t )   pa  p(t ) is a step function that gets added to          mode changes of the hybrid system have to be detected
the parameter value from the time point of fault occurrence.                 accurately as the continuous behavior of the system
 pa is the percentage change in the parameter expressed as a                evolves. In our work, we have designed hybrid observers
                                                                             that are based on the nominal DBN-based PF scheme to
fraction, and our goal is to estimate this value [8].                        track the continuous behavior in individual modes of oper-
   Definition 4 (Discrete fault): A discrete fault manifests as              ation. PF is a general purpose Markov chain Monte Carlo
a discrepancy between the actual and expected mode of a                      method that approximates the belief state using a set of
switching element in the model [2].                                          samples or particles, and keeps the distribution updated as
   Discrete faults occur in discrete actuators, like valves and              new observations are made over time. Moreover, the PF
switches that operate in discrete modes (e.g., on and off).                  approach for DBNs exploits the sparseness and compact-
Consider the example of a valve, it may be commanded to                      ness of the DBN representation to provide computationally
close, but remain stuck open. Also, it may unexpectedly                      efficient solutions, because each measured variable in a
open or close without a command. This type of fault ma-                      DBN typically depends on some but not all continuous state
nifests as an unexpected system mode change, unlike pa-                      variables.
rametric faults, which cause deviations in continuous be-                       For discrete mode changes, the finite state machine
havior.                                                                      (FSM) for each switched junction determines mode transi-
   Definition 5 (Sensor fault): A sensor fault is a discre-                  tions. Since the continuous behavior and discrete mode
pancy between the measurement and actual value in the                        changes will interact with each other as system evolves, the
model.                                                                       FSM needs to execute controlled or autonomous mode
   In this paper, we only consider sensor bias fault, which                  changes. Explicit controlled changes are relatively simple,
can be represented as:                                                       but the autonomous mode changes depend on the internal
                             
                             m(t )           t  tf                         continuous variables. If mode changes occur, the hybrid
                   mb (t )                                    (3)          observer will regenerate the nominal DBN model from
                             m(t )   m t  t f
                                         b
                                                                            TCG in new mode, and use the PF to continuously track
   where m(t) is the true value, and bm is the sensor bias                  system dynamic behavior. The online tracking algorithm
term.                                                                        for hybrid systems is shown in Algorithm 1.




                                                                        30
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                                            Qualitative Fault Isolation
                                                 System Monitoring                   Symbol              Hypothesis           Progressive
                                y(t)
                    System                                               r(t)       Generation           Generation           Monitoring

                                             +
      u(t)                                       r(t)     Fault
                        y(t)                             Detection                    Quantitative Fault Isolation and Identification
                                         -
                    Hybrid
                                                                                      Fault Isolation               Fault Identification
                   Observer     y (t )



                                                                      Temporal                       DBN Modeling
                 Nominal DBN
                                                                     Causal Graph                    Faulty Behavior



                                                                     Hybrid bond
                                                                       graph


                                                        Figure 6 The diagnosis architecture
 Algorithm 1: Online tracking algorithm                                         [1] to hybrid systems. Daigle, et al. [2] extended this me-
 Input: Number of particles, N; a initial DBN model                             thod to model discrete and sensor faults in continuous and
        D  { X , Z ,U , Y }                                                    hybrid systems. All of these methods are based on a formal
 For each particle i, from 1 to N do                                            definition of fault signature as follows:
                                                                                  Definition 6 (Qualitative Fault Signature): Given a fault f
     Sample X 0i from the prior probability distribution
                                                                                and measurement m, the qualitative fault signature can be
       Assign Y0i as the measurement at time step 0                             denoted by QFS ( f , m)  {(s1s2 , s3 ), s1 , s2  (, , 0,*), s3 
  End For                                                                       ( N , Z , X ,*)} ; where  and 0 indicate an increase, de-
  For each time-step t>0 do
                                                                                crease, and no change for residual magnitude or slope. N, Z
       If the controlled or autonomous mode change oc-
                                                                                and X imply zero to nonzero, nonzero to zero, and no dis-
 curs
                                                                                crete change behavior in the measurement from the esti-
          Regenerate a DBN model D ' from TCG in new                            mate. * denotes the ambiguity in the signatures.
 system configuration
       End If                                                                   Table 2 Selected fault signature for hybrid two-tank system
       Prediction: Sample each particle in DBN model                            for the mode when all the valves are open and liquid level in
  D '                                                                           both tanks are above the height of the autonomous pipe
       Weighting: Compute the weight considering the                                        Fault            f6            f9            f14
 observation
       Resampling: Normalize the weighted samples, and                                       C1 a       (, X )       (, X )      (0, X )
 resample N new samples                                                                      C1 i       (0, X )       (0, X )      (0, X )
       Calculate the estimated continuous state variables                                    R1 a       (, X )       (0, X )      (0, X )
  X t and Yt at time step t                                                                  R1 i       (0, X )       (0, X )      (0, X )
 End For                                                                                   v1.off        (0, X )       (0, X )      (0, X )
                                                                                           v 2.off       (, X )       (0, X )      (0, X )
   The fault detection module compares the measured va-
riable y(t) from sensors with its estimate, yˆ (t ) computed                                  f 6        ( 0, )      (00, X )      (00, X )
by the hybrid observer at each time-step t. Ideally, any                                      f 6        ( 0, )      (00, X )      (00, X )
inconsistency r (t )  y (t )  yˆ (t ) implies a fault, and in-
                                                                                    When measurement deviations are detected, the symbol
vokes the qualitative fault isolation module. However, to
                                                                                generator module in QFI scheme is triggered to calculate
account for noise in the measurements and modeling errors,
                                                                                the QFS for the current mode of operation. However, since
statistical techniques are employed to determine significant
                                                                                the fault may have occurred but not detected in an earlier
deviations from zero for the residual. In this paper, a Z-test,
                                                                                mode, the fault hypothesis generation module rolls back to
which uses a sliding window to compute the residual mean
                                                                                find the previous modes in which fault may have occurred,
and variance, is adopted by reliable fault detection with low
                                                                                and generate fault hypothesis set F  {( fi , i , qi )} , where
false-alarm rates [3].
                                                                                  i denotes the deviation of fault parameter value, and q i
3.2 Qualitative Fault Isolation                                                 indicates the possible modes. The progressive monitoring
The QFI scheme is based on qualitative fault signature                          module applies the forward propagation algorithm to con-
(QFS) method, which was proposed by Mosterman and                               tinually refine the fault candidates in the fault hypotheses
Biswas [11] and then extended by Narasimhan and Biswas                          set. For hybrid systems, the progressive monitoring also has




                                                                         31
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


to include forward propagation through mode changes,                        tively. The abrupt parameter faults are modeled as step
which makes the tracking algorithm much more complex.                       decrease in tank capacity and step increases in pipe resis-
Narasimhan and Biswas [1] discuss the details of the roll                   tances and represented as C1 a , C2 a , R1 a , R2 a and R12 a re-
back and roll forward algorithms used to support the pro-                   spectively. We consider discrete faults in each controlled
gressive monitoring task. When a fault signature is no                      valves including the valve gets stuck and valve changes
longer consistent with the observed measurements, and the                   mode without a command. For sensor faults, bias faults
changes cannot be resolved by autonomous mode transi-                       causing abrupt changes in the measurement are considered.
tions, this fault candidate is dropped.                                        We assume that the tanks are initially empty, and start to
   The selected qualitative fault signature for hybrid                      fill in at a constant rate. The initial configuration of the
two-tank system in particular mode is shown in Table 2. For                 system is all the valves are set to open. We will denote the
incipient parametric faults, the QFS is shown as (0 , s3 ) ,               system mode as qijkm , where i, j and k are the modes of
where  is the first nonzero symbol in the QFS for the
                                                                            valve1, valve2 and valve3 respectively, and m is the mode
abrupt faults with same system parameter. Sensor faults
                                                                            of autonomous pipe R12 . More specifically, the mode of
only affect the measurement provided by the sensor, so
other measurements that are not affected are denoted by 00.                 valves includes S1 : on, S2 : off , S3 : Stuck _ on and S 4 :
                                                                            Stuck _ off . Therefore, the initial mode of the system is
3.3 Quantitative Fault Isolation and Identifica-
                                                                            q1113 . At time step t=6.7s, the liquid level in tank 1 reaches
    tion
                                                                            the height of autonomous pipe R12 . The system mode tran-
Quant-FII scheme will be activated when any of the fol-
lowing conditions are fulfilled: 1) All the measurements                    sitions from q1113 into q1111 . Now the autonomous pipe R12
have deviated from nominal, so the remaining fault candi-                   acts as an outflow pipe for the tank 1 but as flow source for
dates cannot be refined further only by the Qual-FI scheme;                 the tank 2. As system evolves, the liquid level in tank 2 will
2) The number of fault candidates has been reduced to a                     also reach the autonomous pipe at time step t=53s. After
predefined value k; 3) A predefined time l has elapsed. We                  that, system mode changes into q1114 . The experiments
restrict the length of Quant-FII scheme as a pre-specified                  have been run for a total of 400s using a sampling period
value, and assume that no autonomous change occurs dur-                     0.1s. Gaussian white noise with zero mean and variances
ing this period.                                                            0.018 is added to measurements.
   The steps describing this scheme are illustrated as fol-
lows: First, a separate DBN faulty model will be con-                       4.1 Incipient Parametric Fault in R1
structed for each remaining fault candidate in the hypothe-                 In this first experiment, we present our diagnosis approach
sis set. Second, we combine each switched DBN faulty
                                                                            for a fault scenario. A 10% rate of increase in pipe R1 is
model with PF method to estimate the system behavior.
Similar to fault detection scheme, a Z-test method is em-                   injected as the incipient fault at time step t = 60s.
ployed to detect the inconsistency between estimated values
from PF and measurements. Ideally, only the correct true
fault model will converge to the observed values of the
measurements. Once the deviation is determined, the cor-
responding fault candidate will be dropped. This scheme
runs in parallel with the qualitative fault isolation scheme,
and if a controlled mode change occurs, these two schemes
need to reload the DBN model for new system mode. This is
the big difference between continuous systems and hybrid
systems.
   If the fault hypothesis cannot be refined further or only a
single parametric or sensor fault candidate is left, fault                   Figure 7 Observed and estimated result for nominal DBN
identification scheme will be activated to identify the abrupt                                       model
or incipient parametric fault in the same model and estimate
the fault parameter value. We can use the PF result of the                     We only consider the measurement M 3 and M 2 for the
fault parameter to calculate the abrupt parameter fault                     flow f 9 through the autonomous pipe R12 and the output
magnitude  pa , incipient parameter fault slope  ip or sensor             flow f14 from tank 2. At time step t=82s, the fault detection
fault bias term bm .                                                       scheme detects an increase in the flow f 9 , resulting in the
                                                                            initial fault hypothesis F  {(C1 a , q1114 ), (C1i , q1114 ), ( R1 a ,
4 Experimental Results
                                                                            q1114 ),( R1i , q1114 ),(v2.off , q1414 ),( f9 , q1114 )} . At 88.4s, the
To demonstrate the effectiveness of our approach, we apply
                                                                            flow f14 shows an increase above nominal (+). A possible
it to the hybrid two-tank system in Figure 1. In this plant,
the incipient parametric faults are modeled as gradual de-                  autonomous transition is executed for the current inconsis-
crease in tank capacity and gradual increases in pipe resis-                tent candidate ( f9 , q1114 ) . After that, the first order change
tances and denoted as C1i , C2i , R1 i , R2 i and R12 i respec-        of flow f 9 is determined to decrease and increase in mode




                                                                       32
                             Proceedings of the 26th International Workshop on Principles of Diagnosis


 q1414 and q1114 at time steps t=94.8s and 97.7s, respective-                      models are shown in Figure 8 and Figure 9 respectively, and
ly, and finally the possible fault hypotheses are                                  the plot for estimated value for R1 is presented in Figure 10.
 F  {(C1i , q1114 ), ( R1 a , q1114 ), ( R1 i , q1114 )} . According to
                                                                                   4.2 Discrete Fault in Valve 2
the fault signatures in mode q1114 , these three candidates
                                                                                   In this subsection, we investigate an unexpected switch
cannot be refined further using observed deviations. Figure                        fault: valve 2 closes without a command at time step t=80s.
7 represents observed and estimated result generated by the
                                                                                   We only consider the flow f 6 and flow f 9 in this experi-
nominal DBN model.
                                                                                   ment.
                                                                                      Figure 11 shows the observed and estimated outputs us-
                                                                                   ing nominal DBN model. The fault is detected at time step
                                                                                   t=80.1s, and the symbol generator reports a decrease in
                                                                                   flow f 6 . QFI scheme generates the fault hypothesis set
                                                                                   F  {( R1 a , q1114 ), ( R1i , q1114 ),(v1.off , q4114 ),(v2.off , q1414 ),
                                                                                   ( f6 , q1114 )} . At time step t=80.6s, the symbol generator
                                                                                   determines the flow f 6 to Z in mode q1114 and q4114 , be-
                                                                                   cause of estimated flow fˆ  0 and the observation f  0 .
                                                                                                                     6                                     6

                                                                                   This symbol eliminates all the parametric faults and discrete
   Figure 8 Estimated observation using fault model C1 i                          fault v1.off from current trajectory. At 83.6s, the flow f10
                                                                                   shows a positive deviation (+), so the fault candidate
                                                                                    (v2.off , q1414 ) is correctly isolated. In this experiment, the
                                                                                   real fault candidate is isolated by the QFI scheme, so the
                                                                                   QFII scheme is not invoked.




  Figure 9 Estimated observation using fault model R1 a / i



                                                                                   Figure 11 Observed and estimated result for nominal DBN
                                                                                                           model

                                                                                      We also perform several additional experiments with
                                                                                   different fault types, fault magnitude, noise level and fault
                                                                                   occurrence time, and obtain satisfactory results. For lack of
                                                                                   space, we do not discuss these results in detail.


                                                                                   5 Conclusion
   Figure 10 Estimated value of true fault parameter R1 i                         In this paper, we presented an integrated approach for on-
                                                                                   line monitoring and diagnosis of incipient or abrupt para-
   The QFII scheme is initiated at time step t=72s, and two
                                                                                   metric faults, discrete faults and sensor faults in hybrid
separate DBN fault model using C1 i and R1 a / i are con-                        linear systems. First of all, we adopt the HBGs to model the
structed. As more measurements are obtained, the Z-tests                           system, and construct the diagnosis models, i.e., the TCGs
indicate a deviation in the measurement estimates obtained                         and the DBN models from the HBG model in different
by the fault model C1 i , and the estimation generated by                         modes. A PF method based on the switched DBN model is
                                                                                   employed for online monitoring of the system dynamic
possible true fault model R1 a / i is consistent with mea-
                                                                                   behavior. Once the discrete finite automaton in the HBGs
surement. The quantitative fault identification part esti-                         detects the controlled or autonomous mode changes, HBGs
mates the value of R1 , and determines that R1 indeed has an                       will regenerate the TCGs and DBN model in new mode.
incipient fault. While the actual fault slope is 0.1, the esti-                    These modeling approaches guarantee that the hybrid sys-
mated slope is 0.1009. The estimation using two faulty                             tems can be tracked correctly.




                                                                              33
                       Proceedings of the 26th International Workshop on Principles of Diagnosis


   Then, we demonstrate that we can accommodate discrete                the 5th Symposium on Fault Detection, Supervision
faults and sensor fault models into the TCG and DBN                     and Safety for Technical Processes, 1125-1131, 2003
models that represent dynamic system behavior. As a result,        [4] Dearden, R. and Clancy, D. Particle filters for real-time
our model-based approach can diagnose parametric, dis-                  fault detection in planetary rovers. In Proceedings of
crete and sensor faults within the same modeling and                    the Thirteenth International Workshop on Principles of
tracking framework. Finally, QFI scheme using Hybrid                    Diagnosis, 2002
TRANSCEND approach and QFII scheme by means of
                                                                   [5] Hofbaur, M. W. and Williams, B. C. Hybrid estimation
switched DBN-based PF approach are combined together                    of complex systems. Systems, Man, and Cybernetics,
into a common framework, which provides more discri-
                                                                        Part B: Cybernetics, IEEE Transactions on, 34(5):
minatory power and less computational complexity.
                                                                        2178-2191, 2004.
   This work builds on approaches presented in
[1][2][11][14]. [1] extends our previous work [11] from            [6] Levy, R., Arogeti, S., and Wang, D.. An integrated
continuous systems to hybrid systems, but previous diag-                approach to mode tracking and diagnosis of hybrid
nosis framework could only handle abrupt parametric faults.             systems. IEEE Transactions on Industrial Electronics,
Soon after, Daigle [2] further extended the work in [1] to              61(4), 2024–2040, 2014.
capture discrete faults and sensor faults. Roychoudhury            [7] Zhao, F., Koutsoukos, X., Haussecker, H., Reich, J.
[8][14] combined a qualitative fault isolation scheme with              and Cheung, P. Monitoring and fault diagnosis of
an efficient DBN approach to diagnose both abrupt and                   hybrid systems. Systems, Man, and Cybernetics, Part
incipient parametric faults for continuous systems. This                B: Cybernetics, IEEE Transactions on, 35(6),
paper proposes a comprehensive diagnosis methodology,                   1225-1240, 2005
which extends DBN-based PF observer [8][14] to track               [8] Roychoudhury, I., Biswas, G., Koutsoukos, X..
behavior of linear hybrid systems within and across mode                Comprehensive diagnosis of continuous systems using
changes, and combines qualitative fault isolation scheme in             dynamic bayes nets. Proceedings of the 19th
[2] with PF-based quantitative fault isolation and identifi-            International Workshop on Principles of Diagnosis.
cation scheme in [8][14] to diagnose multiple fault types.              151-158, 2008
   This method has been successfully applied to a hybrid           [9] Karnopp, D. C., Margolis, D. L. and Rosenberg, R.
two-tank system, and experimental results demonstrate the               C. System Dynamics: Modeling, Simulation, and
effectiveness of the approach. However, since the applica-              Control of Mechatronic Systems. Wiley. 2012
tion in this paper is only a relatively simple hybrid linear
                                                                   [10] Roychoudhury, I., Daigle, M. J., Biswas, G. and
system, our future work will scale up this methodology for
                                                                        Koutsoukos, X. Efficient simulation of hybrid systems:
more realistic linear and nonlinear hybrid systems. More-
                                                                        A hybrid bond graph approach. Simulation, 87(6),
over, distributed diagnostics techniques can efficiently
                                                                        467-498, 2011.
decrease the computational complexity for complex real
systems, so this is also a research direction in future [16].      [11] Mosterman, P. J., and Biswas, G. Diagnosis of
                                                                        continuous valued systems in transient operating
Acknowledgments                                                         regions. Systems, Man and Cybernetics, Part A:
                                                                        Systems and Humans, IEEE Transactions on, 29(6),
This research was supported by China Scholarship Council                554-565, 1999.
under contract number 201306020068. The work was per-
                                                                   [12] Murphy, K. P. Dynamic bayesian networks:
formed in Prof. Biswas’ lab at the Institute for Software
                                                                        representation, inference and learning. PhD thesis,
Integrated Systems (ISIS), Vanderbilt University, USA                   University of California, Berkeley. 2002
                                                                   [13] Lerner, U., Parr, R., Koller, D. and Biswas, G.
References
                                                                        Bayesian fault detection and diagnosis in dynamic
[1] Narasimhan, S. and Biswas, G. Model-based diagnosis                 systems. In AAAI/IAAI, 531-537, 2000.
    of hybrid systems. Systems, Man, and Cybernetics,              [14] Roychoudhury, I. Distributed diagnosis of continuous
    Part A: Systems and Humans, IEEE Transactions on,                   systems: Global diagnosis through local analysis. PhD
    37(3): 348-361, 2007.                                               thesis, Vanderbilt University. 2009
[2] Daigle M J. A qualitative event-based approach to fault        [15] Narasimhan, S. Model-based diagnosis of hybrid
    diagnosis of hybrid systems. PhD thesis, Vanderbilt                 systems. PhD Dissertation, Vanderbilt University.
    University, 2008                                                    Department on Electrical Engineering and Computer
[3] Biswas, G., Simon, G., Mahadevan, N., Narasimhan,                   Science, August 2002.
    S., Ramirez, J. and Karsai, G. A robust method for             [16] Roychoudhury, I., Biswas, G., & Koutsoukos, X.
    hybrid diagnosis of complex systems. Proceedings of                 (2009). Designing distributed diagnosers for complex
                                                                        continuous systems. Automation Science and
                                                                        Engineering, IEEE Transactions on, 6(2), 277-290.




                                                              34
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




     ADS2 : Anytime Distributed Supervision of Distributed Systems that Face
                     Unreliable or Costly Communication

              Cédric Herpson∗ and Vincent Corruble and Amal El Fallah Seghrouchi
          Sorbonne Universités, UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris, France
                           CNRS, UMR 7606, LIP6, F-75005, Paris, France
                                  e-mail: firstname.lastname@lip6.fr


                        Abstract                                        Within the Dem@tFactory1 project, our objective is thus
                                                                     to improve the supervision of an existing digitizing chain
     The purpose of a supervision system is to detect,               distributed over several sites (see Fig 1). Different faults
     identify and repair any fault that may occur in the             – single or multiple – can occur and alter or prevent the
     system it supervises. Nowadays industrial pro-                  processing of the documents (e.g. a scanner quits working,
     cess are mainly distributed, and their supervision              a disruption of the connection between different sites halts
     systems are still centralized. Consequently, when               or corrupts a data transfer, an OCR software is poorly set
     communications are disrupted, it slows down or                  and generates unexploitable results, etc.).
     stops the supervision process. Increasing produc-
     tion rates make this subjection to the state of the
     communications no more acceptable. To allow the
     anytime supervision of such systems, we propose
     a distributed approch based on a multi-agent sys-
     tem where each supervision agent autonomously
     handles both diagnosis and repair on a given lo-
     cation. This degree of delegation, never consid-
     ered in the literature nor in the industry outside
     of the theoretical framework, requires to over-
     come several difficulties : How can one agent au-
     tonomously make a diagnosis with dynamically
     arriving information ? How can several agents
     may coordinate and reach a consensus on a given
     diagnosis or repair with asynchronous communi-
     cation ? Finally, how to allow a human to trust                 Figure 1: In red, the communication links between the main
     the decisions of such a system ? This paper devel-              sites of the digitization chain of the Dem@tFactory project.
     ops our proposal allong these three axis and evalu-             In yellow, the links with the current (centralised) supervi-
     ates ADS2 using an industrial case-study. Exper-                sion system.
     iments demonstrate the relevance of our approach
     with an overall reduction of the supervised system                 Centralized supervision systems are currently the most
     down-time of 34%.                                               common in industry. However, they do not perform well
                                                                     in asynchronous contexts. Indeed, communication malfunc-
                                                                     tions between the supervision system and the geographically
1 Introduction                                                       distributed regions of the supervised system delay the repair
                                                                     and do not allow to quickly return to normalcy, even though
Supervision systems were initially monitoring tools whose            a number of malfunctions may have local predefined repair
role was limited to collect and display information for their        procedures available. The unbounded communication time
interpretation and use by the human expert. Today, the ad-           between the supervision and the supervised system is the
vent of complex and physically distributed systems leads to          main reason for this problem.
a semantic shift from supervision tools to supervision sys-             To overcome this lack of robustness when facing unreli-
tems. Indeed, as the complexity of systems increases, hu-            able communications and to reduce the supervised system
mans can no longer process the flow of information arriving          down-time, we present in this article ADS2 : a multi-agent
at each instant. The need to minimize the down-time and              architecture where each supervision agent autonomously
to improve system effectiveness requires the delegation of           handles both diagnosis and repair on a given location. The
some of the decision-making power of the human supervi-              proposed architecture is composed of three mechanisms: A
sor to the supervision system. This requirement has lead to          decision mechanism, a coordination and consistency recov-
the (re)birth of a research community around the notions of
autonomic computing [1] and self-* systems [2]. Our work                1
                                                                         Project of the French R&D initiative Cap Digital federating 4
lies within this context.                                            industrials and 3 laboratories.




                                                                35
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


ery mechanism and an intertwining mechanism. The deci-                    messages between agents can be lost but not corrupted. The
sion mechanism tackles the dynamicity of the information                  agents are supposed to be reliable (no Byzantine behaviour).
available to an agent in order to make a diagnosis. The coor-             Finally, we consider that the simultaneous occurrence of dif-
dination and consistency mechanism deals with the problem                 ferent faults does not result in phenomena of masking ob-
of reaching a consensus between several agents on a global                servables.
diagnosis (or repair) in a context of asynchronous commu-
nications. Finally, the intertwining mechanism address the                2.2 Fault model and repair plan
problem of the size of the search-space in a multiple-faults              Let F be the set of known faults of a system S and R be the
context.                                                                  set of existing repair plans. The signature of a fault f is a
   In this article, we first present our fault and repair model           sequence of observable events generated by the occurrence
and the various assumptions made in section 2. We then de-                of f . The set of signatures of a given fault f is Sig(f ).
scribe the three mechanisms of our multi-agent architecture                  To be able to represent any temporal dependencies, each
in section 3 to 5. We then demonstrate the viability of our               fault is modeled as a t-temporised Petri net (Fig. 3). Each
proposal with experiments in section 6. Finally, we discuss               fault is supposed to be repairable, that is to say that there ex-
related work in section 7 before concluding.                              ists at least one partially ordered sequence of atomic repairs
                                                                          rk that repairs it (a repair plan).
2 A Multi-Agent Architecture for the
  Supervision of Distributed Systems
Our architecture comes within the scope of fault-based
model2 approaches with spatially distributed knowledge.
The supervision process is distributed among several au-
tonomous agents having each a local view of the system to                 Figure 3: Let f be a fault that possesses 2 signatures.
                                                                                                                      00
be supervised, and endowed with diagnosis and repair capa-                Sig(f ) = {o1 o5 ; o1 [o2 , o3 ][to1 ,to1 +5 ] }. The oi are the
bilities. The supervised system is partitioned into regions,              events observed on the supervised system. The toi indicate
each one is supervised by one agent. As illustrated in Fig.               the temporal constraints. Thus, [to1 , to1 + 500 ] constrains the
2, the supervision agents (Ai ) exchange information in order             sequence of observations [o2 , o3 ] to appear under the 5 sec-
to establish a diagnosis and a repair consistent with the ob-             onds that follow the occurrence of o1 for f to be recognized.
servations (Oj ) they get from the various units of the super-
vised system (Uk ). The links between the square units rep-                  The supervised system is partitioned into regions rgj .
resent the standard workflow of the supervised system. The                Each supervision agent is associated with one unique region
dashed arrows represent the fact that some elements may be                and knows the models of the faults that may occur in the re-
reprocessed if the quality is not sufficient. The arrows be-              gion it oversees. However, a fault can cover several regions.
tween the units and the agents represent the communication                In that case, an agent only knows the part of the model that
links used to transmit alarms logs. The remaining links rep-              concerns its region. Its model is completed with the names
resent the communications between the supervision agents.                 of the agents responsible for the others regions. This hy-
                                                                          pothesis allow to model workflow involving different actors
                                                                          that do not share their data.
                                                                                                              
                                                                                        rgb rgb rgc rgc          SigArgb (f ) = o1 o2 Argc
                                                                          Sig(f ) = {o1 o2 o3 o4 } =⇒
                                                                                                                 SigArgc (f ) = Argb o3 o4
                                                                             Beside getting the models of faults, the issue of defining a
                                                                          global precedence relation between events that occur within
                                                                          the supervised system remains. Indeed, there is no common
                                                                          clock to the different regions. It is therefore necessary to
                                                                          add in each agent a stamping mechanism allowing to recre-
Figure 2: Example of our supervision systems deployed on                  ate this order relation. We will not detail here the concept
a workflow.                                                               of distributed clock.We consider in the following that the
                                                                          agents are able to recreate this partial-order relation.

2.1 Assumptions                                                           2.3 Diagnosis and multiple faults
                                                                          During the period of time [t − ∆t , t], agent Ai collects
We consider that communications are asynchronous and that
                                                                          a sequence of observations seqObsAi (t, ∆t ) generated by
there is no upper bound on transmission delay. We assume
                                                                          the occurrence of faults on the system. However agent Ai
that the messages exchanged between supervised units may
                                                                          does not know which faults have occurred. It thus anal-
be lost or corrupted, and that some units are not supervised
                                                                          yses seqObs in order to determine the set of all faults
(e.g. unit U2 on Fig. 2). This assumption is based on the
                                                                          f pAi (t, ∆t ) whose signatures partially or totally match ele-
fact that a complex industrial process commonly involves
                                                                          ments of seqObs. A diagnosis dg is a set of faults that can
different actors that do not share their supervision informa-
                                                                          explain seqObs. Dg is the set of all possible diagnoses of
tion3 . Moreover, we assume that the observations and the
                                                                          seqObs.
   2
      No model of the system’s correct behaviour is available. The
system can only use faults model, a priori known or dynamically           2.4 Fault cost and repair cost
learned from the system observation.                                      Finally, each fault f (respectively each repair plan rp(f ))
    3
      subcontractors in the case of the Dem@tFactory project.             is associated with a cost of malfunction which depends of




                                                                     36
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


the fault duration Ctdysf (f, t) (resp a cost of execution             these explanations according to available information and to
CtEx (rp(f ))). The cost of a diagnosis dg for the supervi-            the constraints we chose to focus on (e.g the most probable
sion system is the result of the aggregation of the respective         explanation). After this step , the first element of Dg is the
costs of the faults that compose it. In the general case:              diagnosis considered as the most relevant at the current time.
                                                                       It is then necessary to estimate its cost.
      Ctdysf (dgi , t) = Aggregfj ∈dgi (Ctdysf (fj , t))   (1)             The cost of the immediate repair Ct(Drepopt ) must take
   Similarly, the execution cost of a repair plan rp associ-           into account the execution cost of the repair plan associated
ated to a given diagnosis depends on the aggregation of the            to the diagnosis retained (CtEx , equation 2), as well as a
respective costs of the repairs that compose it. Thus, in the          cost representative of the potential error relative to this de-
case the repair plan depends directly on the faults:                   cision, CtErr . Indeed,if the only cost considered is the one
                                                                       of the execution of the repair plan, the final decision (step
   CtEx (rp(dgi )) = Aggregf0 j ∈dgi (CtEx (rp(fj )))      (2)         5) will always favour an immediate action compared with a
                                                                       delayed one due to the additional waiting cost of the delayed
3 Agent Decision Model                                                 action.
We consider highly dynamic systems. Consequently,
information available to an agent at a given time can be                Ct(rp(dgi )) = CtEx (rp(dgi )) + CtErr (dgi |Dg\{dgi })
insufficient to determine with certainty which action to                                                                     (3)
select. A supervision agent has thus to determine the                     The computation of the error cost CtErr relies on the
optimal decision (Dopt ) between the immediate triggering              fact that we assume that the good diagnosis – and so the
of the plan made under uncertainty (Dimmopt ), and a                   good repair – belongs to the sorted list Dg of the po-
delayed action (Ddelayopt ) which lets him to wait and                 tential diagnoses. Thus, in case of misdiagnosis when
communicate with other supervisor agents during k time                 selecting the first diagnosis dg1 of Dg, the system will
steps. This waiting time can yield information that reduces            lose a time equal to the execution time of the first re-
uncertainty and thereby improve decision-making. The                   pair plan (CtExecT ime (rp(dg1 ))) which will be supple-
counterpart is that the elapsed time may have a significant            mented by the execution cost of the newly chosen repair
negative impact on the system. The expected potential gain             plan (CtEx (rp(dg2 ))) associated to the 2nd diagnosis of
in terms of accuracy must be balanced with the risks taken.            Dg. As this second choice may also turn out to be an er-
                                                                       ror, we define CtErr recursively on Dg. Thus:
   Let Ct(x) the cost of an action x and Ctwait (k) the cost
related to the extra time k before selecting a repair plan. The        
decision-making process of each supervision agent works as              CtErr (dg1 |[]) = 0// Dg is empty, the diagnosis is correct.
                                                                       
                                                                       
follows :                                                              
                                                                         Err (dg1 |Dg\{dg1 }) = P (dg1 |Dg\{dg1 })× 
                                                                         Ct
 1. Observation gathering                                              
                                                                               CtExecT ime (rp(dg1 )) + CtEx (rp(dg2 ))
                                                                       
 2. Computation of the different sets of faults that can ex-               + CtErr (dg2 |Dg\{dg1 , dg2 })
    plain the current observations : Dg (set of diagnosis)
                                                                       with P (dg1 |Dg\{dg1 }), the probability that choosing dg1
 3. Determination of the immediate repair Dimmopt                      as the final diagnosis is an error.
    based on available information and on the constraints
    we chose to focus on (Most Probable Explanation, Law               3.2 Delayed repair Ddelayopt
    of parsimony, Worst case,...) and computation of its es-           A time t, an agent knows the set of the faults that may
    timated cost Ct(Dimmopt )                                          be occurring in the region it supervises f pAi (t, ∆t ). The
 4. A time t, an agent knows the set of the faults that may            different faults models are represented using t-temporised
    be occurring in the region it supervises f pAi (t, ∆t ).           Petri-nets (Fig. 3 page 2). The agent is thus able to predict,
    Knowing theirs signatures the agent is able to predict,            for each fault of f pAi (t, ∆t ), the set of observables that
    for each fault of f pAi (t, ∆t ), the set of observables           can be expected to appear during the time interval [t, t + k],
    that can be expected to appear during the time interval            with k an a priori fixed parameter. Note that the agent uses
    [t, t + k], with k an a priori fixed parameter. The agent          the current transmission duration (computed over the inter-
    uses these information to compute the waiting cost                 val [t−∆t , t]) to determine the set of potential observations.
    Ctwait (k), the expected potential gain of a delayed re-
    pair Ddelayopt and its associated cost Ct(Ddelayopt ).                From this information, the agent builds the tree represent-
                                                                       ing the set of all possibles futures working towards the cur-
 5. Choice between the immediate repair Dimmopt and
    the delayed repair Ddelayopt                                       rent time plus k units of time, Arbpossibles
                                                                                                             Ai      (k). Each node
                                                                       of the tree is associated with a set of observations and rep-
  This algorithm is executed at each time-step and by each             resent one possible future (Fig. 4 below). The agent then
agent when faults occur. The value k represents an upper               computes, for each node of the tree, the set of diagnoses
bound delay as an agents’ decision is updated each time an             that explain this future (Dg 0 ).
observation is received. We will detail in the following sub-             The agent can then compute, for each possible future,
sections the steps 3 and 4 relative to the determination of the        the immediate decision considered as optimal. At time
immediate and delayed repair and of their respectives costs.           t, the determination of the delayed decision with horizon
                                                                       k (Ddelayopt ) involves choosing between the various
3.1 Immediate repair Dimmopt                                           possibles situations. This choice is realised by sorting
The knowledge of the different signatures of faults allows             the first elements of each Dg 0 of the tree of the possibles
us to establish a list of potential diagnoses Dg. We sort              futures with each other using the same criterion than the




                                                                  37
                              Proceedings of the 26th International Workshop on Principles of Diagnosis


                                   ∅                Dg10                    The multi-Paxos algorithm, initially developped for
                                                                         reaching an agreement in a network of unreliable proces-
                                                                         sors, falls into this category. The interesting aspect of this
                                   o1          o2                        algorithm is that it was designed to resist to halt failures -
                                        Dg20        Dg40
                                                                         with recovery possibility - of a number of processes, includ-
                  f pAi (t, ∆t )

                                                                         ing the coordinator. Its very low number of assumptions
                                        o2                               makes it operational in an environment with unreliable
                                                    Dg30                 communications. These properties make it particularly
                                                                         suited to multiagent systems. Using the multi-Paxos, each
                                                                         agent is able to initiate, integrate or leave a coalition.
Figure 4: Illustrative example of a tree of the possibles fu-
tures.                                                                      The fact that there is no upper bound on the time needed
                                                                         to reach a consensus will inexorably lead to some unilat-
one used to identify the immediate repair in the sub-section             eral decision-making by agents or agent groups in case of
3.1.                                                                     communication disruption. This feature of our system guar-
                                                                         antees the avoidance of deadlock situations when communi-
   Once the delayed decision is identified, its cost                     cations are too unstable to let the agents reach a consensus.
Ct(Ddelayopt ) is established using Equation (3). We then                However, this ability requires the introduction of an algo-
have to add to this cost the waiting cost Ctwait . This waiting          rithm to restore a consistent view of the system state by all
cost represents the consequences of the faults on the super-             agents.
vised system during the time where no action was triggered.
The computation of the waiting cost depends on the respec-               4.2 System consistency
tive costs of the malfunctions associated to the remaining               Algorithm 1 works in the manner of producer-consumer
diagnoses and of the elapsed time.                                       with the decision-making process introduced in section 3.
                                                                         The two algorithms share, within an agent, a common in-
      Ctwait (k) = Aggregdgi ∈Dg (Ctdysf (dgi , k)))          (4)        consistency queue Finc . When a coalition is left by at least
                                                                         one agent before reaching a consensus (due to a communi-
                                                                         cation breakdown or to an agent’s decision), the members of
4 Distributed Supervision and System                                     the coalition store their respective decision-making context
  Consistency                                                            (the current sequence of observables, the set of considered
In the previous section, we addressed the problem of one                 explanations and the list of agents which belong to the coali-
agent making a decision. However, as each agent has a lo-                tion) into their own potential inconsistency queue Finc .
cal view of the system, a decision about a diagnosis and/or a               The consistency maintenance algorithm is available
repair frequently requires information and knowledge from                within each agent as a behaviour, it continuously observes
other supervision agents. It therefore becomes necessary to              the state of the queue Finc . When an entry is added to Finc ,
reach a consensus on the decision to make.                               the algorithm is automatically triggered.
However, distributed supervision works in a context of asyn-
chronous and unbounded communication. Within these                       Algorithm 1 Check consistency
constraints the theorem of Fisher-Lynch-Paterson [3] states              Require: Pattern observer on Finc
the impossibility of guaranteeing the achievement of a con-               1: if Finc 6= ∅ then
sensus between different components.                                      2:    Try to contact Finc .getF irst().getCoalition()
   To circumvent this difficulty, the literature on supervision           3:    if contact successful then
frequently introduce hypotheses on the quality of the com-                4:        Send Finc .getF irst()
                                                                          5:        Receive other agents decision context
munications. As our work tends to work under real-life hy-
                                                                          6:        Make pairing between local decision context and others.
potheses, we do not make any regarding the (un)reliability                7:        if pairing is ok then
of the communication. We discuss in this section the use                  8:            Finc .removeF irst()
the multi-Paxos algorithm [4] to reach a consensus when the               9:        else
state of the communication allows it, and propose a consis-              10:            start new paxos instance
tency mechanism to restore a common view of the system                   11:        end if
by the agents after a unilateral decision taken by some of               12:    end if
them.                                                                    13: end if

4.1 Consensus algorithm                                                     This algorithm lets each agent find a match between its
In the general case, establishing a consensus must meet the              actions and those selected by the other members of the coali-
following properties : (Agreement) All correct processes                 tion. Thus, in case of faults due to past inconsistency deci-
decide the same value. (Integrity) Every process decides at              sions taken by the agents, they are able to trigger a sequen-
most once. (Validity) Each value determined belongs to the               tial diagnosis and to discriminate initials disturbances from
set of proposed values. (Termination) Every correct process              the consequences of their decisions.
eventually decides in a finite time. However, in the context                When the potential inconsistency queue of an agent con-
of Fisher-Lynch-Paterson theorem, the supervision system                 tains one element, the agent tries to resolve it. The agent
can only offer a guarantee of “best-effort”. i.e, to assure that         tries to contact each of the agents of the coalition concerned
the consensus can be reached, but only if the system is stable           with this potential inconsistency Pinc . If these agents are
on a sufficiently important period of time [5].                          able to communicate (the communications are restored),




                                                                    38
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


they will exchange their respective decision-making con-                faults which are not real and which prevent the repair of
texts. By comparing them, they will be able to determine                the system. We call these situations virtual deadlocks.
whether the decisions made locally by the different groups              To disambiguate these situations, we added a relationship
of agents are consistent with one another. If this is the case          of innocuousness I to these definitions. Thus, for a fault
(the faults are repaired, the system is in a stable state and           f , a repair plan r(f ), its repair conflict set C R (f ) and
correct), each agent removes Pinc of the queue. Otherwise,              considering the current state of the system, I returns the set
the subset of agents involved initiates a new coalition in or-          of sets of faults belonging to CtR on which the execution of
der to resynchronize their respective views of the system and           repair plan r(f ) leaves the system unchanged. The result
make a decision consistent at the system’s scale. If commu-             of this innocuousness relation is the set of disambiguation
nications are too unstable (or too costly), this consensus will         under repair, denoted by DR (f ). Taking into account all
not be reached, which results in adding a new entry in Finc .           this information, we are then able to propose an algorithm
   Restoring the consistency of the system state as it is per-          to plan the order of the repairs and to resolve some conflicts.
ceived by the supervision agents is again relying on the sta-           We illustrate how it operates below :
bility of the communication links for a sufficient amount of
time.                                                                   Example : Let F = {f1 , f2 , f3 } with Sig(f1 ) =
                                                                        Sig(f2 ) = {a} and Sig(f3 ) = {b}. Rp(f1 ) = {r1 },
5 Intertwining Diagnosis and Repair Stages                              Rp(f2 ) = {r2 } and Rp(f3 ) = {r3 }. Moreover, we know
In the previous sections, we endowed the supervision agents             that C R (f1 ) = {{f3 }} and that DR (f2 ) = {{f2 }, {f1 }}.
with decision-making and coordination mechanisms. These                 As the signatures of f1 and f2 are identical, it follows that
abilities allow the agents to dynamically adapt theirs be-              C D (f1 ) = {{f2 }} and C D (f2 ) = {{f1 }}. We assume that
haviours to the current state of the communications and of              an agent detects the observables a ∧ b.
the supervised system. In case of uncertainty regarding the
decision to make, the agents are thus able to explore the so-
lution space, collectively as well as individually. However,                           F
                                                                                                                 Init              End
the large size of this set remains a problem. Indeed, it is both                     {a, b}

a source of misdiagnosis in case of local decision-making
and the cause of a large number of supervision messages                     M F12    M F13    M F23

when a consensus must be reached. In order to reduce the
                                                                             {a}     {a, b}   {a, b}
                                                                                                                           *
complexity of the decision process, we adress in this section
                                                                              f1       f2      f3
the question of obtaining the minimal set of diagnoses and                   {a}      {a}      {b}
                                                                                                                    P lannif ication

of associated repair plans. To this aim, we discuss the idea
                                                                                                                               *
of intertwining the diagnosis and repair stages.
                                                                                       ∅
   This idea has been introduced by Cordier et al [6] on the                          {ok}
                                                                                                                   Disambiguation
formalization of self-healing systems [7]. Several failures
may indeed have the same signature without calling into                                                    Figure 6:       Operation
question the repairability of the system, all that is needed            Figure 5: Diagnosis state for
                                                                                                           scheme of the active repair
is that a repair be common to all of the faults involved (no-           the agent : dg1 = {f2 , f3 },
                                                                                                           algorithm
tion of macrofault).                                                    dg2 = {f1 , f3 }
   However, restricted to the single-fault context, this
formal model defines the diagnosability and repairability of
a system as static properties that can be computed offline.                In Fig. 6, the initialization of the algorithm determines for
This is not the case in the multiple-faults context. Indeed,            each potential fault fi all repair conflicts existing at the cur-
the appearance of faults can prevent the triggering of a                rent time CtR (fi ). The planning phase recursively builds the
repair associated to another fault currently occurring in               repairing order from CtD and CtR adding the faults whose
the system, and the possible situations are endless. Being              conflict sets are empty, and then updates the remaining ones.
able to represent this kind of interference is essential to our         At the end of this phase, if some faults remain, they poten-
work. This led us to introduce context-dependent notions                tially are in a deadlock. In our example, the agent has to
of diagnosability and repairability.                                    choose between {f2 , f3 } and {f1 , f3 }. As highlighted in
                                                                        Fig.5, the agent can repair f3 but is unable to make a dis-
                                                                        tinction between f1 and f2 (we assume that this conflict is
Definition 1 (Conditionnal Diagnosability).
                                                                        virtual and that only one of these faults is occurring).
   Diagnosable(fi , t) ⇐⇒ ∀x ∈ C D (fi ), x ∈
                                            / ss(t)                        The disambiguation phase then attempts - from the
                           ⇐⇒     CtD (fi ) = ∅                         proven innocuousness of some repairs in the current con-
                                                                        text - to solve these conflicts. If one of them is solved, the
   A fault f is diagnosable at time t if none of the faults that        planning phase is retried after updating the conflict sets. If
may prevent its diagnosability (e.g if they share the same              the disambiguation does not work, it means that the agent
signature) is appearing in the system at this instant. This             does not have, at the current time, enough information to
set of faults is the conflict set in diagnosis of the fault f           solve the problem. The decision-making process previously
(denoted by CtD (f )). Following the same reasoning, we can             introduced in section 3 is then triggered. In our example,
define CtR (f ) as the conflict set in repair of the fault f .          repair r2 is selected. If the system returns to normalcy, then
   Finally, the uncertainty regarding the faults that are               both diagnosis and repair phases end. If not, the previous
currently occurring on the supervised system, may conduct               action guarantee the occurrence of f1 , and the associated
the supervision agents’ to “believe” the occurrence of                  repair plan is executed.




                                                                   39
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


6 Experimental Evaluation                                                  We fixed a priori the locations and responsibilities (the
To evaluate our approach we developed a simulator for dis-              regions) of the supervision agents according to the geo-
tributed systems. Based on the JADE multi-agent platform                graphical location of the units that compose the supervised
[8], our environment allows us to model both physical units             process. We used a dataset provided by our industrial part-
and communications links, and to simulate the occurrence                ners (8 GB of data corresponding to 48 hours of logs) to
of failures in it. For a given simulation, a list of faults is          extract nineteen different faults models. From these data,
associated with each site and each communication link (A                we determined that the probability of occurrence of n faults
communication link can increase its transmission time, a                per unit of time follows a Poisson distribution with param-
unit may stop working properly,...). Each fault is associated           eter λ = 0.043. Finally, using information gathered from
with one or more trigger conditions: a date and/or the                  our partners, we were able to estimate the costs of the faults
occurrence of another failure. This allows us to simulate               over time (constant, logarithmic,...) and the associated re-
cascading faults. Our supervision system is deployed                    pair costs.
on this simulator. When running the simulation, some                    6.2 Experiments
faults trigger the sending of an alarm message to the agent
responsible for the site where they appear. The agents                  Our experiments study the behaviour of the supervisions
will try to determine the appropriate behaviour from these              systems when varying the (heterogeneous) transmission
messages.                                                               cost. We used a random generator to affect a transmission
                                                                        time (between 0 to 30 units of time) to each transmission
   Our agent decisional model is generic. It can be instan-             link for each time unit of the simulation. We arbitrarily set
tiated with various criteria (the most probable explanation,            at 10% the probability of a link to get a transmission time
the worst case hypothesis, etc.) depending on available in-             greater than 1. In order to obtain a baseline, we first per-
formation and on the constraints we choose to focus on.In               formed different simulations with homogeneous transmis-
the Dem@tFactory project, it appeared essential to consider             sion costs.
the utility of a decision based on the cost of the occurrence              The performance evaluation is based on three criteria:
of a given set of faults rather than on its occurrence probabil-        (1) The average response time to a malfunction. (2) The av-
ity. This reasoning led us to favor a robust decision criterion.        erage number of supervision messages exchanged. (3) The
The decisions taken by the agents will therefore rely on the            average total cost of repairs made during the experiment.
worst case hypothesis.
   The upper bound k which is the horizon considered by                    Figs. 8(a) and 8(b) present the evolution of the behaviour
the agents of ADS2 for the computation of the delayed de-               of supervision systems ADS2 and SC for the two first cri-
cision is set to 15 units of time. Moreover, we assume in               teria in the case of homogeneous (Ho) and heterogeneous
these experiments that the respective costs of the faults that          (He) communication links. The vertical bar at t=15ut is the
compose a diagnosis are additive. Finally, in order to have             horizon considered by the agents of ADS2 for the compu-
benchmarks for the evaluation of the principles underlying              tation of the delayed decision.
our architecture (ADS2), we also implemented a central-                    Fig. 8(a) shows that our architecture is very robust, al-
ized supervision systems (SC) where all observables are                 lowing the supervised system to rapidly recover from fail-
transmitted to a single supervision agent.                              ures. The response time of ADS2 (Ho and He) progres-
                                                                        sively stabilized around 15ut, when the response time of SC
6.1 Experimental setup                                                  increases over time and becomes higher than ADS2.This is
Our goal is to study the behaviour of the supervision system            due to the fact that the agents of ADS2 can decide to act
facing an industrial case-study.                                        without waiting for the reception of all the messages that
                                                                        come from the units of the supervised system.
                                                                        We can observe an increase of the average response time
                                                                        of ADS2(Ho) when the transmission delay is close to 15
                                                                        ut. This is due to the parameter k of our algorithm, a pri-
                                                                        ori fixed to 15 ut. This parameter defines the agent’s hori-
                                                                        zon for the computation of the delayed decision. When the
                                                                        transmission time becomes greater or equal to k, an agent
                                                                        no longer sees interest in waiting or trying to exchange in-
                                                                        formation with other agents of ADS2; so it decides to act
                                                                        despite the risk of making a mistake. The impact of pa-
Figure 7: Workflow of the digitizing chain of the                       rameter k is less important on ADS2(He). Indeed, as the
Dem@tfactory project.                                                   communication links are in this case heterogeneous, a su-
                                                                        pervision agent is still able to exchange information with
   Fig.      7 represents the digitizing chain of the                   some of the other agents. This leads to a better response
Dem@tFactory used for the experiments. Each dotted rect-                time for ADS2(He) than for ADS2(Ho). This behaviour
angle corresponds to a factory situated on a given geograph-            is clearly highlighted in figure 8(b). The number of mes-
ical location (2 in France, 1 in Madagascar and 1 in Mau-               sages exchanged by the agents of ADS2 drop with the in-
ritius). The circles correspond to the various process re-              crease of the transmission delay. We see a sudden drop of
quired to the digitization. The inter-rectangle links corre-            this number when the transmission delay becomes greater
spond to communications between the different factories,                than 15 ut for ADS2(Ho), confirming the local decisions-
and the intra-rectangle one to local communications. All                making of the agents.
theses components are modeled in the simulator and can en-                 Fig. 8(c) shows that the decisions of ADS2(He) agents
gender the occurrence of faults.                                        generate a limited repair extra-cost in comparison to the cen-




                                                                   40
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




    (a) Response time to a malfunction              (b) Communication cost                         (c) Total cost of repairs

Figure 8: Experimental results. The curves corresponding to homogeneous/heterogeneous communication links are respec-
tively marked with (Ho) and (He). The x-axis is the transmission delay. The y-axis correspond for each figure to one of the
evaluation criteria.

tralised approach (9%). With the Dem@tFactory fault mod-              They have shown that to obtain a minimum overall diagno-
els, the overall gain regarding the supervised system down-           sis is NP-hard in the case of spatially distributed information
time reach 34%.                                                       and that the complexity of obtaining the diagnosis is inde-
   Considering the reactivity of ADS2 and the limited                 pendent of the communications costs engendered during its
repair extra-cost it generates, the communication extra-cost          establishment[13].
for low transmission delays can be considered as an                      Given these theoretical results, reducing the space of
acceptable consequences compared with a total-absence of              potential solutions is generally based on a hierarchical
supervision (SC).                                                     structure of the diagnosis agents [12] and on the choice of
                                                                      not returning to previously excluded explanation. Though
   Our next set of experiments evaluates the impact of the            the no back-track of past decisions guarantees convergence
intertwining of the diagnosis and repair phases on the per-           and termination of the algorithm, it is a source of diagnosis
formances of the supervision system. In order to evaluate             errors in an asynchronous environment. The best-effort
the impact of this behaviour for a supervision agent, we ini-         approach we chose allows us to reduce these diagnosis
tially activated this capability in only one agent of ADS2.           errors, and the termination of the diagnosis algorithm is
We realised 100 simulations. The 10 first simulations are             guaranteed through the anytime decision-making process of
performed with a number a simultaneous faults restricted to           our agents.
1. Then the number is gradually incremented every 10 simu-
lations to reach 10 simultaneous faults. For each simulation,
the number of potential diagnoses considered by the agent is             To our knowledge, the work of Nejdl et al [15] is the only
saved at 5 specific time steps. In order to obtain a baseline,        one that addresses the distribution of both the diagnosis
we performed the same 100 simulation with the intertwining            and repair phases. However, placed at a relatively abstract
behaviour deactivated. Fig. 9(a) shows that the interleaving          level of analysis, this work makes the assumptions that
of the diagnosis and repair process does lead to a reduction          communication links are reliable and that messages can be
of the diagnosis search space of an agent between 10 to 20%           exchanged between agents at no cost. In a real situation
for the set of faults of the Dem@tFactory project.                    these hypotheses are too restrictive. Indeed, to not consider
   Fig. 9(b) shows that the reduction of the number of po-            the communication state may render the supervision system
tential explanations of each agent is of an extend sufficient         ineffective or inoperable. Our proposal does not make such
to allow the agent to reduce the number of supervision mes-           assumptions.
sages. The everage response time to a malfunction is not
significantly improved (Fig. 9(c)) but the repair extra-cost             The problem of online decision-making under uncertainty
fall from +9% (ADS2) to 7.2% (ADS2+) (with p<0.05).                   is the central point of the work by Horvitz [16], Hansen et
                                                                      Zilberstein [17] on the control of anytime algorithms. In-
7 Related Work                                                        deed, they propose a formal framework to dynamically de-
The supervision of a system consists of four steps: De-               termine the time to stop a calculation taking into account the
tection, Isolation, Identification and Repair. The literature         quality of the current solution and the cost of the algorithm
aggregates the 3 first steps under the name FDI (Fault De-            computation.
tection and Isolation) [9]. Although several approaches for              The first distinction between these work and ours is that in
the distributed supervision of distributed systems have been          their work the authors determine when to stop the computa-
proposed in the literature, whether work is from the diagno-          tion based on the distance between the current solution and
sis and control communities[10; 11] or from the multi-agent           the optimal one. This requires knowing the optimal solution
domain [12; 13], they do not cover the repair phase.                  (or an estimatation) and to be able to dynamically determine
   In the work from areas related to distributed systems,             this distance. In our work, talking about the quality of a so-
emphasis is placed on the distribution of available knowl-            lution (i.e a diagnosis) is meaningless insofar as a diagnosis
edge on the status and behaviour of the supervised system.            is right or wrong, and its “value” is only known a posteri-
Frohlich et al [14] and Roos et al [13] have addressed the            ori. The second point of divergence is that we try to select a
question of the ability of a set of agents to determine an            candidate (a diagnosis or a repair) among a set of potential
overall diagnosis according to the shape of this distribution.        solutions. The complexity of the task is therefore increased.




                                                                 41
                          Proceedings of the 26th International Workshop on Principles of Diagnosis




    (a) Size of the potential diagnosis set      (b) Response time to a malfunction                (c) Communication cost

Figure 9: Simulation results when integrating the interleaving of the diagnosis and repair steps. The symbol “+” is associated
to the components which integrate this additional mechanism.


8 Conclusions                                                         [5] R. De Prisco, B. Lampson, and N. Lynch. Revisiting
                                                                           the paxos algorithm. Distributed Algorithms, pages
We presented the first anytime multi-agent architecture for                111–125, 2000.
the supervision of distributed systems that is able to dynam-
                                                                      [6] Marie-Odile Cordier, Y. Pencolé, L. Travé-Massuyes,
ically adapt its behaviour to the current state of the super-
vised system. In particular, the decision model allows each                and T. Vidal. Self-healability = diagnosability + re-
supervision agent to find a balance between a quick local                  pairability. In The 18th International Workshop on
diagnosis and repair under uncertainty, and a delayed, sys-                Principles of Diagnosis, volume 7, pages 251–258.
temic one, based on the respective costs of misdiagnosis and               Citeseer, 2007.
communication. The distributed consistency algorithm al-              [7] Philip Koopman. Elements of the self-healing system
lows each agent to form a coalition to reduce its uncertainty              problem space. In ICSEWADS03, 2003.
or to restore a consistent view of the system state in case           [8] K. Chmiel, M. Gawinecki, P. Kaczmarek, M. Szym-
some had to act locally with incomplete information at an                  czak, and M. Paprzycki. Efficiency of JADE agent
earlier stage. Moreover, the intertwining of the diagnosis                 platform. volume 13, pages 159–172. IOS Press, 2005.
and the repair phases allows an efficient reduction of the di-
agnosis search-space. The overall reduction of 34% of the             [9] Giovanni Betta and Antonio Pietrosanto. Instrument
Dem@tFactory system down-time associated with a repair                     fault detection and isolation: state of the art and new
extra-cost of 7.2% demonstrate that ADS2 is able to effi-                  research trends. volume 49, pages 100–107, 1998.
ciently supervise complex systems under real-life assump-             [10] S. Lafortune, D. Teneketzis, M. Sampath, R. Sengupta,
tions.                                                                     and K. Sinnamohideen. Failure diagnosis of dynamic
   A fully autonomous supervision system is presently not                  systems: an approach based on discrete event sys-
realistic in an industrial context as Humans wants to keep                 tems. In Proc. American Control Conference, vol-
control on what they perceive as critical decisions. ADS2                  ume 3, pages 2058–2071, June 25–27, 2001.
represents what we see as an acceptable trade-off as the def-         [11] Christos G. Cassandras and Stéphane Lafortune. In-
inition of its autonomy degree can be easily accomplished.                 troduction to Discrete Event Systems. Springer, 2008.
Thus ADS2 organizes the set of known faults and repairs
in several subclasses : the ones whose repair plan can be             [12] H. Wörn, T. Längle, M. Albert, A. Kazi, A. Brighenti,
triggered automatically, and those whose final repair deci-                S. Revuelta Seijo, C. Senior, M A S. Bobi, and JV. Col-
sion rests with a human supervisor. The risk aversion of the               lado. Diamond: distributed multi-agent architecture
users defines the size of these two respective sets. If the                for monitoring and diagnosis. Production Planning &
confidence in the efficiency of the autonomous supervision                 Control, 15:189–200, 2004.
of complex and distributed systems is not common today,               [13] Nico Roos, Annette ten Teije, André Bos, and Cees
we believe that the work presented herein provides a step                  Witteveen. A protocol for multi-agent diagnosis with
towards this goal.                                                         spatially distributed knowledge. In Proceedings of the
                                                                           second international joint conference on Autonomous
                                                                           agents and multiagent systems, page 7. ACM, 2003.
References
                                                                      [14] Peter Fröhlich and Wolfgang Nejdl. Resolving con-
[1] J. O. Kephart and D. M. Chess. The vision of auto-                     flicts in distributed diagnosis. In ECAI Workshop on
    nomic computing. Computer, 36(1):41–50, 2003.                          Modelling Conflicts in AI, 1996.
[2] M. Salehie and L. Tahvildari. Self-adaptive software:             [15] W. Nejdl and M. Werner. Distributed intelligent agents
    Landscape and research challenges. ACM Trans. Au-                      for control, diagnosis and repair. RWTH Aachen, In-
    ton. Adapt. Syst., 4(2):1–42, 2009.                                    formatik, Tech. Rep, 1994.
                                                                      [16] E. Horvitz and G. Rutledge. Time-dependent utility
[3] M.J. Fischer, N.A. Lynch, and M.S. Paterson. Impossi-
                                                                           and action under uncertainty. pages 151–158, 1991.
    bility of distributed consensus with one faulty process.
    Journal of the ACM (JACM), 32(2):374–382, 1985.                   [17] E.A. Hansen and S. Zilberstein. Monitoring and con-
                                                                           trol of anytime algorithms: A dynamic programming
[4] L. Lamport. Paxos made simple. ACM SIGACT News,                        approach. Artificial Intelligence, 126:139–157, 2001.
    32:18–25, 2001.




                                                                 42
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




 Data Driven Modeling for System-Level Condition Monitoring on Wind Power
                                  Plants
       Jens Eickmeyer1 , Peng Li2 , Omid Givehchi2 , Florian Pethig1 and Oliver Niggemann1,2
                   1
                     Fraunhofer Application Center Industrial Automation IOSB-INA
           e-mail: {jens.eickmeyer, florian.pethig, oliver.niggemann}@iosb-ina.fraunhofer.de
                                       2
                                         inIT - Institute Industrial IT
                     e-mail: {peng.li, omid.givehchi, oliver.niggemann}@hs-owl.de

                         Abstract                                     failure. On the other hand, there is the strategy of correc-
                                                                      tive maintenance, which reacts to occurred failures. Both
     The wind energy sector grew continuously in the                  strategies need time for actual maintenance, which lead to
     last 17 years, which illustrates the potential of                non productive downtimes. Especially, when considering
     wind energy as an alternative to fossil fuel. In                 offshore WPP, these downtimes produce high costs.
     parallel to physical architecture evolution, the                    To reduce these downtimes a precise proactive schedul-
     scheduling of maintenance optimizes the yield                    ing of maintenance task is needed. This is achieved through
     of wind power plants. This paper presents an                     condition monitoring (CM) systems [4]. Those systems try
     innovative approach to condition monitoring of                   to reason about the inherent system states such as wear, al-
     wind power plants, that provides a system-level                  though these conditions cannot be measured directly, but the
     anomaly detection for preventive maintenance. At                 growing amount of sensors in modern WPP enable an ade-
     first a data-driven modeling algorithm is presented              quate description of the machines state. To make use of this,
     which utilizes generic machine learning methods.                 CM systems need a model of the WPP, which describes the
     This approach allows to automatically model a                    system behavior based on observed data.
     system in order to monitor the behaviors of a                       Existing CM solutions for WPP rely on specific sensors
     wind power plant. Additionally, this automati-                   and are specialized to monitor single parts of the system.
     cally learned model is used as a basis for the sec-              The gearbox [5], the bearing [6], the generator [7] or the
     ond algorithm presented in this work, which de-                  blades [8] have been monitored in order to perform proac-
     tects anomalous system behavior and can alarm                    tive maintenance. Here, specific sensors are needed as a
     its operator. Both presented algorithms are used in              requirement for these specialized methods.
     an overall solution that neither rely on specialized                This article presents a system-level solution which han-
     wind power plant architectures nor requires spe-                 dles heterogeneous WPP architecture regardless of installed
     cific types of sensors. To evaluate the developed                sensor types. Also, an algorithm for modeling a WPP on
     algorithms, two well-known clustering methods                    system level and another algorithm for anomaly detection
     are used as a reference.                                         are stated. To achieve this, three challenges are tackled and
                                                                      their solutions are presented:
1 Introduction                                                          I. Logging data from available sensors of a WPP, using
                                                                           existing infrastructure independent of the architecture.
According to a wind market statistic by the GWEC (Global                   Additionally, the opportunity must be given to add new
Wind Energy Council) [1], the global wind power capac-                     sensors and sensor types on demand.
ity grew continuously for the last 17 years. In 2014, the              II. Automatic modeling of a WPP, by combining existing
global wind industry had a 44 % rise of annual installations               and generic data-driven methods. Such a model must
and the worldwide total installed capacity accumulated to                  be able to learn the complex sensor interdependencies
369553 megawatt at the end of 2014. In Europe, renewable                   without extra manual effort.
energy from wind power plants (WPP) covers up to 11% of
the energy demand [2]. With this rapid continuous growth,             III. Anomaly detection for a WPP regardless of its kind of
the wind power is considered as one of the most competitive                architectures, especially with no assumptions on avail-
alternative to fossil fuels.                                               able types of sensors.
   In a case study, Nilsson [3] denotes an unscheduled down-          The article is structured as follows. Section 2 deals with
time with 1000 e per man-hour, with costs of up to 300000             state of the art technology in WPP CM. Hardware and data
e for replacements. This does not take into account the               acquisition for the presented solution are specified in section
reduced yield through production loss. Therefore, the ob-             3, here point I is the central issue. Data-driven models real-
jective of maintenance is to reduce WPPs downtimes and                izing point II and the analyzed machine learning approaches
provide high availability and reliability.                            are the purpose of section 4. Anomaly detection and its gen-
   High availability is currently achieved by two different           eral approach, according point III is stated in section 4.2.
strategies. On the one hand, maintenance is planned as regu-          The results of an evaluation of the presented methods is con-
lar time-interval based on the manufacturer’s data of specific        tent of section 5. Finally, this paper concludes in section 6
WPP parts. This is performed in order to prevent wearout              and describes future aims of the presented work.




                                                                 43
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


2 Related Work
The core task of a CM system is anomaly detection. As
stated in [9], the models used for anomaly detection of
complex systems should be learned automatically and data-
driven approaches to learning such models should be moved
into the research focus.
   A wide range of data-driven algorithms that deal with
modeling the system behavior for anomaly detection are
available in the literature.
   Because of its simplicity in processing huge amounts of
data, the Principal Component Analysis (PCA) based algo-
rithms are widely applied in the condition monitoring of
WPP [10][11].                                                         Figure 1: Diagram showing the inside of a nacelle and main
   As one of the classic density based clustering method,             components [4]
DBSCAN shows its advantages over the statistical method
on anomaly detection in temperature data [12].
                                                                      on generic industrial standards (IEC 61400-25) and trans-
   Piero and Enrico proposed a spectral clustering based
                                                                      fer them to a database for storage and processing. Such a
method for fault diagnosis where fuzzy logic is used to mea-
                                                                      data logger meets point I (see section 1). In addition, the
sure the similarity and the fuzzy C-Means is used for clus-
                                                                      timestamp of the data should be synchronized between data
tering the data [13].
                                                                      loggers, database and application accurately.
   Due to the high complexity of a WPP and its harsh work-
ing environment, the modeling of WPPs on system level is                 In this work we followed a three layer architecture for
very challenging. Most data-driven solutions to WPP con-              data acquisition as shown in Figure 2 which covers all of
dition monitoring concentrate on the errors of one partic-            the CM system components. In layer 1, the physical ma-
ular component (in component level) [4]. These methods                chine components are connected to the data logger hardware
are designed to detect specific faults (e.g. fault in gearbox,        using different industrial connections and protocols e.g. dig-
generator).                                                           ital GPIO, RS485, MODBUS, etc. The data loggers are time
   The application of such methods is available in differ-            synchronized using global positioning system (GPS) or net-
ent studies. In [6], a shock pulse method is adapted for              work time protocol (NTP) time references via an embed-
bearing monitoring. A multi-agent system is developed in              ded time client running in the data logger. Collected sensor
[5] for condition monitoring of the wind turbine gearbox              data is attached to their accurate timestamps by an embed-
and oil temperature. In [8], the ultrasonic and radiographic          ded OPC UA server inside the data logger. The sensor data
techniques are used for non-destructive testing of the WPP            is categorized based on an OPC unified architecture (OPC
blades. Using these methods can prevent the WPP break-                UA) data model (e.g. conformant to IEC 61400-25) for a
downs caused by the particular faults. For enhancing the              standalone WPP.
availability and the reliability of the whole WPP, a method              The communication between data logger, OPC UA server
for monitoring the WPP on system-level is desired.                    and layer 2 is realized with a secure general packet radio ser-
   In this work, a PCA-based algorithm for condition mon-             vice network (GPRS) or a virtual private network (VPN),
itoring of WPP is presented. This approach is aimed to                while it can be accessed for widely distributed WPPs in
model a WPP on system-level in order to perform auto-                 different geographical locations. The layer 2 comprises a
matic anomaly detection. As a comparison, DBSCAN and                  middleware to collect and host the sensor data coming from
spectral clustering are utilized for the same purpose. To the         distributed data loggers. It mainly covers a database with
best of our knowledge, no application of either DBSCAN or             support of historical data and also an OPC UA server aggre-
spectral clustering in condition monitoring of WPP exists.            gating the data incoming from distributed WPP data loggers
                                                                      and pushes them to the database using an OPC UA database
                                                                      wrapper. As shown in Figure 2, the main component of layer
3 Data Acquisition Solution                                           3 is an analysis engine. This engine applies algorithms on
A WPP includes different types of sensors, actuators and              the database. Based on the learned machine models an out-
controllers installed to monitor and control the different de-        put about the machines condition is presented to the operator
vices and components as shown in Figure 1. To monitor the             by a human machine interface (HMI).
condition of a WPP, it is necessary to collect process data
from its sensors and components accurately and continu-
ously feed this data to the diagnosis algorithms. To max-
                                                                      4 Modeling Solution
imize accuracy, data should be acquired directly from the             The main idea of the presented solution is to automatically
sensors and components or via the existing communication              learn a model of normal system behavior from the observed
systems. Despite the fact that IEC 61400-25 [14] addresses            data using data-driven methods. Classical manual model-
a variety of standards and protocols in WPP, lots of propri-          ing utilizes expert process knowledge to build a simulation
etary solutions exist today. A general approach to accurate           model as a reference for anomaly detection. But a process
data acquisition in an uniform way implies protocol adapters          such as a WPP contains numerous continuous sensor val-
or data loggers (DL) to connect the diagnosis framework.              ues, which make it difficult to model the system manually.
This is done not only for IEC 61400-25 conformant WPP,                Therefore, as first step of the solution a model is learned
but also for proprietary ones using e.g. the MODBUS pro-              from a set of data. The second step utilizes this model as
tocol or a direct connection via general-purpose input/output         reference to perform anomaly detection. This section con-
(GPIO) [15]. Also the data logger should model data based             siders these two steps.




                                                                 44
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                      Clustering based modeling
                                                                      The goal of cluster analysis is to partition data points into
                                                                      different groups. Similarity of points is defined by a mini-
                                                                      mal intra-cluster distance, whereas different cluster aim for
                                                                      a maximum inter-cluster distance. Thus, cluster analysis
                                                                      can be utilized to find the pattern of a system direct us-
                                                                      ing the multi-dimensional data without explicit descriptions
                                                                      about the system features. This is the main advantage in
                                                                      using cluster analysis for modeling complex systems with
                                                                      seasonal components, e.g. WPP.
                                                                         In the presented solution, a system model for anomaly de-
                                                                      tection should characterize the normal system behavior and
                                                                      can be used to identify unusual behavior. For most com-
                                                                      plex system, the normal behavior might consist of multiple
                                                                      modes that depend on different factors, e.g. work environ-
                                                                      ments, operations of the systems. When the cluster analysis
                                                                      is performed on a data set representing the normal behav-
                                                                      ior of a system, multiple clusters can be recognized. Each
                                                                      cluster (group) represents a particular status of the system.
                                                                      Then such multiple clusters can be used as the normal be-
                                                                      havior model of a system for anomaly detection.
                                                                         In this paper, two well-known clustering algorithms, DB-
                                                                      SCAN and spectral clustering, are utilized to model the nor-
Figure 2: Architecture overview of the presented system-              mal behavior of a WPP on system level. Each of them has
level condition monitoring solution for a WPP                         advantages in clustering the data with complex correlations.
                                                                         DBSCAN is resistant to noise and can recognize patterns
                                                                      of arbitrary shapes. In DBSCAN, the density for a particu-
4.1 Step 1: Data-Driven Modeling                                      lar point is defined as the number of neighbor points within
                                                                      a specified radius of that point [17]. Two user-defined pa-
                                                                      rameters are required: Eps - the radius; M inP ts - the min-
In order to automatically compute a system model, the pre-            imal number of neighbors in the Eps. DBSCAN uses such
sented solution use generic methods to analyze training data          center-based density to classify the data points as core point
and aim for process knowledge. These methods from the                 (Eps-neighbors ≥ M inP ts), border point (not core point
field of machine learning reduce effort of time for generat-          but the neighbor of minimal one core point) or noise point
ing a system model caused by the complex sensor interde-              (neither a core nor a border point). Two core points that are
pendencies. Additionally, a WPP is influenced by seasonal             within Eps of each other are defined as density-reachable
components and a normal state of work cannot be declared              core points. DBSCAN partitions the data into clusters by
as precise as for a machine that works in a homogeneous en-           iteratively labeling the data points and collecting density-
vironment of a factory. This meets the requirement in point           reachable core points into same cluster. As result, DBSCAN
II (see section 1). In this solution, step 2 detects anomalies        delivers several clusters in which noise points are also col-
as deviation between an observation and the learned refer-            lected in a cluster. DBSCAN is not suitable to cluster high
ence model of the system, this is described in section 4.2.           dimensional data because density is more difficult to define
                                                                      in high dimensional space. Therefore, a method to reduce
   Common strategies for data-driven modeling are super-              dimensionality should be applied to the data before using
vised and unsupervised learning methods. Supervised meth-             the DBSCAN. This leads to a density based description of
ods such as Multilayer Perceptron, Support Vector Ma-                 the normal behavior.
chines or Naive Bayes Classifier (see [16] for more infor-               This method assumes that the training data perfectly de-
mation) can be used to directly classify data according to            scribe the distribution of system normal states. For WPP,
learned hyperplanes in the data space. To be reliable, those          some special states of the plant occur so rarely that the
methods need a-priori knowledge from labeled data of pos-             recorded data can not represent such special states very well.
sible faults and the normal state. Gathering those precise            In addition, environmental influences lead to noise points
data for a continuous production system like a WPP is hard            within the data set. Therefore, a complete coverage of the
to realize, as faults are rare and environmental conditions           normal states of a WPP in learning data set is unrealistic to
increase the number of possible faults dramatically.                  achieve.
   In comparison, unsupervised learning methods (e.g.                    Compared to the traditional approaches to clustering (e.g.
Clustering, Self Organizing Maps) seek to model data with-            k-means, DBSCAN), spectral clustering can generally de-
out any a-priori knowledge. Therefore, they are able to               liver better results and can be solved efficiently by standard
extract knowledge from unlabeled data sets and generate               linear algebra methods[18]. Another advantage of spectral
a model out of this knowledge. In this article, two types             clustering is the ability to handle the high dimensional data
of unsupervised learning methods are investigated to model            using spectral analysis. Thus, extra dimensionality reduc-
a WPP using unlabeled data. The PCA based modeling is                 tion method is not required. The idea of spectral cluster-
compared against cluster based modeling methods, which                ing is to represent the data in form of a similarity graph
are used as reference.                                                G(V, E) where each vertex vi ∈ V presents a data point




                                                                 45
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


Algorithm 1 PCA based modeling
                                                                                                   1
 1: Input: X                               . learning data set                             Σ0 ≈        XT X
 2: Output: ModelX                      . model of input data                                   N −1
                                                                         By means of EVD (eigen value decomposition) or the
 3: procedure PCA_ BASED _ MODELING (X)                                equivalent SVD (singular value decomposition) the covari-
 4:    l: reduced dimensionality                                       ance matrix is decomposed as follows:
                                                                                                                 
 5:    PCA_Matrix = performPCA(X)                                                          T            Λpc   0
                                                                                    Σ0 = P ΛP, Λ =
 6:    XP CA = mapToLowDimension(X)                                                                      0   Λres
 7:    ModelX = generate_N-Tree(XP CA )                                   With Λ = diag(σ12 < σ22 < · · · < σi2 ) where σi ,
 8: end procedure                                                      i = 1, · · · , m is the i-th eigenvalue and P is a matrix of the
                                                                       eigenvectors, sorted according to the eigenvalues of Λ. Λpc
 9: function GENERATE _N-T REE(XP CA )                                 are the chosen principal components according to a thresh-
10:    Tree: List with length 2l                                       old l and Λres denotes the less informative rest. l is a pa-
11:    for (xpca in XP CA ) do                                         rameter which depends on the eigenvalues proportion of to-
12:        i = determine_orthant(xpca )                                tal variance and determines the dimension of the reduced
13:        Treei = append(Treei , xpca )                               normal space.
14:    end for
15:    for ( leaf in Tree ) do                                                                  Y = PTX
16:        if (sizeOf(leaf) > 1) then
17:            leaf = generate_N-Tree(leaf)                            Transforms the p-dimensional dataset X into a dataset Y of
18:        end if                                                      a lower dimension l, with a minimum of information loss.
19:    end for                                                         The axes of the dimensionally reduced data space are or-
20:    return ( Tree, PCA_Matrix )                                     thonormal and aligned to the maximum variance of data.
21: end function                                                       Prerequisite for modeling a WPP with this kind of trans-
                                                                       formation is the input data to calculate eigenvalues and the
                                                                       rotation matrix. Therefore, the presented data set of a WPP
in the dataset. Each edge eij ∈ E between two vertices vi              needs to describe a period of fault free operation, which is
and vj carries a non-negative weight (similarity between the           denoted by the term ’normal state’. Using this data set as
two points) wij . Then, the clustering problem can be han-             a learning base, the PCA described above spans a reduced
dled as graph partition[19]. G will be divided into smaller            normal state space, where signal covariances are taken into
components, such that the vertices within the small compo-             account due to the eigenvalues of the covariance matrix as
nents have high connection and there are few connections               the basis for transformation. The input variables are trans-
between the small components. These small components                   formed within the algorithm 1 in line 6.
correspond to the clusters in the results of spectral cluster-            In comparison to clustering methods only the covariance
ing and can be used as normal status model for anomaly                 matrix stores explicit shape informations. This leads to the
detection.                                                             necessity of taking into account all data points for classi-
                                                                       fying a new observation. That is why computational effort
PCA based modeling                                                     for this model increases with the number of data points in
Algorithm 1 presents the stated modeling solution for a                the data set and their dimension. To overcome this issue,
system-level approach to a WPP. The algorithm utilizes the             the model is extended with an N-Tree as geometrical data
Principal Component Analyses (see, line 5 algorithm 1 ) as a           structure (see function generate_N −T ree in algorithm 1).
very first step to achieve a dimensional reduced description           The axis of the PCA transformed normal state space divides
of the training data set. Although a part of the information is        the data into 2l subspaces. Centering these subspaces in
lost due to the reduction, the sensor correlations in the low          each iteration divides the subspaces recursively until each
dimensional space are reduced drastically, which minimizes             leaf of the tree contains one data point or is empty. Note,
the computational effort.                                              that the mean of each subspace needs to be stored.
   The PCA is based on the assumption, that most of the
information is located in the direction of most variance.
Therefore, this method aims to project a data set to a sub-            4.2 Step 2: Anomaly Detection
space with a lower dimension by minimizing the sum of                  To comply with point III (see section 1), the prerequisite
squares of yi and to their projections θi following cost func-         for a system-level anomaly detection is a data-driven model
tion:                                                                  as stated above. Given such a model, a distance measure
                      Xm
                           = ||yi − θi ||2 .                           is needed to calculate the deviation between a new system
                     i=1
                                                                       observation and the model in order to identify anomalies.
  Let x1 , . . . , xm be the data point of m sensor values and         Therefore, an observation vector needs to be transformed
X is a historical dataset of N scaled data points.                     into the dimensionally reduced space of the model. Then
                                                                     the deviation of an actual observation and the learned model
                      x1,1 . . . x1,m                                  can be calculated using a distance metric, such as Euclidean
           X =  ...                 ..  ∈ RN ×m
                            ..                                        distance, Mahalanobis distance or Manhattan distance.
                                .     . 
                                                                          DBSCAN generated cluster provide a discrimination of
                     xN,1 . . . xN,m
                                                                       core and border data points. Distance computation in DB-
Then as first step for computing the PCA, the covariance               SCAN use the euclidean distance metric. Only core points
matrix is formed as                                                    are used to measure the distance between an observations




                                                                  46
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                         Algorithm 2 Anomaly detection
                                                                          1: Input: Tree        . (Learned model, see algorithm 1)
                                                                          2: Input: O                         . Input observation
                                                                          3: Output: Boolean                           . Anomaly
                                                                          4: procedure A NOMALY _D ETECTION(Tree, O)
                                                                          5:    OPCA = mapToLowDimension(O)
                                                                          6:    subset = get_subset(Tree, O)
                                                                          7:    dist = calculate_distance(O, subset)

                                                                          8:    if ( dist > 0 ) then
                                                                          9:         anomaly: TRUE
                                                                         10:    else
                                                                         11:         anomaly: FALSE
                                                                         12:    end if
Figure 3: Characteristics of Gaussian distribution in com-               13:    return ( anomaly )
parison to Marr Wavelet (dashed). Spots are marked where                 14: end procedure
the Marr Wavelet reach zero
                                                                         15: function GET _ SUBSET( Tree , OPCA )
                                                                         16:    i = determine_orthant(xpca )
and the core points. This leads to the decision whether an               17:    if (size(leafi ) > 1) then
observation is part of the models cluster or not.                        18:        get_subset(leafi )
   Spectral clustering computes clusters in a dimensionally              19:    else
reduced space but gives no further information about core or             20:        subset = neighbors(leafi )
border points. Measuring the distance between such clus-                 21:    end if
ters can be achieved by a prototype, for example the clus-               22:    return ( subset )
ter center. Then, for computing the distance, a metric like              23: end function
the Mahalanobis distance is used, which is sensible for the
multidimensionality of such cluster. Representing a cluster
based on a prototype is a generalization.                                Where l denotes dimensions of reduced normal-space and
   The PCA based modeling approach uses the dimension-                                      v
ally reduced input data as description of the multidimen-                                   u l
                                                                                            uX
sional normal state space. Algorithm 2 shows how the                                   k = t (Opcai − Xpcai )2
model, computed with algorithm 1, is used for anomaly de-                                         i
tection. At first a new observation is mapped to the low
dimensional space of the model, using the rotation matrix                k is the l-space euclidean distance. For ψ > 0 an observa-
from the PCA (see line 5). Then the mapped observation                   tion in principal space Opca is denoted part of the normal
is compared with the normal state space. Therefore, the N-               state space (see line 17).
Tree is searched for its corresponding subset first (see func-
tion get_subset). If an empty leaf is found, all neighbor                5 Results
leafs are aggregated to a most relevant subset of data points.           The data used in the evaluation is collected over a duration
As the data is not generalized by border points or cluster               of 4 years from 11 real WPPs in Germany with 10 minutes
means as prototypical points it is necessary to measure dis-             resolution. The dataset consists of 12 variables which de-
tance of the observation to each point of this subset. Now               scribe the work environment (e.g. wind speed, air temper-
the distance is computed (see line 7).                                   ature) and the status of WPP (e.g. power capacity, rotation
   Absolute distance measuring is missing a threshold to de-             speed of generator, voltage of the transformer).
cide when an observation meets the model or not. Even                    For evaluation, a training data set of 232749 observations
when utilizing a Gaussian density function to provide an in-             of the 10 minutes resolution was used to model the nor-
dicator for classification, a threshold needs to be estimated            mal behavior of a WPP. The evaluation data set of 11544
for classification. In this project, a Marr wavelet function             observations contains 4531 reported failures and 7013 ob-
is used to decide whether a new observation is part of the               servations of normal behavior. Table 1 shows the confusion
learned normal space. Instead of a Gaussian distribution the             matrix [21] as a result of the evaluation. Here, true negative
characteristic form of a Marr wavelet [20] allows a classi-              denotes a correct predicted normal state and true positive a
fication where the threshold can be set to zero, see figure              correct classified failure For this use case, the F1-score is
3. Taking into account the Marr wavelet and the euclidean                used to analyze the system’s performance in anomaly de-
distance function the process of distance measuring is com-              tection. Also, the runtime for the evaluation is denoted in
puted as follows.                                                        Table 1 to compare speed performance of the different ana-
   Let Xpca = [x1 , · · · , xl ] be a vector of the models’ prin-        lyzed methods.
cipal normal-space and Opc = [o1 , · · · , ol ] a transformed            As can be seen, the presented PCA based algorithm outper-
observation, where l denotes the number of principal com-                forms the standardized spectral clustering. Especially a sig-
ponents. Then the distribution function to measure if a new              nificant performance boost in computation time is achieved
observation is part of the normal state space is formed as:              due to the extended N-Tree data structure.
                                                                            Both, DBSCAN and Spectral Clustering, rely on com-
                            2              k           k                 plete sensor information for clustering the data set. A defect
    ψ(Xpca , Opca ) = √          1   ·1−      · exp (− 2 )               sensor leads to a maintenance action. The delay for this
                           3σπ   4         σ2         2σ



                                                                    47
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


                 True Pos.       True Neg.       False Pos.           False Neg.     Bal. Acc.      F-Measure       elapsed
                                                                                                                    Time
 DBSCAN          1812            6827            186                  2719           68.66%         55.50%          3s
 Spectral        3832            6328            685                  699            87.40%         84.71%          6637s
 Clustering
 PCA based       3970            6517            496                  561            90.27%         88.25%          68s

                                    Table 1: Evaluation results of wind power station data.


maintenance is based on the localization of the WPP and                References
cause missing sensor values for a certain time. To be oper-            [1] Global Wind Energy Council. Global wind statistics
able in the use case of WPP such a model needs a fall back                  2014. Avaiable online at http://www.gwec.net/wp-
strategy in case of missing sensor values. Here, redundancy                 content/uploads/2015/02/GWEC_GlobalWindStats
and correlation of different sensors comes in handy. By ex-                 2014_FINAL_10.2.2015.pdf, March 2015.
tending the PCA to a Probabilistic Principal Component An-
                                                                       [2] European Wind Energy Association et al. Wind in
alyzes (PPCA), missing values can be estimated according
to the data learned from the data set. Tipping and Bishop                   power, 2012 european statistics, 2013. Available
[22] extend a classic PCA by a probability model. This                      online at http://www.ewea.org/fileadmin/files/library
model assumes Gaussian distributed latent variables which                   /publications/statistics/Wind_in_power_annual
can be inferred from the existing variables and the matrix                  _statistics_2012.pdf, March 2015.
of eigenvectors from the PCA. With the use of a PPCA, the              [3] Julia Nilsson and Lina Bertling. Maintenance man-
solution for a system-level is robust enough to stay reliable               agement of wind power systems using condition mon-
even when sensors are missing. This was tested by training                  itoring systems life cycle cost analysis for two case
the model with a defect data set, containing 10% missing                    studies. Energy Conversion, IEEE Transactions on,
sensor values. While evaluating this model, also 10% of the                 22(1):223–229, 2007.
data was damaged, simulating missing sensor values. The                [4] Wenxian Yang, Peter J Tavner, Christopher J Crabtree,
result of this evaluation is presented in table 1.                          Y Feng, and Y Qiu. Wind turbine condition monitor-
                                                                            ing: technical and commercial challenges. Wind En-
                                                                            ergy, 17(5):673–693, 2014.
6 Conclusion                                                           [5] AS Zaher and SDJ McArthur. A multi-agent fault de-
                                                                            tection system for wind turbine defect recognition and
In this work a solution for system-level anomaly detection
                                                                            diagnosis. In Power Tech, 2007 IEEE Lausanne, pages
was presented. Three main requirements are identified and
                                                                            22–27. IEEE, 2007.
satisfied: At first a hardware concept for sensor data ac-
quisition in the heterogeneous environment of WPPs was                 [6] Li Zhen, He Zhengjia, Zi Yanyang, and Chen Xue-
developed. This hardware logs existing sensor values and                    feng. Bearing condition monitoring based on shock
offers an adaptive solution to integrate new sensors on de-                 pulse method and improved redundant lifting scheme.
mand. Second, generic data-driven algorithms to automati-                   Mathematics and computers in simulation, 79(3):318–
cally compute a system-level model out of minimal labeled,                  338, 2008.
historical sensor data is presented. At last an anomaly detec-         [7] Saad Chakkor, Mostafa Baghouri, and Abderrah-
tion method has been shown, which reaches an F-Measure                      mane Hajraoui. Performance analysis of faults de-
of 89.02% and a ballanced accuracy of 91.46%. This solu-                    tection in wind turbine generator based on high-
tion is not specialized for specific parts of a WPP and can be              resolution frequency estimation methods.          arXiv
trained in a short period. With an extension of the standard                preprint arXiv:1409.6883, 2014.
PCA to a probabilistic PCA, the robustness of the algorith-            [8] E Jasinien, R Raiutis, A Voleiis, A Vladiauskas,
mic solution against sensor failures is ensured.                            D Mitchard, M Amos, et al. Ndt of wind turbine blades
   In the future, this solution will be evaluated using data                using adapted ultrasonic and radiographic techniques.
from more WPPs with different working environment. Be-                      Insight-Non-Destructive Testing and Condition Moni-
yond the task of anomaly detection, diagnosis of the root                   toring, 51(9):477–483, 2009.
cause of an anomaly is also a sensible functionality of a CM           [9] Oliver Niggemann and Volker Lohweg. On the diag-
system. The presented solution will be extended by a root
                                                                            nosis of cyber-physical production systems: State-of-
cause analysis. Such an extension can support maintenance
                                                                            the-art and research agenda. In Association for the Ad-
personal to trace the detected anomaly. Another focus will
                                                                            vancement of Artificial Intelligence (AAAI), 2015.
be the prognosis of anomalies in a WPP. To achieve this, an
appropriate algorithm will be developed to predict the future          [10] O. Bennouna, N. Heraud, and Z. Leonowicz. Condi-
system status using the learned model of the system behav-                  tion monitoring and fault diagnosis system for offshore
ior.                                                                        wind turbines. In Environment and Electrical Engi-
                                                                            neering (EEEIC), 2012 11th International Conference
                                                                            on, pages 13–17, May 2012.
Acknowledgments                                                        [11] Ning Fang and Peng Guo. Wind generator tower vibra-
                                                                            tion fault diagnosis and monitoring based on pca. In
Funded by the German Federal Ministry for Economic Af-                      Control and Decision Conference (CCDC), 2013 25th
fairs and Energy, KF2074717KM3 & KF2074719KM3                               Chinese, pages 1924–1929, May 2013.




                                                                 48
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


[12] M. Celik, F. Dadaser-Celik, and A.S. Dokuz. Anomaly
     detection in temperature data using dbscan algorithm.
     In Innovations in Intelligent Systems and Applications
     (INISTA), 2011 International Symposium on, pages
     91–95, June 2011.
[13] P. Baraldi, F. Di Maio, and E. Zio. Unsupervised clus-
     tering for fault diagnosis. In Prognostics and System
     Health Management (PHM), 2012 IEEE Conference
     on, pages 1–9, May 2012.
[14] Karlheinz Schwarz and Im Eichbaeumle. Iec 61850,
     iec 61400-25 and iec 61970: Information models and
     information exchange for electric power systems. Pro-
     ceedings of the Distributech, pages 1–5, 2004.
[15] Richard A Zatorski. System and method for control-
     ling multiple devices via general purpose input/output
     (gpio) hardware, June 27 2006. US Patent 7,069,365.
[16] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar.
     Introduction to Data Mining. Addison-Wesley, 2006.
[17] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A
     density-based algorithm for discovering clusters in
     large spatial databases with noise. In Proceedings of
     the 2nd International Conference on Knowledge Dis-
     covery and Data Mining (KDD96), 1996.
[18] Ulrike von Luxburg. A tutorial on spectral clustering.
     In Statistics and Computing, 17 (4), 2007.
[19] Shifei Ding, Liwen Zhang, and Yu Zhang. Research on
     spectral clustering algorithms and prospects. In Com-
     puter Engineering and Technology (ICCET), 2010 2nd
     International Conference on, volume 6, 16-18 2010.
[20] Lei Nie, Shouguo Wu, Jianwei Wang, Longzhen
     Zheng, Xiangqin Lin, and Lei Rui. Continuous
     wavelet transform and its application to resolving and
     quantifying the overlapped voltammetric peaks. Ana-
     lytica chimica acta, 450(1):185–192, 2001.
[21] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar.
     Introduction to Data Mining. Addison-Wesley, 2006.
[22] Michael E Tipping and Christopher M Bishop. Prob-
     abilistic principal component analysis. Journal of the
     Royal Statistical Society: Series B (Statistical Method-
     ology), 61(3):611–622, 1999.




                                                                49
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                   50
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




         Using Incremental SAT for Testing Diagnosability of Distributed DES

                    Hassan IBRAHIM1 and Philippe DAGUE1 and Laurent SIMON2
                              1
                                LRI, Univ. Paris-Sud and CNRS, Orsay, France
                                 hassan.ibrahim@lri.fr, philippe.dague@lri.fr
                          2
                            LaBRI, Univ. Bordeaux and CNRS, Bordeaux, France
                                               lsimon@labri.fr


                         Abstract                                      positioning the sensors to manage the observation require-
                                                                       ments. The main difficulty in diagnosability algorithms is
     We extend in this work the existing approach to                   related to the states number explosion. Another difficulty
     analyse diagnosability in discrete event systems                  appears when checking diagnosability of a system which
     (DES) using satisfiability algorithms (SAT), in or-               is actually diagnosable, i.e. the inexistence of a counter-
     der to analyse the diagnosability in distributed                  example witnessing non diagnosability. Thus all possibili-
     DES (DDES) and we test this extension. For this,                  ties need to be tested as for proving the non existence of a
     we handle observable and non observable com-                      plan in a planning problem, and usually in this case some
     munication events at the same time. We also pro-                  approximations are used to avoid exploring all the search
     pose an adaptation to use incremental SAT over                    space.
     the existing and the extended approaches to over-                 The paper is structured as follows. Section 2 will introduce
     come some of the limitations, especially concern-                 the system transition models for centralized DES and recall
     ing the length and the distance of the cycles that                the traditional definition of the diagnosability in those mod-
     witness the non diagnosability of the fault, and                  els and the state of the art of encoding this definition as a
     improve the process of dealing with the reacha-                   satisfiability problem in propositional logic. Section 3 will
     bility limit when scaling up to large systems.                    present our first contribution, an extension of this state of the
                                                                       art to DDES with observable and non observable communi-
1 Introduction                                                         cation events in the same model, and will give experimental
                                                                       results of this extension. Section 4 is devoted to our sec-
Diagnosis task is mainly using the available observations to
                                                                       ond contribution, using incremental SAT calls to overcome
explain the difference between the expected behavior of a
                                                                       the limitation when the number of steps required to check
system and its real behavior which may contain some faults.
                                                                       diagnosability, i.e., the length of possible paths with cycles
Many works have been done to study the automatic ap-
                                                                       witnessing non diagnosability, is large, and will present ex-
proaches to system fault diagnosis. They all try to deal with
                                                                       perimental results showing how the method scales up. Sec-
the main problem, i.e. the compromise between the number
                                                                       tion 5 will present related works and section 6 will conclude
of possible diagnoses to the considered faults and the num-
                                                                       and give our perspectives for future work.
ber of observations which must be given to make the deci-
sion. Diagnosis problem is NP-hard and one always needs
to cope with an explosion in the number of system model                2 Using SAT in Diagnosability Analysis of
states. Moreover, the diagnosis decision is not always cer-              Centralized Systems
tain, and thus running a diagnosis algorithm may not be ac-
curate. For example, two sets of observations provided by              We recall first the definitions of DES models we use and of
different sets of sensors or at different times may lead to            diagnosability for these models.
different diagnoses. This uncertainty raises the problem of
diagnosability which is essential while designing the system           2.1 Preliminaries
model. After that, the model based diagnosis will be used              We will use finite state machines (FSM) to model systems.
in applications to explain any anomaly, with a guarantee of            We define labeled transition systems following [1].
correctness and precision at least for anticipated faults.
Diagnosability of the considered systems is a property de-             Definition 1. A Labeled Transition System (LTS) is
fined to answer the question about the possibility to distin-          a tuple T = hX, Σo , Σu , Σf , δ, s0 i where:
guish any possible faulty behavior in the system from any
other behavior without this fault (i.e., correct or with a dif-          • X is a finite set of states,
ferent fault) within a finite time after the occurrence of the           • Σo is a finite set of observable correct events,
fault. A fault is diagnosable if it can be surely identified
from the partial observation available in a finite delay af-             • Σu is a finite set of unobservable correct events,
ter its occurrence. A system is diagnosable if every possible            • Σf is a finite set of unobservable faulty events,
fault in it is diagnosable. This property provides information
                                                                         • δ ⊆ X × (Σo ∪ Σu ∪ Σf ) × X is the transition relation,
before getting into finding the explanations of the fault. It
also helps in designing a robust system against faults and in            • s0 is the initial state.




                                                                  51
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


   In [2] the authors used an equivalent but more compact               that L(T ) is live (i.e., for any state, there is at least one tran-
representation than LTS for modeling systems in order to                sition issued from this state) and convergent (i.e., there is no
analyze their diagnosability: succinct transition systems,              cycle made up only of unobservable events).
that exploit the regularity in the systems structures and                  A system T is said to be diagnosable iff any fault f ∈ Σf
are expressed in terms of propositional variables, which                is diagnosable in T . In order to avoid exponential complex-
allowed them to translate more easily to a SAT problem the              ity in the number of faults during diagnosability analysis,
twin plant method proposed by [3] for checking diagnos-                 only one fault at a time is checked for diagnosability. It will
ability.                                                                thus be assumed in the following that there exists only one
As we aim at studying the diagnosability of DDES using                  fault event f (Σf = {f }), without restriction on the num-
SAT solvers, we will follow the model of [2] who stud-                  ber of its occurrences. Diagnosability checking has been
ied the same problem in centralized DES. It represents                  proved in [3] to be polynomial in the number |X| of states
the system states by the valuations of a finite set A of                for LTS, so exponential in the number |A| of state variables
Boolean state variables where valuation changes reflect                 for SLTS (actually the problem is NLOGSPACE-complete
the transitions between states according to the events. The             for LTS and PSPACE-complete for SLTS [4]).
set of all literals issued from A is L = A ∪ {¬a|a ∈ A}
and L is the language over A that consists of all formulas              2.2 SLTS Diagnosability as Satisfiability
that can be formed from A and the connectives ∨ and                     An immediate rephrasing of the definition 3 shows that T is
¬. We use the standard definitions of further connec-                   non diagnosable iff it exists a pair of trajectories correspond-
tives Φ ∧ Ψ ≡ ¬(¬Φ ∨ ¬Ψ), Φ → Ψ ≡ ¬Φ ∨ Ψ and                            ing to cycles (and thus to infinite paths), a faulty one and
Φ ↔ Ψ ≡ (Φ → Ψ) ∧ (Ψ → Φ). The transition relation                      a correct one, sharing the same observable events. Which
is defined to allow two or more events to take place                    is equivalent to the existence of an ambiguous (i.e. made
simultaneously. Thus each event is described by a set of                up of pairs of states respectively reachable by a faulty path
pairs hφ, ci which represent its possible ways of occurrence            and a correct path) cycle in the product of T by itself, syn-
by indicating that the event can be associated with changes             chronized on observable events, which is at the origin of the
c ∈ 2L in states that satisfy the condition φ ∈ L.                      so called twin plant structure introduced in [3]. This non
                                                                        diagnosability test was formulated in [2] as a satisfiability
Definition 2. A Succinct Transition System (SLTS)                       problem in propositional logic. We recall below this encod-
is described by a tuple T = hA, Σo , Σu , Σf , δ, s0 i where:           ing with the variables and the formulas used, where super-
   • A is a finite set of state variables,                              scripts t refer to time points and (eto ) and (êto ) refer respec-
                                                                        tively to the faulty and correct events occurrences sequences
   • Σo is a finite set of observable correct events,
                                                                        (corresponding states being described by valuations of (at )
   • Σu is a finite set of unobservable correct events,                 and (ât )) of a pair of trajectories witnessing non diagnos-
   • Σf is a finite set of unobservable faulty events,                  ability (so sharing the same observable events represented
                                       L                                by (et ) and forming a cycle). The increasing of the time
   • δ : Σ = Σo ∪ Σu ∪ Σf → 2L×2 assigns to each event                  step corresponds to the triggering of at least one transition
      a set of pairs hφ, ci,                                            and the extension by an event of at least one of the two tra-
   • s0 is the initial state (a valuation of A).                        jectories. T = hA, Σu , Σo , Σf , δ, s0 i being an SLTS, the
It is straightforward to show that any LTS can be repre-                propositional variables are thus:
sented as an SLTS (one takes dlog(|X|)e Boolean variables                 • at and ât for all a ∈ A and t ∈ {0, . . . , n},
and represents states by different valuations of these vari-              • eto for all e ∈ Σo ∪ Σu ∪ Σf , o ∈ δ(e) and t ∈ {0, . . . ,
ables; one assigns to each occurence of an event e labeling                 n − 1},
a transition (x, e, y) a pair hφ, ci, with φ expressing the
valuation of x and c the valuation changes between x and                  • êto for all e ∈ Σo ∪ Σu , o ∈ δ(e) and t ∈ {0, . . . ,
y). And reciprocally any SLTS can be mapped to an LTS                       n − 1},
(see Definition 2.4 in [2]).                                              • et for all e ∈ Σo and t ∈ {0, . . . , n − 1}.
The formal definition of diagnosability of a fault f in a
                                                                        The following formulas express the constraints that must be
centralized system modeled by (an LTS or SLTS) T was
                                                                        applied at each time step t or between t and t + 1.
proposed by [1] as follows:
                                                                          1. The event occurrence eto must be possible in the current
Definition 3. Diagnosability.         A fault f is diagnos-                  state:
able in a system T iff
                                                                                           eto → φt     for o = hφ, ci ∈ δ(e)         (2.1)
      ∃k ∈ N, ∀sf ∈ L(T ), ∀t ∈ L(T )/sf , |t| ≥ k ⇒
                                                                             and its effects must hold at the next time step:
         ∀p ∈ L(T ), (P (p) = P (sf .t) ⇒ f ∈ p).
                                                                                          ^
In this formula, L(T ) denotes the prefix-closed language of                        eto →     lt+1 for o = hφ, ci ∈ δ(e)              (2.2)
T whose words are called trajectories, sf any trajectory end-                              l∈c
ing by the fault f , L(T )/s the post-language of L(T ) after
s, i.e., {t ∈ Σ∗ |s.t ∈ L(T )} and P the projection of tra-                  We have the same formulas with êto .
jectories on observable events. The above definition states               2. The present value (T rue or F alse) of a state variable
that for each trajectory sf ending with fault f in T , for each              changes to a new value (F alse or T rue, respectively)
t that is an extension of sf in T with enough events, every                  only if there is a reason for this change, i.e., because of
trajectory p in T that is equivalent to sf .t in terms of obser-             an event that has the new value in its effects (so, change
vation should contain in it f . As usual, it will be assumed                 without reason is prohibited). Here is the change from




                                                                   52
                              Proceedings of the 26th International Workshop on Principles of Diagnosis


       T rue to F alse (the change from F alse to T rue is de-                    3.1 DDES Modeling
       fined similarly by interchanging a and ¬a):                                In order to model DDES with SLTS, we need to extend
              (at ∧ ¬at+1 ) → (eti1 o      ∨ · · · ∨ etik o )        (2.3)        these ones by adding communication events to each com-
                                      j1                 jk
                                                                                  ponent. So we use the following definition for a distributed
       where the ojl = hφjl , cjl i ∈ δ(eil ) are all the occur-                  SLTS with k different components (sites):
       rences of events eil with ¬a ∈ cji .
       We have the same formulas with ât and êtil o .                           Definition 4.      A Distributed Succinct Transition
                                                                                  System (DSLTS) with k components is described by a tuple
                                                              jl

  3. At most one occurrence of a given event can occur at                         T = hA, Σo , Σu , Σf , Σc , δ, s0 i where (subscripts i refer to
     a time and the occurrences of two different events can-                      component i):
     not be simultaneous if they interfere (i.e., if they have
     two contradicting effects or if the precondition of one                        • A is a union of disjoint finite sets (Ai )1≤i≤k of com-
     contradicts the effect of the other):                                            ponent own state variables, A = ∪ki=1 Ai ,
         ¬(eto ∧ eto0 ) ∀e ∈ Σ, ∀{o, o0} ⊆ δ(e), o 6= o0 (2.4)                      • Σo is a union of disjoint finite sets of component own
                                                                                      observable correct events, Σo = ∪ki=1 Σoi ,
        ¬(eto ∧ e0t
                 o0 ) ∀{e, e0} ⊆ Σ, e 6= e0, ∀o ∈ δ(e),
                                                                                    • Σu is a union of disjoint finite sets of component own
              ∀o0 ∈ δ(e0) such that o and o0 interfere               (2.5)            unobservable correct events, Σu = ∪ki=1 Σui ,
     We have the same formulas with êto .                                          • Σf is a union of disjoint finite sets of component own
  4. The formulas that connect the two events sequences                               unobservable faulty events, Σf = ∪ki=1 Σf i ,
     require that observable events take place in both se-
                                                                                    • Σc is a union of finite sets of (observable or unobserv-
     quences whenever they take place (use of et ):
         _                  _                                                         able) correct communication events, Σc = ∪ki=1 Σci ,
              eto ↔ et and      êto ↔ et ∀e ∈ Σo (2.6)                               which are the only events shared by at least two differ-
          o∈δ(e)                 o∈δ(e)                                               ent components (i.e., ∀i, ∀c ∈ Σci , ∃j 6= i, c ∈ Σcj ),
                                                                                    • δ = (δi ), where δi : Σi = Σoi ∪ Σui ∪ Σf i ∪ Σci →
  The conjunction of all the above formulas for a given t is                                Li
denoted by T (t, t + 1).                                                              2Li ×2 , assigns to each event a set of pairs hφ, ci in
A formula for the initial state s0 is:                                                the propositional language of the component where it
           ^                         ^                                                occurs (so, for communication events, in each compo-
 I0 =              (a0 ∧â0 ) ∧         (¬a0 ∧¬â0 ) (2.7)                            nent separately where they occur),
         a∈A,s0 (a)=1               a∈A,s0 (a)=0
                                                                                    • s0 = (s0i ) is the initial state (a valuation of each Ai ).
   At last, the following formula can be defined to encode
                                                                                  In this distributed framework, synchronous communication
the fact that a pair of executions is found with the same ob-
                                                                                  is assumed, i.e., communication events are synchronized
servable events and no fault in one execution (first line), but
                                                                                  such that they all occur simultaneously in all components
one fault in the other (second line), which are infinite (in
                                                                                  where they appear. More precisely, a transition by a com-
the form of a non trivial cycle, so containing at least one
                                                                                  munication event c may occur in a component iff a simul-
observable event, 1 at step n; third line), witnessing non di-
                                                                                  taneous transition by c occurs in all the other components
agnosability:
                                                                                  where c appears (has at least one occurrence). In particular,
        ΦTn =       I0 ∧ T (0, 1) ∧ · · · ∧ T (n − 1, n) ∧                        all events before c in trajectories in all these components
                        n−1
                        _     _     _                                             necessarily occur before all events after c in these trajecto-
                                           eto   ∧                                ries. The global model of the system is thus nothing else that
                        t=0 e∈Σf o∈δ(e)
                                                                                  the product of the models of the components, synchronized
                                                                                  on communication events. Notice that we allow in whole
    n−1
    _         ^                                      n−1
                                                     _        _                   generality communication events to be, partially or totally,
          (       ((an ↔ am ) ∧ (ân ↔ âm )) ∧                    et )           unobservable, so one has in general to wait further obser-
    m=0 a∈A                                          t=m e∈Σo                     vations to know that some communication event occurred
   From this encoding in propositional logic, follows the re-                     between two or more components. On the other side, as-
sult (theorem 3.2 of [2]) that an SLTS T is not diagnosable                       suming these communications to be faultless is not actually
if and only if ∃n ≥ 1, ΦTn is satisfiable. It is also equivalent                  a limitation. If a communication process or protocol may be
to ΦT22|A| being satisfiable, as the twin plant states number is                  faulty, it has just to be modeled as a proper component with
an obvious upper bound for n, but often impractically high                        its own correct and faulty behaviors (the same that, e.g., for
(see in [2] some ways to deal with this problem).                                 a wire in an electrical circuit). In this sense, communica-
                                                                                  tions between components are just a modeling concept, not
3 Using SAT in Diagnosability Analysis of                                         subject to diagnosis. It will be also assumed that the observ-
  Distributed Systems                                                             able information is global, i.e. centralized (when observable
                                                                                  information is only local to each component, distributed di-
We extend from centralized systems to distributed systems                         agnosability checking becomes undecidable [5]), allowing
the satisfiability framework of subsection 2.2 for testing di-                    to keep definition 3 for diagnosability.
agnosability and we provide some experimental results.
   1                                                                              3.2 DSLTS Diagnosability as Satisfiability
     This verification that the cycle found is not trivial was not done
in [2]; it is why the authors had to add for each time point a for-               Let T be a DSLTS made up of k components denoted by
mula, not needed here, guaranteeing that at least one event took                  indexes i, 1 ≤ i ≤ k. In order to express the diagnosability
place, to avoid silent loops with no state change.                                analysis of T as a satisfiability problem, we have to extend




                                                                             53
                               Proceedings of the 26th International Workshop on Principles of Diagnosis


the formulas of subsection 2.2 to deal with communication                        We have tested our tool on small examples with sev-
events between components. Let Σc = Σco ∪ Σcu be the                          eral communication events with multiple occurrences (three
communication events, with Σco = ∪ki=1 Σco i the observ-                      communicating components) with global communication
able ones and Σcu = ∪ki=1 Σcu i the unobservable ones.                        (all components share the same event) or partial commu-
    The idea is to treat each communication event as any                      nication (only some components share the same event), as
other event in each of its owners and, as it has been done                    in Figure 1, which was the running example in [7].
with events et for e ∈ Σo for synchronizing observable
events occurrences in the two executions, to introduce in the
same way a global reference variable for each communica-
tion event at each time step, in charge of synchronizing any
communication event occurrence in any of its owner with
occurrences of it in all its other owners. We use one such
reference variable for each trajectory, et and êt , for unob-
servable events e ∈ Σcu , and only one for both trajectories,
et , for observable events e ∈ Σco as it will also in addition
play the role of synchronizing observable events between
trajectories exactly as the et for e ∈ Σo . So, we add to the
previous propositional variables the new following ones:
   • eto , êto for all e ∈ Σc , o ∈ δ(e) = ∪i δi (e) and
     t ∈ {0, . . . , n − 1},
   • et for all e ∈ Σc ,              êt     for all e ∈ Σcu     and
     t ∈ {0, . . . , n − 1}.                                                  Figure 1: A DDES made up of 3 components C1, C2 and
Formulas in T (t, t + 1) are extended as follows.                             C3 from left to right. ci ,1≤i≤2 are unobservable communi-
                                                                              cation events, oi ,1≤i≤5 are observable events and fi ,1≤i≤2
  1. Formulas (2.1), (2.2), (2.3) and (2.5) extend unchanged                  are faulty events.
     to eto and êto ∀e ∈ Σc , expressing that a communication
     event must be possible and has effects in each of its                       The total number of propositional variables V arsN um
     owner components and that two such different events                      in the generated formula ΦTn after n steps is:
     cannot be simultaneous if they interfere.                                                                            PObs
                                                                              V arsN um        =    n × (2|A| + 3 i=1 ObOcci +
                                                                              PF aults                    PU nobs
  2. Formulas (2.4) extend to prevent two simultaneous oc-                       i=1     F aultOcci + 2 i=1 U nobOcci ), where:
     currences of a given communication event in the same                     |A| is the total number of state variables,
     owner component, i.e. apply ∀e ∈ Σc , ∀i, ∀{oi , oi 0} ⊆                 Obs the total number of observable events,
     δi (e), oi 6= oi 0 and the same with ê (obviously they do               ObOcci the total number of occurrences of the observable
     not apply to different owner components, by the very                     event ei ,
     definition of communication events).                                     F aults the total number of faults,
  3. Finally, the new following formulas express the com-                     F aultOcci the total number of occurrences of the faulty
     munication process itself, i.e. the synchronization of                   event ei ,
     the occurrences of any communication event e in all its                  U nobs the total number of unobservable correct events,
     owners components (S(e) being the set of indexes of                      U nobOcci the total number of occurrences of the unob-
     the owners components of e) and extend also formulas                     servable correct event ei .
     (2.6) to observable communication events:                                The results are in Table 1, where the columns show the
                                                                              system and the fault considered (3 cases), the steps number
   _                            _                                             n, the numbers of variables and clauses and the runtime.
             etoi ↔ et and                êtoi ↔ êt   ∀e ∈ Σcu ∀i ∈ S(e)
oi ∈δi (e)                   oi ∈δi (e)                                        System       Fault   |Steps|   SAT?   |Variables|   |Clauses|   runtime(ms)
   _                            _                                              C2           f2      4         No     106           628         27
             etoi ↔ et and                êtoi ↔ et    ∀e ∈ Σco ∀i ∈ S(e)     C2           f2      5         Yes    131           783         15
                                                                               C2, C3       f2      5         No     225           1157        28
oi ∈δi (e)                   oi ∈δi (e)                                        C2, C3       f2      32        No     1386          7340        641
                                                                               C2, C3       f2      64        No     2762          14668       1422
The formula ΦTn is unchanged except that, in the verification                  C2, C3       f2      128       No     5514          29324       5061
                                                                               C2, C3       f2      256       No     11018         58636       18970
that the found cycle (third line) is not trivial, any observable               C2, C3       f2      512       No     22026         117260      130164
event can be used, so the final disjunct of events et is ex-                   C2, C3       f2      1024      No     44042         234508      548644
tended to all e ∈ Σo ∪ Σco . We have thus the result that a                    C1, C2, C3   f1      8         No     576           3546        91
                                                                               C1, C2, C3   f1      9         Yes    646           3987        110
DSLTS T is not diagnosable if and only if ∃n ≥ 1, ΦTn is
satisfiable.                                                                          Table 1: Results on the example of Figure 1.
                                                                                 Which means that f 2 is not diagnosable in C2 alone
3.3 Implementation and Experimental Testing                                   while it becomes diagnosable when synchronizing C2 and
We have implemented the above extension in Java. We used                      C3. For this last result, we have increased the steps number
the well designed API of the SAT solver Sat4j [6]. If more                    until reaching 22|A| , which is the theoretical upper bound of
efficient solvers could have been chosen, it fitted well our                  the twin plant states represented in the logical formula. As
clause generator written in Java and only a limited speed                     in general it is not always possible to reach this bound in
up can be awaited from C++ solvers (a speed up of 4, i.e.                     practice, we propose in section 4 using incremental SAT to
reduction of 75% of the runtime is often observed).                           improve the management of increasing steps number. While




                                                                         54
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


f 1 is not diagnosable even after synchronizing all three                of the behavior of both trajectories (represented by the con-
components together. Numbers of variables and clauses are                junction of formulas T (t, t+1), 0 ≤ t ≤ n−1, representing
small in comparison to what SAT solvers can handle (up to                the (t + 1)th step). The second part Dn describes the diag-
hundred thousands propositional variables and millions of                nosability property at step n, i.e., the occurrence of a fault
clauses). These tests are mentioned as a proof of concept.               in the n previous steps of the faulty trajectory (given by the
However, to test the tool on larger systems and because of               formula Fn ) and the detection of a cycle at step n (given by
the absence of benchmark in the literature, we have created              the formula Cn ). So we obtain, for n ≥ 1:
in subsection 4.2. an example that can be scaled up.
                                                                         ΦTn = Tn ∧ Dn
4 Adaptation to Incremental SAT                                                           n−1
                                                                                          ^
                                                                         Tn = I0 ∧              T (t, t + 1)           Dn = Fn ∧ Cn
  Diagnosability Checking                                                                 t=0
We adapt satisfiability algorithms for checking diagnosabil-                    n−1
                                                                                _         _       _
ity of both centralized (subsection 2.2) and distributed (sub-           Fn =                            eto
section 3.2) DES in order to incrementally process the max-                     t=0 e∈Σf o∈δ(e)
imum length of paths with cycles searched for witnessing                        n−1                                                 n−1
non diagnosability and we provide experimental results.                         _         ^                                         _     _
                                                                         Cn =         (         ((an ↔ am ) ∧ (ân ↔ âm )) ∧                  et )
4.1 Diagnosability as Incremental Satisfiability                                m=0 a∈A                                             t=m e∈Σo

Two cases have to be distinguished while testing diagnos-                Add now at each step j a control variable hj allowing to
ability using SAT solvers to verify the satisfiability of the            disable (when its truth value is F alse) or activate (when its
logical formula ΦTn for a given n [2]. The first case is when            truth value is T rue) the formulas Fj and Cj and keep at step
we find a model for ΦTn , which definitely indicates the non             n all these controlled formulas for 1 ≤ j ≤ n. We obtain
diagnosability of the studied fault. The second case is when             the following ΨTn formula, for n ≥ 1:
we do not find such a model: this result indicates just that the                                n
                                                                                                ^
studied fault has not been found non diagnosable according                  ΨTn = Tn ∧                Dj 0     Dj 0 = Fj 0 ∧ Cj 0   1≤j≤n
to the value of n. In other words, after testing all the possible                               j=1
first n steps, we did not find a pair of executions of length
at most n containing cycles such that one of them contains                  Fj 0 = ¬hj ∨ Fj                    Cj 0 = ¬hj ∨ Cj      1≤j≤n
the fault and not the other and such that the two executions             We have thus the equivalence, for all n ≥ 1:
are equivalent in terms of observation. However, as the the-
                                                                                                                        n−1
                                                                                                                        ^
oretical upper bound n = 22|A| which would guarantee that
                                                                                                ΦTn ≡ ΨTn ∧ hn ∧              ¬hj
the fault is actually diagnosable is often in practice unreach-
                                                                                                                        j=1
able, such a pair may exist for a greater value of n. Testing
it means increasing n and rebuilding the logical formula ΦTn             This allows one, for all n ≥ 1, to replace the SAT call on
then recalling the SAT solver.                                           ΦTn by a SAT call on ΨTn under the control variables set-
   Instead, we propose to adapt the formula ΦTn in order to              ting given by Hn = {¬h1 , . . . , ¬hn−1 , hn } (indicated in a
be tested in an incremental SAT mode by multiple calls to                second argument of the call):
a Conflict Driven Clause Learning (CDCL) solver. Using
CDCL solvers in a specialized, incremental, mode is rela-                                     SAT (ΦTn ) = SAT (ΨTn , Hn )
tively new but already widely used [8] in many applications.             The idea is now to consider the control variables hj as as-
In this operation mode, the solver can be called many times              sumptions and use incremental SAT calls IncSATj under
with different formulas. However, solvers are designed to                varying assumptions, for 1 ≤ j ≤ n. For this, we use
work with similar formulas, where clauses are removed and                the following recurrence relationships for both formulas ΨTj
added from calls to calls. Learnt clauses can be kept as soon            and assumptions Hj :
as the solver can ensure that clauses used to derive them are
not removed. This is generally done by adding specialized                 ΨT0 = I0         ΨTj+1 = ΨTj ∧ T (j, j + 1) ∧ Dj+1 0            j≥0
variables, called assumptions, to each clause that can be re-             H1 = {h1 } Hj+1 = Hj [{¬hj , hj+1 }] j ≥ 1
moved. By assuming the variable to be F alse, the clause
is activated and by assuming the variable to be T rue, the               where the notation Hj [{assi }] means updating in Hj
clause is trivially satisfied and no longer used by the solver.          assumptions hi by their new settings assi , i.e., in the
What is interesting for our purpose is that the CDCL solver              formula above, replacing the truth value of hj , which was
can save clauses learnt during the previous calls and test               T rue, by F alse, and adding the new assumption hj+1
multiple assumptions in each new call. This means that af-               with truth value T rue. From these relationships, the unique
ter n steps we hope that the solver will have learnt some                call to SAT under given assumptions SAT (ΨTn , Hn ) can
constraints about the behavior of the system. Although we                be replaced, starting with the set of clauses I0 , by multiple
are interested in testing the diagnosability property on a de-           calls, 0 ≤ j ≤ n − 1, to an incremental SAT under varying
fined system, this property is independent from the system               assumptions:
behavior which can be learnt by the solver from the previous               IncSATj+1 (N ewClausesj+1 , N ewAssumptionsj+1 )
calls.
   In order to extend the clauses representation given in sub-             = IncSATj+1 (T (j, j + 1) ∧ Dj+1 0 , {¬hj , hj+1 }) (4.1)
sections 2.2 and 3.2 to this mode of operation, we propose               If IncSATj answers SAT, the search is stopped as non diag-
to divide the formula ΦTn in two parts. The first part Tn de-            nosability is proved, if it answers UNSAT, then IncSATj+1
scribes the first n steps, synchronized on the observations,             is called.




                                                                    55
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


   Notice that we used a unique assumption hj for control-             model (for k = 3, 13, 23, 33, 43 and 63). The length of a pair
ling both Fj and Cj as non diagnosability checking requires            of executions with cycles witnessing the non diagnosability
the presence of both a fault occurrence in the faulty trajec-          of f in each example is k + 2 and we consider the satisfia-
tory and of a cycle. But the same framework allows the                 bility of the formula ΦTk+2 , so the number of steps required
independent control of formulas by separate assumptions.               for SAT to provide the answer Yes is: |Steps] = k + 2. In
For sake of simplicity, we also assumed we called IncSAT               order to obtain a fair comparison between IncSAT , which
at each step, but this is not mandatory and indexes j for the          manages internally by handling assumptions the successive
successive calls can be decoupled from indexes t for steps.            satisfiability checks of increasing formulas for j = 1, . . . ,
We should also say that, even if IncSAT allows us to re-               k + 2, and SAT, for which k + 2 successive calls are made to
activate an already disabled clause, we are sure in our case           the solver with respective formulas ΦTn for n = 1, . . . , k+2,
to never use this function (when hk has been set to F alse,            the sum of the k + 2 runtimes of the SAT solver calls are
it always remains so) and we can thus force the solver to              considered in this case (last column in the tables).
do a hard simplification process that removes the forgotten
clauses permanently. As a result of our adaptation we will                    |Steps|   |Clauses|     Inc. SAT(s)     SAT(s)
be able to scale up the size of the tested system and the dis-
tance and length of a cycle witnessing non diagnosability.                    20        42,614        1.5             1.3
                                                                              30        131,714       10.3            13.1
4.2 Experimental Results                                                      40        303,736       49.3            77.8
                                                                              50        576,466       106             223
We show in this subsection a comparison between our                           60        970,156       320             699
adapted version of subsection 4.1, that uses incremental                      100       4,334,018     9410            13040
SAT, and the previous versions, for centralized model (sub-
section 2.2 following [2]) and for distributed model (subsec-
tion 3.2). We have created the example in Figure 2 which                  Table 2: Results on the faulty component of Figure 2.
contains 2k + 1 components: one faulty component and
two sets of k neighboring components. The faulty compo-
nent has two separated paths, each one containing k differ-             |Steps|   |Comps|      |Clauses|    Inc. SAT(s)     SAT(s)
ent successive unobservable events ci and ending with the
                                                                        5         7            1,962        0.04            0.06
same observable cycle of length 1, but only one of them
                                                                        15        27           30,313       0.8             0.5
contains the fault. The centralized model will be limited to
                                                                        25        47           113,906      6.5             4.8
this faulty component alone and thus in this case the events
                                                                        35        67           277,873      33.8            33.7
ci , 1 ≤ i ≤ 2k, are just unobservable events as is u. In
                                                                        45        87           542,033      111             132
the distributed model, these events ci are communication
                                                                        65        127          1,490,590    967             1090
events and the faulty component is considered with the other
two sets of components, where each component in both sets                   Table 3: Results on the whole system of Figure 2.
shares one event ci with the faulty component to ensure a
number 2k of communications before arriving to the cycles
that will witness the non diagnosability of the fault. Each               Although these examples remain relatively simple and do
set of components will be synchronized with only one path,             not reflect any potential constraint that could be resumed by
either the faulty path or the correct one. This allows us to           some learnt clauses (e.g. no interfering events), we can al-
study the effect of the cycle distance in both models.                 ready notice the difference in runtime in favor of our incre-
                                                                       mental version in the centralized case and for the two largest
                                                                       values of k in the distributed case. This difference could be
                                                                       explained by the fact that generating all variables from the
                                                                       beginning for all time steps and for all events imply many
                                                                       meaningless clauses that would add a load on the solver in
                                                                       the version in [2], this load being avoided in our incremen-
                                                                       tal version because of the clauses learnt by the CDCL SAT
                                                                       solver. From another side, we should say that generating in
                                                                       both versions all variables from the beginning has two main
                                                                       advantages: firstly, it allows the system description without
                                                                       unfolding it (even if this description is verbose); secondly,
                                                                       it allows the ordering of these variables by their time step
                                                                       in order to generate the constraints for only one time step
                                                                       and then get next steps constraints by just shifting the num-
                                                                       bers (as we are representing the clauses in DIMACS for-
                                                                       mat). One last point could help to a more efficient descrip-
                                                                       tion of the system: in the succinct systems we represent all
                                                                       the occurrences of an event together, but in its SAT encod-
Figure 2: One faulty component that communicates with                  ing we “unfold” this succinctness by generating for each
two sets of k components. Each set communicates with one               occurrence n variables (for n time steps), even though log-
path (resp. faulty and correct) in the faulty component.               ically only one of them will be assigned to True. We could
                                                                       thus mark this relation among these n copies by introducing
  The results are in Table 2 for the centralized model (for k          a global cardinality constraint to express that these copies
= 18, 28, 38, 48, 58 and 98) and in Table 3 for the distributed        belong to only one occurrence of an event.




                                                                  56
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


5 Selection of Related Works                                            faults.
                                                                           The work by [11] has optimized the construction of lo-
The first introduction to the notion of diagnosability was by           cal twin plants, by exploiting the fact that one distinguishes
[1]. The authors studied diagnosability of FSM, as defined
                                                                        two behaviors (faulty and correct) and one synchronizes at
in definition 1. Their formal definition of diagnosability is           two levels (observations first and communications later). It
the one we mentioned in definition 3. They introduced an                improved the construction of the twin plants proposed by
approach to test this property by constructing a deterministic          [7] by exploiting the different identifiers given to the com-
diagnoser. However, in the general case, this approach is               munication events at the observation synchronization level
exponential in the number of states of the system, which                (depending on which instance, left or right, they belong to)
makes it impractical.                                                   to assign them directly to the two behaviors studied (left
   In order to overcome this limitation [3] introduced the              copy assigned to the faulty behavior, right copy to the cor-
twin plant approach, which is a special structure built by              rect one). This helped in deleting the redundant informa-
synchronizing on their observable events two identical in-              tion, then in abstracting the amount of information to be
stances of a nondeterministic fault diagnoser, and then                 transferred later to next steps if the diagnosability was not
searched for a path in this structure with an observed cy-              answered. The generalization to fault patterns in DDES was
cle made up of ambiguous states, i.e. states that are pairs             introduced by [12].
of original states, one reached by going through a fault and               After the reduction of diagnosability problem to a path
the other not. Thus faults diagnosability is equivalent to the          finding problem by [3], it became transferable to a satis-
absence of such a path, called a critical path. This approach           fiability problem like it is the case for planning problems
turns the diagnosability problem in a search for a path with            [13]. This was done by [2] which formulated the diagnos-
a cycle in a finite automaton, and this reduces its complexity          ability problem (in its twin plant version) into a SAT prob-
to be polynomial of degree 4 in the number of states (and ex-           lem, assuming a centralized DES with simple fault events.
ponential in the number of faults, but processing each fault            The authors represented the studied transition system by a
separately makes its linear in the number of faults).                   succinct representation (cf. definition 2). This allows both
   Let us mention here that the two previous works were in-             a compact representation of the system states and a max-
terested in centralized systems with simple faults modeled              imum amount of non interfering events to be fired simul-
as distinguished events. The first studies about fault pat-             taneously. Thus, they represented the system states by the
terns were introduced in [9] and [10] which generalize the              valuation of a set of Boolean state variables (dlog(q)e state
simple fault event in a centralized DES to handle a sequence            variables for q states) and the interference relation between
of events considered together as a fault, or handle multiple            two events according to the consistency among their effects
occurrences of the same fault or of different faults. More              and preconditions, one versus the other. They distinguished
generally, a fault pattern is given as a suffix-closed rational         between an occurrence of an event in the faulty sequence or
events language (so by a complete deterministic automaton               in the correct sequence by introducing two versions of it and
with a stable subset of final states).                                  constructed the logical formula expressing states transitions
   The first work that addressed diagnosability analysis in             for each possible step in the system. Each step may con-
DDES was [7]. A DDES is modeled as a set of communicat-                 tain simultaneous events that belong to faulty and correct
ing FSM. Each FSM has its own events set, communication                 sequences but must synchronize the occurrence of observ-
events being the only ones shared by at least two different             able events whenever they take place. For a given bound n
FSM. In [7] was introduced an incremental diagnosability                of paths length, they made the conjunct of these formulas
test which avoids to build the twin plant for the whole dis-            for n steps and added the logical formula that represents the
tributed system if not needed. Thus one starts by building              occurrence of the fault in the faulty sequence and the oc-
a local twin plant for the faulty component to test the exis-           currence of a cycle in both sequences. The satisfiability of
tence of a local critical path. If such a path exists one builds        the obtained formula is equivalent to finding a critical path,
the local twin checkers of the neighboring components. Lo-              i.e. to the non diagnosability of the fault (see subsection 2.2
cal twin checker is a structure similar to local twin plant,            for a summary of this approach). Although this approach
i.e., where each path in it represents a pair of behaviors with         allows one to test diagnosability in large systems, it has a
the same observations, except that there is no fault infor-             limitation which is that we cannot dynamically increase n
mation in it since it is constructed from non-faulty compo-             to ensure reaching more states while scaling up the size of
nent. After constructing local twin checkers, one tries to              the system where the cycles that witness non diagnosabil-
solve the ambiguity resulting from the existence of a critical          ity can be very long. However the authors notice that we
path in the local twin plant. This is done by synchronizing             are not always forced to test all reachable states in many
on their communication events this local twin plant with the            cases where an approximation for the reachable states can
local twin checker of one neighboring component. In other               be applied, but without explaining explicitly how such an
words, one tries to distinguish the faulty path from the cor-           approximation can be found.
rect one by exploiting the observable events in the neigh-
boring components, because theses events occurrences that
are consistent with the occurrences of the communication
                                                                        6 Conclusion and Future Works
events could solve the ambiguity. The process is repeated               By extending the state of the art works for centralized DES,
until the diagnosability is answered, so only in the worst              we have expressed diagnosability analysis of DDES as a
case has the whole system to be visited. Another impor-                 satisfiability problem by building a propositional formula
tant contribution in this work was to delete the unambigu-              whose satisfiability, witnessing non diagnosability, can be
ous parts after each synchronization on the communication               checked by SAT solvers. We allow both observable and
events, reducing thus the amount of information transferred             non observable communication events in our model. Our
to next check (if needed). The approach assumed simple                  expression of these communication events, which avoids




                                                                   57
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


merging all their owner components, helps in reducing the              [2] J. Rintanen and A. Grastien. Diagnosability testing
number of clauses used to represent them and this reduction                 with satisfiability algorithms. In Proceedings of the
is proportional to the number of their occurrences. We have                 20th International Joint Conference on Artificial Intel-
also proposed an adaptation of the logical formula in order                 ligence (IJCAI’07), pages 532–537, 2007.
to use incremental SAT calls helping managing the scaling              [3] S. Jiang, Z. Huang, V. Chandra, and R. Kumar. A poly-
up of the distance and the length of the intended cycles                    nomial algorithm for testing diagnosability of discrete-
witnessing non diagnosability and thus the size of the tested               event systems. IEEE Transactions on Automatic Con-
system. Thus we exploited the clauses learnt about the                      trol, 46(8):1318–1321, 2001.
system behavior in the previous calls. This approach is
more practical and more efficient for complex systems than             [4] J. Rintanen. Diagnosers and diagnosability of succinct
existing ones, as it avoids starting from scratch at each call.             transition systems. In Proceedings of the 20th Interna-
                                                                            tional Joint Conference on Artificial Intelligence (IJ-
   We are now considering the extension of this work to                     CAI’07), pages 538–544, 2007.
fault patterns diagnosability [12]. We will use the same ap-           [5] L. Ye and P. Dague. Undecidable case and decidable
proach to express predictability analysis [14] as a satisfia-               case of joint diagnosability in distributed discrete event
bility problem, for DES and DDES [15] and both for simple                   systems. International Journal On Advances in Sys-
fault events and fault patterns [16] . Although our represen-               tems and Measurements, 6(3 and 4):287–299, 2013.
tation can be easily extended to deal with local observations          [6] D. Le Berre and A. Parrain. The sat4j library, release
(i.e., observable events in one component are observed only                 2.2. Journal on Satisfiability, Boolean Modeling and
by this component), we know that in general diagnosability                  Computation, 7:59–64, 2010.
checking becomes then undecidable, e.g. when communica-
tion events are unobservable (obviously it remains decidable           [7] Y. Pencolé. Diagnosability analysis of distributed dis-
when these events are observable in all their owners) [5]. A                crete event systems. In Proceedings of the 16th Euro-
future work will be to study decidable cases of diagnosabil-                pean Conference on Artificial Intelligence (ECAI’04),
ity checking in DDES with local observations, e.g. assum-                   2004.
ing some well chosen communication events being observ-                [8] A. Nadel and V. Ryvchin. Efficient SAT solving under
able. Another natural question is to study if the methods                   assumptions. In Proceedings of the 15th International
used in [7] and refined in [11] to check diagnosability in                  Conference on Theory and Applications of Satisfiabil-
DDES in an incremental way in terms of the system com-                      ity Testing (SAT’12), 2012.
ponents could be transposed as guiding strategies for some             [9] T. Jéron, H. Marchand, S. Pinchinat, and M.-O.
component incremental SAT based approach for testing di-                    Cordier. Supervision patterns in discrete event sys-
agnosability in DDES. Transposing in SAT these methods,                     tems diagnosis. In Proceedings of the 8th International
based on building a local twin plant and local twin check-                  Workshop on Discrete Event Systems, 2006.
ers for gaining efficiency with regards to a global checking,
seems difficult. Basically, at any step k, corresponding to            [10] S. Genc and S. Lafortune. Diagnosis of patterns in
considering a subsystem made up of k components, these                      partially-observed discrete-event systems. In Proceed-
methods build all critical paths witnessing non diagnosabil-                ings of the 45th IEEE Conference on Decision and
ity at the level of this subsystem and the incremental step,                Control, pages 422–427. IEEE, 2006.
when adding a (k + 1)th neighboring component, consists                [11] L. Ye and P. Dague. An optimized algorithm for diag-
in checking the consistency of these pairs with the observa-                nosability of component-based systems. In Proceed-
tions in the new component: only those pairs which can be                   ings of the 10th International Workshop on Discrete
consistently extended are kept, if any. In addition, in [11],               Event Systems (WODES’10), 2010.
only useful and abstracted information is kept from one step           [12] L. Ye, Y. Yan, and P. Dague. Diagnosability for pat-
to the next one. With SAT, only one critical pair witness-
                                                                            terns in distributed discrete event systems. In Proceed-
ing non diagnosability of the subsystem (i.e., a model for
                                                                            ings of the 21st International Workshop on Principles
the formula) will be built. If it is not consistent, and thus
                                                                            of Diagnosis (DX’10), 2010.
disappears, when adding the (k + 1)th component, diagnos-
ability is not proven for all that: other critical pairs in the        [13] H. Kautz and B. Selman. Planning as satisfiability. In
subsystem, not completely computed at step k, may exist                     Proceedings of the 10th European Conference on Ar-
and be extendible to step (k + 1). So, they have to be com-                 tificial Intelligence (ECAI’92), pages 359–363, 1992.
puted now, which limits the incremental characteristic of the          [14] S. Genc and S. Lafortune. Predictability of Event
approach. In the same way, abstracting some information                     Occurrences in Partially-observed Discrete-event Sys-
is difficult to achieve with SAT. So, there is no evidence                  tems. Automatica, 45(2):301–311, 2009.
a priori that efficiency gain could be obtained by trying to
                                                                       [15] L. Ye, P. Dague, and F. Nouioua. Predictability Analy-
develop a component incremental SAT based approach for
testing DDES diagnosability.                                                sis of Distributed Discrete Event Systems. In Proceed-
                                                                            ings of the 52nd IEEE Conference on Decision and
                                                                            Control (CDC-13), pages 5009–5015. IEEE., 2013.
References                                                             [16] T. Jéron, H. Marchand, S. Genc, and S. Lafortune.
                                                                            Predictability of Sequence Patterns in Discrete Event
[1] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamo-                      Systems. In Proceedings of the 17th World Congress,
    hideen, and D. Teneketzis. Diagnosability of discrete-                  pages 537–453. IFAC., 2008.
    event systems. IEEE Transactions on Automatic Con-
    trol, 40(9):1555–1575, 1995.




                                                                  58
                         Proceedings of the 26th International Workshop on Principles of Diagnosis




   Improving Fault Isolation and Identification for Hybrid Systems with Hybrid
                                Possible Conflicts


                Anibal Bregon and Carlos J. Alonso-González and Belarmino Pulido
             Departamento de Informática, Universidad de Valladolid, Valladolid, 47011, Spain
                              e-mail: {anibal,calonso,belar}@infor.uva.es




                          Abstract                                      electrical systems, respectively. These changes in the con-
                                                                        tinuous behavior increase the difficulties for accurate and
     Model-based fault isolation and identification in                  timely online fault diagnosis. Our focus in this paper is on
     hybrid systems is computationally expensive or                     developing efficient model-based methodologies for online
     even unfeasible for complex systems due to the                     fault isolation and identification in complex hybrid systems.
     presence of uncertainty concerning the actual                         Both the DX and the FDI communities have approached
     state, and also due to the presence of both dis-                   hybrid systems modeling and diagnosis during the last 20
     crete and parametric faults coupled with changing                  years. They have used different modeling proposals [1; 2;
     modes in the system. In this work we improve                       3], and have approached diagnosis either as hybrid state es-
     fault isolation and identification performance for                 timation [2] or as online state tracking [4; 5; 6], or a combi-
     hybrid systems diagnosis using Hybrid Possible                     nation of both methods [7]. The main difficulties in any ap-
     Conflicts. The Hybrid Bond Graph modeling ap-                      proach is to estimate the current state or set of states, and to
     proach makes feasible to track system behavior                     diagnose that set of feasible states. Both tasks are computa-
     without enumerating the complete set of system                     tionally expensive or even unfeasible for complex systems.
     modes. Hybrid Possible Conflicts focus the anal-                   Several approaches have been proposed in the DX field to
     ysis on potential mode changes on those sub-                       tackle these problems [4; 6].
     systems whose behavior deviates from expected.                        In this work we have selected the hybrid system model-
     Moreover, using information derived from the                       ing based on Hybrid Bond Graphs (HBGs) [1; 6], together
     Hybrid Bond Graph model, we can cope with both                     with consistency-based diagnosis using Possible Conflicts
     discrete and parametric faults in a unique frame-                  (PCs) [8]. HBGs are an extension of Bond Graphs (BG)
     work.                                                              [9], which models the discrete changes as ideal switching
     Fault detection with Hybrid Possible Conflicts re-                 junctions that can be set to ON or OFF according to an au-
     lied upon an statistical test to decide when a sig-                tomaton. In [10] we presented Hybrid Possible Conflicts
     nificant deviation in the residual occurs. Fault de-               (HPCs) as an extension of Possible Conflicts using HBGs
     tection time was later used to start the fault isola-              to track hybrid systems behavior. Later, the HPCs approach
     tion and identification stages. In this work we pro-               was extended to integrate fault diagnosis of both parametric
     pose to analyze the evolution of the residual sig-                 and discrete faults using HPCs [11] in a unique framework.
     nal using CUSUM to find a more accurate estima-                       In order to achieve efficient fault identification, it is very
     tion of the time of fault occurrence, which allows                 important to determine the time of fault occurrence as ac-
     to improve both the potential new modes track-                     curately and quickly as possible. But there is a required
     ing and the parametric fault identification. More-                 trade-off between fast and reliable fault detection. In our
     over, we extend our previous proposal for fault                    approach we relied upon an statistical test to decide when a
     identification in continuous systems to cope with                  residual deviates from the current mode, and used this time
     fault identification along a set of mode changes                   to start the fault isolation and identification stages, however,
     while performing parameter identification. We                      the fault detection instant can be delayed from the fault oc-
     have tested these ideas in a four-tank hybrid sys-                 currence time and this has some problems (e.g., that the fault
     tem with satisfactory results.                                     identification process is delayed, or that we have to assume
                                                                        that we know the value of the state variables at the beginning
                                                                        of the identification process). In this work we propose to an-
1 Introduction                                                          alyze the evolution of the residual signal using the CUSUM
Complex hybrid systems are present in a broad range of en-              algorithm [12; 13] to find a more accurate estimation of the
gineering applications, such as mechanical systems, electri-            time of fault occurrence, both for potential new modes track-
cal circuits, or embedded computation systems. The behav-               ing and for parametric fault identification. Moreover, we
ior of these systems is made up of continuous and discrete              extend our previous proposals for fault identification [14;
event dynamics.The main sources of hybrid behavior are                  15] to cope with fault identification along a set of mode
discrete actuators, like discrete valves or switches in fluid or        changes while performing the parameter identification.




                                                                   59
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


   The rest of the paper is organized as follows. Section                A finite state machine control specification (CSPEC) im-
2 presents the case study used along the paper and intro-                plements those junctions. Transitions between the CSPEC
duces the Hybrid Bond Graph (HBG) modeling technique.                    states can be triggered by endogenous or exogenous vari-
Section 3 summarizes the Hybrid Possible Conflicts (HPCs)                ables, called guards. CSPECs capture controlled and au-
background, while section 4 explains the unified framework               tonomous changes as described in [17]. Figure 2 shows the
for both discrete and parametric faults. Section 5 introduces            HBG model of the four-tank system in Figure 1.
some concepts related to the CUSUM algorithm required
in our approach. Section 6 explains our approach for fault
identification. Section 7 introduces some results obtained
applying our proposal on our case study. Finally, Section 8
draws some conclusions.

2 Case Study
The hybrid four-tank system in Figure 1 will be used to show
some concepts and to present some results in this work. The
system has an input flow which can be sent to tank 1, to tank
3 or to both tanks. Next to tank 1 there is tank 2, once the
liquid in tank 1 reaches a level of h it starts to fill also tank
2. The lower part of the system has the same configuration,
tank 4 is next to tank 3 connected by a pipe at a distance h
above the base of the tanks.




                                                                                  Figure 2: Bond graph model of the plant.

                                                                            The system has four switching junctions: SW1 , SW2 ,
                                                                         SW3 and SW4 . SW1 and SW3 are controlled ON/OFF
                                                                         transitions, while SW2 and SW4 are autonomous transi-
                                                                         tions. Both kinds of transitions are represented using a finite
                                                                         state machine. Figure 3 shows: a) the automaton associated
                                                                         with switching junction SW1 and b) the automaton repre-
                                                                         senting the autonomous transition in SW2 . Since the system
                                                                         is symmetric, automata for SW3 and SW4 are equivalent to
                                                                         the ones shown in Figure 3.


       Figure 1: Schematics of the four-tank system

   The methodology chosen to model the system in this
work is Hybrid Bond Graph (HBG), which is an exten-
sion of Bond Graphs (BGs). BGs are defined as a domain-
independent energy-based topological modeling language
for physical systems [9]. Several types of primitive elements
are used to build BGs: storage elements (capacitances, C,                Figure 3: a) Automaton associated with the ON/OFF
and inductances, I), dissipative elements (resistors, R) and             switching junction SW1 ; b) Automaton representing the au-
elements to transform energy (transformers, TF, and gyra-                tonomous transition in SW2 .
tors, GY). There are also effort and flow sources (Se and
Sf), which are used to define interactions between the sys-
tem and the environment. Elements in a BG are connected
by 0 or 1 junctions (representing ideal parallel or series con-          3 Hybrid Possible Conflicts background
nections between components). Each bond has associated                   Consistency-based diagnosis of continuous systems using
two variables (effort and flow). The power is defined as ef-             Possible Conflicts (PCs) [8] is based upon a dependency-
fort × flow for each bond. The SCAP algorithm [16] is used               compilation technique from the DX community. PCs are
to assign causality automatically to the BG.                             computed offline, finding minimal structurally overdeter-
   To model hybrid systems using BGs we need to use some                 mined subsets of equations with sufficient analytical redun-
kind of connections which allow changes in their state. Hy-              dancy to generate fault hypotheses from observed measure-
brid Bond Graphs (HBGs) [1] extend BGs by including                      ment deviations. Only structural and causal information
those connections. They are idealized switching junctions                about the system description is required. This information
that allow mode changes in the system. If a switching junc-              can be obtained from a set of algebraic and/or differential
tion is set to ON, it behaves as a regular junction. When it             equations, or can be automatically derived from bond graph
changes to OFF, all bonds incident on the junction are de-               models [18; 19]. Once the set of PCs is found, they can
activated forcing 0 flow (or effort) for 1 (or 0) junctions.             be implemented as simulation, state-observers or gray-box




                                                                    60
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


models for tracking online actual system behavior [20], or
for online fault identification [14].                                      Table 1: Reduced Qualitative Fault Signature Matrix.
   The PCs approach has been recently extended to cope                                     HP C1   HP C2       HP C3    HP C4
with hybrid system dynamics, and the set of PCs for hybrid                           C1+    −+
                                                                                     C2+               −+
systems were called Hybrid Possible Conflicts (HPCs) [10].                           C3+                        −+
HPCs rely upon the Hybrid Bond-Graph modeling formal-                                C4+                                 −+
ism [1], whose main advantage is that the set of possible                             +
                                                                                     R01    0−                  0+
modes in the system do not need to be enumerated. More-                               +
                                                                                     R03    0+                  0−
over, HBGs are capable to track online hybrid system be-                             R1+    0+
havior, performing online causality reassignment in the sys-                         R2+               0+
                                                                                     R3+                        0+
tem model by means of the HSCAP algorithm [17]. Using                                R4+                                 0+
HPCs we make even more efficient the HSCAP algorithm,                                 +
                                                                                     R12    0−         0−
because causality needs only to be revised within the sub-                            +
                                                                                     R34                        0+       0−
system defined for each HPC, and these changes are local to
the switching junction affected by the mode change.
   For the four-tank system we have found four HPCs. Each                 The relation between the HPCs and their related switch-
one of them estimates one of the measured variables (p1 ,               ing junctions can be seen in Table 2, which is called Hybrid
p2 , p3 , or p4 ). Figure 4 shows the BG fragments of these             Fault Signature Matrix (HFSM). This information can be
four HPCs. In this example, the four HPCs were computed                 used in the unified framework for discrete and parametric
assuming that all switching junctions are set to ON.                    fault isolation and identification [11].
   As mentioned before, when any of these junctions is
switched to OFF, causality in the system needs to be re-                Table 2: Hybrid Fault Signature Matrix (HFSM) showing
assigned, but the HPCs generation process does not need to              the relations between switching junctions and each HPC.
be restarted again [10]. The decomposition of a hybrid sys-
                                                                                           HP C1       HP C2    HP C3    HP C4
tem model obtained from HPCs is unique, and after a mode                            1SW1     1                    1
change some portions of some HPCs can disappear (or even                            1SW2     1           1
the entire HPC), but no additional HPC appears. It is proved                        1SW3     1                    1
in [10] that once PCs of the system have been generated con-                        1SW4                          1        1
sidering all switching junctions set to ON mode, turning a
switch from ON to OFF or viceversa, no genuine new HPCs                    Discrete faults usually introduce high non-linearities in
will ever appear.                                                       the system outputs, that should be easily detected if mag-
   Regarding fault profiles, our current proposal works with            nitudes related to the failing switch were measured, gener-
single fault, and abrupt fault assumptions. Abrupt faults are           ating almost instantaneous detection for discrete faults. In
modeled as an instantaneous change in a parameter, whose                this case, exoneration could be applied. But even if those
magnitude does not change afterwards (can be modeled as a               measurements are not available we can still use the qual-
step function).                                                         itative signature of the effects of the discrete faults in the
   Regarding parametric faults, fault isolation is performed            HPC residuals. With this information we can build the so-
by means of the Reduced Qualitative Fault Signature Matrix              called Hybrid Qualitative Fault Signature Matrix (HQFSM)
(RQFSM). Table 1 shows the RQFSM for the mode where                     that can also be used for exoneration purposes in the fault
each switch is set to ON. For a given mode, the RQFSM can               isolation stage. In our system we can build the following
be computed online from the TCG associated to an HPC [1].               HQFSM for HP C1 and HP C3, which are linked to com-
In this table there is a row for each fault considered. And             manded switches SW1 and SW3 , which are the potential
there is a column for each HPC. The entry in the table rep-             source of discrete faults in our system. We do not show
resent the Qualitative Fault Signature of the fault in the HPC          SW2 and SW4 in the table since they introduce hybrid dy-
residual, as computed in TRANSCEND [1]. The “reduced”                   namics in the system, but they can not be the source of a
tag means that the Qualitative Fault Signature is computed              discrete fault.
within the subsystem delimited by a HPC, and not for the
whole set of measurements [18]. Once fault detection is                     Table 3: Hybrid Qualitative Fault Signature Matrix.
performed, we can use this information to reject those faults
whose residual evolution does not match the qualitative sig-                                            HP C1    HP C3
                                                                                           1SW1 (11)     +        −
natures in this table.                                                                     1SW1 (00)     −        +
   We also consider discrete faults, i.e. faults in discrete ac-                           1SW1 (01)     +        −
                                                                                           1SW1 (10)     −        +
tuators, as commanded mode switches which do not per-                                      1SW3 (11)     −        +
form the correct action. In our case study, there are four                                 1SW3 (00)     +        −
faulty situations to be considered, where SWi denotes the                                  1SW3 (01)     −        +
switching junction i of the system.                                                        1SW3 (10)     +        −

  1. SWi = 11: SWi stuck ON (1).                                          Next section presents our diagnosis framework for hybrid
  2. SWi = 00: SWi stuck OFF (0).                                       systems using HPCs.

  3. SWi = 01: Autonomous switch ON (SWi is OFF (0)                     4 Hybrid Systems Diagnosis using HPCs
     and it switches to ON itself (1)).
                                                                        As we mentioned before, tracking of hybrid systems can
  4. SWi = 10: Autonomous switch OFF (SWi is ON (1)                     be performed using Hybrid PCs [10]. Initially, the set of
     and it switches to OFF itself (0)).                                HPCs is built assuming all switching junctions are set to ON.




                                                                   61
                         Proceedings of the 26th International Workshop on Principles of Diagnosis



                                                                     CSPECSW2                                                  CSPECSW2
                      CSPECSW1       R: R01                  C: C1           R: R12                                                     R: R12           C: C2
                                                                                                                                                                  De: p2
                                      3                   5                  8                                                          8                10
                                                    4                7                 9                                       7                    9
                                          1SW1                0              1SW2            0                            0             1SW2                  0

                                                          6                                                                                              11
                                      2

                                1                            R: R1          De: p1         Se: p2     HPC2                     Se: p1                     R: R2
                         Sf               0

                                     12                                           Se: p3              CSPECSW1   R: R01
                                                                                                                                                              HPC3
                                                   14
                                          1SW3                0                                                   3
                                                                                                                               4                           Se: p1
                                                                                                                      1SW1          0
                                     13

                      CSPECSW3       R: R03
                                                                                     HPC1                         2


                                                                                                        Sf
                                                                                                             1                                                     Se: p4
                                                                                                                      0
                                          R: R34         C: C4                                                                      C: C3               R: R34
                                                                         De: p4                                  12
                                      18                20                                                                         15               18
                                                   19                                                                         14                                  19
                                17                                                                                                             17
                         0                 1SW4              0                                                        1SW3              0                1SW4          0

                                                         21                                                      13                16

                       Se: p3                            R: R4                                        CSPECSW3   R: R03             R: R3 De: p3
                                              CSPECSW4                                HPC4                                                                CSPECSW4




                                Figure 4: Bond graphs of the four PCs found for the four-tank system.


Afterwards, the set of models for the HPCs for the actual                                             only one mode has the residual close to zero, this is the new
mode are efficiently built, and they start tracking the system.                                       system mode.
Whenever a mode change, commanded or autonomous, is                                                      If the residual for each hypothesized new mode does not
detected, a new set of models for the HPCs is computed on-                                            converge to zero, discrete faults (as mode changes) are dis-
line.                                                                                                 carded and we focus on parametric faults, starting the identi-
   In case a fault occurs, one or more HPC residuals will                                             fication stage. As mentioned before, qualitative fault signa-
trigger. Significant deviations in the residuals are found us-                                        tures in the RQFSM can be used to reject those parametric
ing the statistical Z-test. Based on the activated residuals                                          faults non consistent with current observations thus focusing
for the set of HPCs in the current mode, the structural in-                                           even further the fault identification stage.
formation in the HQFSM (Table 3), and the RQFSM (Table                                                   Finally, once the set of parametric fault candidates is
1), we build the current set of fault candidates. This set can                                        refined through the RQFSM, we perform fault identifica-
contain both discrete and parametric faults. Since discrete                                           tion for the set of remaining parametric fault candidates.
faults generally have a bigger and potentially more danger-                                           Fault identification is done with hybrid parameter estima-
ous influence in the system behavior, in our framework we                                             tors, which are presented in Section 6.
consider discrete faults as preferred candidates before con-
sidering the parametric ones. If there is no discrete fault as
candidate, then we directly go to the fault identification as
                                                                                                      5 Time of Fault Estimation using CUSUM
described in Section 6.                                                                               In the previous section we have presented our fault isolation
   At this point we run the CUSUM algorithm (described                                                approach of discrete faults by hypothesizing the faults com-
in Section 5) to approximately determine the time of fault                                            patible with the Hybrid Qualitative Fault Signature Matrix
occurrence. Once this is done, we create a new simulation                                             and filtering out those faults whose models do not converge.
model using the HPCs, and starting at the fault time deter-                                           Divergence of non-current models is usually easy to check
mined by the CUSUM, we begin tracking the system be-                                                  when we are dealing with discrete faults. However, the con-
havior in each one of the hypothesized mode changes (the                                              vergence of the current model may be slow if initial values
HQFSM and the qualitative value of the HPC residuals are                                              of the state variables of the model are not known or our ini-
used to reject those modes that are inconsistent with ex-                                             tial guess is far from the actual value. We are assuming that
pected deviations in the HQFSM). If the hypothesized mode                                             we are able to track the system dynamic before the occur-
is the correct one, the residual for that mode will go to zero                                        rence of a fault. In other words, we are assuming that we
after a relatively small period of time (this is possible, as we                                      know -or we are able to estimate- the state variables before
will show later, thanks to the accurate estimation of the fault                                       the time of fault occurrence. Hence, in order to speed up the
time provided by the CUSUM). If the hypothesized mode is                                              convergence of the current model, it is important to have a
not correct, the residual will keep deviating from zero, and                                          good estimation of that time.
after an empirically determined time window without con-                                                 The cumulative sum algorithm, CUSUM, introduced by
verging, the discrete fault candidate will be discarded. If                                           [12] and discussed in detail in [13] and elsewhere, is an op-




                                                                                                 62
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


timal fault detection algorithm that can also provide a esti-           Proposition 1. A HPC, HP Ck , along with its set of in-
mation of the time of fault occurrence t0 , as we will detail           put variables, uhpck , the commanded signals of the switch-
later. Nevertheless, it makes the strong assumption that the            ing junctions, swhpck , and initial value of the parameter
signal we are tracking changes its mean value from a con-               to identify, θf , can be used as a parameter estimator using
stant initial mean µ0 to a final constant mean µ1 .                     ŷhpck = ehpck (uhpck , θf , swhpck (t)), where the measured
   On the other hand, the Z-test [21] is a sub-optimal fault            variable estimated by the HPC, ŷhpci , is solved in terms of
detection algorithm compared to CUSUM, but it makes no                  the remaining measured variables.
assumptions concerning the properties of the new mean                      Each estimator is uniquely related to one HPC, hence it
value. Particularly, it does not require this to be constant.           contains minimal redundancy required for parameter esti-
   In order to have a robust fault detection mechanism and              mation. In this case, each HPC has an executable model
a good approximation of the fault time, we have opted for               that can be used for simulation purposes. For the four-tank
combining both tests. We use Z-test to perform fault de-                system we have obtained four hybrid parameter estimators
tection and, afterwards, we estimate the fault time using               shown in table 4, one for each HPC.
CUSUM.
   CUSUM was designed to detect abrupt changes in the                      Estimator
                                                                                            Related
                                                                                             PC                 P arameters             Inputs         Output
mean of stochastic signals. In the simple case of a Gaus-                      e1           HP C1          R01 , R03 , R12 , R1 , C1   Sf , p2 , p3      p1
sian residual, res(i), of constant variance σ 2 , constant                     e2           HP C2                R12 , R2 , C2             p1            p2
and known initial mean µ0 and constant and known fi-                           e3           HP C3          R01 , R03 , R34 , R3 , C3   Sf , p1 , p4      p3
                                                                               e4           HP C4                R34 , R4 , C4             p3            p4
                                                      Pk
nal mean µ1 , the decision signal, Sk , is Sk =           si =
                                                      i=1
P
k                                                                       Table 4: Hybrid parameter estimators found for the four-
  µ1 −µ0
      σ2   (res(i)− µ0 +µ
                       2
                         1
                           ). Hence, for a window of N sam-             tank system, and their related HPCs.
i=1
ples with a change in mean at 1 ≤ t0 ≤ N , Sk decreases at                 The basic idea is to use the estimator ehpck to compute
the constant rate µ = µ1 −µ
                          2
                            0
                               for k < t0 and increases by µ            estimations for ŷhpck with different values of the parameter
for t0 ≤ k. It can be shown [13] that the change time t0 can            θf , so that we can find a value of the parameter that min-
be estimated as t̂0 = arg mink Sk .                                     imizes the least squares (LS) error between the estimation
   When µ1 is unknown, it can be set to the residual corre-             ŷhpck and the measured value yhpck .
sponding to the smallest fault to be detected, typically some              Fig. 5 shows the parameter estimation process using the
units of the residual noise deviation, σ. This can be done              hybrid estimators. A parametrized estimator, ehpck , uses the
without increasing the fault positive alarm rate because we             inputs of the system, uhpck , and a parameter value, θf , to
use Z-test to perform fault detection, and we only use this             generate an estimation of the output, ŷhpck . This estimated
CUSUM variant to estimate the time of fault occurrence, t0 .            output is compared against the observed output, yhpck , by
We have also tried estimating µ as the empirical mean of                the quadratic error calculator block. This block computes
the residual, with similar results. In all the cases we have            the quadratic error between ŷhpck and yhpck for the fault
tested, the estimated time of fault occurrence, t̂0 , computed          candidate f , Ef2 . Then, the iteration engine block, that con-
by CUSUM, is smaller than the detection time provided by                tains a nonlinear optimization algorithm, finds the minimum
Z-test.                                                                 of the error surface Ef2 (θf ), by iteratively invoking the es-
                                                                        timator with different parameter values. The value of the
6 Fault Identification with HPCs                                        parameter and its minimum LS error will be the output of
                                                                        the parameter estimation block (and the input for the deci-
Once all the discrete fault candidates have been discarded,
                                                                        sion procedure block).
we have to do fault identification for the set of isolated para-
metric faults. In previous work [14] we proposed to use                 Fault candidate f (θf initial value)
minimal parameter estimators computed from PCs to gen-
erate parameterized estimators. However, that approach is                                                          eHPC k estimator                     
not applicable for hybrid systems fault identification since                              Inputs: uHPC k
                                                                                                                (obtained from HPCk)
we can have mode changes during the identification pro-
cess. As a solution, we propose a extension of our minimal
parameterized estimators which are computed directly from                                            ŷHPCk                            θ*f
HPCs, thus being able to handle mode changes during the
identification process.
   The fault identification process is done by the following                                                       quadratic            E2f           Iteration
                                                                                        Output: yHPC k               error
steps: (i) model decomposition by offline computation of                                                           calculator
                                                                                                                                                       Engine
the set of HPCs from the hybrid bond graph model; (ii)
offline computation and selection of the better hybrid esti-
mator for each fault candidate; (iii) after the fault isolation         Figure 5: Parameter estimation using the hybrid estimators
process, online quantitative parameter estimation procedure             from HPCs.
over the hybrid estimators related with the set of isolated
fault candidates; and (iv) decision procedure to select the
faulty candidate.                                                       7 Results
   Using HPCs we can derive the structure of a hybrid pa-               To test the validity of the approach, we implemented the
rameterized estimator, ehpck , for a hybrid system. The pa-             four hybrid HPCs for the four-tank system, with its cor-
rameterized estimator ehpck can be used as a hybrid estima-             responding estimators, and run different simulation exper-
tor as stated in the following proposition:                             iments.




                                                                   63
                                                                    Proceedings of the 26th International Workshop on Principles of Diagnosis




                                                                                                                                      Pr e s s ur e ( Pas c als )
                                           600                                                                                                                      300
                                                     p1                                                                                                                                                            p 1 - me as u r e me nt
                                                     p2                                                                                                                                                            p 1 - H P C 1 e s t i mat i on
                                                     p3
                                           500       p4                                                                                                             200
         Pr e s s ur e ( Pas c als )



                                           400                                                                                                                      100



                                           300                                                                                                                        0
                                                                                                                                                                       0   50   100   150   200      250   300   350      400      450       500
                                                                                                                                                                                                  Tim e ( s )
                                           200
                                                                                                                                                                    100
                                                                                                                                                                                                                       Re s i d u al f or H P C 1

                                                                                                                                                                     50




                                                                                                                                    Re s idual
                                           100

                                                                                                                                                                      0
                                             0
                                                                                                                                                                    −50

                                       −100                                                                                                                  −100
                                                 0   50     100   150   200      250   300         350   400    450   500                                        0         50   100   150   200      250   300   350      400      450       500
                                                                              Time ( s )                                                                                                          Tim e ( s )


Figure 6: Measured pressures in the four tanks when a fault                                                                      Figure 8: Estimation and residual for HP C1 (using
in SW1 is introduced at t = 190 s.                                                                                               CUSUM) when a fault in SW1 occurs and the hypothesized
                                                                                                                                 fault is SW1 .

                            35




                                                                                                                                      Pr e s s ur e ( Pas c als )
                                                                                                                                                                    300
                                                                                                                                                                                                                   p 1 - me as u r e me nt
                            30                                                                                                                                                                                     p 1 - H P C 1 e s t i mat i on

                                                                                                                                                                    200
                            25
     Cus um value




                                                                                                                                                                    100
                            20

                                                                                                                                                                      0
                                                                                                                                                                       0   50   100   150   200      250   300   350      400      450       500
                            15                                                                                                                                                                    Tim e ( s )
                                                                                                                                                                    100
                            10
                                                                                                                                                                                                                       Re s i d u al f or H P C 1

                                                                                                                                                                     50
                                                                                                                                    Re s idual




                                       5
                                                                                                                                                                      0

                                                                                                                                                                    −50
                                       0
                                        80            100         120          140           160          180         200
                                                                         Tim e ( s )                                                                         −100
                                                                                                                                                                 0         50   100   150   200      250   300   350      400      450       500
                                                                                                                                                                                                  Tim e ( s )

                                       Figure 7: CUSUM output for a fault in SW1 .
                                                                                                                                 Figure 9: Estimation and residual for HP C1 (using
                                                                                                                                 CUSUM) when a fault in SW1 occurs and the hypothesized
    In the first experiment, we assume that the water tanks are                                                                  fault is SW3 .
initially empty, and start to fill in at constant rate. Hence, the
initial configuration of the system is SW1 and SW3 set to
ON, and SW2 and SW4 set to OFF. Tanks 1 and 3 start to                                                                           estimation and the residual for HP C1 when the hypothe-
fill in, and approximately at time 20 s level in both tanks                                                                      sized faults are SW1 (10) and SW3 (10), respectively (we do
reach the height of the connecting pipes and tanks 2 and 4                                                                       not show the result for HP C3 since are similar to the results
start to fill in. At time 190 s, a fault occurs in the controlled                                                                obtained for HP C1 ). Looking at the results, it is obvious
junction SW1 , which switches off (see Fig. 6 for the mea-                                                                       that the residual converges to zero when a fault in SW1 (10)
sured pressures in the four tanks for this experiment).                                                                          is hypothesized, while the residual when SW3 (10) is hy-
    Four seconds after the fault is introduced, at t = 194                                                                       pothesized does not converge. Hence, SW1 (10) is con-
s, both HP C1 and HP C3 trigger, and consequently both                                                                           firmed as the fault. This confirmation is done by continu-
SW1 or SW3 are initially considered as discrete fault can-                                                                       ously analyzing residual signals with the Z-test. Please note
didates. At this point, the CUSUM algorithm is run, de-                                                                          that, since the CUSUM algorithm gives a good approxima-
termining that the fault has occurred at t = 191 s. In this                                                                      tion of the point of failure, the residual is able to converge
case study we use a CUSUM window of size 100. Figure 7                                                                           very quickly when the true fault is hypothesized. For com-
shows the output of the CUSUM algorithm where the abso-                                                                          parison purposes, Fig. 10 shows the estimation and resid-
lute maximum represents the approximate time (due to noise                                                                       ual for HP C1 when CUSUM is not used to re-initialize the
in the system) of fault occurrence.                                                                                              simulation (for the hypothesized fault SW1 ). By comparing
    Once the point of fault occurrence has been determined at                                                                    this figure with Fig. 8 it is clear that using CUSUM allows
t = 191 s, the diagnosis framework takes the values of the                                                                       the HPC to converge faster.
simulation at such time instant and launches two parallel                                                                           As a second diagnosis experiment, we start off from the
diagnosis experiments, one for each hypothesized fault can-                                                                      same situation of the previous experiment, but in this case,
didate, i.e., SW1 (10) and SW3 (10). Figs. 8 and 9 show the                                                                      we introduce a small parametric fault and after a short while,




                                                                                                                            64
                                                                                                       Proceedings of the 26th International Workshop on Principles of Diagnosis



     Pr e s s ur e ( Pas c als )




                                                                                                                                                                                 Pr e s s ur e ( Pas c als )
                                     300                                                                                                                                                                       300
                                                                                                                                     p 1 - me as u r e me nt                                                                                                                   p 1 - me as u r e me nt
                                                                                                                                     p 1 - H P C 1 e s t i mat i on                                                                                                            p 1 - H P C 1 e s t i mat i on

                                     200                                                                                                                                                                       200


                                     100                                                                                                                                                                       100


                                                                 0                                                                                                                                                      0
                                                                  0            50        100     150     200      250     300      350       400      450       500                                                      0     50     100   150     200      250   300     350       400       450      500
                                                                                                               Tim e ( s )                                                                                                                                Tim e ( s )
                                     100                                                                                                                                                                       100
                                                                                                                                          Re s i d u al f or H P C 1                                                                                                              Re s i d u al f or H P C 1

                                                  50                                                                                                                                                                50
   Re s idual




                                                                                                                                                                               Re s idual
                                                                 0                                                                                                                                                      0

                                    −50                                                                                                                                                                        −50

                            −100                                                                                                                                                                        −100
                                0                                              50        100     150     200      250     300      350       400      450       500                                         0                  50     100   150     200      250   300     350       400       450      500
                                                                                                               Tim e ( s )                                                                                                                                Tim e ( s )



Figure 10: Estimation and residual for HP C1 (without us-                                                                                                                   Figure 12: Estimation and residual for HP C1 when a fault
ing CUSUM to re-initialize the simulation) when a fault in                                                                                                                  in R01 occurs and then SW1 is set to OFF mode.
SW1 occurs and the hypothesized fault is SW1 .

                                                                                                                                                                                                                        30

                                                                     600
                                                                                    p1
                                                                                    p2
                                                                                    p3                                                                                                                                  25
                                                                     500            p4

                                                                                                                                                                                                                        20
                                   Pr e s s ur e ( Pas c als )




                                                                                                                                                                                                         Cus um value
                                                                     400



                                                                     300                                                                                                                                                15




                                                                     200
                                                                                                                                                                                                                        10



                                                                     100
                                                                                                                                                                                                                         5


                                                                       0

                                                                                                                                                                                                                         0
                                                                                                                                                                                                                          80        100       120            140         160            180              200
                                                                 −100                                                                                                                                                                                     Tim e ( s )
                                                                           0        50     100     150     200      250      300    350       400      450      500
                                                                                                                 Time ( s )

                                                                                                                                                                                                                         Figure 13: CUSUM output for a fault in R01 .
Figure 11: Measured pressures in the four tanks when a fault
in R01 is introduced at t = 190 s and the switching junction
SW1 is turned off at t = 210 s.                                                                                                                                             used a total of 60 seconds of data starting from t = 191 s,
                                                                                                                                                                            hence, the estimator was capable of correctly estimating the
                                                                                                                                                                            value of the faulty parameter even if the system transitions
a discrete change. Specifically, a 20% blockage in the input                                                                                                                from one mode to another during the estimation process.
pipe of tank 1, R01 , is introduced at t = 190 s, and then                                                                                                                     We run several experiments with different mode config-
SW1 is commanded to switch OFF at t = 210 s (Fig. 11                                                                                                                        urations and different faults, varying the size, time of fault
shows the measured pressures in the four tanks for this ex-                                                                                                                 occurrence (in some of them by introducing faults immedi-
periment).                                                                                                                                                                  ately after the mode change). Results for all these situations
   For this experiment, both HP C1 and HP C3 trigger at                                                                                                                     were equivalent to the examples shown in this section.
t = 198 s (as an example, see Fig. 12 with the estimation
and residual for HP C1 ), and consequently both SW1 (10)                                                                                                                    8 Conclusions
and SW3 (10) are initially considered as discrete fault candi-
dates. However, in this scenario, after running the CUSUM                                                                                                                   In this work we have presented an approach for hybrid sys-
(see Fig. 13 for the CUSUM output), which estimated the                                                                                                                     tems fault identification using Hybrid Possible Conflicts.
fault time at t = 191s, and the diagnosis experiments for                                                                                                                   Using HBGs we can generate minimal estimators that can
both fault candidates, none of the residuals was able to con-                                                                                                               be used for fault identification just considering the possi-
verge within a reasonable, empirically determined, amount                                                                                                                   ble mode changes within the estimators. Additionally, we
of time, thus concluding that a parametric fault has occurred.                                                                                                              have proposed the integration of the CUSUM algorithm to
At this point, the fault identification process is triggered for                                                                                                            accurately determine the time of fault occurrence. A more
R01 , which is the only parametric fault candidates (R03                                                                                                                    accurate estimation of the fault instant allows to quickly iso-
is discarded due to the qualitative sign in the residuals).                                                                                                                 late discrete faults, and to obtain a better approximation of
The estimated value for parameter R01 was 0.1937, i.e., a                                                                                                                   the values of the state variables, which are needed as initial
19.37% blockage in the pipe. Please note that the estimator                                                                                                                 values for the fault identification.




                                                                                                                                                                       65
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


   Diagnosis results using a four-tank system showed that                 2013, volume 8109 of Lecture Notes in Computer Sci-
the proposed approach can be successfully used for fault                  ence, pages 239–249. Springer-Verlag Berlin, 2013.
identification of hybrid systems.                                    [12] E.S. Page.           Continuous inspection schemes.
   In future work, we will test the approach in more com-                 Biometrika, 41:100–115, 1954.
plex systems with real data, and will propose a distributed
approach for hybrid systems fault diagnosis.                         [13] M. Basseville and I.V. Nikiforov. Detection of Abrupt
                                                                          Changes: Theory and Applications. Prentice Hall,
Acknowledgments                                                           1993.
This work has been funded by the Spanish MINECO                      [14] A. Bregon, G. Biswas, and B. Pulido. A Decompo-
DPI2013-45414-R grant.                                                    sition Method for Nonlinear Parameter Estimation in
                                                                          TRANSCEND. IEEE Trans. Syst. Man, Cyber. Part A,
                                                                          42(3):751–763, 2012.
References
[1] P. Mosterman and G. Biswas. Diagnosis of continuous              [15] N. Moya, B. Pulido, CJ Alonso-González, A. Bre-
     valued systems in transient operating regions. IEEE                  gon, and D. Rubio. Automatic Generation of Dynamic
     Trans. on Sys., Man, and Cyber. Part A, 29(6):554–                   Bayesian Networks for Hybrid Systems Fault Diagno-
     565, 1999.                                                           sis. In Proc. of the 23rd Int. WS on Pples. of Diagnosis,
                                                                          DX12, Great Malvern, UK, Jul-Aug 2012.
[2] M.W. Hofbaur and B.C. Williams. Hybrid estimation
                                                                     [16] D.C. Karnopp, R.C. Rosenberg, and D.L. Margolis.
     of complex systems. IEEE Trans. on Sys., Man, and
     Cyber. Part B, 34(5):2178 –2191, oct. 2004.                          System Dynamics, A Unified Approach. 3nd ed., John
                                                                          Wiley & Sons, 2000.
[3] E. Benazera and L. Travé-Massuyès. Set-theoretic es-
     timation of hybrid system configurations. IEEE Trans.           [17] I. Roychoudhury, M. Daigle, G. Biswas, and X. Kout-
     on Sys. Man Cyber. Part B, 39:1277–1291, October                     soukos. Efficient simulation of hybrid systems: A hy-
     2009.                                                                brid bond graph approach. SIMULATION: Transac-
                                                                          tions of the Society for Modeling and Simulation In-
[4] S. Narasimhan and L. Brownston. Hyde - a general                      ternational, (6):467–498, June 2011.
     framework for stochastic and hybrid model-based di-
     agnosis. In Proc. of the 18th Int. WS on Pples. of Di-          [18] A. Bregon, B. Pulido, G. Biswas, and X. Koutsoukos.
     agnosis, DX07, pages 186–193, Nashville, TN, USA,                    Generating Possible Conflicts from Bond Graphs Us-
     May 29-31 2007.                                                      ing Temporal Causal Graphs. In Proc. of the 23rd
                                                                          European Conference on Modelling and Simulation,
[5] Mehdi Bayoudh, Louise Travé-Massuyès, and Xavier                    ECMS09, pages 675–682, Madrid, Spain, 2009.
     Olive. Fault detection and diagnosis; hybrid systems
     modeling and control; discrete event systems model-             [19] A. Bregon, G. Biswas, B. Pulido, C. Alonso-González,
     ing and control. In Proc. of the Int. Conference on Con-             and H. Khorasgani. A Common Framework for Com-
     trol, Automation and Systems, ICCAS08, pages 7265–                   pilation Techniques Applied to Diagnosis of Linear
     7270, Seoul, Korea, October 2008.                                    Dynamic Systems. IEEE Trans. on Syst., Man, and
                                                                          Cyb.: Systems, 44(7):863–873, 2013.
[6] S. Narasimhan and G. Biswas. Model-Based Diagno-
     sis of Hybrid Systems. IEEE Trans. on Sys., Man and             [20] A. Bregon, C. Alonso-González, and B. Pulido. In-
     Cyber., Part A, 37(3):348–361, May 2007.                             tegration of simulation and state observers for on
                                                                          line fault detection of nonlinear continuous systems.
[7] Th. Rienmüller, M. Bayoudh, M.W. Hofbaur, and
                                                                          IEEE Trans. on Syst., Man, and Cyb.: Systems,
     L. Travé-Massuyès. Hybrid Estimation through Syner-                44(12):1553–1568, 2014.
     gic Mode-Set Focusing. In Proc. of the 7th IFAC Sym-
     posium on Fault Detection, Supervision and Safety of            [21] G. Biswas, G. Simon, N. Mahadevan, S. Nararsimhan,
     Technical Processes, SAFEPROCESS09, pages 1480–                      J. Ramirez, and G. Karsai. A robust method for hy-
     1485, Barcelona, Spain, 2009.                                        brid diagnosis of complex systems. In Proceeding
                                                                          of the 5th IFAC Symposium on Fault Detection, Su-
[8] B. Pulido and C. Alonso-González. Possible Conflicts:
                                                                          pervision and Safety of Technical Processes, SAFE-
     a compilation technique for consistency-based diagno-                PROCESS 2003, pages 1125–1130, Washington D.C.,
     sis. IEEE Trans. on Sys., Man, and Cyber. Part B: Cy-                USA, June 2003.
     bernetics, 34(5):2192–2206, Octubre 2004.
[9] D.C. Karnopp, D.L. Margolis, and R.C. Rosen-
     berg. System Dynamics: Modeling and Simulation of
     Mechatronic Systems. John Wiley & Sons, Inc., New
     York, NY, USA, 2006.
[10] A. Bregon, C. Alonso, G. Biswas, B. Pulido, and
     N. Moya. Fault diagnosis in hybrid systems using pos-
     sible conficts. In Proc. of the 8th IFAC Symposium on
     Fault Detection, Supervision and Safety of Technical
     Processes, SAFEPROCESS12, Mexico City, Mexico,
     2012.
[11] N. Moya, A. Bregon, CJ Alonso-González, and
     B. Pulido. A Common Framework for Fault Diagnosis
     of Parametric and Discrete Faults Using Possible Con-
     flicts. In Advances in Artificial Intelligence, CAEPIA




                                                                66
                            Proceedings of the 26th International Workshop on Principles of Diagnosis




      State estimation and fault detection using box particle filtering with stochastic
                                      measurements
          Joaquim Blesa 1 , Françoise Le Gall2 , Carine Jauberthie2,3 and Louise Travé-Massuyès2
1
    Institut de Robòtica i Informàtica Industrial (CSIC-UPC), Llorens i Artigas, 4-6, 08028 Barcelona, Spain
                                         e-mail: joaquim.blesa@upc.edu
                     2
                       CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
                               Univ de Toulouse, LAAS, F-31400 Toulouse, France
                                       e-mail: legall,cjaubert,louise@laas.fr
                               3
                                 Univ de Toulouse, UPS, LAAS, F-31400 Toulouse
                         Abstract                           stochastic noise instead of bounded noise. The errors af-
                                                                           fecting the system dynamics are kept bounded because this
         In this paper, we propose a box particle filter-                  type uncertainty really corresponds to many practical situa-
         ing algorithm for state estimation in nonlinear                   tions, for example tolerances on parameter values. Combin-
         systems whose model assumes two types of un-                      ing these two types of uncertainties following the seminal
         certainties: stochastic noise in the measurements                 ideas of [5] and [6] within a particle filter schema is the
         and bounded errors affecting the system dynam-                    main issue driving the paper. This issue is different from the
         ics.These assumptions respond to situations fre-                  one addressed in [7] in which the focus is put on Bernouilli
         quently encountered in practice. The proposed                     filters able to deal with data association uncertainty. The
         method includes a new way to weight the box                       proposed method includes a new way to weight the box par-
         particles as well as a new resampling procedure                   ticles as well as a new resampling procedure based on repar-
         based on repartitioning the box enclosing the up-                 titioning the box enclosing the updated state.
         dated state. The proposed box particle filtering                      The paper is organized as follows. Section 2 describes
         algorithm is applied in a fault detection schema                  the problem formulation. A summary of the Bayesian fil-
         illustrated by a sensor network target tracking ex-               tering is presented and the box-particle approach is intro-
         ample.                                                            duced. The main steps of this approach are developed in
                                                                           section 3. Section 4 and 5 are devoted to the repartitioning
                                                                           of the boxes and the computation of the weight of the box
    1 Introduction                                                         particles in order to control the number of boxes. In section
    For various engineering applications, system state estima-             6 the box particle filter is used for state estimation and fault
    tion plays a crucial role. Kalman filtering (KF) has been              detection; the results obtained with the proposed method for
    widely used in the case of stochastic linear systems. The              a target tracking in a sensor network are presented in sec-
    Extended Kalman Filter (EKF) and Unscented Kalman Fil-                 tion 7. Conclusion and future work are overviewed in the
    ter (UKF) are KF’s extensions for nonlinear systems. These             last section.
    methods assume unimodal, Gaussian distributions. On the
    other hand, Particle Filtering (PF) is a sequential Monte              2 Problem formulation
    Carlo Bayesian estimator which can be used in the case                 We consider nonlinear dynamic systems represented by dis-
    of non-Gaussian noise distributions. Particles are punctual            crete time state-space models relating the state x(k) to the
    states associated with weights whose likelihoods are defined           measured variables y(k)
    by a statistical model of the observation error. The efficiency
    and accuracy of PF depend on the number of particles used
    in the estimation and propagation at each iteration. If the                         x(k + 1) = f (x(k), u(k), v(k))                (1)
    number of required particles is too large, a real implementa-                   y(k) = h(x(k)) + e(k), k = 0, 1, . . .             (2)
    tion is unsuitable and this is the main drawback of PF. Sev-
    eral methods have been proposed to overcome these short-               where f : Rnx × Rnu × Rnv → Rnx and h : Rnx → Rny
    comings, mainly based on variants of the resampling stage              are nonlinear functions, u(k) ∈ Rnu is the system input,
    or different ways to weight the particles ([1]).                       y(k) ∈ Rny is the system output, x(k) ∈ Rnx is the state-
       Recently, a new approach based on box particles was pro-            space vector, e(k) ∈ Rny is a stochastic additive error that
    posed by [2; 3]. The Box Particle Filter handles box states            includes the measurement noise and discretization error and
    and bounded errors. It uses interval analysis in the state up-         is specified by its known pdf pe . v(k) ∈ Rnx is the process
    date stage and constraint satisfaction techniques to perform           noise.
    measurement update. The set of box particles is interpreted               In this work the process noise is assumed bounded
    as a mixture of uniform pdf’s [4]. Using box particles has             |vi (k)| ≤ σi with i = 1, . . . , nx , i.e pv ∼ U([V]), where
    been shown to control quite efficiently the number of re-              [V] = [−σ1 , σ1 ] × · · · × [−σnx , σnx ].
    quired particles, hence reducing the computational cost and
    providing good results in several experiments.                         2.1 Bayesian filtering
       In this paper, we take into account the box particle fil-           Given a vector of available measurements at instant k:
    tering ideas but consider that measurements are tainted by             Y(k) = {y(i), i = 1, ..., k}, Y(0) = y(0), the Bayesian




                                                                      67
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


solution to compute the posterior distribution p(x(k)|Y(k))                     • to provide the prior probabilities associated to the par-
of the state vector at instant k + 1, given past observations                     ticles of the new state estimation set
Y(k) is given by (Gustafsson 2002):                                                     P ([x(k + 1)]i |Y(k)) i = 1, · · · , Nk+1               (10)

                                    p(x(k + 1)|Y(k)) =                      3 Interval Bayesian formulation
        Z
                                                                 (3)        This section deals with the evaluation of the Bayesian so-
                p(x(k + 1)|x(k))p(x(k)|Y(k))dx(k)
         Rn x                                                               lution of the state estimation problem considering bounded
                                                                            state boxes (6).
  where the posterior distribution p(x(k)|Y(k)) can be
computed by                                                                 3.1 Measurement update
                                                                            Whereas each particle is defined as a box by (6), the mea-
                      1                                                     surement is tainted with stochastic uncertainty defined by
  p(x(k)|Y(k)) =          p(y(k)|x(k))p(x(k)|Y(k − 1))                      the pdf pe . The weight w(k)i associated to a box particle is
                     α(k)
                                                        (4)                 updated by the posterior probability P ([x(k)]i |Y(k)):
  where α(k) is a normalization constant, p(y(k)|x(k)) is
the likelihood function that can be computed from (2) as:                                 1
                                                                            w(k)i =           P ([x(k)]i |Y(k − 1))pe (y(k) − h([x(k)]i )
                                                                                         Λ(k)
              p(y(k)|x(k)) = pe (y(k) − h(x(k))        (5)                                                     Z
                                                                                 1
and p(x(k)|Y(k − 1)) is the prior distribution.                             =        P ([x(k)]i |Y(k − 1))              pe (y(k) − h(x(k))) dx(k)
                                                                                Λ(k)                             x(k)∈[x(k)]i
  Equations (5), (4) and (3) can be computed recursively                                                                                        (11)
given the initial value of p(x(k)|Y(k − 1)) for k = 0 de-                                 i = 1, . . . , Nk
noted as p(x(0)) that represents the prior knowledge about                  where the normalization constant Λ(k) is given by
the initial state.
                                                                                       Nk
                                                                                       X                             Z
2.2    Objective                                                            Λ(k) =           P ([x(k)]i |Y(k − 1))             pe (y(k) − h(x(k))) dx(k)
Considering the assumptions of our problem, we adopt a                                 i=1                               x(k)∈[x(k)]i

particle filtering schema which is well-known for solving                                                                                       (12)
numerically complex dynamic estimation problems involv-
ing nonlinearities. However, we propose to use box particles                    then
and to base our method on the interval framework. Box par-                                              Nk
ticle filters have been demonstrated efficient, in particular to                                        X
reduce the number of particles that must be considered to                                                     w(k)i = 1                         (13)
                                                                                                        i=1
reach a reasonable level of approximation [2].
   Let’s consider the current state estimate X (k) as a set, de-               The deduction of the measurement update equation (11)
noted by {X (k)}, that is approximated by Nk disjoint boxes                 from the particle filtering update equation (4) is detailed in
                                                                            the Appendix for nx = 1, without the loss of generality. The
                      [x(k)]i i = 1, · · · , Nk                  (6)        principle of the proof is that the point particles are grouped
                                                                            into particle groups inside boxes, then the posterior proba-
                  i             i                     i
  where [x(k)] =          [x(k) , x(k)i ],   with x(k) , x(k)i
                                                          ∈                 bility of a box can be approximated by the sum of posterior
Rnx . The width of every box is smaller or equal to a given                 probabilities of the point particles when the number of these
accuracy for every component, i.e                                           particles tends to infinity.
                                                                            3.2 State update
 xj (k)i − xj (k)i ≤ δj      i = 1, · · · , Nk ,
                                          j = 1, . . . , nx                 This step is similar to the state update state as in [2] and [3].
                                                          (7)               Hence, we have:
where δj is the predetermined minimum accuracy for every
component j.                                                                                            Nk
  Moreover, every box [x(k)]i is given a prior probability                                              X
                                                                             p(x(k + 1)|Y(k)) ≈               w(k)i U[f ]([x(k)]i ,u(k),[v(k)]) (14)
denoted as
                                                                                                        i=1

            P ([x(k)]i |Y(k − 1)) i = 1, · · · , Nk              (8)          The interval boxes [x(k + 1)|x(k)]i are computed from
                                                                            (1) using interval analysis as follows,
with
                 Nk
                 X
                       P ([x(k)]i |Y(k − 1)) ≥ γ                 (9)              [x(k + 1)|x(k)]i ≈ [f ]([x(k)]i , u(k), [v(k)])               (15)
                 i=1
                                                                              The update interval boxes inherit the weights w(k)i of
where γ ∈ [0, 1] is a confidence threshold.                                 their mother boxes [x(k)]i i = 1, . . . , Nk .
  Then, given a new output measurement y(k), the problem
that we consider in this paper is:                                          4 Resampling
  • to compute the state estimate X (k + 1),                                Once the updated boxes [x(k + 1)|x(k)]i and their associ-
  • to decide about the number Nk+1 of disjoint boxes of                    ated weights w(k)i have been computed, the objective is to
     the approximation of X (k + 1), each with accuracy                     compute a new set of disjoint boxes. This corresponds to
     smaller or equal to δj ,                                               the resampling step of the conventional particle filter.




                                                                       68
                                Proceedings of the 26th International Workshop on Principles of Diagnosis


4.1     Repartitioning                                                              Algorithm 1 Weights of the new boxes.
We assume that the new boxes are of the same size, that they                          Algorithm Weights-new-boxes (Z, [x(k + 1)|x(k)]1 ,
cover the whole space defined by the union of the updated                             . . . , [x(k + 1)|x(k)]Nk , w(k)1 , . . . w(k)Nk )
boxes [x(k + 1)|x(k)]i i = 1, . . . , Nk , and that their weight                          wzi ← 0 i = 1, . . . , M
is proportional to the weight of the former boxes.                                        for j = 1, . . . , Nk do
   For this purpose, a support box set Z is computed as the                                    [Ninter , Vinter ] = intersec([x(k + 1)|x(k)]j , Z)
minimum box such that                                                                          for h = 1, . . . , Ninter do
                              Nk
                              [                                                                  i = Vinter (h)  Q                 T
                                                                                                               nx
                                                                                                                    |[x (k+1)|x(k)]j   [z ]i |
                       Z⊇          [x(k + 1)|x(k)]i .                   (16)                 wzi = wzi + l=1    Qnx l                  j
                                                                                                                                          l
                                                                                                                                            w(k)j
                                                                                                                   l=1 |[xl (k+1)|x(k)] |
                              i=1
                                                                                          end for
Z is partitioned into M disjoint boxes of the same size                                 end for
                                                                                        Return (wz1 , . . . , wzM )
                            [z]i i = 1, · · · , M                       (17)          endAlgorithm
  where [z]    i
                   = [z , zi ], zi , zi ∈ Rnx , and
                        i

                                                                                    4.2 Controlling the number of boxes
      zji − zji = εj        i = 1, · · · , M      j = 1, . . . , nx .   (18)
                                                                                    Once the new disjoint boxes and their associated weights
  The box component widths are computed as                                          have been computed, the associated weights can be used
                                                                                    to select the set of boxes that are worth pushing forward
                            Zj − Zj                                                 through the next iteration. This is performed by selecting
                   εj =                    j = 1, . . . , nx            (19)        the boxes with highest weights and discarding the others. In
                              mj
                                                                                    order to fulfill the confidence threshold criterium (9) pro-
  where mj is the number of intervals along dimension j                             posed in Section 2.2, Algorithm 2 is proposed. The set Wz
computed as                                                                         of weights wzi associated to the boxes [z]i is defined as
                            Zj − Zj                                                                     Wz = {wz1 , . . . , wzM }.                  (25)
               mj = ⌈                 ⌉      j = 1, . . . , nx          (20)
                        δj
                                                                                      Given a desired confidence threshold γ, the M disjoint
   where ⌈.⌉ indicates the ceiling function and δj the mini-                        boxes [z]i that compose the uniform grid partition of Z and
mum accuracy for every state component j defined in Sec-                            vector Wz with the associated weights, Algorithm 2 deter-
tion 2.2. In this way, we guarantee that                                            mines the minimum number Nk+1 of boxes [z]i with highest
                                                                                    weights wzi that fulfill
                   εj ≤ δj j = 1, . . . , nx           (21)
    Finally, the number M of boxes of the uniform grid par-                                                  Nk+1
                                                                                                              X
tition is given by                                                                                                   wzi ≥ γ                        (26)
                                                                                                              i=1
                                       nx
                                       Y
                               M=            mj                         (22)           The new state estimate X (k + 1) is approximated by this
                                       j=1                                          set of Nk+1 boxes and their prior probability by
   Once the new boxes [z]i have been computed, the weight
of the new boxes wzi can be computed as                                               P ([x(k + 1)]i |Y(k)) ≈ Wk+1
                                                                                                               i
                                                                                                                             i = 1, . . . , Nk+1 . (27)

            Nk  Qnx
            X                             T                                                 i
                                                                                    where Wk+1    are the Nk+1 highest weights of Wz associated
                       |[x (k + 1)|x(k)]j [zl ]i |
   wzi =           Qnx l
                  l=1
                                            j
                                                   w(k)j
                                                                                    with the disjoint boxes [x(k + 1)]i , i = 1, · · · , Nk+1 , that
            j=1       l=1 |[xl (k + 1)|x(k)] |                                      approximate X (k + 1). Wk+1i
                                                                                                                  can be referred as the a priori
                                                          (23)                      weights.

                         i = 1, . . . , M                                           Algorithm 2 State update at step k + 1 with confidence
  where [vl ]i refers to the l-th component of the vector [v]i                      threshold γ.
and the interval width xl − xl is denoted by |[xl ]| for more
                                                                                       Algorithm State-update([z]1 , . . . , [z]M ,Wz ,γ)
compactness. The new weights fulfill
                                                                                         γc ← 0, {X (k+1)} ← {∅}, Wk+1 ← {∅}, Nk+1 ← 0
                       M
                       X             Nk
                                     X                                                   while γc < γ do
                             wzi =         w(k)i = 1                    (24)                  [value, pos] = max(Wz )
                       i=1           i=1                                                      addbox(X (k + 1), [z]pos )
                                                                                              addelement(Wk+1 , value)
  The new weights wzi in (4.1) can be computed efficiently                                    γc = γc + value
using Algorithm 1. This algorithm searches the number                                         Wz (pos) ← 0
Ninter of boxes of Z that intersect every [x(k + 1)|x(k)]j .                                  Nk+1 ← Nk+1 + 1
Then, the weight w(k)j is distributed proportionally to                                  endwhile
the volume of the intersection between the updated boxes                                 Return (X (k + 1), Wk+1 , Nk+1 )
[x(k + 1)|x(k)]j and each of the Ninter boxes of Z that                                endAlgorithm
have a non-empty intersection.




                                                                               69
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


   This algorithm generates a set of state boxes {X (k + 1)}                • Abnormal low sum of the unnormalized posterior prob-
                      i
a list of weights Wk+1    , a cumulative weight variable γc ,                  ability of all the particles at instant k, which means
and a cardinality variable Nk+1 . At the beginning of the                      that all the particles have been penalized by the cur-
algorithm, the state boxes and weight list are initialized as                  rent measurements. This abnormality can be checked
empty sets and cumulative weight and cardinality variable                      by thresholding Λ(k) defined in (12).
are initialized to zero. The loop "while" operates as a sort-               If enough representative fault free data are available, the
ing, eliminating the boxes with smallest weights so that the             indicators defined above can be determined by means of
cumulative sum of the boxes with largest weights is greater              thresholds computed with these data. For example, the
or equal to the threshold γ. If the state space is not bounded,          threshold that defines the abnormal abrupt change in state
the threshold 0 < γ < 1 does not guarantee a bounded num-                estimation can be computed as
ber of boxes in a worst-case scenario in which the measure-                                                q
ments do not emphasize some particles against others. In                 ∆x̂max = β1 max                       (x̂(i) − x̂(i − 1)) (x̂(i) − x̂(i − 1))T
                                                                                            i=2,··· ,L
this case, a maximum number of particles N max should be                                                                                              (29)
imposed.                                                                  where L is the length of the fault free scenario and β1 > 1
                                                                         a tuning parameter. Then the fault detection test consists in
5 State estimation and fault detection                                   checking at each instant k if
                                                                             q
5.1   State estimation                                                                                                T
                                                                               (x̂(k) − x̂(k − 1)) (x̂(k) − x̂(k − 1)) > ∆x̂max
Once the set of Nk+1 disjoint boxes [x(k + 1)]i , i =
1, · · · , Nk+1 , that approximate X (k + 1) and their asso-                                                                      (30)
ciated a priori weights Wk+1  i
                                  have been computed, their                 In a similar way, threshold Λmin that defines the min-
                                                                         imum expected unnormalized posterior probability can be
measurement updated weights w(k + 1)i are obtained us-                   computed as
ing (11). Then, according to [2], the state at instant k + 1 is
approximated by                                                                                  Λmin = β2 min (Λ(i))                                 (31)
                                                                                                                     i=2,··· ,L
                         Nk+1
                         X                                               where Λ(i) is determined using (12) and 0 < β2 < 1 is a
           x̂(k + 1) =          w(k + 1)i xi0 (k + 1)       (28)         tuning parameter. Then the fault detection test consists in
                          i=1                                            checking at each instant k if
where xi0 (k + 1) is the center of the particle box [x(k + 1)]i .                                          Λ(k) < Λmin                                (32)
  Algorithm 3 summarizes the whole state estimation pro-
cedure.
                                                                         6 Application example
Algorithm 3 State estimation                                             In this section a target tracking in a sensor network exam-
  Algorithm State estimation                                             ple presented in [8] is used to illustrated the state estima-
    Initialize X (0),       N0 and P ([x(k)]i |Y(k −                     tion method presented above. The problem consists of three
  1))k=0,i=1...N0                                                        sensors and one target moving in the horizontal plane. Each
    for k = 1, . . . , end do                                            sensor can measure distance to the target, and by combining
       Obtain Input/Output data {u(k), y(k)}                             these a position fix can be computed. Fig. 1 depicts a sce-
       Measurement update                                                nario with a trajectory and a certain combination of sensor
          compute Λ(k) using Eq. (12)                                    locations (S1 , S2 and S3 ).
          compute w(k)i using Eq.(11) i = 1 . . . N0
       State estimation                                                              4
          compute x̂(k) using (28)
                                                                                    3.5
       State update
          compute [x(k + 1)|x(k)]i i = 1 . . . N0 using (15)                         3

          compute Z that fulfils (16)                                               2.5
          compute disjoint boxes [z]i i = 1, · · · , M of (17)                                                          S2
                                                                                     2
          compute weights wzi using Algorithm 1
                                                                           x2(m)




          compute new state estimation using Algorithm 2                            1.5
                                                                                                                                    S
             Nk+1 disjoint boxes that approximate X (k +1)                           1                                                  3

             Prior probabilities given by weights Wk+1                              0.5
    end for                                                                                        S
                                                                                                       1
                                                                                     0
  endAlgorithm
                                                                                   −0.5

                                                                                    −1
                                                                                     −1   −0.5     0           0.5      1     1.5   2       2.5   3
5.2   Fault detection                                                                                                 x1(m)

In our framework, fault detection can be formulated as de-
tecting inconsistencies based on the state estimation. To do             Figure 1: Target true trajectory and sensor positions in the
so, we propose the two following indicators:                             bounded horizontal plane
  • Abrupt changes in the state estimation provided by (28)
       p instant k−1 to instant k, i.e. abnormal high values
    from                                                                   The behaviour of the system can be described by the fol-
    of (x̂(k) − x̂(k − 1))(x̂(k) − x̂(k − 1))T                           lowing discrete time state-space model:




                                                                    70
                                                             Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                                                                                             Box particle filtering weight of boxes using measurement y1(1)
                                                                                               
           x1 (k + 1)                                            x1 (k)                    v1 (k)
                                                    =                           + Ts                         (33)
           x2 (k + 1)                                            x2 (k)                    v2 (k)                                0.01


                                                 q                                                                            0.005
                                                       (x1 (k) − S1,1 )2 + (x2 (k) − S1,2 )2
    y1 (k)       q                                                                                                               0
  y2 (k)  =                                                                                                                    4
                                                                                                                                                                                                                      3
                   (x1 (k) − S2,1 )2 + (x2 (k) − S2,2 )2                                                                                    2
                                                                                                                                                         0                                   1          2
    y3 (k)       q                                                                                                                                              −2    −1         0

                    (x1 (k) − S3,1 )2 + (x2 (k) − S3,2 )2                                                                (34)
                                                                                                                                  Box particle filtering weight of boxes using measurements y1(1),y2(1) and y3(1)
       e1 (k)
 +  e2 (k)                                                                                                                      0.2
       e3 (k)
                                                                                                                                  0.1
 where x1 (k) and x2 (k) are the object coordinates bounded
by −1 ≤ x1 (k) ≤ 3 and −1 ≤ x2 (k) ≤ 4 ∀k ≥ 0.                                                                                     0
                                                                                                                                   4
                                                                                                                                                                                                                      3
Ts = 0.5s is the sampling time, v1 (k) and v2 (k) are the                                                                                     2
                                                                                                                                                         0                                   1          2
                                                                                                                                                                  −2    −1         0
speed components of the target that are unknown but con-
sidered bounded by the maximum speed σv = 0.4m/s
(|v1 (k)| ≤ σv and |v2 (k)| ≤ σv ). y1 (k), y2 (k) and y3 (k)
are the distances measured by the sensors. Si,j denotes                                                                    Figure 3: Box weights using measurement y1 (k) (up) and
the component j of the location of sensor i. e1 (k), e2 (k)                                                                measurements (y1 (k), y2 (k), y3 (k))T (down)
and e3 (k) are the the stochastic measurement
                                            √ additive er-
rors pei ∼ N (0, σi ) with σ1 = σ2 = σ3 = 0.05m.
   Fig. 2 shows the evolution of the real sensor distances
                                                                                                                                         Box particle filtering weight contour of boxes using measurement y (1)
and measurements in the target trajectory scenario depicted                                                                                                                                                   1
                                                                                                                                    4
in Fig. 1.                                                                                                                                                                                       Real point
                                                                                                                                    3                                                            Estimated BPF
                                                                                                                                    2
                                  4
                 Distance 1 (m)




                                                                                                                                    1
                                              Real
                                              Measured                                                                              0
                                  2
                                                                                                                                   −1
                                                                                                                                    −1       −0.5       0       0.5        1       1.5       2        2.5         3
                                  0
                                      0                      5                     10                   15
                                                                                                                                    Box particle filtering weight of boxes using measurements y1(1),y2(1) and y3(1)
                         1.5
     Distance 2 (m)




                                                                                                                                    4
                                  1                                                                                                                                                              Real point
                                                                                                                                    3                                                            Estimated BPF
                         0.5
                                                                                                                                    2
                                  0                                                                                                 1
                                      0                      5                     10                   15
                                  3                                                                                                 0
                 Distance 3 (m)




                                  2                                                                                                −1
                                                                                                                                    −1       −0.5       0       0.5        1       1.5       2        2.5         3
                                  1

                                  0
                                      0                      5                     10                   15
                                                                     Time (s)                                              Figure 4: Box weight contours using measurement y1 (k)
                                                                                                                           (up) and measurements (y1 (k), y2 (k), y3 (k))T (down)
Figure 2: Real and measured distances from the target to the
sensors
                                                                                                                           instant k = 1 (y1 (1), y2 (1) and y3 (1)) (down). Fig. 5 de-
   In order to apply the state estimation methodology pre-
                                                                                                                           picts the box weights and their contours using the measure-
sented above, a minimum accuracy δ1 = δ2 = δ = 0.2m
                                                                                                                           ments at hand at instant k = 2.
has been selected for both components. No a priori infor-
mation has been used in the initial state. Then, a uniform                                                                   The real trajectory and the one estimated using (28) are
grid of disjoint boxes with the same weights and component                                                                 shown in Fig. 6.
widths ε1 = ε2 = δ that covers all the bounded coordi-
                                                                                                                              Finally, different additive sensor faults have been simu-
nates −1 ≤ x1 ≤ 3 and −1 ≤ x2 ≤ 4 has been chosen as
                                                                                                                           lated and satisfactory results of the fault detection tests (30)
initial state X (0). Posterior probabilities of the boxes have
                                                                                                                           and (32) have been obtained for faults bigger than 0.5m us-
been approximated by weights w(k)i computed using the
                                                                                                                           ing thresholds ∆x̂max and Λmin computed with (29) and
new sensor distances measurements in (4.1). State update
                                                                                                                           (31)with L = 3200, β1 = 1.1 and β2 = 0.9.
has been computed considering speed bounds in (33). The
new boxes have been rearranged considering the minimum                                                                        Fig. 7 shows the real trajectory and the one estimated us-
accuracy δ and their associated weights have been computed                                                                 ing (28) when an additive fault of +0.5m affects sensor S1
using (4.1). Finally, Algorithm 2 with threshold γ = 1 has                                                                 at time k = 22. The behaviour of fault detection tests (30)
been applied to reduce the number of boxes.                                                                                and (32) is depicted in Fig. 8. As seen in this figure, both
   Figs. 3 and 4 depict the box weights and their contours                                                                 thresholds are violated at time instant k = 22 and therefore
using measurement y1 (1) (up) and all the measurements at                                                                  the fault is detected at this time instant.




                                                                                                                    71
                                        Proceedings of the 26th International Workshop on Principles of Diagnosis


            Box particle filtering weight of boxes using available measurements at instant k=2
                                                                                                                       4

           0.2                                                                                                       3.5
                                                                                                                                       real
                                                                                                                       3               Box Particle Filter
           0.1
                                                                                                                     2.5
            0                                                                                                                                                    S2
            4                                                                                                          2
                       2                                                                     3
                                                                                 2                                                                                         k=21     k=22
                                   0                                  1




                                                                                                        x2 (m)
                                            −2              0
                                                  −1                                                                 1.5
                                                                                                                                                                                    Fault Detection
    Box particle filtering weight contour of boxes using available measurements at instant k=2                         1
         4                                                                                                                                                                          S
                                                                                                                     0.5                                                               3
                                                                      Real point
         3                                                            Estimated BPF                                                       S1
                                                                                                                       0
             2

             1                                                                                                   −0.5

             0                                                                                                       −1
                                                                                                                      −1       −0.5      0        0.5          1         1.5       2        2.5        3
            −1                                                                                                                                               x (m)
                                                                                                                                                             1
             −1       −0.5        0       0.5          1    1.5       2        2.5       3


                                                                                                                           Figure 7: Trajectories in fault scenario
Figure 5: Box weights (up) and Box weights contours
(down) at instant k = 2
                                                                                                                     0.6




                                                                                                            ∆x̂(k)
                                                                                                                     0.4
             4
                                                                                                                     0.2
            3.5              Real trajectory                                                                                                                                      Fault Detection
                             Box particle estimation                                                                   0
             3                                                                                                             5            10              15          20             25             30
                                                                                                                                                        Time (Ts=0.5s)
            2.5
                                                                                                                     120
                                                       S2                                                            100
             2                                                                                                                                                            Fault Detection
                                                                                                                      80
  x2 (m)




                                                                                                         Λ(k)




            1.5                                                                                                       60
                                                                       S3                                             40
             1
                                                                                                                      20

            0.5                                                                                                        0
                                                                                                                           5            10              15          20             25             30
                                   S1                                                                                                                   Time (Ts=0.5s)
             0

           −0.5
                                                                                                      Figure 8: Fault indicators and thresholds in the fault sce-
            −1
             −1       −0.5        0       0.5        1      1.5       2        2.5       3            nario
                                                  x1 (m)



                                Figure 6: Trajectories                                                Acknowledgments
                                                                                                      This work has been partially funded by the Spanish Ministry
                                                                                                      of Science and Technology through the Project ECOCIS
7 Conclusion and perspectives                                                                         (Ref. DPI2013-48243-C2-1-R) and Project HARCRICS
                                                                                                      (Ref. DPI2014-58104-R).
A Box particle algorithm has been proposed for estimation
and fault detection in the case of nonlinear systems with                                             A Demonstration of Measurement update:
stochatic and bounded uncertainties. Using this method in                                               "From particles to boxes"
the case of a target tracking sensor networks illustrates its
                                                                                                      A.1 Particle filtering
feasibility. It has been shown how the measurement up-
date state for the box particle is derived from the particle                                          Consider the particles {x(k)j }N
                                                                                                                                     j=1 uniformly distributed in
case. However convergence and stability of this filter have to                                        x(k)j ∈ [x(k), x(k)] ∀j = 1, . . . , N where x(k), x(k) ∈
be proved. Resampling unfortunatly drops information and                                              R. Then according to [1] the relative posterior probability
waives guaranteed results that characterize interval analysis                                         for each particle is approximated by
based solutions. However without resampling the particle
filter suffers from sample depletion. This is the reason why                                                                              1
                                                                                                      P (x(k)j |Y(k)) ≈                      P (x(k)j |Y(k − 1))pe (y(k) − h(x(k)j ))
resampling is a critical issue in particle filtering (Gustafsson                                                                        c(k)
2002). This approach has to be compared to other PF vari-                                                                                                                      (35)
ants which reduce the number of particles [2] and further                                             with
                                                                                                                                                 N
                                                                                                                                                 X
investigations concerning resampling are required, in par-
ticular if we want to take better benefit of the interval based                                                                       c(k) =              P (x(k)j |Y(k))                                  (36)
approach.                                                                                                                                        j=1




                                                                                                 72
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


A.2 Grouping particles
                                                                                         i∆N
                                                                                         X
If we group the N particles in Ng groups of ∆N elements
                                                                                                  pe (y(k) − h(x(k)j ))∆x(k) ≈
                           Ng                                                       j=1+(i−1)∆N
                           [
          {x(k)j }N             {x(k)l }i∆N                                    Z (i∆N )∆x(k)
                  j=1 =                 l=1+(i−1)∆N           (37)
                                                                                                                                              (47)
                           i=1                                                                       pe (y(k) − h(x(k)))dx(k) ≈
                                                                                 (1+(i−1)∆N )∆x(k)
with Ng = ∆NN                                                                            Z
  If we select the groups of points in such a way that                                                       pe (y(k) − h(x(k)))dx(k)
                                                                                           x(k)∈[x(k)]i
                                                                            Finally, multiplying the numerator and denominator of
   {x(k)l }i∆N
           l=1+(i−1)∆N ∈ [x(k)]
                                i
                                         ∀i = 1, . . . , Ng   (38)        equation (44) by ∆x, we obtain the particle box measure-
                                                                          ment update equation
where
                                                                                                                                     P ([x(k)]i |Y(k)) ≈
                                                                                                         R
                                                                            P ([x(k)]i |Y(k − 1)) x(k)∈[x(k)]i pe (y(k) − h(x(k)))dx(k)
       [x(k)]i = [x(k) + (i − 1)∆L, x(k) + i∆L]               (39)        PNg                       R
                                                                                         l
                                                                           l=1 (P ([x(k)] |Y(k − 1)) x(k)∈[x(k)]l pe (y(k) − h(x(k)))dx(k))
with                                                                                                                               (48)
                             x(k) − x(k)                                  that corresponds to the equation (11) with
                     ∆L =                                     (40)                                                                            Λ(k) =
                                    Ng
                                                                          Ng
                                                                          X                              Z
 If the number of particles N → ∞ and therefore ∆N →
                                                                                (P ([x(k)]l |Y(k − 1))                      pe (y(k) − h(x(k)))dx(k))
∞                                                                                                            x(k)∈[x(k)]l
                                                                          l=1
                                                                                                                                               (49)
                              i∆N
                              X
  P ([x(k)]i |Y(k)) ≈                    P (x(k)j |Y(k))      (41)
                         j=1+(i−1)∆N
                                                                          References
                                                                          [1] F. Gustafsson, F. Gunnarsson, N. Bergman, U. Forssell,
according to (35)                                                             J. Jansson, R. Karlsson, and P.J. Nordlund. Particle
                                                                              filters for positioning, navigation, and tracking. Sig-
                                             P ([x(k)]i |Y(k)) ≈              nal Processing, IEEE Transactions on, 50(2):425–437,
    Pi∆N                    j                                j                2002.
      j=1+(i−1)∆N P (x(k) |Y(k − 1))pe (y(k) − h(x(k) ))
 PNg Pl∆N                                                                 [2] F. Abdallah, A. Gning, and P. Bonnifait. Box particle fil-
                               j                                  j
   l=1    j=1+(l−1)∆N P (x(k) |Y(k − 1))pe (y(k) − h(x(k) ))                  tering for nonlinear state estimation using interval anal-
                                                                (42)          ysis. Automatica, 44(3):807–815, 2008.
                                                                          [3] A. Doucet, N. De Freitas, and N. Gordon. An intro-
  If we consider the particles in the same group i have the
                                                                              duction to sequential Monte Carlo methods. Springer,
same prior probabilities, then:
                                                                              2001.
                                                                          [4] A. Gning, L. Mihaylova, and F. Abdallah. Mixture
                                       p(x(k)j |Y(k − 1)) =                   of uniform probability density functions for non linear
  P ([x(k)]i |Y(k − 1))                                                       state estimation using interval analysis. In Information
                            ∀j = 1 + (i − 1)∆N, . . . , i∆N                   Fusion (FUSION), 2010 13th Conference on, pages 1–
            ∆N
                                                               (43)           8. IEEE, 2010.
                                                                          [5] R.M. Fernández-Cantí, S. Tornil-Sin, J. Blesa, and
and (42) leads to                                                             V. Puig. Nonlinear set-membership identification and
                                                                              fault detection using a bayesian framework: Applica-
                                                 P ([x(k)]i |Y(k)) ≈          tion to the wind turbine benchmark. In Proceedings of
              i            Pi∆N                                  j            the IEEE Conference on Decision and Control, pages
    P ([x(k)] |Y(k − 1)) j=1+(i−1)∆N pe (y(k) − h(x(k) ))
 PNg                          P
                                                                              496–501, 2013.
                 l |Y(k − 1))    l∆N                                j )))
   l=1 (P ([x(k)]                j=1+(l−1)∆N p e (y(k) − h(x(k)           [6] J. Xiong, C. Jauberthie, L. Travé-Massuyès, and F. Le
                                                                (44)          Gall. Fault detection using interval kalman filtering en-
                                                                              hanced by constraint propagation. In Proceedings of the
  If the N particles are uniformly distributed in the interval                IEEE Conference on Decision and Control, pages 490–
[x(k), x(k)], i.e                                                             495, 2013.
                                                                          [7] A. Gning, B. Ristic, and L. Mihaylova. Bernoulli
      x(k)j − x(k)j−1 = ∆x(k) ∀j = 2, . . . , N                (45)           particle/box-particle filters for detection and tracking in
                                                                              the presence of triple measurement uncertainty. IEEE
where                                                                         Transactions on Signal Processing, 60(5):2138–2151,
                          x(k) − x(k)       ∆L                                2012.
                ∆x(k) =                  =                     (46)       [8] F. Gustafsson. Statistical sensor fusion. Studentlitter-
                                N          ∆N
  Then                                                                        atur, Lund, 2010.




                                                                     73
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                   74
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




     Minimal Structurally Overdetermined Sets Selection for Distributed Fault
                                   Detection
                          Hamed Khorasgani1 Gautam Biswas1 and Daniel Jung2
                    1
                      Institute of Software Integrated Systems, Vanderbilt University, USA
                         e-mail: {hamed.g.khorasgani,gautam.biswas}@vanderbilt.edu
                        2
                          Dept. of Electrical Engineering, Linkoping University, Sweden
                                             e-mail: daner@isy.liu.se

                         Abstract                                   which can further affect diagnostic accuracy. Detection time
                                                                    is important for the safe and reliable operation of safety-
     This paper discusses a distributed diagnosis ap-               critical systems. Faster fault detection and isolation en-
     proach, where each subsystem diagnoser operates                ables accompanying fault tolerant control units to react in
     independently without a coordinator that com-                  a timely manner, thus reducing damage and down time of
     bines local results and generates the correct global           systems [Roychoudhury et al., 2009; Daigle et al., 2007;
     diagnosis. In addition, the distributed diagnosis              Duarte Jr and Nanya, 1998; Rish et al., 2005; Bregon et al.,
     algorithm is designed to minimize communica-                   2014]. The computational intractability of building central-
     tion between the subsystems. A Minimal Struc-                  ized diagnosers for the large systems is another important
     turally Overdetermined (MSO) set selection ap-                 reason to develop distributed solutions for FDI problems.
     proach is developed as a Binary Integer Linear
     Programming (BILP) optimization problem for                       In this paper, we formulate the distributed minimal struc-
     subsystem diagnoser design. For cases, where a                 turally overdetermined set selection as a binary integer lin-
     complete global model of the system may not be                 ear programming (BILP) problem [Wolsey, 1998]. The ap-
     available, we develop a heuristic approach, where              proach efficiently picks a minimal number of measurements
     individual subsystem diagnosers are designed in-               from a subsystem and its neighboring subsystems to develop
     crementally, starting with the local system MSOs               a local diagnoser for each subsystem of the larger, complex
     and progressively extending the local set to in-               dynamic system. We start with an efficient algorithm de-
     clude MSOs from the immediate neighbors of the                 signed by [Krysander et al., 2008a] for finding minimally
     subsystem. The inclusion of additional neighbors               overdetermined sets of constraints to generate the minimal
     continues till the MSO set ensures correct global              structurally overdetermined (MSO) sets for designing the
     diagnosis results. A multi-tank system is used to              diagnoser. Other researchers have employed binary inte-
     demonstrate and validate the proposed methods.                 ger programming and binary linear integer programming for
                                                                    optimal sensor placement for fault detection and isolation
                                                                    [Sarrate et al., 2007; Rosich et al., 2009]. In this paper, we
1 Introduction                                                      utilize BILP for distributed MSO selection to facilitate an
The Minimal Structurally Overdetermined (MSO) sets ap-              efficient distributed diagnosis approach.
proach has been used extensively for designing model based             Our method is designed in a way that the subsystem di-
fault detection and isolation (FDI) schemes for complex             agnosers, once designed can operate independently with no
systems [Krysander et al., 2008a; Krysander et al., 2008b;          communication with the other subsystem diagnosers (other
Svard et al., 2012]. However, for large complex systems             than a minimal number of shared measurements), but still
such as aircraft and other transportation systems, manufac-         provide globally correct diagnosis results. Unlike [Lafor-
turing processes, supply chain and distribution networks,           tune, 2007; Debouk et al., 2000; Indra et al., 2012] this
and power generation and the power grid it is becoming              method does not require the use of a centralized coordina-
imperative to develop distributed approaches to monitoring          tor during on-line operations. Therefore, we avoid the sin-
and diagnosis to overcome the need for complete global              gle point-of-failure problem of centralized diagnosers. Our
models, while also addressing computational complexity              method assumes the availability of a global system model
and reliability problems for the diagnosers [Leger et al.,          from which the set of MSOs for the system can be derived.
1999; Shum et al., 1988; Deb et al., 1998; Lanigan et al.,          The independent subsystem diagnosers are designed to min-
2011].                                                              imize the sharing of measurements across subsystems, thus
   Unlike centralized approaches, distributed approaches are        decreasing the cost, and increasing the reliability of the
more reliable because they avoid single points of failure.          overall system diagnosis.
In addition, they can reduce the problems of noise, cor-               However, global models of a complex system are hard to
ruption, and losses that can occur when transmitting sig-           construct and may not be readily available. Subsystems are
nals from individual subsystems to a centralized fault di-          often provided by different manufacturers, who are not will-
agnosis unit. Measurement noise and signal corruption can           ing to pass along all of the intellectual property associated
significantly affect diagnoser robustness and accuracy [Fer-        with the subsystem to the system integrator. Therefore, to
rari et al., 2012]. Transmission delays not only increase           avoid the unrealistic assumption that the complete model of
detection time, but can also affect the order of detection,         the complex system is available for subsystem diagnoser de-




                                                               75
                              Proceedings of the 26th International Workshop on Principles of Diagnosis


sign, we propose a second algorithm that constructs the in-
dividual subsystem diagnosers without assuming the avail-                             1
ability of a global model. The modified algorithm is com-              e1 : ṗ1 =           (qin1 − q1 )
                                                                                  CT 1 + f1
putationally more efficient, but we cannot guarantee that the                                                            e4 : qin1 = u1
shared measurements between the subsystems is minimal                              p1 − p2
                                                                       e2 : q 1 =                                          e5 : p1 = y1
globally (i.e., across the entire system).                                        R + f2
                                                                                  Z P1                                     e6 : q1 = y2 .
   The rest of this paper is organized as follows. The back-           e3 : p1 = ṗ1 dt
ground material, definitions and the running example, a
four-tank system, are presented in Section 2. The distributed                                                                          (1)
diagnosis problem formulation is presented in Section 3. Al-          Therefore, E1 = {e1 , e2 , e3 , e4 , e5 , e6 } defines the set of
gorithm 1 for distributed MSO set selection is described in           equations, V1 = {ṗ1 , p1 , p2 , qin1 , q1 } defines the set of vari-
Section 4. The heuristic modifications to Algorithm 1 given           ables, M1 = {u1 , y1 , y2 } defines the set of subsystem mea-
the global model is not available is presented in Section 5 as        surements, and F1 = {f1 , f2 } defines the set of faults asso-
the incremental algorithm. Section 6 discusses the contribu-          ciated with this subsystem model.
tions of the paper in relation to previous work, and presents            Similarly, the second subsystem model is defined by the
the conclusion of the paper.                                          following equations:
                                                                                       1
                                                                         e7 : ṗ2 =          (q1 − q2 )
                                                                                   CT 2 + f3
2 Background                                                                        p2 − p3                             e10 : p2 = y3
                                                                         e8 : q2 =
This section introduces the basic concepts associated with                         RP 2 + f4                            e11 : q2 = y4 .
                                                                                   Z
MSO set selection for structural diagnosis of dynamic sys-               e9 : p2 = ṗ2 dt
tems. The system model S is defined as follows.
                                                                                                                                      (2)
Definition 1 (System model). A system model S is a four-              For this subsystem the set of equations is E2 =
tuple: (V , M , E, F ), where V is the set of variables, M is         {e7 , e8 , e9 , e10 , e11 }, the set of variable is V2 = {ṗ2 , p2 ,
the set of measurements, E is the set of equations and F is           p3 , q1 , q2 }, the set of measurements is M2 = {y2 , y4 }, and
the set of system faults.                                             F2 = {f3 , f4 } is the set of faults.
                                                                         In this paper, we assume there are no overlapping com-
   We use a configured four tank system, shown in Fig-                ponents among the subsystems. However, the subsystems
ure 1, as a running example throughout this paper to de-              may share variables at their interface. For example, the liq-
scribe the problem, and to illustrate the algorithms for dis-         uid flowrate at outlet pipe of subsystem qi = qi0 , the liquid
tributed MSO set selection. We assume each tank, and the              flowrate at input to connected tank i + 1.
outlet pipe to its right, constitute a subsystem. Therefore,
this system has four subsystems. Two of the subsystems,               Definition 3 (First Order Connected Subsystems). Two sub-
1 and 3, also have inflows into their tanks. We assume the            systems, Si and Sj are defined to be first order connected if
subsystems are disjoint, i.e., they have no overlapping com-          and only if they have at least one shared variable.
ponents. Associated with each subsystem are a set of mea-                In the running example, subsystems S1 and S2 are first
surements that are shown as encircled variables in the figure.        order connected and their shared variables are V1 ∩ V2 =
                                                                      {p2 , , q1 }. The two other subsystems in the running example
                                                                      are:
               u1                          u2
                                                                                        1
                                     S2                                e12 : ṗ3 =           (qin2 + q2 − q3 )
          S1                                                                           CT 3
                                                                                        p3 − p4                           e15 : qin2 = u2
                                                                       e13 : q3 =
                                      P2         P3                                    RP 3 + f5                            e16 : q3 = y5 .
                         P1                                 P4                        Z
                    T1         T2           T3        T4
                                                                       e14 : p3 = ṗ3 dt
                         y2           y4         y5
                    y1          y3                     y6                                                                               (3)

                                                                                      1                                             Z
      Figure 1: Running example: Four Tank System.                    e17 : ṗ4 =           (q3 − q4 )
                                                                                  CT 4 + f6                            e19 : p4 =       ṗ4 dt
                                                                                   p4
   More generally, we assume the system, S has n pre-                  e18 : q4 =                                      e20 : p4 = y6 .
                                                                                  RP 4
defined subsystems, S1 , S2 , ....Sn . Each subsystem model                                                                        (4)
is defined as:                                                           In more general terms, ith order connected subsystem
                                                                      models are defined as follows.
Definition 2 (Subsystem model). A subsystem model of sys-
tem model S, Si (1 ≤ i ≤ k) is also a four-tuple: (Vi , Mi ,          Definition 4 (ith Order Connected Subsystems). Two sub-
Ei , Fi ), where Vi ⊆ V , Mi ⊆ M , Ei ⊆ E and Fi ⊆ F .                systems, Sk and Sj are defined to be ith order connected
Also, S1 ∪ S2 ∪ ....Sk = S.                                           if and only if there exists a subsystem model Sm that is
                                                                      (i−1)th order connected to Sk , and is first-order connected
  For illustration, the first subsystem in our running exam-          to Sj , or Sm is (i − 1)th order connected to Sj , and is first-
ple is described by the following set of equations:                   order connected to Sk .




                                                                 76
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


   For example in the four tank system, S1 and S3 are sec-              Definition 11. (Binary integer linear programming problem
ond order connected because both of them are first order                (BILP)) A Binary integer linear programming problem is a
connected to S2 .                                                       special case of an integer linear programming (ILP) opti-
   In this paper, we use MSO sets [Krysander et al., 2008b]             mization problem in which some or all the unknown vari-
as the primary conceptual approach for fault detection and              ables to be solved for are required to be binary, and the
isolation. The formal definitions of Structurally Overdeter-            constraints in the problem and the objective function, like
mined (SO) and MSO sets are:                                            ILP, are linear.
Definition 5. (Structural Overdetermined Set) Consider a                  The mathematical formulation of BILP is as follows.
set of equations and its associated variables, measurements,
and faults: (E, V, M, F ). This set of equations is struc-                                              min cT x
turally overdetermined (SO) if the cardinality of the set {E}                                            Ax ≤ b
is greater than the cardinality of set {V }, i.e. |E| > |V |.                                                                      (5)
                                                                                                        ∃xb ⊂ x
Definition 6. (Minimal Structurally Overdetermined Set)                                   ∀xk ∈ xb ⇒ xk ∈ {0, 1},
A set of over determined equations is minimal structurally
overdetermined (MSO) if it has no subset of structurally                where vector c is the cost weights and matrix A and vector
overdetermined equations.                                               b define linear constraints, x represents the variables, and
   Consider subsystem S1 of the four tank system in equa-               xb represents the binary variables [Wolsey and Nemhauser,
tion (1). Using the software developed by [Krysander                    2014].
et al., 2008a], we can compute the only minimal struc-
turally overdetermined set in this subsystem as M SO11 =                3 Problem Formulation
(E11 , V11 , M11 , F11 ), where E11 = {e1 , e3 , e4 , e5 , e6 },
V11 = {ṗ1 , p1 , qin1 , q1 }, M11 = {u1 , y1 , y2 } and F11 =          Designing a set of distributed diagnosers that together have
{f1 }. For the sake of brevity and simplification we simply             the same diagnosability as a centralized diagnoser is the fo-
say a specific equation, variable, measurement, or fault is a           cus of our work in this paper. In the ideal case, each sub-
member of a MSO in the rest of the paper. For example, we               system includes sufficient redundancies, such that its set
say f1 ∈ M SO11 .                                                       of MSOs is sufficient to detect and isolate all of its faults,
   MSOs represent the redundancies in the system and can                Fi uniquely and unambiguously. In that case, we can as-
form the basis for fault detection and isolation. Global and            sociate an independent diagnoser Di with each subsystem
Local fault detectability are defined as:                               Si ; 1 ≤ i ≤ k, and each diagnoser operates with no cen-
Definition 7. (Globally detectable fault) A fault f ∈ F is              tralized control, and no exchange of information with other
globally detectable in system S if there is a minimal struc-            diagnosers. If the independence among diagnosers does not
turally overdetermined set M SOi in the system, such that f             hold, then the subsystems need to communicate some of
∈ M SOi .                                                               their measurements to other subsystems to detect and iso-
                                                                        late the faults. To address this problem in an efficient way,
Definition 8. (Locally detectable fault) A fault f ∈ Fi is lo-          we derive an integrated approach to select a set of MSOs for
cally detectable in subsystem Si if there is a minimal struc-           each subsystem that guarantee full diagnosability and mini-
turally overdetermined set M SOi in the subsystem that f ∈              mum exchange of measurements among subsystems.
M SOi .                                                                    Given subsystems, Si ; 1 ≤ i ≤ k, with a set of local fault
   Consider Definition 8 and equation (1). Fault f1 is lo-                                          S k
                                                                        candidates, Fi , such that i=1 Fi = F . We may need to
cally detectable because f1 ∈ M SO11 but f2 is not lo-                  augment each subsystem with additional measurements that
cally detectable since there is no MSO in this subsystem                are typically acquired from the (nearest) neighbors of the
that includes f2 . To detect f2 locally, the diagnosis subsys-          subsystem, such that all of the faults associated with the ex-
tem needs to include additional measurements. Global and                tended model of this subsystem are detectable and isolable.
Local fault isolability are defined as:                                 In the worst case, all of the measurements from another sub-
Definition 9. (Globally isolable fault) A fault fi ∈ F is               system may have to be included to make the current subsys-
globally isolable from fault fj ∈ F if there exists a mini-             tem diagnosable. When such a situation occurs, we say the
mal structurally overdetermined set M SOi in the system S,              two subsystems are merged and represented by a common
such that fi ∈ M SOi and fj 6∈ M SOi .                                  diagnoser, therefore, the total number of independent dis-
Definition 10. (Locally isolable fault) A fault fi ∈ Fi is              tributed diagnosers may be less than k.
locally isolable from fault fj ∈ F if there exists a mini-                 Each MSO is sensitive to a set of faults and, therefore can
mal structurally overdetermined set M SOi in subsystem Si ,             be used to detect them and isolate them from the other faults
such that fi ∈ M SOi and fj 6∈ M SOi .                                  in the system. For each subsystem Si , our goal is to find a
   Note that if a fault fi is locally detectable in a subsys-           minimal set of MSOs that provide maximum detectability
tem Si , it is globally detectable too, and if a fault fi is lo-        and isolability to that subsystem. A set of MSOs is mini-
cally isolable from a fault fj , it is globally isolable from fj        mal if there is no subset of MSOs that provides the same
as well. The problem of MSO selection is presented as a                 detectability and isolability. To achieve distributed fault di-
binary integer linear programming (BILP) problem in this                agnosis, we also want each subsystem to use the minimum
paper. BILP is a special case of the integer linear program-            number of measurements from the other subsystems. In
ming problem (ILP), where the unknowns to be solved for                 other words, we want to minimize communication or the
are binary variables.1                                                  amount of data (measurements) to be transmitted between
                                                                        the subsystems. More formally, the problem for designing a
   1
   See definition in Wikipedia: https://en.wikipedia.                   diagnoser for a particular subsystem Si can be described as
org/wiki/Integer_programming.                                           follows:




                                                                   77
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


   Consider MSO = {M SO1 , M SO2 , . . . , M SOr } as the                   To formulate the problem (6) as a BILP problem we de-
set of possible MSOs for the subsystem Si . We need to de-               fine a binary variable x(k): 1 ≤ k ≤ l, for measurement
velop an algorithm to select a minimal subset of MSO that                mk in the system as follows:
guarantees maximal structural detectability and isolability                                 (
for faults Fi associated with the subsystem, and include a                                     1 if mk ∈ Mi ∪ Mo
                                                                                    x(k) =                                    (7)
minimum number of measurements from the other subsys-                                          0 if mk ∈   / Mi ∪ Mo ,
tems in the system to assure the equivalence of local and
global diagnosability , i.e.,                                            where Mo is the answer to problem (6). We also define
                                                                         x(k + l): 1 ≤ k ≤ r, for MSO M SOk in the system as
               ∀Si ; 1 ≤ i ≤ k                                           follows.
             Select M SOSi ⊂ MSO                                                            (
                                                                                              1 if M SOk ∈ MSOi
                s.t.   min |Mo |                             (6)                 x(k + l) =                                (8)
                        Mo ⊆M                                                                 0 if M SOk ∈  / MSOi .
                       Di (Mi ∪ Mo ) = Di (M ),
                                                                           To minimize the number of measurements from the other
                       Ii (Mi ∪ Mo ) = Ii (M ),                          subsystems, we develop the following cost function c as:
                                                                                           (
where Mo represents the set of measurement we need to                                         0 if        m k ∈ Mi
communicate to the subsystem Si along with the set of mea-                          c(k) =    1 if mk ∈ M \Mi                    (9)
surements, Mi associated with the subsystem Si . M repre-                                     0 if l < k ≤ l + r,
sents the set of all measurements in the system. For a given
set of measurements, X, Di (X) represents the set of de-                 where l is the number system measurements and r is the
tectable faults in Fi , and Ii (X) represents the set of isolable        number of MSOs in the system. Using the algorithm pro-
faults in Fi from the system faults, F .                                 posed in [Krysander et al., 2008a] 165 MSOs are generated
   In the next section we formulate the problem as a BILP                for the running example, the four tank system. Since there
problem. Formulating the problem as a BILP, enables us to                are 8 measurements in the system c is a vector with 173 el-
use a number of well-developed tools like branch and bound               ements for this example.
algorithms [Land and Doig, 1960] and branch and cut al-                     Consider subsystem Si with local faults Fi and the set of
gorithms [Mitchell, 2002] to solve the problem. However,                 system faults, F . Each local fault fj ∈ Fi has to be lo-
much like integer linear programming, the general BILP so-               cally detectable. Given definition 8, we can guarantee local
lution is exponential.                                                   detectability of all the faults fj ∈ Fi with the following
                                                                         constraints in the optimization problem (5).
                                                                                       (
4 MSOs Selection for Distributed Fault                                                      0        if            k 1, θk,bol is replaced by inf(θˆk (T0 ), θˆk (T0 )),                                 P2
  • if αk < 1, θk,bol is replaced by sup(θˆk (T0 ), θˆk (T0 )),               Figure 3 – Partition P2 and test results for this partition.
and one of the bounds of the components of S(θ)T1 remains
equal to θk,eol .                                                              We iterate the process until the precision gain
   In the general case, when considering the inspection time                 G(Pi+1 /Pi ) is greater then a given threshold, as it is
Ti , S(θ)Ti is hence obtained with θ̂(Ti−1 ) as follows:                     shown in Fig. 4.

  • if αk > 1, inf(θˆk (Ti−1 ), θˆk (Ti−1 )) is replaced by
                                                                                P1
     inf(θ̂(Ti ), θ̂(Ti )),
  • if αk < 1, sup(θˆk (Ti−1 ), θˆk (Ti−1 )) is replaced by
                                                                                                P2
    sup(θˆk (Ti ), θˆk (Ti )).

4.3 FRP Parameter Estimation for a Single                                                         P3
    Parameter
                                                                                       Figure 4 – Test results for partition P3 .
In this section, we consider one single parameter θ whose
evolution is monotonically increasing. As an example, let’s
state that θ is a bearing friction coefficient that grows with               Remarks
the bearing wearing and the clogging of the environment. In                  The method can be easily generalized to a system whose
the general case, this kind of knowledge must be brought by                  parameter vector has dimension nθ > 1. The computing
an expert of the system and/or the manufacturer.                             costPis proportional to the number of boxes that are tested,
   Let us consider the first inspection time and the initial                        nP
                                                                             i.e. i=1   1/(Pi ), where nP is the number of partitions.
search space S(θ)T0 given by the domain value of the pa-                        Let’s notice that the partition may be non-regular. For ex-
rameter Ω(θ) = [θbol , θeol ]. The search space is partitioned               ample, for a slowly ageing parameter, one may choose small
into boxes, in our case intervals (cf. Fig. 2).                              boxes for the values of θ that are close to θbol and larger
   The dynamic equation of Σ is integrated on the time win-                  ones for the values close to θeol . The result is guaranteed
dow ti = t0 , . . . , tH , where tH = T0 , as many times as                  even if the partition has not been properly chosen or if the
the number of intervals in the partition P1 . The number                     parameter has evolved in a non expected way, although the
of intervals is defined by the partition factor (P1 ), which                computation cost may be higher.
equals 1/15 in our example (cf. Fig. 2). We start with                          The convex union provides a poor result if the set of ad-
[θ]1 = [θbol , θbol + pw], where pw = (P1 )w(S(θ)T0 ) is                    missible values is made of several mutually disjoint con-
the width of the partition intervals, then proceed with the                  nected sets, as shown in Fig. 5. The algorithm may test
subsequent intervals [θ]j . For each interval, we get an esti-               some boxes that have already been rejected by the tests of
mation of the state vector at times ti = t0 , . . . , tH , denoted           the previous partition. This drawback could be addressed
as x̂(t0 . . . tN )j , and obtain ŷ(t0 . . . tN )j thanks to the ob-        by defining the solution as a list of boxes whose labels (un-
servation equation (1). This latter is tested for consistency                feasible, feasible, or undetermined) are inherited by the next
against the measurements ym (t0 . . . tN ).                                  partition boxes.
   Depending on the output of the tests (4), (5), and (7), the
parameter interval [θ] is rejected or added to the solution as
feasible or undetermined (red-colored, green-colored, and                       P
yellow-colored parts, respectively, in Fig. 2).                                                         Solution

                                                                             Figure 5 – The returned solution is the convex hull of mutu-
   P1                                                                        ally disjoint connected intervals.

  Figure 2 – Partition P1 and test results for this partition.
                                                                             5 SM prognosis
   The convex union of feasible and undetermined
                                          h    i      intervals              The prognosis phase consists is calculating the number of
provides a guaranteed estimation θ̂ = θ̂, θ̂ of the admis-                   cycles remaining before anomaly, which is also called the
sible values for hθ. Wei iterate the process by creating a new               Remaining Useful Life or RUL. To optimally adapt this cal-
                                                                             culation to the system’s life requires the knowledge of the
partition P2 of θ̂, θ̂ with a precision (P2 ) = 1/10 (cf.                   health status of the system at the current time, which was
Fig. 3).                                                                     the topic of Section 4.




                                                                        86
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


5.1 Component degradation                                               partitioned into NΠ = Πnk=1θ
                                                                                                       Nk possible boxes that must be
The global model (Σ + ∆) assumes that the parameters of                 fed as input to D. Let us for instance consider a two param-
the behavior model Σ given by (1) evolve in time, and that              eters vector and its beginning-of-life and end-of-life values
their evolution is represented by the degradation model ∆               as follows:
                                                                                                                      
given by the dynamic equation (2) that is recalled below:                        θ             1             4              3
                                                                          θ = 1 , θ bol =         , θ eol =     and N =        , (12)
              ∆ : θ̇(t) = g(t, θ(t), w, x(t)).                                   θ2            1             9              2

∆ provides the dynamics of the parameter vector as a func-              then, if we select the partition landmarks as {5} for θ1 and
tion of the state of the system x(t) and of a degradation               {2, 3} for θ2 2 , D must be run for the following 6 box values:
parameter vector w that allows one to tune the degradation                                                                   
for each of the considered parameters.                                              [1, 2]              [2, 3]              [3, 4]
                                                                           [θ]1 =            , [θ]2 =            , [θ]3 =            ,
   The global model (Σ + ∆), in the form of a dynamic                               [1, 5]              [1, 5]              [1, 5]
model with varying parameters, cannot be directly inte-                                                                        
                                                                                     [1, 2]              [2, 3]              [3, 4]
grated by VNODE-LP. An original method, coupling the                        [θ]4 =             , [θ]5 =            , [θ]6 =            . (13)
                                                                                     [5, 9]              [5, 9]              [5, 9]
two models Σ and ∆ iteratively is proposed in the following.
The method is illustrated by Fig. 6 and used to determine                  For each of these box values taken as input for cycle i, i.e.
the degradation suffered by each parameter during one unit              θ i = [θ]l , l = 1, . . . , 6, D returns the (box) value θ i+1 after
cycle as defined in Section 3.2.                                        one unit cycle. This computation is then projected on each
   Let us denote uC (t), t ∈ [τ, τ + dC ], the system input             dimension to obtain a set of nθ tables, Dθk , k = 1, . . . , nθ ,
stress during one unit cycle C . As shown in Fig. 6, the                that provide the degradation of each individual parameter θk
following steps are iteratively executed, every iteration cor-          after one unit cycle.
responding to a computation step given by the sampling pe-
riod δ:                                                                 5.2 RUL determination
  1. The normal behavior model Σ is used first with input               The RUL, understood as a RUL for the whole system, can
     u(t) = uC (τ ) to compute the state x(τ ) and the output           now be determined by computing the number of cycles that
     y(τ );                                                             are necessary for the parameters to reach the threshold defin-
  2. The parameters are updated with the degradation                    ing the end-of-life (cf. Fig. 7).
     model ∆ using the value of the state determined pre-
     viously, i.e. θ(τ ) is computed;
                                                                                  Diagnosis at
  3. The parameters of the behavior model Σ are updated                            inspection
                                                                                                                             No

     with θ(τ );                                                                     time Tk

  4. The next stress input value uC (τ + δ) is considered,                                                            Yes

     and so on until the end of the cycle, i.e. until the last                                                     RUL = i
     value of the cycle uC (τ + dC ) is reached.
                                                                                           Figure 7 – RUL computation
                             Σ
                                                                           For the cycle i = 0, θ 0 is initialized with θ̂, which is
                                                                        the result of the parameter estimation computed by the di-
                                                                        agnosis engine. θ̂ is given as input to D, which returns
                                                                        D(θ̂) = θ 1 . i is incremented by 1 and θ 1 is given as input
                             Δ
                                                                        to D and so on until the set-membership test θ i  θ eol is
                                                                        achieved, which provides the stopping condition. This test
Figure 6 – Computation of the degradation parameters dur-               may take several forms as explained in Section 5.3. If the
ing one unit cycle.                                                     test is true, then the index i is the number of cycles required
                                                                        to reach the degradation threshold, so RUL = i.
  The above algorithm defines the function:                                For a given cycle i, the box value θ i that must be given
                                                                        as input to D is not necessarily among the values [θ]l , l =
                      D : IRnθ → IRnθ                      (11)
                                                                        1, . . . , NΠ , of the partition. We propose to compute θ i+1
where nθ is the number of parameters of the system. Let’s               by assuming that the mapping between θ i and θ i+1 is linear
assume the cycle i, then D maps θ i into D(θ i ) = θ i+1 ,              in every domain l of the partition. Considering p ∈ Rnθ ,
which is the value of θ after one unit cycle.                           D(p) is approximated as follows:
   D is nonlinear. Thus the value of the parameter vector
                                                                               ∀θ ∈ [θ]l , D(θ) ≈ a      θ + b, l=1, . . . , NΠ         (14)
after one cycle θ i+1 depends on the initial value θ i . Indeed,
we know that a system generally degrades in a nonlinear                 where a= w(D([θ]l ))./w([θ]l ), b= D([θ]l ) − a[θ]l , and
fashion. We must hence compute θ i+1 for all possible val-              is the product of two vectors term by term.
ues of the parameter vector θ i .                                                                       i
                                                                           Equation (14) is applied to θ and θ i to obtain an approx-
   For this purpose, the domain value Ω(θk ) of each param-                             i
                                                                        imation of D(θ ).
eter θk is partitioned into Nk intervals. Nk is chosen suffi-
ciently large to reduce non conservatism of the interval func-             2
                                                                            Notice that the intervals issued from the partitioning are not
tion D. The domain value of the parameter vector θ is hence             required to be of equal length.




                                                                   87
                            Proceedings of the 26th International Workshop on Principles of Diagnosis


5.3 Set-membership test for the RUL                                       • if ζ = 0, then the answer is a sinusoid;
The set-membership test implemented with the order rela-                  • if 0 < ζ < 1, then the answer is a damped sinusoid;
tion  may take several forms. For instance, if the test
                                                                          • if ζ ≥ 1, then the answer is a decreasing exponential.
θ i  θ eol is interpreted as:
                                                                          The state model is given by the equation:
  ∃k ∈ {1, . . . , nθ } |                                                                                     
       i                                                                                     0     1            0
       θk ≥ θk,eol if αk > 1 or θik ≤ θk,eol if αk < 1, (15)                      Ẋ(t) = −k −c X(t) +
                                                                                                                   U (t)
                                                                                                                 1
then it means that the bound of the interval value of at least                               m m                                (19)
                                                                                 
one parameter θk is above or below its end-of-life threshold                      Y (t) = 1 0 X(t)
                                                                                 
value θk,eol . The RUL is then qualified as the “worst case                                  0 1
RUL”, which means that the RUL indicates the earliest cycle             with X(t) = [x(t), ẋ(t)]T , and the transfer function is:
at which the system may fail.
   One can also test whether the value higher bound of one                                X(p)       1
of the parameters is higher than its end-of-life threshold, that                                = 2 c    k
                                                                                                           .                      (20)
                                                                                          U (p)  p +m p+ m
is to say:
                                                                          An example of bounded error step response obtained with
  ∃k ∈ {1, . . . , nθ } |
                                                                        VNODE-LP with a sampling parameter δ = 0.1 s, c = 1,
                                    i
      θik ≥ θk,eol if αk > 1 or θk ≤ θk,eol if αk < 1. (16)             m = 2 and k = [3, 9 ; 4, 1] is shown in Fig. 9a. There,
The RUL then represents the cycle at which it is certain that           ζ ' 0.177 and the step response is a damped sinusoid. Be-
the system will fail.                                                   cause k is assumed to have an uncertain value bounded by
  It is obviously possible to combine these different tests             an interval, the outputs are in the form of envelops.
applied to the different individual parameters depending on             6.2 Unit cycle
their criticality.
                                                                        In the case study, a unit cycle is defined by the application
6 Case study                                                            of a power unit for a determined time. The force is applied
                                                                        at time t0 +5s, where t0 is the cycle starting time. The force
6.1 Presentation                                                        lasts 20s and cancels at t0 + 25s as shown by the red curve
The case study is a shock absorber that consists of a moving            of Fig. 9b. The cycle ends at t0 + 50s.
mass connected to a fixed point via a spring and a damper as               Fig. 9b presents the system’s response for a spring con-
illustrated by Fig. 8. The movement of the mass takes place             stant k = [3.9, 4.1] N/m, a mass m = 2 kg, a damping co-
in the horizontal plane in order to eliminate the forces due            efficient c = 10 Ns/m, and initial speed and position equal
to gravity. Aerodynamic friction forces are neglected.                  to zero. The response is a decreasing exponential.

                                        x                               6.3 Degradation model
                              k                                         The degradation model chosen is the ageing of the damper
                                        m                               cylinder. It is represented by a reduction of the damping
                              c
                                                                        coefficient proportional to the velocity of the mass [15]:
                                                                                               ċ = β ẋ, β < 0.                  (21)

            Figure 8 – Spring and damper system                         The more the spring is used, the weaker it becomes, charac-
                                                                        terized by the change in the damping coefficient.
  The Newton’s second law is written as:
                                                                        6.4 Diagnosis
              m~a = ΣF~ = F~r + F~c + ~u                   (17)
                                                                        The FRP parameter estimation method presented in Section
where m is the mass, ~a is the acceleration, F~k is the spring          4 has been used with the measures shown in Fig. 9c. These
biasing force, F~c is the friction force exerted by the damper          measures were obtained for
and ~u is the force applied on the mass. Expressing the forces                                   " # " #
                                                                                                   c       5
and the acceleration as a function of the position of the mass
                                                                                            θ= k = 4                         (22)
x(t), we get:                                                                                     m        2
                        c          k
                ẍ(t) + ẋ(t) + x(t) = u(t)                (18)            The goal is to estimate the damping coefficient c and the
                        m          m
where k is the spring stiffness constant (N/m), m is the mo-            stiffness constant k. The search space is defined by the inter-
bile mass (kg), and c is the damping coefficient (Ns/m). (18)           val [4 9] for c and [3.5, 9] for k. The value of m is assumed
is a second order ODE. Let us rewrite                                   to be known m = 2. Using the notation introduced above,
                     c                k                                 we have:                  " #             " #
                       = 2ζω 0 and      = ω 20                                                       4             9
                    m                m                                                    θ bol = 3.5 , θ eol = 9                 (23)
and we get                                                                                           2             2
                        r
                           k               c                            The partition P1 is achieved with a precision (P1 ) = 1/10
                  ω0 =         and ζ = √       .
                           m            2 km                            for the two parameters to be estimated c and k. Fig. 10
The impulse response of such system depends on the value                presents two examples of prediction results with two param-
                                                                                                                              T
of ζ:                                                                   eter boxes of P1 : [θ]i = [4, 1, 4, 2], [4, 7, 4, 8], 2 on the




                                                                   88
                         Proceedings of the 26th International Workshop on Principles of Diagnosis




      (a) Step response for ζ = 0.177.            (b) Unit cycle for the case study.               (c) Measured input and output.

                                         Figure 9 – Cases study simulation and data plots.

                                       T
left and [θ]j = [5, 5, 1], [4, 4, 1], 2 on the right. On the           6.5 RUL computation
left figure, one can see that there is no intersection between         In this section we apply the set-membership method de-
the estimate and the measurement for the position, hence the           scribed in Section 5.2 to compute the RUL for the damping
box used for the simulation is rejected. On the right, there           coefficient c.
is an intersection between the measurement and the estima-                The damper is assumed to fail when c ≤ ceol = 2. The
tion for all time points, but the estimate is not included in          degradation model (21) with β = −0, 13 allows us to deter-
the measure envelop, hence the parameter box is considered             mine the degradation table Dc for the parameter c for a unit
undetermined.                                                          cycle:
                                                                               ci          D(ci ) = ci+1    ci     D(ci ) = ci+1
                                                                            [9, 10]       [8.917, 9.977]  [4, 5] [3.874, 4.977]
                                                                       Dc = [8, 9]        [7.911, 8.978]  [3, 4] [2.863, 3.973]
                                                                             [7, 8]       [6.898, 7.979]  [2, 3] [1.721, 2.97]
                                                                             [6, 7]       [5.814, 6.982]  [1, 2]     [0, 1.98]
                                                                             [5, 6]       [4.859, 5.979]  [0, 1]    [0, 0.9755]
                                                                                                                               (25)
Figure 10 – Estimation results with a rejected parameter box           After proceeding to the linear interpolation given by (14),
(left) and an indetermined box (right)                                 the graphical representation of ci+1 as a function of ci is
                                                                       given by Fig. 12.
  The results for partition P1 are presented in Fig. 11a and
we obtain a first estimation for θ:
                          "             #
                            [4, 1, 5.8]
                      θ̂ = [3, 7, 4, 2] .
                                 2
  The estimation precision for partition P1 is given by:
                                                "     #
                                                 0.85
 ω(P1 ) = mid(θ̂) ./( mid(θ̂) + w(θ̂)/2) = 0.94 (24)
                                                   1
   The first estimation for θ is used as the search space for
partition P2 , whose precision is increased by a factor of 10,         Figure 12 – Approximated degradation of the damping co-
i.e. (P2 ) = 0, 1. The obtained estimation results are shown          efficient c
in Fig. 11b.
   The estimation is refined as:                                          The number of elements of the partition has been chosen
                         "
                           [4, 51, 5, 57]
                                         #                             relatively small to better illustrate the method. In a real sit-
                     θ̂ = [3, 85, 4, 14] .                             uation, this number should be high in order to obtain less
                                 2                                     conservative predictions.
                                                                          The value of c has been previously estimated and is
   The precision is now ω(P2 ) = [0.9, 0.96, 1]T , and the                                   ĉ = [4.548, 5.526].
precision gain is G(P2 /P1 ) = [0.056, 0.025, 0]T . The val-
ues for the gain indicate that partitioning a third time might         The graph of Fig. 12 allows us to approximate the predicted
be quite inefficient. To confirm this fact, let us perform a           value after one unit cycle:
third partition P3 , whose precision is increased by a factor                          D(ĉ) = c1 = [4.4787, 5.4481].
of 5, i.e.  = 0.02 (cf. Fig. 11c). The new estimation for θ is
                                        T                            The next iteration of the algorithm allows us to compute c2 ,
θ̂ = [4.548, 5.526], [3.872, 4.132], 2 , and the precision
                                                                       etc. After 30 iterations, we obtain c30 = [1.7665, 3.4235].
gain is G(P3 /P2 ) = [0.0073, 0.0036, 0]T . As expected, the
gain is quite negligible with respect to the computation time             3
                                                                           The coefficient β has been chosen arbitrarily to illustrate the
increase.                                                              approach; it does not represent the real ageing of a damper.




                                                                  89
                         Proceedings of the 26th International Workshop on Principles of Diagnosis




              (a) Partition P1                           (b) Partition P2                              (c) Partition P3

Figure 11 – Partitions and estimation results (red, yellow and green boxes are resp. rejected, undetermined, accepted param-
eters values).


Since c30 < ceol , we get RU L = 30 cycles. After the 44th            [5] D. Gucik-Derigny, R. Outbib, and M. Ouladsine. Es-
iteration, we get c44 = [0.037591, 1.985928]. We then have                 timation of damage behaviour for model-based prog-
c44 < ceol and hence RU L = 44 cycles. The RUL of the                      nostic. In Fault Detection, Supervision and Safety of
damper is hence given by:                                                  Technical Processes, pages 1444–1449, 2009.
                                                                      [6] X. Guan, Y. Liu, R. Jha, A. Saxena, J. Celaya, and
                  RU L = [30, 44] cycles.
                                                                           K. Geobel. Comparison of two probabilistic fatigue
                                                                           damage assessment approaches using prognostic per-
7 Conclusion                                                               formance metrics. International Journal of Prognos-
This paper addresses the condition-based monitoring and                    tics and Health Management, 1(005), 2011.
prognostic problems with a new focus that trades the tra-             [7] R.E. Moore. Automatic error analysis in digital com-
ditional statistical approach by an error-bounded approach.                putation. Technical report LMSD-48421, Lockheed
It proposes a two stages method whose principle is to first                Missiles and Space Co, Palo Alto, CA, 1959.
determine the health status of the system and then use this           [8] R.E. Moore. Interval Analysis. Prentice-Hall, Engle-
result to compute the RUL of the system. This study uses
                                                                           wood Cliffs, 1966.
advanced interval analysis tools to obtain guaranteed results
in the form of interval bounds for the RUL.                           [9] L. Jaulin, M. Kieffer, O. Didrit, and E. Walter. Ap-
   The results for the case study demonstrate the feasibility              plied Interval Analysis, with examples in parameter
of the approach. The next step is to adapt the FRP-based                   and state estimation, Robust control and robotics.
SM parameter estimation algorithm in order to output a list                Springer, Londres, 2001.
of boxes instead of a single box given by the convex hull             [10] L. Jaulin and E. Walter. Set inversion via interval anal-
of the boxes. The convex hull is indeed a very conservative                ysis for nonlinear bounded-error estimation. Automat-
approximation when the solution set is not convex.                         ica, 29:1053–1064, 1993.
   The second stream of work is to consider contextual con-
                                                                      [11] L. Jaulin, M. Kieffer, O. Didrit, and E. Walter. Ap-
ditions and their associated uncertainties. Environmental
conditions, like weather, different usage, etc. may indeed                 plied Interval Analysis, with examples in parameter
significantly affect the stress input and prognostics results.             and state estimation, Robust control and robotics.
                                                                           Springer, Londres, 2001.
                                                                      [12] C. Jauberthie, N. Verdière, and L. Travé-Massuyès.
References
                                                                           Fault detection and identification relying on set-
[1] Indranil Roychoudhury and Matthew Daigle. An inte-                     membership identifiability. Annual Reviews in Con-
    grated model-based diagnostic and prognostic frame-                    trol, 37:129–136, 2013.
    work. In Proceedings of the 22nd International Work-              [13] N. Nedialkov. VNODE-LP, a validated solver for ini-
    shop on Principle of Diagnosis (DX’11). Murnau,
                                                                           tial value problems for ordinary differential equations.
    Germany, 2011.
                                                                      [14] R. J. Lohner. Enclosing the solutions of ordinary initial
[2] Matthew J Daigle and Kai Goebel. A model-based
                                                                           and boundary value problems. In E. W. Kaucher, U. W.
    prognostics approach applied to pneumatic valves. In-
                                                                           Kulisch, and C. Ullrich, editors, Computer Arithmetic:
    ternational Journal of Prognostics and Health Man-
                                                                           Scientific Computation and Programming Languages,
    agement, 2:84, 2011.
                                                                           pages 255–286. Wiley-Teubner, Stuttgart, 1987.
[3] J. Luo, K. R Pattipati, L. Qiao, and S. Chigusa. Model-           [15] Ian M Hutchings. Tribology: friction and wear of
    based prognostic techniques applied to a suspension                    engineering materials. Butterworth-Heinemann Ltd,
    system. Systems, Man and Cybernetics, Part A: Sys-                     1992.
    tems and Humans, IEEE Transactions on, 38(5):1156–
    1168, 2008.
[4] Q. Gaudel, E. Chanthery, and P. Ribot. Hybrid par-
    ticle petri nets for systems health monitoring under
    uncertainty. International Journal of Prognostics and
    Health Management, 6(022), 2015.




                                                                 90
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




Configuration as Diagnosis: Generating Configurations with Conflict-Directed A*
                 - An Application to Training Plan Generation -

                                         Florian Grigoleit, Peter Struss
                                    Technische Universität München, Germany
                                       email: {struss, grigolei}@in.tum.de


                        Abstract                                               to a set of system components such that it is op-
                                                                               timally compliant with a set of observations.
    Although many approaches to knowledge-based
                                                                     Based on this analogy, we exploit a search technique that
    configuration have been developed, the genera-                   has been developed as consistency-based diagnosis, see
    tion of optimal configurations is still an open is-
                                                                     [5]), and as a generalization for optimal constraint satisfac-
    sue. This paper describes work that addresses this
                                                                     tion, called conflict-directed A*, see [6]
    problem in a general way by exploiting an analo-                    In the following section, we discuss related work on
    gy between configuration and diagnosis. Based on
                                                                     configuration systems. In section 3, we present some ex-
    a problem representation consisting of a set of
                                                                     amples of configuration problems that we tackled using
    ranked goals and a catalog of components, which                  GECKO and that will serve for illustration purposes. Next,
    can contribute in combination to their satisfaction,
                                                                     we introduce our formalization of the configuration task
    configuration is formulated as a finite constraint-
                                                                     and the key concepts of GECKO. In section 5, we discuss
    satisfaction-problem. Configuration is then solved               the analogy between diagnosis and configuration, the ap-
    by state-search, in which a problem solver selects
                                                                     plication of CDA*, variants of utility functions and how
    components to be included in an appropriate con-
                                                                     they relate to different types of configuration applications.
    figuration. A variant of Conflict-Directed A* has                The results are shown in section 6. Finally, our current
    been implemented to generate optimal configura-
                                                                     work and some of the open issues are discussed.
    tions. To demonstrate its feasibility, the concept
    was applied, among other domains, to personal-
    ized automatic training plan generation for fitness              2   Knowledge-based Configuration
    studios.                                                            Applications of configuration are immensely diverse,
                                                                     but they all share a number of common problems, such as
1   Introduction                                                     compliance with domain knowledge, size of the solution
                                                                     space, and the resulting complexity of the problem solving
Besides diagnosis, the task of configuration has been one
                                                                     task. It requires knowledge-based approaches to support
of the earliest application areas of work on knowledge-
                                                                     the problem-solving activities, such as product configura-
based systems, initially in the form of rule-based “expert           tion or variability management see [3] and [4].
systems”, for instance in [1]. Today, systems for automat-
                                                                        Current research on configuration, especially for large
ed configuration have reached maturity for practical appli-
                                                                     applications, tends to neglect global optimization, focusing
cations, as shown in [2], [3], and [4]. Despite this success,        on local optimization, user interaction, or aiming at pro-
developing algorithms for computing optimal or opti-
                                                                     ducing “good” solutions, see [3] and [7].
mized configurations with general applicability still de-
                                                                        The focus of this paper is a generic, constraint-based
serves more research efforts.                                        configurator (GECKO) for solving optimal configuration
   Driven by a number of different configuration tasks, we
                                                                     problems. The core of GECKO is a variant of Brian Wil-
developed GECKO (Generic constraint-based Konfigura-
                                                                     liams’ Conflict-Directed A* (CDA*, [6]). The solution
tion), a generic solution to the configuration problem that          works on a generic representation of configuration
can be specialized to different application domains and
                                                                     knowledge and tasks. We consider the task of generating
that, among other objectives, aims at supporting the gener-
                                                                     configurations as similar to consistency-based diagnosis.
ation of optimal configurations.                                     Instead of assigning modes for fault identification as in [5],
In a nutshell, the solution exploits an analogy:
                                                                     GECKO assigns the activity to components contributing to
      The configuration task can be seen as searching
                                                                     goals. A configuration is consistent if all task-relevant
          for an assignment of active or non-active to the           goals are satisfied. The quality of a configuration is given
          components in a given repository, representing
                                                                     by the level of goal satisfaction and the amount of resource
          whether or not a component is included in the
                                                                     consumption. Our approach allows the arbitrary selection
          configuration, such that it achieves some goals in         of optimization criteria, like minimal resource consump-
          an optimal way
                                                                     tion or maximal goal contribution. In the presented case
      Diagnosis has been formalized as a search for an
                                                                     study, our aim was to maximize the number of satisfied
          assignment of behavior modes (normal or fault_x)           goals under consideration of available resources.




                                                                91
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


3     Application Examples                                                     enabling the use of the system without deep do-
                                                                                main knowledge, esp. about how high-level goals
Configuration problems are almost ubiquitous in modern                          of the user break down to more detailed and
life, with applications as different as creating a customized                   technical ones;
computer as done by R1 in [1] and adapting the system                       handling also soft domain constraints and user
functions of a car, see [8]. To illustrate the versatility of                   preferences, and
GECKO, we present three applications.                                       offering support to the user by providing expla-
                                                                                nations for generated parts of the configuration
3.1 Car Configuration                                                           and for unavailable options and by suggesting re-
Today, car manufacturers offer a vast number of models,                         visions to resolve inconsistencies.
model variants, and equipment options to their customer.                However, this paper focuses on the basis, a generic
The resulting complexity does not only prohibit a compre-            problem solver for (optimal) configuration. Determining
hensive exploration of the solution space, but is also likely        the solution – the configuration - means selecting a set of
to provide customers with sub-optimal car variants. A do-            instances of given types of elements - components -, per-
main model for car configuration was created and mapped              haps with certain attribute values and organized in a par-
to the GECKO concepts, which are presented in 4.2([9])               ticular structure. The configuration has to
                                                                          1. satisfy a set of high-level user goals,
3.2 User Interface Configuration                                          2. be compliant with particular attributes and re-
The Beam Instrumentation group at CERN is responsible                          strictions supplied by the user,
                                                                          3. be realizable both in principle (i.e. not violating
for the design and implementation of particle beam meas-
                                                                               domain-specific restrictions on valid configura-
urement systems. These systems are specifically built for
                                                                               tions),
each case, resulting in extensive work on constructing                    4. under consideration of available resources, and
them. While the generation of the GUIs, that is the imple-                5. optimal (or near optimal) according a criterion
mentation, is automated, the configuration is not. This task                   that reflects the degree of fulfilling the goals and
currently requires an expert to select libraries, graphical                    the amount of resources consumed.
elements, and data sources and to parameterize them. Such               Configurations can be physical devices, such as tur-
tasks are typical configuration tasks and thus enable the            bines, communication systems, and computers, abstract
automation of the configuration of the GUIs by GECKO                 ones like a curriculum or a company structure, or a soft-
([10]).                                                              ware system. In contrast to a design task involving the
                                                                     creation of new types of components, configuration as-
3.3          Training Planning in Sport Science                      sumes that all required Components are instances of com-
At a first glance, training planning may appear to be a typ-         ponent types from a repository ([11]). This leads to differ-
ical scheduling task, instead of a configuration problem.            ent kinds of reasoning involved: innovative design has to
Taking a closer look shows that it mainly involves activi-           verify that its result satisfies the goals by inferring that
ties we consider the core of configuration: selecting, pa-           they achieved by the system behavior based on behavior
rameterizing, and arranging components to satisfy goals,             models of the components, whereas for a configuration
whereas assigning time slots to the selected exercises is, in        task, it is assumed that behavioral implications of aggre-
general, fairly straightforward                                      gated components have been compiled into explicit inter-
   A trainer has to analyze the biometric state of his train-        dependencies of Goals and Components. As a result, soft-
ee, such as fitness or age, to consider constraints on the           ware systems for configuration are typically based on
created training plan, for example duration or available             knowledge encoded as constraints or rules, as in [1] and
equipment, and to select and order appropriate exercises.            [2], and do not require the exploitation of behavior models.
   The sheer number of existing exercises and the size of
the solution space show that training planning includes              4.2 Core Concepts
optimization. In general, a trainer tries to maximize the               The core concepts of GECKO are derived from the de-
training effect within the available time and under consid-          scription above, as depicted in Fig. 1:
eration of the trainee’s goal and abilities. The specializa-               Goals express the achievements expected from a
tion of GECKO to training planning is described in section                    specific configuration. They may have an
5.                                                                            associated priority dependent on the task and
                                                                              different criteria for goal satisfaction.
4     GECKO - Foundations                                                  Components are the building blocks of the Con-
                                                                              figuration. They may be organized in a type hier-
4.1 Intuition                                                                 archy (for example, Lithium battery is a voltage
With GECKO, we aim at developing a generic solution to                        source). In addition, there may be Components
configuration problems, which can be tailored towards a                       that are aggregations of lower level components.
particular domain by specializing some basic classes and                   A Task specifies the requirements on a configura-
creating a knowledge base in terms of domain-specific                         tion from the user’s perspective. It is split into
constraints. Its design is driven by the following objec-                     three kinds of restrictions:
tives:
       supporting both automatic and interactive con-                 Task
                                                                       = TaskGoals  TaskParameters  TaskRestrictions.
          figuration;




                                                                92
                       Proceedings of the 26th International Workshop on Principles of Diagnosis


   TaskGoals are a collection of Goals the user is aware of                 ComponentConstraints establish interdependen-
and which can be (de)activated or prioritized by the user.                   cies among components (and their attributes): a
Each TaskGoaln is stated as a restriction Task-                              component may be dependent on or incompatible
Goaln.Satisfied=T in the Task description.                                   with the presence of another component in the
   While TaskGoals represent objectives a user requires,                     configuration
TaskParameters associate values to properties of the                        TaskParameterComponentConstraints may in-
Task, hence have the form TaskParameter k=valuekj. For                       clude or exclude certain components based on
instance, in vehicle configuration, the target country may                   TaskParameter values
have an influence on daytime running lights being manda-
tory. However, these implications are not drawn by the                 A fundamental constraint type is
user (who only provides the country information), but by               Requires (x, y)
the domain knowledge represented in the system.                     which is defined by
   In contrast, TaskRestrictions refer explicitly to the               x.active=T  y.active=T
choice of Components and their attributes, e.g. that for the        and used to express dependencies among goals (e.g. refin-
user, a convertible is not an option or that the engine             ing a goal to a set of mandatory sub-goals) and compo-
should be a Diesel engine.                                          nents (e.g. cruise control requires automatic transmission)
A specific, and often essential, TaskRestriction can be in-         and as the fundamental coupling between goals and com-
cluded:                                                             ponents (to achieve high-speed driving, an engine of a cer-
      A ResourceConstraint limits the cost of the con-             tain power is needed). Furthermore, in order to express that
         figuration, which may be indeed money (car con-            several goals or components provide some partial contribu-
         figuration) or time (in training plan configura-           tions that jointly result in the satisfaction of a goal (or es-
         tion), but also computer memory etc. Components            tablish the preconditions of a component), we introduce
         have to have an attribute that allows calculating          the concept of a choice, which can also fill the role of y in
         the resources needed for the entire configuration          a Requires-constraint. A choice is given by a relation
         (often as the sum).                                           GoalChoice  Goals  ContributionDom
                                                                    or
4.3 Constraints on Configurations                                      ComponentChoice  Components  ContributionDom,
The configuration knowledge of a particular application             where ContributionDom specifies a set of values for quan-
comprises the domain-specific specialization and instantia-         tifying how much a goal or component contributes to the
tion of Goals, Components (possibly including component             satisfaction of the choice and needs a zero element and an
attributes and their domains), and relevant TaskParameters          operator  to add up contributions (e.g. addition of inte-
and their domains as well as constraints that capture inter-        gers). The idea behind choices is implemented by three
dependencies among these instances. Dependent on which              kinds of constraints. The degree of the satisfaction of a
kinds of objects are related, we distinguish between the            (component) choice is given by the combined contribu-
following (illustrated in Fig. 1):                                  tions of the active components of the choice:
                                                                       Choice.satLevel =  Choice.goal.actContribution
                                                                    and
                                                                       Choice.goal.actContribution =
                                                                            Choice.goal.contribution IF goal.active=T
                                                                            zero                           IF goal.sctive=F .
                                                                       The choice is satisfied, if the satLevel lies in a specified
                                                                    range, satThreshold:
                                                                       Choice.active = T 
                                                                            Choice.satLevel  Choice.satThreshold .
                                                                       This allows implementing not only a minimum level as
                                                                    a precondition for the satisfaction of a choice, but also a
                                                                    maximum. Preventing “over-satisfaction” may not be a
                                                                    common requirement, but in the fitness domain, one may
                                                                    want to restrict the set of exercises that impose a load on a
              Fig. 1 Task constraints in GECKO
                                                                    particular muscle group.
                                                                    Another predefined general type of constraint is
        TaskParameterGoalConstraints express that
                                                                       Excludes (x, y)
         certain TaskParameter values may exclude or re-
                                                                    defined by
         quire certain goals                                           x.active=T  y.active=F
        GoalConstraints relate goals to each other, in
                                                                    to express conflicting goals, incompatible components, and
         particular for refinement of higher-level (esp.
                                                                    TaskParameterGoal/ComponentConstraints (e.g. high
         TaskGoals) to lower-level ones, such as goals re-          body weight may rule out certain exercises).
         lated to various muscle groups that should be ex-
                                                                       The application-specific configuration knowledge is,
         ercized, although the user is not aware of this
                                                                    thus, basically encoded as a set of the constraints explained
        GoalComponentConstraints capture essential                 above. This, together with the domain-specific ontology
         configuration knowledge, namely whether and
                                                                    (as a specialization of the basic GECKO concepts, includ-
         how the available components contribute to the
                                                                    ing choices, and associated attributes) and, perhaps, specif-
         achievements of goals                                      ic contribution domains and operators, establishes the con-




                                                               93
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


figuration knowledge base, called ConfigKB in the fol-               Criteria 5, optimality, will be discussed in the following
lowing.                                                              section.
We make some reasonable fundamental assumptions about
ConfigKB:                                                            5   Generating (Near) Optimal Configura-
      Each potential TaskGoal is supported: it is the
         starting node of a connected hyper graph of Re-                 tions
         quires constraints that includes components, i.e. it
         actually needs a (partial) configuration in order to
                                                                     5.1 Configuration as Diagnosis
         be satisfied (which does not mean it can actually           The current version of GECKO is based on the assumption
         be satisfied).                                              that there exists a finite set of components, COMPS, as a
      Closure assumption: the encoded interdepend-                  repository for all configurations. This means, no new in-
         encies, esp. the Requires constraints, are com-             stances of components types are created during configura-
         plete. In other words, if all constraints Requires          tion and, more specifically, a component will not be dupli-
         (x, y) associated with x are satisfied by a configu-        cated if it is included in the configuration due to several
         ration, then x is satisfied.                                constraints. In this case, determining ACTCOMPS of a
      It is consistent.                                             complete configuration can be seen as an activity assign-
                                                                     ment
4.4 Definition of the Configuration Task                                AA: COMPS  {active, inactive} ,
The goal is to select an appropriate subset of the available         indicating the inclusion in or exclusion from the configura-
components, which we call the active ones, and possibly              tion, and the consistency test of Definition 2 becomes
                                                                        AA  ConfigKB  Task ⊭ .
determine or restrict their attributes.
                                                                     This representation shows the analogy to the consistency-
Definition 1 (Complete Configuration)                                based formalization of component-oriented diagnosis: an
  A configuration                                                    assignment MA of modes (i.e. nominal or faulty behavior)
     PARCONFIG = (ACTCOMPS, COMPATTR)                                to a set of components,
  is complete if includes exactly the active components:                MA  {OK, fault1, fault2, …}
                                                                     characterizes a diagnosis, if it is consistent with the do-
      comp  ACTCOMPS  comp.Active = T.                             main knowledge (a library of behavior models and a struc-
GECKO has to generate a configuration PARCONFIG that                 tural description), called system description, SD, and a set
satisfies the criteria stated in section 4.1.                        of observations, OBS:
                                                                        MA  SD  OBS ⊭ .
Definition 2 (Solution to a Configuration Task)                      In both cases, the assignments to the components
A configuration task is a pair                                          AA  MA
    (ConfigKB, Task)                                                 are checked for consistency with a fixed set of constraints
                                                                     representing the domain knowledge
(as specified in sections 4.3 and 4.2, respectively), and a
                                                                        ConfigKB  SD ,
complete configuration PARCONFIG is a solution to it, if
                                                                     and a set of constraints representing a specific problem
it is consistent with the ConfigKB and the Task,
                                                                     instance
   PARCONFIG  ConfigKB  Task ⊭ .                                     Task  OBS .
This may seem too weak, because criterion 1 in section 4.1              In consistency-based diagnosis, theories and algorithms
requires the entailment of the satisfaction of the TaskGoals         have been developed to determine diagnostic solutions,
in Task.                                                             which can be exploited for the configuration task based on
Proposition 1                                                        the analogy outlined above.
If PARCONFIG is a solution to a configuration task (Con-
                                                                     5.2 Conflict-directed A*
figKB, Task), then
    PARCONFIG  ConfigKB ⊨                                           Based on the above formalization, many implementations
               goalTaskGoals goal.Satisfied = T.                   of consistency-based diagnosis exploit a best-first search
                                                                     for consistent mode assignments, using probabilities of
   This follows from the closure assumption: Since for the           individual behavior modes as a utility function (and usual-
chosen TaskGoals, Satisfied=T is explicitly introduced in            ly making the assumption that faults occur independently)
Task, it follows that all Requires constraints related to            as SHERLOCK does([12]). Classical A* search has been
them are satisfied, and, hence, they are not only consistent,        extended and improved by pruning the search space based
but entailed. As for the other criteria of section 4.1:              on inconsistent partial mode assignments that have been
     2. Compliance with specific application require-                previously detected during the search (called conflicts),
          ments is guaranteed by consistency with the                exploiting a truth-maintenance system (TMS, such as the
          TaskParameters under the TaskParameter-                    assumption-based TMS [13]) as a dependency recording
          Goal/ComponentConstraints in ConfigKB and                  mechanism that delivers conflicts. From the diagnostic
          with TaskRestrictions in Task                              solutions, this approach has been generalized later as con-
     3. Realizability is established by consistency with             flict-directed A* search, see [6].
          ComponentConstraints
     4. The ResourceConstraint is also consistent.




                                                                94
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                        In diagnosis, it is possible to check partial mode as-
Procedure CDASTAR                                                    signments to detect useful conflicts. In configuration, we
                                                                     have to consider complete variable assignments, which, in
1) Terminate=F
                                                                     assigning T or F to activity variables of all components,
2) Solutions=
                                                                     correspond to complete configurations. The reason is that,
3) Conflicts=
                                                                     as illustrated by the above trivial example, the constraints
4) VA=VAinitial
                                                                     related to a choice deliver important conflicts based on
5) DO WHILE Terminate=F
                                                                     components being not active. A partial configuration, e.g.
6)      Apply Constraints(VA)
                                                                     assigning active=T to, say, C1 only, is consistent with the
7)      Check consistency of VA
                                                                     respective choice; that this configuration does not satisfy
8)      IF consistent
                                                                     G1 is detected only, if all other components are assumed to
9)         THEN append VA to Solutions
                                                                     be inactive (Of course, if the satThreshold has an upper
10)           Terminate=Solutions.Terminate
                                                                     limit, we obtain conflicts involving too large sets of active
11) ELSE
                                                                     components, as well). This observation is related to anoth-
12) Conflicts=APPEND(Conflicts, newConflicts)
                                                                     er difference:
13) END IF
                                                                     (NON-)Locality of the Domain Theory
14) VA=Conflicts.BestCandidateResolvingConflicts
                                                                     In diagnosis, the domain theory is as modular as the de-
15) END DO WHILE
                                                                     vice: it consists of constraints that represent the local inter-
16) RETURN Solutions
                                                                     action of components and constraints that capture the local
   The effectiveness of the pruning of the search space
                                                                     behavior of components under certain modes. Checking
based on previously detected inconsistencies (highlighted
                                                                     the consistency of a partial mode assignment requires ap-
in the above pseudo code) grows with the number of (non-
                                                                     plying the directly related constraints only. In contrast,
redundant) conflicts that are extracted. Achieving this,
                                                                     constraints representing configuration knowledge are al-
however, can be computationally expensive and may have
                                                                     most by definition non-local: they are meant to relate many
to be traded off against the computational cost of the con-
                                                                     components across the entire configuration, e.g. as choices.
sistency test and/or the optimality of the solution. We will
                                                                     If choices play a major role and are large, this can be a
get back to this issue below.
                                                                     source of severe problems.
   The straightforward mapping of the configuration prob-
                                                                        The training plan generation application forms an ex-
lem to CDA* is obtained by representing configurations as
                                                                     treme example: choices may involve in the order of 100
variable assignments:
                                                                     components, because many exercises may be related to a
   VARS={ Compi.active  CompiCOMPS}
                                                                     particular muscle group, while only a handful of them to-
   DOM(Compi.active)={T, F} .
                                                                     gether satisfy the goal. In addition, exercises are challeng-
   To illustrate how the algorithm works using a simple
                                                                     ing several muscle groups. If the lower boundary of the
example, assume that goal G1 depends on a component
                                                                     satisfactionThreshold of a choice is k and the size of the
choice that involves 3 components, Ci, each with a contri-
                                                                     choice is n, then (assuming a contribution 1 for each com-
bution of 1 in this choice, which has a satisfactionThresh-
                                                                     ponent), the number of resulting minimal conflicts will be
old (2,3), i.e. it is satisfied if at least two of the compo-                                     𝑛
nents are active. Search starts with an empty configuration                                    (    )
(active=F for all components) which leads to an incon-                                           𝑘−1
sistency with the constraints related to the choice. Each               – prohibitively large in the training application. This has
pair of inactive components establishes a (minimal) con-             an impact on the algorithm, as discussed in section 5.5.
flict:                                                               First, we have to introduce appropriate utility functions to
   { C1,active=F, C2,active=F },                                     measure the quality of a configuration.
   { C1,active=F, C3,active=F },
   { C2,active=F, C3,active=F }.
                                                                     5.4 Utility Functions
Configurations resolving these conflicts are the ones with              The utility of a configuration should essentially reflect
active components                                                             the degree of fulfillment of the relevant goals
   { C1, C2}, { C1, C2}, or { C2, C3},                                            and
and the best one would be checked further. If this is done                    the amount of resources required.
against another choice for a goal G2, which is based on                 A measure of the former may also consider priorities of
components C3, C4, C5 (again all with contribution 1) and a          goals. The same holds for individual components. Since
threshold (1, 3), then a new conflict                                inactive components neither make contributions nor con-
   { C3,active=F, C4,active=F, C5,active=F }                         sume resources, it is plausible to assume that the utility of
is detected, and the configurations resolving all include            a configuration depends on its active components only.
active components are                                                   In the following, it is assumed that
   { C1, C3}, { C2, C3}, { C1, C2, C4}, or { C1, C2, C5}.                  the contribution of a configuration is obtained
                                                                               solely as a combination of contributions of the
5.3 Diagnosis vs. Configuration                                                active components included in the configuration
Despite the mentioned basic commonality, there are some                        and otherwise independent of the type of proper-
important distinctions at a conceptual level, but with a po-                   ties of the components,
tentially strong impact on the computational complexity.                   we can define a subtraction “-” of contributions,
Partial vs. complete assignments                                           the cost of the contribution is given as the sum of
                                                                               the cost of the involved active components and
                                                                               will usually be numerical,




                                                                95
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


          we can define a ratio “/” of contributions and re-         3)    Priority=max(actGoals.Priority)
           sources,                                                   4)    DO WHILE Priority >=1
      there is a function that maps priority of goals to a           5)       ApplyConstraints(Constraints(
           weight of the contributions and a kind of multi-                      GoalPriorityClass(Priority))
           plication,                                                 6)       VA=VA(ActComps,COMPS\ActComps)
            *: DOM(weight)xDOM(contribution)                          7)       NewActComps=
                    DOM(contribution),                                          GECKO.CDASTAR(VA).ActComps
         is defined.                                                  8)       ActComps = NewActComps.Commit
   Then the following specifies a family of utility functions         9) END DO WHILE
(where we simplify the notation by writing                            10) RETURN ActComps
Goalj.SatThreshold instead of Choicej.SatThreshold etc.):                In line 8, the algorithm fixes the components added to
   For an active Goal Goalj, the TotalContribution of a               satisfy the recently considered goals. This means, when
Configuration is                                                      trying to satisfy further goals (with lower priority) they
   Configuration.TotalContribution(Goalj):=                           will not be de-activated. This heuristic aims at satisfying as
    CompiConfiguration.ACTCOMPS Compi.Contribution(Goalj)           many goals as possible with the given resources in the or-
where  denotes the Combine operation, the ActualCon-                 der of their priority, but, obviously, may miss a globally
tribution is given as                                                 optimal solution.
   Configuration.ActualContribution(Goalj):=
   max(Configuration.TotalContribution(Goalj),                        6      Case Study: Training Plan Generation
           Goalj.Combine(Goalj
         SatThreshold,posTolerance)),                                 We are working on the realization of the three applications
and a penalty (for over-satisfaction) as                              presented in section 3. To demonstrate the specialization of
   Configuration.Penalty(Goalj):=                                     GECKO concepts and the capabilities of the GECKO algo-
   max (0,                                                            rithm, we selected the fitness training example. From the
         Configuration.TotalContribution(Goalj)                       three examples, fitness is best suited to illustrate the ad-
       - Goalj.Combine(Goalj SatThreshold,posTolerance).              vantages of CDA* in configuration.
Based on this, we define the utility function as
   Configuration.Utility(ACTGOALS):=                                  6.1 Domain Theory
    Goalj Configuration.ACTGOALS                                    In fitness, trainees perform exercises, like push-ups or run-
      weight(Goalj.Priority)                                          ning, to train body parts under certain aspects (endurance,
         * Configuration.ActualUtility(Goalj)                         muscle gain). To train means to improve physical abilities,
      + f * Configuration.Penalty(Goalj) )                            like endurance, and to influence biometric parameters,
      /  CompiConfiguration.ACTCOMPS Compi.Resource.                such as weight. In configuration terms: exercises contrib-
   The factor f determines whether or not excessive contri-           ute to a set of fitness goals. Hence, we created the domain
butions are penalized (by the excessive amount); the                  theory for training planning using the concepts specified in
weight can emphasize contributions to Goals with high                 section 4.2. Table 1 contains an overview on the most im-
priority, and the tolerance interval can express how exactly          portant specializations.
the intended SatThreshold has to be hit.                              The result may appear straightforward to outsiders, but it is
                                                                      actually the result of several months of analyses carried out
5.5 GECKO Algorithm                                                   jointly with experts from sports sciences, which took as to
For the GECKO variant of CDA* we modified CDA* by                     several versions and revisions of the model.
activating only the constraints needed at a specific stage,                       Table 1: Specialization of GECKO Concepts
thereby reducing the number of occurring conflicts signifi-
cantly.                                                                    GECKO Concept       Fitness Concept        Example
GECKO characterizes a stage in the problem solving pro-                    Goal                TraineeGoal        Muscle Gain
cess and hence the criteria for constraint activation as a                                     TrainingGoal       Strength
pair                                                                                           TargetGoal         Biceps
          S = (GOALS, configuration),                                      Component           Exercise           Push-up
that is a set of goals that are considered and a configuration             Task                Trainee            -
to be checked for consistency. This allows for search strat-               TaskRestriction                        TrainingDuration
egies that do not consider all active goals from the begin-                TaskParameter       TrainingProperty   Equipment
                                                                                               TraineeProperty    Fitnesslevel
ning. Therefore, the constraints to be applied are not only
determined by the variable assignment, but also by the                Task
goals. In our first application, goals are activated in a de-         A GECKO Task in fitness is a trainee, or more precisely
scending order, according to their priority.                          the request of a training plan by a trainee. A trainee has
   To determine the hitting sets of the conflicts we use dif-         expectations regarding the result of the training, represent-
ferent algorithms from [14], depending on the domain. In              ed by TraineeGoals. The Trainee also has a set of Train-
BestCandidateResolvingConflicts, the next-best solution is            eeProperties, like Fitnesslevel, and sets the TrainingProp-
generated.                                                            erties. Furthermore, a trainee has to specify the desired
Procedure GECKO Configuration Algorithm                               TrainingDuration.
1) ApplyConstraints(Constraints(Initial)                                 Special among the TraineeProperties are the FitnessTar-
2) ActComps=ACTCOMPS0                                                 gets and FitnessCategories. A FitnessTarget has to be




                                                                 96
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


trained by an Exercise, such as legs. FitnessCategories are           2), 12 exercises( table 3), and 2 TaskParameters, namely
the main abilities of a Trainee, such as strength.                    Equipment and a general Fitnesslevel – thus omitting the
                                                                      consideration of different Fitnesslevels related to the spe-
Goals                                                                 cific FitnessTarget, as done in the application system. Fur-
The domain theory contains three types of goals:                      thermore, we set the duration of all exercises to require 5
      TraineeGoal: The only TaskGoal in fitness, de-                 minutes.
          scribing the expected effect of the fitness training           Using this reduced knowledge-base, we applied both the
      TrainingGoal: Abstract goals, specifying the type              basic GECKO algorithms and the goal-focused variant.
          of physical ability to be improved e.g. strength            The results are described in the following subsection.
      TargetGoal: Body part the training has to stimu-
          late.                                                                  Table 2: Exemplary muscle goals with priorities
     To capture the structure of the human body and the                                              Priority:          Priority:
     differences in fitness categories, we decompose the                    ID      MuscleGoal
                                                                                                    MuscleGain        GeneralFitness
     TargetGoals into three levels:                                         G1      Biceps               1                  2
                o RegionGoal                                                G2      Triceps              1                  2
                o MuscleGroupGoal                                           G3      Latissimus           2                  3
                o MuscleGoal
   Reflecting the FitnessTargets, TargetGoals are struc-                               Table 3: Exercises and parameters
tured in a goal tree. Because FitnessTargets are trained at                                                                     Required
different levels, the tree is unbalanced. For example En-                                                        Required
                                                                      ID         Exercise     Contributions                      Fitness
durance is generally trained for the whole body, while                                                          Equipment
                                                                                                                                  level
Strength is trained at a muscular level.                              C1     Biceps              Biceps: 100         None           1
                                                                             Curl
Components                                                            C2     Dips              Triceps: 100          None              1
All components in fitness are exercises. Each exercise is                                     Latissimus: 20
related to a FitnessCategory, e.g, pushup is a StrengthEx-            C3     Lat-Pull           Biceps:20           Machines           1
ercise. Exercises can contribute to multiple TargetGoals,                                      Latissimus:
but only TargetGoals of their own FitnessCategory. For                                             100
example, a StrengthExercise can only contribute to Tar-               C4     Rev. But-         Triceps: 40          Machines           1
getGoals related to strength.                                                terfly
   Exercises comprise a set of fixed attributes, such as re-          C5     Pushup              Triceps: 80         None              2
quiredEquipment or requiredFitnesslevel, as well as a set             C6     Pushup on           Triceps: 60         None              1
of unspecified attributes, like TrainingWeight or Dura-                      knees
tions. The values of such volatile attributes depend on the           C7     Shoulder            Triceps: 80        Machines           1
selected TraineeGoal, because they define how an exercise                    press
effects a FitnessTarget – an increase in strength is achieved         C8     Rowing            Biceps: 40           Machines           1
by a small number of slow repetitions with very high                                          Latissimus: 80
weight, while fat is burnt best with many fast repetitions            C9     Pull up           Biceps: 100           None              2
with little weight.                                                                           Latissimus: 80
                                                                      C10    Triceps           Triceps 100          Machines           2
Utility                                                                      Pulldown
The utility of a configuration in SmartFit depends on the             C11    Pull up           Biceps: 20            None              1
contributions of the active components to required Choices                   (supported)      Latissimus: 80
DOM(compi.contributioni) ={20,40,60,80,100}                           C12    Rowing            Biceps: 40           Machines           2
The satThreshold of the Choices depends on the priority of                   one-armed         Latissimus:
the associated goal                                                                                100
satThreshold = combine(Goali.Priority,normThreshold),
with DOM(Priority) ={1,2,3,4,5}.                                      To compare the results of different tasks, we conducted to
   For the example in 6.2, we simply multiplied the priori-           experiments with different TraineeGoals and TaskParame-
ties with the normThreshold =80.                                      ter values. For the basic algorithm, we used the Tasks
The domain of the combined contribution is from 0 to 500              shown in Table 4.
in steps of 20. In case of contributions larger than 500, the                        Table 4: Task for experiments A and B
overshoot is cut, and the value set to 500.
   The utility for fitness training is given by the following                Variable                Values A               Values B
equation:                                                               TaskGoal                  General Fitness      Muscle Gain
   Config.Utility (ACTGOALS):=                                          TaskParameter:            Untrained (1)        Trained (2)
    Goalj Config.ACTGOALS                                             FitnessLevel
   weight (Goalj.Priority) * Config.ActualUtility (Goalj))              TaskParameter:            Machines             none
   /  CompiConfig.ACTCOMPS Compi.Resource.                            Equipment
                                                                        TaskRestriction:          15 minutes           30 minutes
6.2 Simplified Example                                                  TrainingDuration
To make the capabilities of GECKO more tangible, we
                                                                      The results of the configuration with the basic GECKO
present a small experiment. For brevity and clarity, we use
a reduced knowledge-base, with three MuscleGoals( table               algorithm are shown in Tables 5 and 6.




                                                                 97
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


    Table 5: Configuration results basic GECKO Algorithm               technical level (the constraint system) will also be ex-
                                                                       plored.
             Experiment A           Experiment B                          Furthermore, we are currently preparing an application
               Lat-Pull                 Pull up                        to configuration of automation systems for collaborative,
                                         Dips                          flexible manufacturing and modular multi-purpose vehi-
                  Rowing          Pull up (supported)                  cles. This application of GECKO is likely to require
                                   Pushup on knees                     stronger spatial and also temporal constraints for structur-
               Shoulder press         Biceps Curl                      ing a configuration.
  Table 6: State of the Goals after running the basic Algorithm
                                                                       Acknowledgments
     Muscle Goal                Value A            Value B
                                                                       We would like to thank our project partners for providing
  Biceps                   Partially Satis-    Satisfied               their domain knowledge and their assistance, esp. Florian
                           fied                                        Kreuzpointner and Florian Eibl. Special thanks to Oskar
  Triceps                  Satisfied           Satisfied               Dressler (OCC’M Software) for proving the constraint
  Latissimus               Satisfied           Partially Satis-
                                                                       system (CS3 or Raz’r). The project was funded by the the
                                               fied
                                                                       German Federal Ministry of Economics and Technology
Application Evaluation                                                 under the ZIM program (KF2080209DB3).
The results indicate that the GECKO algorithms are capa-
ble of generating optimal solutions to configuration prob-             References
lems. In experiment B, it can be seen that GECKO was not
                                                                       [1] JP McDermott, “RI: an Expert in the Computer
able to satisfy G3 completely, since there were not enough
consistent exercises available. Thus, the less important                    Systems Domain”, Artificial Intelligence, 1980.
goals were satisfied, but not the important one. In experi-            [2] U. Junker, D. Mailharro, “The logic of ILOG(J)
ment A on the other hand, the algorithm was able to fully                   Configurator: Combining Constraint Programming
satisfy G3 but not G1, since the duration resource was only                 with a Description Logic”, IJCAI, 2003.
sufficient for three exercises.                                        [3] A. Felfernig, L. Hotz, C. Bagley, and J. Tiihonen,
                                                                            “Knowledge-based Configuration: From Research to
6 Discussion and Outlook                                                    Business Cases.”
The results shown above indicate that treating configura-              [4] D. Sabin, R. Weigel, “Product configuration
tion as a diagnostic problem, and solving it with tech-                     frameworks – a survey.” IEEE Intelligent System,
niques from consistency-based diagnosis is a promising                      1998.
approach to user-oriented configurators for optimal con-               [5] J. de. Kleer, BC Williams, “Diagnosing Multiple
figuration problems.                                                        Faults,” Artificial Intelligence, 1987
   The analysis of different application domains, including
the ones mentioned in section 3, triggers the insight that             [6] BC William, R.J. Ragno, “Conflict-directed A* and its
variations of the search algorithm may be required in order                 role in model-based embedded systems”, Discrete
to reflect the specific requirements and structure of the                   Applied Mathematics, 2007.
problems. This is particularly true for applications that              [7] M. Stumptner, G. Friedrich, A. Haselböck,
involve a high level of interaction, such as leaving choices                “Generative constraint-based configuration of large
to the user, providing explanations for system decisions,                   technical systems“, Artificial Intelligence for
and allowing him to modify his/her decisions in an in-                      Engineering Design, Analysis and Manufacturing,
formed way. Retracting decisions and also generating ex-                    1998.
planations can be supported by the ATMS, which also                    [8] G. Weiß, F. Grigoleit, P. Struss, “Context Modeling
produces conflicts.
                                                                            for Dynamic Configuration of Automotive Functions”,
   The conceptual and algorithmic solution to configura-
                                                                            ITSC, 2013
tion generation presented in this paper could certainly be
implemented using other techniques that have been pro-                 [9] C. Richter, “Development of an interactive car
posed and used for configuration. However, our choice of                    configuration system”, Master’s Thesis, Tech. Univ.
an ATMS-based solution (and CDA*) was strongly moti-                        of. Munich,
vated by the overall objectives stated in section 4.1: we              [10] A. Verikios, “A tool for the Configuration of CERN
intend to base explanation facilities (“which user inputs                   Particle Beam Measurement Systems”, Master’s
and domain restriction prevent option x to be viable?”),                    Thesis, Tech. Univ. of. Munich,
preferences and soft constraints, and the possibility to re-
                                                                       [11] U. Junker, “Configuration.” Handbook of Constraint
tract input and explore several alternative solutions on ca-
pabilities of the ATMS.                                                     Programming, Configuration, p. 837-868, 2006.
   A goal of our work is to extract features from the case             [12] J. De Kleer, BC Williams,”Diagnosis with Behavioral
studies that can support a classification of configuration                  Model”, IJCAI, 1993.
applications as a basis for selection from a set of prede-             [13] J De Kleer, “An assumption-based TMS” Artificial
fined algorithm variants and strategies for man-machine                     intelligence, 1986.
interaction.
                                                                       [14] J De Kleer, “Hitting Set Algorithms for Model-based
   Other options, such as compiling (parts of) the con-
                                                                            Diagnosis”, Principles of Diagnosis, 2011.
straint network and moving search heuristics to a lower




                                                                  98
                         Proceedings of the 26th International Workshop on Principles of Diagnosis




    Decentralised Fault Diagnosis of Large-scale Systems: Application to Water
                              Transport Networks
                                    Vicenç Puig and Carlos Ocampo-Martinez
                               Universitat Politècnica de Catalunya - BarcelonaTech
                              Institut de Robòtica i Informàtica Industrial, CSIC-UPC
                                   Llorens i Artigas, 4-6, 08028 Barcelona, Spain
                                  e-mail: {vpuig,cocampo}@iri.upc.edu

                          Abstract                                      several proposals where there is no centralised control struc-
                                                                        ture or coordination process among diagnosers [6, 7, 8]. Ev-
     In this paper, a decentralised fault diagnosis ap-                 ery diagnoser shares information with the neighbouring di-
     proach for large-scale systems is proposed. This                   agnosers. In these systems the model is distributed, the di-
     approach is based on obtaining a set of local                      agnosis is locally generated and the consistency among the
     diagnosers using the analytical redundancy rela-                   subsystems should be satisfied.
     tion (ARRs) approach. The proposed approach                            In this paper, the main contribution relies on the devel-
     starts with obtaining the set of ARRs of the sys-                  opment of a decentralised fault diagnosis approach for LSS
     tem yielding into an equivalent graph. From that                   based on analytical redundancy relations (ARRs) and graph
     graph, the graph partitioning problem is solved                    theory. The algorithm starts considering a set of ARRs and
     obtaining a set of ARRs for each local diagnoser.                  then stating an equivalent graph. From that graph, the prob-
     Finally, a decentralised fault diagnosis strategy is               lem of graph partitioning is then solved. The resultant parti-
     proposed and applied over the resultant set of par-                tioning consists of a set of non-overlapped subgraphs whose
     titions and ARRs. In order to illustrate the ap-                   number of vertices is as similar as possible and the num-
     plication of the proposed approach, a case study                   ber of interconnecting edges between them is minimal. To
     based on the Barcelona drinking water network                      achieve this goal, the partitioning algorithm applies a set of
     (DWN) is used.                                                     procedures based on identifying the highly connected sub-
                                                                        graphs with balanced number of internal and external con-
1 Introduction                                                          nections in order to minimize the degree of coupling among
                                                                        the resulting partitions (diagnosers). This algorithm is spe-
Large-scale systems (LSS) present new challenges due to
                                                                        cially useful in systems where there is no a clear functional
the large size of the plant and its resultant model [1, 2]. Tra-
                                                                        decomposition. Finally, a decentralised fault diagnosis strat-
ditional supervision methods for LSS (including diagnosis
                                                                        egy is introduced and applied over the resultant set of par-
and fault tolerant control) have been mostly developed as-
                                                                        titions, in a similar way to the one introduced in [5]. In
suming a centralized scheme that assumes to have the full
                                                                        order to illustrate the application of the proposed approach,
information. In the same way, a global dynamical model
                                                                        a case study based on the Barcelona drinking water network
of the system is considered to be available for supervision
                                                                        (DWN) is used.
design (off-line). Moreover, all measurements must be col-
                                                                            The remainder of this paper is organised as follows. Sec-
lected in one location in a centralised way. When consid-
                                                                        tion 2 presents and discusses the overall problem statement.
ering LSS, the centrality assumption usually fails to hold,
                                                                        Section 3 presents the ARR graph partitioning methodology.
either because gathering all measurements in one location
                                                                        Section 4 describes the proposed decentralised fault diag-
is not feasible, or because a centralised high-performance
                                                                        nosis approach. Section 5 shows both the considered case
computing unit is not available. These difficulties have re-
                                                                        study and the way of implementing the proposed decen-
cently led to research in fault diagnosis (and fault-tolerant
                                                                        tralised fault diagnosis approach. Finally, Section 6 draws
control) algorithms that operate in either decentralised or
                                                                        the main conclusions.
distributed way. Depending on the degree of interaction of
the diagnoser associated to the subsystems and their diag-
nosis process, they can be classified into decentralised and            2 Problem Statement
distributed diagnosis categories.                                       2.1 Fault Diagnosis using ARRs
   In the decentralised diagnosis, both a central coordina-             Consider a dynamical system represented in general form
tion module and a local diagnoser for each subsystem that               by the state-space model
forms the whole supervision system are running in paral-
lel. Some examples were presented in [3, 4, 5], where local                                   x+ = g(x, u, d),                   (1a)
diagnosers are communicated to a coordination process (su-                                     y = h(x, u, d),                   (1b)
pervisor), obtaining a global diagnosis. On the other hand,
in the distributed approach, a set of local diagnosers share            where x ∈ Rn and x+ ∈ Rn are, respectively, the vectors
information by means of some communication protocol in-                 of the current and successor system states (that is, at time
stead of requiring a global coordination process such as in             instants k and k + 1, respectively if the model is expressed
a decentralised approach. In the related literature, there are          in discrete-time), u ∈ Rm is the system input vector, d ∈




                                                                   99
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


Rp is the vector containing a bounded process disturbance               graph, the problem consists in partitioning the graph R into
and y ∈ Rq is the system output vector. Moreover, g :                   subgraphs. Since such partitioning is oriented to the appli-
Rn × Rm × Rp 7→ R is the states mapping function and                    cation of a decentralised fault diagnosis, it is convenient that
h : Rn × Rm × Rp 7→ Rq corresponds with the output                      the resultant subgraphs have the following features:
mapping function.                                                           • nearly the same number of vertices;
   The design of a model-based diagnosis system is based
on utilizing the system model (1) in the construction of the                • few connections between the subgraphs.
diagnosis tests. According to [9], by means of the structural               These features guarantee that the obtained subgraphs
analysis tool and perfect matching algorithm, a set of ARRs,            have a similar size, fact that balances computations be-
namely R, can be derived from (1). ARRs are constraints                 tween local diagnosers and allows minimising communica-
that only involve measured variables (y, u) and known pa-               tions with a supervisory diagnoser. Hence, the partitioning
rameters θ. The set of ARRs can be represented as                       the ARR graph can be more formally established following
    R = {ri | ri = Ψi (yk , uk , θk ), i = 1, . . . , nr },    (2)      the dual problem proposed in [13] as stated here in Problem
                                                                        1.
where Ψi is the ARR mathematical expression and nr is the
                                                                        Problem 1 (ARR Graph Partitioning Problem). Given a
number of obtained ARRs. Then, fault diagnosis is based
                                                                        graph G(V, E) obtained from a set of ARRs, where V de-
on identifying the set of consistent ARRs
                                                                        notes the set of vertices, E is the set of edges, and p ∈ Z≥1 ,
  R0 = {ri |ri = Ψi (yk , uk , θk ) = 0, i = 1, . . . , nr }, (3)       find p subsets V1 , V2 , . . . , Vp of V such that
and inconsistent ARRs,                                                         p
                                                                               S
                                                                           1.    Vi = V ,
  R1 = {ri |ri = Ψi (yk , uk , θk ) 6= 0, i = 1, . . . , nr }, (4)           i=1

at time instant k when some inconsistency in (2) is de-                    2. Vi ∩ Vj = ∅, for i ∈ {1, 2, . . . , p}, j ∈ {1, 2, . . . , p},
tected [10]. Fault isolation task starts by obtaining the ob-                  i 6= j,
served fault signature, where each single fault signal indi-               3. #V1 ≈ #V2 ≈ · · · ≈ #Vp ,
cator φi (k) is defined as follows:
                                                                          4. the cut size, i.e., the number of edges with endpoints in
                           0 if ri (k) ∈ R0 ,                                  different subsets Vi , is minimised.
              φi (k) =                                    (5)
                           1 if ri (k) ∈ R1 .                           Remark 2.1. Conditions 3 and 4 of Problem 1 are of high
   Fault isolation is based on the knowledge about the bi-              interest from the point of view of a decentralised scheme
nary relation between the considered fault hypothesis set               since they are related to the degree of interconnection be-

  f1 (k), f2 (k), . . . , fnf (k) and the fault signal indicators       tween resultant subsystems and their size balance.                
φi that are stored in the fault signature matrix M . An el-             Remark 2.2. The inclusion of additional specifications di-
ement of this matrix, namely mij , is equal to 1 if the fault           rectly related to the FDI performance of each subsystem di-
hypothesis fj is expected to affect the residual ri such that           agnoser will be addressed as a future extension of the pro-
the related fault signal φi is equal to 1 when this fault is af-        posed partitioning approach.                                      
fecting the monitored system. Otherwise, the element mij
                                                                        Remark 2.3. The partitioning approach starts from a given
is zero-valued. A column of this matrix is known as a the-
                                                                        set of ARRs obtained using the perfect matching algorithm.
oretical fault signature. Then, the fault isolation task in-
                                                                        The selection of the best ARRs from the set of the all pos-
volves finding a matching between the observed fault signa-
                                                                        sible ARRs (that could be obtained using the available sen-
ture with some of theoretical fault signatures.
                                                                        sors and system structure) such that when applying the par-
2.2 Partitioning the Set of ARRs                                        titioning algorithm produces a set of diagnosers with good
                                                                        FDI performance could be considered as an additional fu-
In order to design a decentralised fault diagnosis system fol-
                                                                        ture improvement.                                                 
lowing the ARR approach recalled above, the set of ARRs in
(2) should be decomposed into subsets with minimal degree                   In general, graph partitioning approaches are considered
of coupling. Each subset of ARRs will allow to implement                as N P-complete problems [2]. However, they can be solved
a local diagnoser. With this aim, a graph representation of             in polynomial time for #Vi = 2 (Kernighan-Lin algorithm);
R in (2) is determined. The graph G(V, E) representing the              see, e.g., [14]. Since the latter condition is quite restric-
set of ARRs is obtained considering that                                tive for large-scale graphs, alternatives for graph partition-
   • the ARRs are the graph vertices collected in a set V ,             ing based on fundamental heuristics are properly accepted
     and                                                                and broadly discussed.
   • the measured input/output variables are the graph                  3 Proposed Partitioning Approach
     edges collected in a set E.
                                                                        Starting from the system ARR graph obtained as described
The graph incidence matrix IM is obtained considering that,
                                                                        in Section 2, this section proposes a partitioning algorithm
without loss of generality, the directionality of the edges are
                                                                        through which a decomposition of the set of system ARRs
derived from the relation between ARRs (rows of IM ) and
                                                                        can be performed. This decomposition allows the splitting
input/output variables (columns of IM ), in analog way as
                                                                        of a centralised diagnoser into local diagnosers. The philos-
proposed by [11] (and references therein) for the partition-
                                                                        ophy of the proposed approach comes from the partitioning
ing of LSS1 . Once IM has been obtained from the ARR
                                                                        methodology reported in [13], where a dynamic system is
    1
      There are alternative matrix representations for a graph such     decomposed into several subsystems following certain cri-
as the adjacency matrix and the Laplacian matrix (see [12]), which      teria towards fulfilling a set of design conditions. For com-
are related to the matrix representation used in this paper.            pleteness and full understanding of the proposed diagnosis




                                                                  100
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


methodology, that approach is explained below and suitably         for the particular case study. Additional auxiliary routines
adapted if needed.                                                 might be designed in such a way that the diagnosis perfor-
   The algorithm is divided into the main kernel and auxil-        mance that would be achieved when used in decentralised
iary routines in order to refine the final result according to     or distributed fault diagnosis is taken into account. These
the nature of the system and the given criteria depending          auxiliary routines are:
on the case. Here, the ARR graph is decomposed into sub-             • The pre-filtering routine, which lightens the start-up
graphs in the same way as a system would be divided into               routine by merging all these vertices with single con-
subsystems.                                                            nection to those to which they are connected. It al-
3.1 Main Kernel                                                        lows to have a smaller initial graph and then perform-
                                                                       ing faster clustering of vertices.
This part performs the central task of defining how the
equivalent ARR graph of the LSS is split into subgraphs.             • The post-filtering routine, which adds a tolerance pa-
The steps of the algorithm are followed in the form of sub-            rameter δ in such a way that the uncoarsening rou-
routines towards reaching the main goals outlined in Prob-             tine yields in less subgraphs when two of them may
lem 1. Notice that the whole algorithm is used off-line,               be conveniently merged but the numerical constraints
i.e., the partitioning of the ARR graph is not carried out dy-         does not allow to do so. This routine might increase
namically on-line. Ongoing research is focused to adapt the            the complexity since the internal weight of some sub-
proposed algorithm such that the partitioning could be per-            graphs would also increase, unbalancing the resultant
formed on-line when some structural change of the network              set of partitions.
occurs. The different subroutines are briefly described next.        • The anti-oscillation routine, which leads to solve a pos-
  • The start-up routine, which requires the matrix-based              sible issue when the refining (external balance) routine
    definition of the graph, e.g., via the incidence matrix,           is run since it defines a maximum number of iterations
    in order to state the connections between the graph ver-           ρ that the refining routine is executed.
    tices.
  • The preliminary partitioning routine, which performs           4 Decentralised Fault Diagnosis
    a clustering-like procedure where all graph vertices are       Once a partitioned set of ARRs has been obtained by means
    assigned to a particular subset according to predefined        of the algorithm presented in Section 3, the decentralised
    indices related to the resultant subgraph and its inter-       fault diagnosis approach is introduced. In order to explain
    nal weight (defined as the number of vertices of a sub-        how the proposed fault diagnosis approach works, it is con-
    graph), its external weight (defined as the number of          centrated on faults affecting the sensors measuring the in-
    shared edges between subgraphs) and other statistical          put/output variables implied in the ARRs. The approach
    measures. The resultant amount of partitions at this           could be easily extended to other type of faults, but in order
    stage is automatically obtained.                               to keep the explanation simpler, it is restricted to the discus-
  • The uncoarsening routine, which is applied for reduc-          sion about the set of considered faults. In this way, a fault
    ing the number of resultant subgraphs if their internal        can be associated to each measured input/output variable.
    weight is unbalanced, which would produce partitions              Each subset of ARRs will allow to implement a local di-
    with large differences of amount of vertices. This rou-        agnoser Di in the way described in Section 2.1. The ARRs
    tine defines a design parameter ϕmax for determining           associated to a local diagnoser can be split in two groups.
    the variance of the internal weight for all the resultant      The first group, named in the following local ARRs, is com-
    subgraphs.                                                     posed of ARRs that do not involve shared variables with
                                                                   other ARRs in a different local diagnoser. On the other
  • The refining routine, which aims at reducing the cut           hand, the second group, named shared ARRs, is composed
    size of the resultant subgraphs, i.e., the number of           by ARRs that involve shared variables. Figure 1 shows two
    edges they share. This routine is based on the connec-         sets of ARRs associated to two local diagnosers, named
    tivity of the vertices of a subgraph with other vertices       D2 and D4 . These two diagnosers share some variables
    in the same subgraph and in neighbouring subgraphs2 .          (in this case only outputs, but can be both inputs and out-
  Applying the aforementioned routines to the entire ARR           puts). This set of shared variables allows to define the set
graph, the expected result consists of a set of subgraphs that     of shared ARRs, named DC in the figure. The remaining
determines a particular decomposition. This set P is finally       ARRs, which do not share variables, are local ARRs.
defined as                                                            Similarly, faults in the fault signature matrix M of the lo-
             (                          p
                                                   )               cal diagnoser that only involve local ARRs can be locally di-
                                       [                           agnosed. Thus, the local diagnoser works in a decentralised
        P = Gi , i = 1, 2, . . . , p :     Gi = G .        (6)
                                                                   manner regarding those faults. On the other hand, faults that
                                      i=1
                                                                   involve ARRs with shared variables in different subgraphs
3.2 Auxiliary Routines                                             can not be locally diagnosed. On the contrary, a global diag-
Although the decomposition algorithm yields to an auto-            noser that evaluates the involved ARRs is used. This diag-
matic partitioning of a given graph, it does not imply that        noser has a fault signature matrix M collecting the involved
the resultant set P follows the pre-established requirements       ARRs with shared variables between local diagnosers and
stated in Problem 1. Therefore, complementary routines             faults that should be globally diagnosed. When local diag-
enhance the partitioning routine depending on their tune           nosers evaluate an ARR composed of shared variables, they
                                                                   send the result of the consistency check to the global di-
   2
    Two subgraphs are called neighbours if they are contiguous     agnoser, which proceeds with the global diagnosis using a
and share edges (see, e.g., [15] among many others).               fault signature matrix that contains the involved ARRs. As




                                                             101
                                   Proceedings of the 26th International Workshop on Principles of Diagnosis

                                                  y1,S2 . . . y4,S2
                     . . . y1,S4   . . . y28,S4   y29,S4 . . . y32,S4   y5,S2 . . .   y12,S2         through the m actuators (pumps and valves), d ∈ Rq cor-
                ..
                                                                                                     responds to the vector of the q water demands (sectors of
                 .                                                                                   consume) and y ∈ Rn are the vector of measured water
        ARR1,S4
           ..                                                                                        volumes of the n tanks. In this case, the difference equa-
            .                          D4                                                            tions in (7a) describe the dynamics of the storage tanks,
        ARR28,S4
ARR1,S2 ARR29,S4
                                                                                                     the algebraic equations in (7b) describe the static relations
   ..      ..                                                                                        (i.e., mass balance at junction nodes) in the network and
    .       .                                          DC                                            in (7c) describe the relation between the physical and mea-
ARR4,S2 ARR32,S4
        ARR5,S2
           ..                                                                D2                      sured tank volumes. Moreover, A, B, Bp , C, E1 and E2
            .                                                                                        are system matrices of suitable dimensions dictated by the
        ARR12,S2
                                                                                                     network topology.

                                                                                                     5.3 Implementation of the Proposed Approach
Figure 1: Subsets of ARRs of two local diagnosers sharing                                            This section discusses the way the proposed decentralised
some variables                                                                                       fault diagnosis approach is implemented in the considered
                                                                                                     real case study. Figure 2 corresponds to the aggregate model
a result of the global diagnosis based on the involved ARRs                                          of the Barcelona DWN, which is a simplification of the com-
with shared variables, a fault in these variables could be di-                                       plete model, where groups of elements have been aggre-
agnosed or alternative excluded. In case of exclusion, local                                         gated (not discarded) in single nodes to reduce the size of
diagnosers sharing a given ARR whose shared variable has                                             the whole network model. Using this aggregate model, the
been considered non-faulty continue reasoning now with all                                           ARR graph of the Barcelona DWN has been derived after
ARRs, i.e., all the involved ones, proposing a fault candidate                                       generating the set of ARRs from the mathematical model
using the local fault signature.                                                                     (7) by using the perfect matching algorithm [9] that aims
                                                                                                     to find a causal assignment which associates unknown sys-
5 Application to a Case Study                                                                        tem variables with the system constraints from which they
                                                                                                     can be calculated. Applying the partitioning algorithm to
This section briefly describes a case study in order to exem-                                        this graph, five groups of ARRs are obtained, which corre-
plify the application of the proposed decentralised diagnosis                                        sponds to five diagnosers that monitor a different part of the
approach in a real LSS. In particular, the transport infras-                                         Barcelona DWN represented with different colors in Fig-
tructure of the Barcelona Drinking Water Network (DWN)                                               ure 2. Table 2 collects the descriptions of the resultant sub-
is used.                                                                                             graphs, their number of ARRs and shared variables (ma-
                                                                                                     nipulated flows through actuators) represented using circles
5.1 Case Study Description                                                                           in Figure 2. At this point it should be recalled that one of
The Barcelona DWN, managed by Aguas de Barcelona,                                                    the goals of the partitioning algorithm is to reduce as much
S.A. (AGBAR), supplies drinking water to Barcelona city                                              as possible the number of shared edges between subgraphs
and its metropolitan area through four drinking water treat-                                         obtaining a graph decomposition as less interconnected as
ment plants: the Abrera and Sant Joan Despí plants, which                                            possible and with similar number of vertices for each sub-
extract water from the Llobregat river, the Cardedeu plant,                                          system (internal weight). This will allow an easier global
which extracts water from Ter river, and the Besòs plant,                                            diagnosis configuration, not only with respect to the num-
which treats the underground flows from the aquifer of the                                           ber of distributed diagnosers but also with respect to the
Besòs river. All source together provide a total amount of                                           complexity of each local diagnoser Di . Thus, the appli-
flow of around 7 m3 /s. The water flow from each source                                              cation of the approach to the Barcelona DWN implies the
is limited, what implies different water prices depending on                                         design of five decentralised diagnosers together with a cen-
water treatments and legal extraction canons. See [16] for                                           tralised/supervisory one, which is in charge of the coupled
further information about this system and [17] for further                                           relations within the corresponding fault signature matrix of
details about its modelling and management criteria.                                                 the whole system.
5.2 Monitoring-oriented Model
In order to obtain a monitoring-oriented model of the DWN,                                           Table 1: Barcelona DWN subsystems and number of both
the constitutive network elements (i.e., tanks, actuators, wa-                                       shared elements and ARRs
ter demand sectors, nodes and sources) as well as their basic                                            Number Color # ARRs # Shared variables
relationships should be stated [16].
                                                                                                             1       green        4                1
   By considering the mass balance at tanks and the static
                                                                                                             2        red         5                5
relations at α network nodes, the monitoring-oriented
                                                                                                             3       yellow       8                6
discrete-time state-space model of the DWN can be written
                                                                                                             4        blue        8               16
as
                                                                                                             5       purple       5                5
                         xk+1 = Axk + Γνk ,                                              (7a)
                         E1 νk = E2 ,                                                    (7b)
                                                                                                        For this example, it is important to highlight that ARRs
                            yk = Cxk ,                                                   (7c)        have been obtained by considering the following assump-
with Γ = [B Bp ], νk = [uTk dTk ]T , where x ∈ Rn is the                                             tion.
state vector corresponding to the water volumes of the n                                             Assumption 5.1. Fault in actuators are only taken into ac-
tanks, u ∈ Rm represents the vector of manipulated flows                                             count. Sensors are supposed to operate properly.        




                                                                                               102
                                               Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                                                                             ApotA
                                                                                                                                                                         x1
                                                                                                                                             aMS
                                                                                                                                                       bMS_21

                                                                                                                                                                      d 125 PAL _1                c125PAL
                                                                                                                                                                                                               d1
                                                                                                                                                     u1
                                                                                                                     u3            vAdd
                                                                                                                                                              CPIV _1            u2
                                                                                                                                                u4
                                                                                                                                                vAdd_45
                                                                                                           nAportA1_19
                                                                                                                                                                n70PAL_20
                                                                                                                                                                                                  c70PAL       d2
                                                                                                                                                   CPII _2
                                                                                                           nAportA2_21                                                d110PAP _2                  c110PAP                         c200BARs -c


                                                                                                                                                u5                                   x2                        d3
                                                                                                                 u6                  vAdd_47
                                                                                                                                                                d200BLL _11
                                                                                                                                                                                                                        n200BARs -c

                                                                                                                                                                                            VF_30
                                                                                                                                                                                                                                                                                                c200ALT
                                                                                                                                                                                                                         VB _38

                                                                                                                                                                                                                                                                         d200ALT _15
                                                                                                                                             c200BLL
                                                                                                                                                                                                                                                  CA _17
                                                                                                                                                                                     c176BARsud                                                                                                                                                             c101MIR
                                                                                                                                                                                                            d176BARsud _13                                                                                    VMC _44


                                                                                                                                                                                                                              VP _39
                                                                                                                                                                                                                                                             VBSLL _43

                                                                                                                                                                                                                  CF176 _15                                                                                                   c200BARnord
                                                                                                                                                                                                                                                                                                   d200BARnord _17
                                                                                                                                           c140LLO                                    CF200 _14                                                                                                                                                           d101MIR _18
                                                                                                                                                                                                                                                                          c176BARcentre
                                                                                                                                                                                                                                                                                                                    vAdd_60
                                                                                                                                      n140LLO_24
                                                                                                     aPousE                                                                                                                                  n176BARcentre_33                                       vAdd_56
                                                                                                                 bPousE_23                                                      VE _31
                                                                                                                                                                                                                          d130BAR_12                                                                                                                  vAdd_55
                                                                                                                                                   CE _13
                                                                                                                                                                                      c130BAR                                                                                                                                                            AportT
                                                                      c100LLO                                                                                                                                                                                                                       vAdd_57
                                                                                             n100LLO_22                                                                                                 VCO _37
                                                                                                                                                                                                                                                                          CR O_20                                                 vAdd_54
                                                                                                                                                                                                                                                     C-PR
                                   c115 CAST
                                                       aPou Cast                                                                                                                                                        CCO _16                                                                                                                     nAportT_32          vAdd_ 312
                                                                                              vAdd_48                                                    d100FCE _9                     c100FCE                                                                                                                                                                                         AportT
                                                                                                                                                                                                                                                                                                                      vAdd_61
                                                                                                                VSJD _29                                                                               n100BLLsud_25              VS _36                    V COA _40

 ACast 8   bCast 8                                                 bPou Cast _24                                                                                                                                                                                                 d100BLLnord _16
                                                                                                                                                                             VRM _32                                                               n100BLLcentre_29                                       VBMC_42

                     d115CAST
                                                   VCR _27
                                                                                     d80GAVi80CAS
                                                                                        85CRO _6
                                                                                                           c80GAVi80CAS
                                                                                                                                               u9                                            c100BLLsud                      c100BLLcentre
                                                                                                                                               CRE _8                                                                                                                               c100BLLno       bPousB_26
                                                             CCA _3                                                                                                                                                                                                                                              aPousB



                                                              CB _4
                                                                                   VCA _28
                                                                                                                x3           d54REL _8
                                                                                                                                                                           CC 130 _19        VZF _33
                                                                                                                                                                                                                         VT _35
                                                                                                                                                                                                                                                     VPSJ _41
                                                                                                                                                                                                                                                                                        rd



                                                                                                        CGIV _5                                 CC 100_11

                                                                                                                                               vAdd_64
                                                                          n70LLO_23                                                                                                        n70FLL_26                         VCT _34
                                                                                                                                                                                                                                           d70BBEsud _14
                                                                                                                                                                                                                                                                                          vAdd_53
                                                                                                                                               vAdd_50

                                                                                                         CPLANTA50 _7
                                                                           c70LLO
                                                                                             CPLANTA70 _6
                                                                                                                u7                       CC50 _9
                                                                                                                                                                                                                                                                                                                                                                                    n135SCG

                                                                                                                                 u8
                                                                                                                                                                                CC70 _12
                                                                                                                                                                                                               c70FLL                                                                           d120POM
                                                                                                                                                                                                                                             c 70BBEsu
                                                                                                                                                                                                                                                  d                C_MO                                                                     V_CON
                                                                                                                                                d10COR _10                                                                                                                                                                                                                                       c135SCG
                                                                                       dPLANTA _7
                                                                                                                 PLANTA10 _10


                                                                           vAdd_ 308                vAdd_ 309                                      c10COR

                                         ApotLL1                                                                                                                                                                                                                                                c120POM
                                                                                                                                   ApotLL2




                                                                                             Figure 2: ARR Partitioning of the Barcelona DWN


Table 2: Barcelona DWN subsystems and number of both
shared elements and ARRs                                                                                                                                                                                                                                                                                                             S2
    Number Color # ARRs # Shared variables                                                                                                                                                                                                           S1                                                                                                                     1
                1               green                              4                                                        1
                2
                3
                                 red
                                yellow
                                                                   5
                                                                   8
                                                                                                                           5
                                                                                                                            6
                                                                                                                                                                                                                                                                                                 1                                   4
                4                blue                              8                                                       16                                                                                                 S5                                                                                                                                            S3
                5               purple                             5                                                        5

                                                                                                                                                                                                                                                                    5                                                                                    5
   In order to easyly understand how the proposed decen-
tralised fault diagnosis approach would work, it will be ex-
                                                                                                                                                                                                                                                                                                    S4
plained focusing on subsystems S1 and S4 presented in Fig-
ure 3 in red lines that corresponds to the subsystems in green
                                                                                                                                                                                                       Figure 3: Scheme of decentralised diagnoser scheme for the
(S1 ) and in blue (in S4 ) in Figure 2. In particular, consider-
                                                                                                                                                                                                       Barcelona DWN resultant subsystems and their number of
ing the set of ARRs corresponding to S1 as
                                                                                                                                                                                                       shared variables
   S1
  r1,k = y1,k − y1,k−1 − ∆t[u1,k−1 + u2,k−1 − d1,k−1 ],
   S1
  r2,k = u1,k − u2,k − d2,k ,                                                                                                                                                                                                                   Table 3: Fault signature matrix of S1
   S1
  r3,k = y2,k − y2,k−1 − ∆t[u5,k−1 − d3,k−1 ],                                                                                                                                                               ARR                           fy1                  fu1                    fu2                    fy2                    fu5                fu3                  fu4                 fu6
   S1
  r4,k = u3,k − u4,k − u5,k − u6,k ,                                                                                                                                                                           S1
                                                                                                                                                                                                              r1,k                           7                     7                      7
                                                                                                                                                                                                               S1
the fault signature matrix presented in Table 3 can be ob-                                                                                                                                                    r2,k                                                 7                      7
tained. From this table, it is possible to identify the shad-                                                                                                                                                  S1
                                                                                                                                                                                                              r3,k                                                                                              7                       7
owed part, which corresponds to the faults that the local di-                                                                                                                                                  S1
                                                                                                                                                                                                              r4,k                                                                                                                      7                  7                        7            7
agnoser D1 is able to isolate when a fault activates any of
the ARRs ri,k , i = 1, 2, 3, since those ARRs only involve
local variables. However, if the resiual r4,k is activated, it is
necessary that a global diagnoser interacts with D1 discrim-                                                                                                                                           u6 is then in fault and hence isolated. Otherwise, D1 can
inating whether the corresponding ARR in S4 , defined here                                                                                                                                             decide locally (then isolating u3 , u4 or u5 ).
    S4
as r1,k , was also activated. If this is the case, the element                                                                                                                                           In Table 4, the fault signature matrix for the ARRs that




                                                                                                                                                                                     103
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                  [4] Y. Pencolé and M.-O. Cordier. A formal framework
Table 4: Part of the fault signature matrix accounting shared          for the decentralised diagnosis of large scale discrete
variables between S1 and S4                                            event systems and its application to telecommunica-
           ARR . . . fu5 fu6 fu7 . . .                                 tion networks. Artificial Intelligence, 164(1-2):121–
             S1                                                        170, 2005.
            r4,k          7      7
             S4
            r1,k                 7      7                         [5] S. Indra, L. Travé-Massuyès, and E. Chanthery. De-
                                                                       centralized diagnosis with isolation on request for
                                                                       spacecraft. In Fault Detection, Supervision and Safety
                                                                       of Technical Processes, pages 283–288, México, 2012.
contain shared variables between both S1 and S4 is pre-
                 S1                                               [6] F. Boem, R.M.G. Ferrari, T. Parisini, and M. M. Poly-
sented. There, r4,k corresponds with the fourth ARR of S1              carpou. Distributed fault diagnosis for continuous-
(last row of Table 3), while                                           time nonlinear systems: The input-output case. Annual
        S4
       r1,k = x3,k − x3,k−1 − ∆t[u7,k−1 + u8,k−1                       Reviews in Control, 37(1):163 – 169, 2013.
                                                                  [7] J. Biteus, E. Frisk, and M. Nyberg. Distributed diagno-
              + u6,k−1 − u9,k−1 ]
                                                                       sis using a condensed representation of diagnoses with
corresponds with the first defined ARR for S4 . Notice that            application to an automotive vehicle. IEEE Transac-
the global diagnoser should decide by looking at the ARR               tions on Systems, Man, and Cybernetics – Part A: Sys-
activations occurred in this fault signature matrix and then           tems and Humans, 41(6):1262–1267, November 2011.
interact with the different local diagnosers if needed.           [8] I. Roychoudhury, G. Biswas, and X. Koutsoukos. De-
                                                                       signing distributed diagnosers for complex continuous
6 Conclusions                                                          systems. IEEE Transactions on Automation Science
                                                                       and Engineering, 6(2):277–290, April 2009.
In this paper, a decentralised fault diagnosis approach for
large-scale systems based on graph-theory has been pre-           [9] M. Blanke, M. Kinnaert, J. Lunze, and
sented. The algorithm starts with the translation of the sys-          M. Staroswiecki.         Diagnosis and Fault-Tolerant
tem model into a graph representation. Then, applying the              Control. Springer-Verlag, Berlin, Heidelberg, second
perfect matching algorithm, a set of analytical redundancy             edition, 2006.
relations is obtained. From the analytical redundancy rela-       [10] S. Tornil-Sin, C. Ocampo-Martinez, V. Puig, and
tion graph, the problem of graph partitioning is then solved.          T. Escobet. Robust fault diagnosis of nonlinear sys-
The resultant partition consists of a set of non-overlapped            tems using interval constraint satisfaction and analyt-
subgraphs whose number of vertices is as similar as possi-             ical redundancy relations. IEEE Transactions on Sys-
ble and the number of interconnecting edges between them               tems, Man, and Cybernetics: Systems, 44(1):18–29,
is minimal. To achieve this goal, the partitioning algorithm           Jan 2014.
applies a set of procedures based on identifying the highly
                                                                  [11] A. I. Zečević and D. D. Šiljak. Control of Complex Sys-
connected subgraphs with balanced number of internal and
external connections. Finally, a decentralised fault diagno-           tems: Structural Constraints and Uncertainty. Com-
sis strategy is introduced and applied over the resultant set          munications and Control Engineering. Springer, 2010.
of partitions. In order to illustrate and discuss the use and     [12] J.A. Bondy and U.S.R. Murty. Graph Theory, vol-
application of the proposed approach, a case study based on            ume 244 of Graduate Series in Mathematics. Springer,
the Barcelona DWN has been used. As further research, the              2008.
partitioning algorithm will be improved by acting directly        [13] C. Ocampo-Martinez, S. Bovo, and V. Puig. Parti-
on the system model and not on the set of ARRs in order                tioning approach oriented to the decentralised predic-
to generate a set of ARRs for each local diagnoser with en-            tive control of large-scale systems. Journal of Process
hanced fault diagnosis properties.                                     Control, 21(5):775–786, 2011.
                                                                  [14] T.N. Bui and B.R. Moon. Genetic algorithm and
Acknowledgements                                                       graph partitioning. IEEE Transactions on Computers,
This work has been partially supported by the EFFINET                  45(7):841–855, 1996.
grant FP7-ICT-2012-318556 of the European Commission              [15] L. Addario-Berry, K. Dalal, and B. Reed. Degree-
and the Spanish project ECOCIS (Ref. DPI2013-48243-C2-                 constrained subgraphs. Discrete Applied Mathematics,
1-R).                                                                  156(7):1168–1174, 2008.
                                                                  [16] C. Ocampo-Martinez, V. Puig, G. Cembrano,
References
                                                                       R. Creus, and M. Minoves. Improving water manage-
[1] J. Lunze. Feedback Control of Large-Scale Systems.                 ment efficiency by using optimization-based control
    Prentice Hall, Great Britain, 1992.                                strategies: the Barcelona case study. Water Science &
[2] D.D. Šiljak. Decentralized control of complex systems.             Technology: Water supply, 9(5):565–575, 2009.
    Academic Press, 1991.                                         [17] C. Ocampo-Martinez, V. Puig, G. Cembrano, and
[3] L. Console, C. Picardi, and D. Theseider Duprè. A                  J. Quevedo. Application of predictive control strate-
    framework for decentralized qualitative model-based                gies to the management of complex networks in the
    diagnosis. In International Joint Conference on Artifi-            urban water cycle [applications of control]. IEEE Con-
    cial Intelligence (IJCAI), pages 286–291, Hyderabad,               trol Systems Magazine, 33(1):15–41, 2013.
    India, 2007.




                                                            104
                       Proceedings of the 26th International Workshop on Principles of Diagnosis




              Self-Healing as a Combination of Consistency Checks
                      and Conformant Planning Problems

                                             Alban Grastien
                                   Optimisation Research Group, NICTA
                    Artificial Intelligence Group, The Australian National University
                                 Canberra Research Laboratory, Australia


                       Abstract                                 element of the belief state in which the plan is not ap-
                                                                plicable. To this end we define a new type of diagnoser
    We introduce the problem of self healing, in                that solves the following problem: find a possible be-
    which a system is asked to self diagnose and                haviour of the system (that agrees with the model and
    self repair. The two problems of computing                  the observations) that ends up in a state q in which the
    the diagnosis and the repair are often solved               plan is not correct; this state q is added to the sample
    separately. We show in this paper how to tie                of the belief state so that the planner finds a more suit-
    these two tasks together: a planner searches                able repair plan at the next iteration. Failure on the
    a prospective plan on a sample of the belief                part of the diagnoser to find such a behaviour proves
    state; a diagnoser verifies the applicability of            that the plan is indeed correct. In practice the prob-
    the plan and returns a state of the belief state            lem of verifying the correctness of a plan is reduced to a
    (added to the sample) in which the plan is                  propositional satisfiability (sat) problem that is unsat-
    not applicable. This decomposition of the                   isfiable iff the plan is applicable in all states and that
    self healing process avoids the explicit com-               returns a counterexample if not.
    putation of the belief state. Our experiments                  The contributions of this paper are i) a formal def-
    demonstrate that it scales much better than                 inition of the self-healing problem, ii) the solving of
    the traditional approach.                                   self-healing as a combination of diagnosis and planning
                                                                steps, and iii) the reduction of each step to sat.
                                                                   This work is performed in the context of discrete
1    Introduction                                               event systems [Cassandras and Lafortune, 1999]. As
Autonomous systems are subject to faults and require            opposed to supervisory control, where actions (either
regular repair actions; systems capable of performing           active or passive, such as forbidding some events) are
such tasks are called self healing. Finding the optimal         performed while the system is running, we follow the
repair involves solving a diagnosis problem (what may           work from Cordier et al. [2007] and assume that the
the current system state be?) together with a planning          repair is being performed whilst the system is inactive.
problem (what optimal/near optimal course of actions,              The paper is divided as follows. Next section defines
applicable in all of the possible states, leads to an ac-       the self-healing problem formally. Section 3 presents
ceptable state?). In large, partially observable, systems       the proposed algorithm with a set-based perspective.
computing an explicit “belief state” can be intractable;        The sat implementation is presented in Section 4. Ex-
finding a plan applicable in all elements of this belief        perimental validation is given in Section 5. A compar-
state can be also intractable.                                  ison with other problems and approaches is given in
   In this paper we propose a method that avoids these          Section 6.
two intractable problems. This method relies on the in-
tuition that the full belief state is not necessary to find     2   Problem Definition
the appropriate repair. For instance, if a self-healing         The problem we are addressing is illustrated on Fig-
problem requires to make sure that n given machines             ure 1. We are concerned with finding the most appro-
are turned off and if the status (on or off) of these ma-       priate repair for a partially observed system that has
chines is unknown, then the belief state is comprised of        been running freely.
2n states. However the optimal plan (press the stop but-           We assume that the system can run in two different
ton on every machine) happens to be the optimal plan            modes: the “active” (and useful) mode in which the
of the state where none of the machines has been shut:          system is free to operate (left half of the figure) and
this single state is “representative” of all the states in      the “repair” mode in which the system state is being
the belief state.                                               re-adjusted (right half). The system behaves quite dif-
   Our approach uses a planner to compute an opti-              ferently in the two modes. In the active mode, the sys-
mal plan for a small sample of the belief state (at most        tem is partially observable but uncontrolled. In the re-
dozens of elements); the plan is applicable in all these        pair mode, the system is not observed albeit controlled;
states and leads to the goal state. In order to vali-           the state changes only through explicit application of
date the plan for the full belief state we search for an        actions; and special attention must be made to their




                                                          105
                           Proceedings of the 26th International Workshop on Principles of Diagnosis



   Initial state                                      Current state (unknown)                                             Goal state




             e1              e2     ...   en−1              en                a1            a2    ...     ak−1      ′     ak
      q0            q1                             qn−1              qn=q0′          q1′                           qk−1          qk′

                           Partially observed,
                                                                                    Repair plan (problem solution)
                         uncontrolled, behaviour


Figure 1: Schematic description of the self-healing problem: find a repair plan that returns the state in the goal
set.

applicability/effects. One reason for assuming that the                       Notice that a plan is a simple sequence: we do not as-
system does not run freely in the repair mode is that we                   sume that additional observations are available at run-
do not want to consider scenarios where faults can oc-                     time. There is no probing action available. After non
cur during the repair, which would increase the overall                    deterministic action effects, the use of conditional plans
complexity of the problem. We believe that this limi-                      is a second natural extension of this work.
tation, essentially the fact that the repair actions have
deterministic effects, can be lifted.                                      Definition 2 The self-healing problem is a pair P =
                                                                           hM, Oi where M is a model and O is an observation.
2.1        Explicit Model                                                  A repair plan for P is a plan that is guaranteed to be
We are considering discrete event systems (DES, [Cas-                      correct in the current state. Formally a repair plan is
sandras and Lafortune, 1999]). The system is modeled                       a plan π such that
as a finite state machine, i.e., a finite set Q of states                                         1  e     n   e
                                                                                         ∀ρ = q0 −→ . . . −→ qn .
together with a set T of transitions labeled with finitely-
                                                                              (q0 ∈ I ∧ obs(ρ) = O ∧ qn 6∈ U ) ⇒ π(qn ) ∈ G.
many events/actions.
                                                                                                                                       (1)
Definition 1 An explicit self-healing system model is
a tuple M = hQ, I, Σ, Σo , Σa , T, G, U i where                            The set of repair plans is denoted Π(M, O) or simply
                                                                           Π.
  • Q is a finite set of states, I ⊆ Q is a set of initial                   Given a cost function on sequences of actions, the
     states, G ⊆ Q is a set of goal states, U ⊆ Q is a                     objective of the self-healing problem is to find a cost-
     set of unstable states,                                               minimal repair plan (for simplicity we assume that such
  • Σ is a finite set of events, Σo ⊆ Σ is the set of                      a plan exists):
     observable events, Σa ⊆ Σ is the set of actions,
     and                                                                                    π ⋆ = arg min cost(π).
                                                                                                         π∈Π
  • T ⊆ (Q × Σ × Q) is the set of transitions hq, e, q ′ i
                     e                                                     This definition assumes a cost function that provides
     also denoted q −→ q′ .                                                a total order on the plans. In practice we will try to
                                                                 e
    In the active mode the system takes a path ρ = q0 −→          1
                                                                           minimise the number of actions (all actions have the
      en
. . . −→ qn such that {e1 , . . . , en } ⊆ Σ \ Σa , q0 ∈ I and             same cost, the cost is cumulative) and break ties at
qn 6∈ U . This last condition is used to prevent situ-                     random.
ations where a fault happened right before the repair                         We see two main categories of self-healing problems,
is applied, i.e., before any observation of this fault was                 namely i) a recurring situation where the system is
made. This assumption is similar to the one made, e.g.,                    stopped regularly, which provides a good opportunity
by Lamperti and Zanella that the system is quiescent                       to perform corrective actions on the system; ii) a situa-
(no more event is about to happen) when diagnosis is                       tion where a diagnoser/monitor detects an anomaly on
performed [Lamperti and Zanella, 2003]. This assump-                       the system and triggers a self-healing procedure. The
tion can be removed by assuming U = Q. Finally the                         present work is independent from how the problem was
observation O = obs(ρ) of this path is the projection                      prompted.
of e1 , . . . , en on the observable events Σo (i.e., all non-
observable events are eliminated from the sequence).                       2.2     Solving the Problem Explicitly
    In the repair mode a sequence of actions, called a plan                This paper works under the assumption that the sys-
π = a1 , . . . , ak , is applied ({a1 , . . . , ak } ⊆ Σa ). From          tem model is very large and that it is impractical to
state q0′ ∈ Q, the application of π leads to the (single)                  manipulate sets of states. We discuss this issue here
                                    a1          ak
state qk′ = π(q0′ ) such that q0′ −→    . . . −→    qk′ . We assume        and present some notations.
that every action is applicable in every state (if this is                    The simplest way to solve the self-healing problem
not the case a non-goal sink state can be created where                    is to compute the belief state and then compute the
all inapplicable actions lead to) and have deterministic                   optimal plan for this set of states.
effects. If π leads q0′ to a goal state, we say that π is                     Given a model M and the observation O, the belief
correct for q0′ .                                                          state B O is defined as the set of states that the system




                                                                     106
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


could be in:                                                             Finally we look at a formulation of the planning prob-
                                        e1          en                 lem that is complementary to the computation of the
 BO =           {q ∈ Q | ∃ρ = q0 −→ . . . −→ qn .                      belief state. Assume that a plan π is given and we want
          q0 ∈ I ∧ obs(ρ) = O ∧ qn 6∈ U ∧ q = qn }.                    to compute the set of states B π in which the plan π is
Notice that the definition of the belief state matches                 correct: B π = {q ∈ Q | π(q) ∈ G}.
the first part of Equation (1).                                        Lemma 1 Plan π is a correct plan iff B O ⊆ B π .
                                                                                     def
                                                                       Writing B π = Q \ B π the set of states for which π is
   A conformant plan for the set of states B O is a plan
π that is correct for all states of B O : ∀q ∈ B O . π(q) ∈ G          not correct, plan π is a correct plan iff B O ∩ B π = ∅.
(cf. Figure 2). Compared to the general definition of a
conformant plan (a more detailled comparison is given                  3    Set Formulation of Self-Healing
in Section 6) we only deal with uncertainty on the initial             We first present a formulation of our solution that is
state and we assume that actions have deterministic                    based on sets and that does not consider implementa-
effects. Conformant planning is provably pspace-hard                   tion issues (presented in the next section).
for explicit models.                                                      We propose a lazy approach to self-healing. In this
   We consider the conformant planning problem from                    approach we search a correct plan for a sample of the
the initial set of states B O and use b = |B O | to denote             belief state (a “belief sample”) and then search for a
the size of B O . The problem can be solved by consid-                 state of the belief state in which the plan is not applica-
ering the finite state machine M ′ where each state of                 ble; this state is added to the sample and the procedure
M ′ is a set of states of the original model and each                  is iterated again until a robust plan has been found.
transition from state S labeled by action a leads to                      We first give the theoretical results that justify the
S ′ = {q ′ ∈ Q | ∃q ∈ S. hq, a, q ′ i ∈ T }. The initial               algorithm presented at the end of the section.
state of M ′ is B O ; a state S of M ′ is a goal state if                 In the following we use the notations B O and B to
it satisfies S ⊆ G. A plan π is a sequence of actions                  represent sets of states such that B ⊆ B O . B O will
such that π(B O ) (in M ′ ) is a goal state. Because the               represent the belief state and B a small subset (a few
original model is deterministic the transition hS, a, S ′ i            elements) of B O . S, S ′ will represent any set of states.
is such that the size of S ′ is smaller than S. The num-                  Let Π(q) be the set of repair plans that are correct
ber of statesin M ′is bounded  by the  sum of binomial             for state q. Let Π(S) be the set of repair plans that are
                |Q|                 |Q|                                correct whichever
                                                                                T           is the current state from S. Then
coefficients           + ···+              .
                 1                   b                                 Π(S) = q∈S Π(q). Notice that Π = Π(B O ).
                                                                          A trivial result is:
     Initial states B                              Goal states                             S ⊆ S ′ ⇒ Π(S) ⊇ Π(S ′ ).
                                                                       A consequence of this proposition is that the optimal
        q01         q11        ...            1
                                             qk−1         qk1          repair for B O is a correct plan for B. Computing the
                                                                       optimal repair plan for the latter may therefore yield
         ..          ..        ..             ..           ..          the optimal plan for the former. Let π ∗ (S) be the op-
          .           .             .          .            .          timal plan for a set of states. The next proposition
                                                                       determines how to characterize that an optimal plan
        q0b         q1b        ...            b
                                             qk−1         qkb          was found:
                                                                           S ⊆ S ′ ∧ (π ∗ (S) ∈ Π(S ′ )) ⇒ π ∗ (S) = π ∗ (S ′ ).
Figure 2: Solving conformant problems; the vertical                       This result can be derived from the previous propo-
lines mean that the transitions are labeled by the same                sition. π ∗ (S ′ ) belongs to Π(S) since S ⊆ S ′ ; therefore
action.                                                                π ∗ (S) is better than (or equal to) π ∗ (S ′ ). However, if
                                                                       π ∗ (S) ∈ Π(S ′ ) and yet π ∗ (S ′ ) 6= π ∗ (S), then π ∗ (S ′ )
   The model M ′ presented before cannot be easily ex-                 must be strictly better than π ∗ (S), which contradicts
pressed in planning modeling languages such as strips                  what was just said.
or pddl, or implemented in sat. Another reduction, to                     Applied to S = B and S ′ = B O ⊇ B, this means
M ′′ , can be introduced whose states are tuples (with b               that π ∗ (B) ∈ Π(B O ) implies π ∗ (B) = π ∗ (B O ).
elements) of states from the original model: Q′′ = Qb .
A tuple state is a goal state if all its elements are in the              We reuse the notation B π for the set of states in
goal: G′′ = Gb . The transitions in M ′′ correspond to                 which the plan π is correct, and B π = Q \ B π for the
the parallel execution of the same action in each state                set of states in which it is not. With this notation,
of the tuple (represented by the vertical lines on Fig-
                                                                       π ∗ (B) ∈ Π(B O ) is equivalent to B O ∩ B π∗ (B) = ∅.
ure 2).
                                                                          Assume         that  there         exists a   procedure
   In general M ′′ is larger than M ′ . The model also con-
                                                                       verify applicability (S, π) that extracts a state
tains symmetries that efficient implementations might
need to address explicitely: for instance in model M ′′                q ∈ S ∩ B π if such a state exists, and returns ⊥
states hq1 , q2 i and hq2 , q1 i are different while they would        otherwise. Then, for S ⊆ S ′ , the following results are
be the same in M ′ : {q1 , q2 } = {q2 , q1 }.                          trivial:
   Clearly this type of approach is only applicable if B O                • verify applicability (S ′ , π ∗ (S)) = ⊥ ⇒ π ∗ (S) =
comprises no more than a few dozen elements.                                 π ∗ (S ′ );




                                                                 107
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


  • let q = verify applicability (S ′ , π ∗ (S)) 6= ⊥ be a
                                                                                        o1
    state where π ∗ (S) is not applicable, then q 6∈ S
    and π ∗ (S ∪ {q}) 6= π ∗ (S) (and cost (π ∗ (S ∪ {q})) >
    cost (π ∗ (S)))1 .                                                                          u           a2
                                                                                        B             D           G           I
The first proposition shows that verify applicability can
be used to check whether the plan π ∗ (B) is correct for
B O . The second proposition indicates how a better                                    o2              o1        a1          a1
prospective plan can be computed if π ∗ (B) is not cor-                       o1                                        u
rect: the addition of q to S guarantees that a different                                        o1          o2
plan will be generated.                                                  A              C             E           F           H
   These results lead to the procedure presented in Al-                       o2                                        a2
                                                                                                a2
gorithm 1. In this procedure, find plan(B) is a method
that computes a conformant plan from B as defined
at the end of the previous section (and described next           Figure 3: System example with two initial states (A
section). The procedure computes the optimal plan for            and B), two goal states (A and G), one unstable state
a belief sample B. If verify applicability finds a state         (B), two observable events (o1 and o2 ), and two actions
q ∈ B O in which this plan is not correct, then this state       (a1 and a2 ; an action affects the system state only if
is added to the belief sample and a new optimal plan             there is a transition).
is generated and tested.

Algorithm 1 Diagnosis algorithm for the self-healing             that neither A nor D from B O were explicitly generated
problem without enumerating the belief state B O                 during the procedure.
  B := ∅
  loop                                                           4       SAT Formulation of Self-Healing
    π := find plan (B)                                           In this section we show how Algorithm 1 can be im-
    q := verify applicability (B O , π)                          plemented using sat. This implementation assumes
    if q = ⊥ then                                                a symbolic representation of the model, i.e., a repre-
       return π                                                  sentation where states and transitions are not enumer-
    else                                                         ated but are, instead, implicitly defined by a set V of
       B := B ∪ {q}                                              Boolean state variables (aka fluents) as can be found,
    end if                                                       e.g., in a strips model.
  end loop
                                                                 4.1      Computing a Conformant Plan for B
  Because i) each loop iteration adds an element to B            The procedure we use to compute the optimal plan for
and ii) B O is finite, this procedure is guaranteed to ter-      a belief sample relies on a sat solver and follows the
minate. The number of iteration is, in the worst case,           schematic representation of Figure 2. In planning by
the size of B O ; we expect however that a handful of            sat [Kautz and Selman, 1996], given a horizon k and a
calls to find plan (·) will be sufficient to find the opti-      planning problem a propositional formula Φ is defined
mal plan.                                                        that is satisfiable iff there exists a sequence of actions
                                                                 of length k that solves the planning problem.2 Fur-
Example                                                          thermore Φ is defined over k + 1 copies of the state
We illustrate Algorithm 1 with the example of Figure 3.          variables (the state sat variables p0 to pk where p is a
Assume that the observations are O = [o1 , o2 ]. Accord-         state variable) and k copies of the actions (the action
ing to the model, the belief state is B O = {A, D, F, H}         sat variables a0 to ak−1 where a is an action). Φ is
(state B is unstable, so the system cannot be in this            defined such that a solution to the planning problem
state). The state needs to be returned to a subset of            can be trivially extracted from the satisfying assign-
{A, G}.                                                          ment (for instance, if ai evaluates to true, then the ith
   Since the belief sample B0 is initially empty, Algo-          action of the plan is a). If, for instance, action a sets
rithm 1 first generates the empty plan π0 = ε. The               state variable p to false, Φ will be defined such that for
procedure verify applicability exhibits state F such that        all i ∈ {1, . . . , k}
    u      o1      o2
B − → D −→     E −→     F could explain O and such that
                                                                                   Φ        ≡   (ai−1 → ¬pi ) ∧ · · ·
plan π0 does not lead to a goal state when applied from
F . The optimal plan for B1 = {F } is π1 = a1 . This             We refer the reader to the literature on planning by sat
time verify applicability extracts state H which also be-        for more details on this reduction.
longs to the belief state and for which the application            Given a sample B of b states we create b copies of the
of a1 leads to sink state I. The belief sample B2 now            state sat variables: p1i , . . . , pbi ; the variables pℓi model
equals {F, H} and the optimal conformant plan for B2             the effects of applying the plan on the state qℓ ∈ B.
is π2 = a2 , a1 (remember that unobservable transition           We stick to a single set of action sat variables and
    u
F −→ H cannot trigger after the execution of a2 ). This          each copy of the state sat variables is linked to this
plan is correct for all elements in the belief state. Notice
                                                                     2
                                                                    The value of k is initialized to 0 and incremented until
   1
       Remember that no two plans have the same cost.            Φ becomes satisfiable.




                                                           108
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


set. The formula Φ presented in the example above                 plan is finally computed the conformant planning re-
will therefore now translate as                                   duction to sat is satisfiable while the reduction of the
                                                               applicability function is not.
  Φ ≡         ai−1 → ¬p1i ∧ · · · ∧ ai−1 → ¬pbi ∧ · · ·
4.2    Verifying Correctness of a Plan                            5                Experiments
Like the plan generation, plan correctness is imple-              We ran some experimental evaluation of the approach
mented in sat. This time it matches the representation            presented in this paper.
of Figure 1.                                                         Since the problem presented here is new, we had to
   A plan is proved incorrect if an explanation of the            build new benchmarks. We propose a variant of the
observations can be found in which the application of             benchmark presented by Grastien et al. [2007] which
the plan leads to a non final goal (remember that all             will be made available to the community. The sys-
plans are applicable).                                            tem comprises 20 components interconnected in a torus
   Once again a propositional formula is defined that is          shape. Each component contains eight states, including
satisfiable iff such an explanation exists. This formula          two unstable states and one goal state. The behaviour
contains two parts: sat variables pi∈{0,...,n} represent          on each component can affect its neighbour and the
the state of the system in the active mode while vari-            local observations cannot allow to determine anything
ables p′i∈{0,...,k} represent the state in the repair mode.3      about the local behaviour: the full system needs to
The formula is the conjunction of the formulas:                   be monitored in order to understand the state system.
   • Φactive a propositional formula that is satisfiable          Repair actions can also be local or affect several com-
      iff there exists an explanation to the observa-             ponents.
      tions (whose final state is represented by the vari-           We built 100 problem instances on this system. We
      ables pn ); this type of reduction is quite standard        restricted ourselves to totally ordered observations, but
      [Grastien and Anbulagan, 2013];                             notice that one of the benefits of using diagnostic tech-
                                                                  niques is to be able to handle partially-ordered obser-
   • Φ′repair a propositional formula that is satisfiable         vations (observations where the order of the observed
      iff there exists a state in which the proposed plan         events is only partially known because the delay be-
      is not correct (this state is represented by the vari-      tween their reception is small compared to the trans-
      ables p′0 );                                                mission/processing delay).
      V
   • p∈V (pn ↔ p′0 ), where p ranges over the state                  We compare our approach to a symbolic approach
      variables, the formula that links the final state of        that uses BDDs (specifically the buddy package) to
      the active phase and the initial state of the repair        track the belief state and then uses A* to find the op-
      phase.                                                      timal repair plan. The heuristic used by A* is imple-
   Intuitively, the assignments of the variables pn that          mented as follows: a state of the system is extracted
are consistent with Φactive are a symbolic representa-            from the BDD and the optimal repair is computed for
tion of B O . Formally let V be the set of variables that         this state using sat; the length of this optimal repair is
appear in Φactive ; then ∃(V \ {pn | p ∈ V }). Φactive is         used as a lower bound for the optimal repair from the
logically equivalent to the symbolic representation of            current search node.
B O . Similarly the variables p′0 of Φ′repair represent B π .        Our belief sample method uses glucose_static 4.0
                                                                  [Audemard and Simon, 2009]. glucose is heavily based
   As a consequence any other representation of B O or            on the minisat solver [Eén and Sörensson, 2003].
B π could be used if such representations are more con-              The experiments were run on 4-core 2.5GHz cpu with
venient (e.g., if they are more compact or if they help           4GB RAM, with GNU/Lunix Mint 16 “petra”. A ten
the sat solver).                                                  minutes (600s) timeout was provided.
Difference Between the Two Reductions                                            1000

The first reduction aims at finding a plan of length k
that is applicable in b states. Therefore it includes b × k                      100

copies of the state variables and k copies of the action
                                                                      Time (s)




variables.                                                                        10

   The second reduction aims at finding a plan com-                                                                                              BuDDy
                                                                                                                                          Belief Sample

posed of two parts: a trajectory in the active space and                           1
                                                                                        0   10   20   30   40      50      60   70   80          90       100
a trajectory in the repair space. Therefore it includes                                                         Problems


n + k copies of the state variables and n copies of the
events (there could be k copies of the actions but the            Figure 4: Runtime in seconds required to solve self-
value of these variables is known in advance since the            healing problem instances; sorted.
plan is an input of this reduction).
   An interesting difference between the two reductions
                                                                    The results are summarized in Figure 4. The in-
is that the trajectories of the former should lead to goal
                                                                  stances are sorted in increasing runtime, meaning that
states while the trajectory of the latter should lead to
                                                                  the instance at position x for one implementation may
a non goal state. As a consequence when the repair
                                                                  be different from the instance at the same position for
   3
     It is assumed that the length of the explanation can be      the other. The approach based on the generation of the
bounded by a known value n; k is the length of the plan           belief state only saw 64 instances solved before timeout,
being tested.                                                     against 83 for our approach. In general our approach




                                                            109
                      Proceedings of the 26th International Workshop on Principles of Diagnosis


is two orders of magnitude faster than A*, although            generated [Micalizio, 2014].
we would need more benchmarks and comparisons to
understand better the strength of this approach.               7   Conclusion and Extensions
   Out of the 87 instances instances solved by the Belief      In this paper we presented a method to solve the self-
Sample method, 82 could be solved by exhibiting only           healing problem. The problem consists in finding a
one element of the belief state. Another three instances       repair plan that can lead back to a goal state a sys-
could be solved with a sample of two elements, and             tem whose execution has been partially observed. We
two required a sample of three elements to generate a          avoid computing the belief state. Instead we propose a
conformant plan.                                               method whereby plans are computed on a sample of the
                                                               belief state whilst a diagnoser verifies their correctness
6   Discussion                                                 and generates an element of the belief state (added to
                                                               the sample) if the plan is not correct. Both the plan-
The objective of connecting the diagnostic and plan-
                                                               ning and the diagnosis problems are reduced to sat
ning tasks is quite ambitious. From the diagnostic per-
                                                               problems. We show that non trivial problems can be
spective, and since the seminal work from Sampath et
                                                               easily solved by this approach.
al. [1995] the problem has generally been the detection
of specific events or patterns of events [Jéron et al.,
2006]. The main inspiration of the present work is the            There are many possible extensions to this work.
self-heability question asked by Cordier et al. [2007];        One issue is that enforcing a conformant plan may be
the aforementioned work is one of the first attempt to         too restrictive. We want to avoid prohibitive repairs
frame diagnosis as the problem of finding the optimal          in situations where the system is healthy. This is a
repair plan, although the complexity of computing the          common problem in diagnosis of dynamic systems: the
plan is not addressed. In static contexts similar ques-        state of the system can never be precisely determined
tions have been asked where the problem was framed as          at the current time; it is often not unconceivable that
finding the optimal balance between increasing the cost        a fault just happened on the system and has not had
of gathering information (observations) and improving          time to develop into a visible faulty trace. The issue
the precision of diagnosis (and, consequently, reducing        here is that conformant plans must provide for such
the cost of planning) [Torta et al., 2008].                    contingencies even when there is no evidence for them.
   Supervisory control [Ramadge and Wonham, 1989] is           An implicit assumption of our work is that unhealthy
a problem very similar to self-healing. The goal is to         system behaviours can be detected to a large extend.
control some actions (forbid their occurrence) in order        The set of unstable states serves this purpose: they are
to meet some specification. The main difference with           useful to model the fact that any “failure” in the sys-
our work is the fact that control applies continuously         tem will lead to abnormal observations before a repair
while we assume that self-healing is performed when the        action is performed.
system is not active (either because the repair process           We see two avenues to handle situations where the
is expensive—it might require to stop the system for           unstability feature cannot address the problem pre-
instance—or because it can only be performed at some           sented before. First probabilities can be incorporated
time—every night for instance). Furthermore control            into the model, which allows for chance-constrained
tries to be as unobtrusive as possible: it merely forbids      planning [Santana and Williams, 2014]. Issues with this
some transitions and generally does not choose actions         approach include the problem of building large models
to perform.                                                    with meaningful probabilities and the problem of ex-
   Conformant planning [Smith and Weld, 1998] is the           tending the sat reduction to deal with probabilities
problem of finding a sequence of actions that is guar-         (as well as scaling up to large models). A second, qual-
anteed to lead to the specified goal, despite uncertainty      itative, possibility is to ignore contingencies that are
on the initial state and nondeterministic action effects.      supported by no strong evidence. For instance failures
Solutions to conformant planning have been proposed            that are not part of a minimal diagnosis might be ig-
that compute the belief state and run heuristic search         nored.
[Bonet and Geffner, 2000] or that represent the belief            Another restriction of the current approach is that
state symbolically [Cimatti and Roveri, 2000]. More            the goal G is assumed to be known explicitly. Speci-
similar to our work Hoffmann and Brafman [2006] pro-           fication of goal states may however be more complex:
posed Conformant-FF in which the belief state is rep-          Ciré and Botea [2008] have proposed to define goals
resented implicitly by the set of initial states and the       as properties of states defined in linear temporal logic
sequence of actions leading to the current state; at ev-       (LTL). Other relevant goal properties is diagnosability
ery time step, a sat solver is used to determine the state     [Sampath et al., 1995], i.e, the property that the obser-
variable values that can be inferred with certainty. This      vations on the system will allow to detect/identify the
approach is similar to ours in the way it avoids comput-       important system failures. A related issue is the incre-
ing belief states. More generally, we would like to adapt      mental aspect: how to handle a repair after an active
our method to solve conformant planning problems.              period following a first repair. A simple solution is to
   The combination of planning and diagnosis has also          assume that the initial state after the repair is the goal
been studied in the context of plan repair. There, a           state.
(possibly conformant) plan is computed that assumes
that contigencies are unlikely to happen. The plan ex-         Acknowledgments
ecution is then monitored and if the outcome of exe-           NICTA is funded by the Australian Government
cution does not match the predictions, a new plan is           through the Department of Communications and the




                                                         110
                               Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                                           event/action     neighbour event/action
                                  nf                                                            f                    nf
                                                                                                t                     z

          N2                      F2                            R2                              Table 1: Synchronised events

reb            nf          reb                            reb        nf            N (no fault); it moves to F when a fault occurs and R
                                                                                   when it recovers. The second part of the state is gener-
          N1        back          F1                   back     R1        back     ally 0 (the component is running) and moves to 1 when
                                                                                   it needs to reboot and to 2 when it is rebooting. A fault
                       f                                                           on a component forces its neighbours to reboot. One
nf                                                        nf                       difficulty of diagnosis for this type of system is that the
                                                   f
                                                                                   observations (reb and back) do not point precisely to
          N0                                                    R0                 the faulty component.
                                                                                      The repair consists in returning to state N0 . Most
                                                                                   states require action t to return to state N0 but this
                                                                                   action can move the neighbours of the component to
Figure 5: Active model for one component (observable                               state N2 . Therefore finding the optimal repair requires
events are reb and back).                                                          to order the actions carefully.

               N2                      F2                            R2            References
                                                                                   [Audemard and Simon, 2009] G. Audemard and L. Si-
                                                                                     mon. Predicting learnt clauses quality in modern
                                                                                     SAT solver. In 21st International Joint Conference
                                                                                     on Artificial Intelligence (IJCAI-09), 2009.
      z        N1     s                F1                            R1
                           t                                                       [Bonet and Geffner, 2000] B. Bonet and H. Geffner.
                                               t                                     Planning with incomplete information as heuristic
                           t                                                         search in belief space. In Fifth International Con-
                                           t                                         ference on AI Planning and Scheduling (AIPS-00),
                                                                                     pages 52–61, 2000.
               N0                      t                             R0
                                                                                   [Cassandras and Lafortune, 1999] C. Cassandras and
                                                                                     S. Lafortune. Introduction to discrete event systems.
                                                                                     Kluwer Academic Publishers, 1999.
Figure 6: Repair model for one component (no transi-
tion means that the state is not affected by action).                              [Cimatti and Roveri, 2000] A. Cimatti and M. Roveri.
                                                                                     Conformant planning via symbolic model checking.
                                                                                     Journal of Artificial Intelligence Research (JAIR),
Australian Research Council through the ICT Centre                                   13:305–338, 2000.
of Excellence Program.
                                                                                   [Ciré and Botea, 2008] A. Ciré and A. Botea. Learn-
                                                                                     ing in planning with temporally extended goals and
A         Problem Benchmark                                                          uncontrollable events. In Eighteenth European Con-
We now present the system we used in the experi-                                     ference on Artificial Intelligence (ECAI-08), pages
ments.4                                                                              578–582, 2008.
   The system includes 20 components ci,j where i                                  [Cordier et al., 2007] M.-O. Cordier, Y. Pencolé,
ranges between 0 and 3 and j between 0 and 4. The                                    L. Travé-Massuyès, and T. Vidal. Self-healability
component ci,j is connected to ci′ ,j ′ iff the total differ-                        = diagnosability + repairability. In Eighteenth
ent |i − i′ | + |j − j ′ | is at most one (where i and j are                         International Workshop on Principles of Diagnosis
taken modulo 3 and 4). For instance, c0,1 is connected                               (DX-07), pages 251–258, 2007.
to four components c0,0 , c0,2 , c3,1 , and c1,1 .
   The model of one component for the active mode is                               [Eén and Sörensson, 2003] N. Eén and N. Sörensson.
given in Figure 5 and the model for the repair mode                                  An extensible SAT-solver. In Sixth Conference
is given in Figure 6. The connections between com-                                   on Theory and Applications of Satisfiability Testing
ponents implies forced transitions when some events                                  (SAT-03), pages 333–336, 2003.
occur; these are summarised in Table 1 For instance,                               [Grastien and Anbulagan, 2013] A.      Grastien     and
when event f occurs on component c0,1 , event nf oc-                                 A. Anbulagan. Diagnosis of discrete event systems
curs on every one of its four neighbours.                                            using satisfiability algorithms: a theoretical and
   A component state contains two types of informa-                                  empirical study. IEEE Transactions on Automatic
tion: whether a failure occurred on the component and                                Control (TAC), 58(12):3070–3083, 2013.
whether it is run. The first part of the state is initially
                                                                                   [Grastien et al., 2007] A. Grastien, A. Anbulagan,
    4
    The benchmark is available at this address:                                      J. Rintanen, and E. Kelareva. Diagnosis of discrete-
http://www.grastien.net/ban/data/bench-dx15.tar.gz.                                  event systems using satisfiability algorithms. In




                                                                             111
                      Proceedings of the 26th International Workshop on Principles of Diagnosis


   22nd Conference on Artificial Intelligence (AAAI-
   07), pages 305–310, 2007.
[Hoffmann and Brafman, 2006] J.         Hoffmann     and
   R. Brafman. Conformant planning via heuristic for-
   ward search: a new approach. Artificial Intelligence
   (AIJ), 170:507–541, 2006.
[Jéron et al., 2006] T. Jéron, H. Marchand, S. Pinchi-
   nat, and M.-O. Cordier. Supervision patterns in
   discrete-event systems diagnosis. In Seventeenth
   International Workshop on Principles of Diagnosis
   (DX-06), pages 117–124, 2006.
[Kautz and Selman, 1996] H. Kautz and B. Selman.
   Pushing the envelope : planning, propositional logic,
   and stochastic search. In Thirteenth Conference on
   Artificial Intelligence (AAAI-96), pages 1194–1201,
   1996.
[Lamperti and Zanella, 2003] G.        Lamperti      and
   M. Zanella. Diagnosis of active systems. Kluwer
   Academic Publishers, 2003.
[Micalizio, 2014] R. Micalizio. Plan repair driven by
   model-based agent diagnosis. Intelligenza Artificiale,
   8(1):71–85, 2014.
[Ramadge and Wonham, 1989] P.           Ramadge      and
   W. Wonham. The control of discrete event systems.
   Proceedings of the IEEE: special issue on Dynamics
   of Discrete Event Systems, 77(1):81–98, 1989.
[Sampath et al., 1995] M. Sampath, R. Sengupta,
   S. Lafortune, K. Sinnamohideen, and D. Teneket-
   zis.      Diagnosability of discrete-event systems.
   IEEE Transactions on Automatic Control (TAC),
   40(9):1555–1575, 1995.
[Santana and Williams, 2014] P.         Santana      and
   B. Williams.          Chance-constrained consistency
   for probabilistic temporal plan networks. In 24th
   International Conference on Automated Planning
   and Scheduling (ICAPS-14), 2014.
[Smith and Weld, 1998] D. Smith and D. Weld. Con-
   formant graphplan. In Fifteenth Conference on Ar-
   tificial Intelligence (AAAI-98), pages 889–896, 1998.
[Torta et al., 2008] G. Torta, D. Theseider Dupré, and
   L. Anselma. Hypothesis discrimination with abstrac-
   tions based on observation and action costs. In Nine-
   teenth International Workshop on Principles of Di-
   agnosis (DX-08), pages 189–196, 2008.




                                                        112
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




                         Implementing Troubleshooting with Batch Repair

                           Roni Stern1 and Meir Kalech1 and Hilla Shinitzky 1
                                     1
                                       Ben Gurion University of the Negev
                   e-mail: roni.stern@gmail.com, kalech@bgu.ac.il, hillash@post.bgu.ac.il



                         Abstract                                   repair overhead. Instead, an efficient BRP algorithm would
                                                                    repair all the faulty components in a single repair action.
     Recent work has raised the challenge of efficient              More generally, we expect an intelligent BRP algorithm to
     automated troubleshooting in domains where re-                 weigh the cost of repairing batches of components as well
     pairing a set of components in a single repair ac-             as the repair overhead. Some discussion on repairing mul-
     tion is cheaper than repairing each of them sepa-              tiple components together was done in prior work on self
     rately. This corresponds to cases where there is a             healability [5].
     non-negligible overhead to initiating a repair ac-
     tion and to testing the system after a repair ac-                 Due to the repair overhead, repairing a single component,
     tion. In this work we propose several algorithms               even if it is the component most likely to be faulty, can
     for choosing which batch of components to repair,              be wasteful. This is especially wasteful in cases where all
     so as to minimize the overall repair costs. Experi-            the found diagnoses consists of multiple faulty components,
     mentally, we show the benefit of these algorithms              thus suggesting that repairing a single component would not
     over repairing components one at a time (and not               fix the problem. Alternatively, one may choose to repair the
     as a batch).                                                   components in the most likely diagnoses. This may also
                                                                    be wasteful, especially if there are several diagnoses which
                                                                    have similar likelihood. It might be worthwhile to repair
1 Introduction                                                      by a single repair action a set of components that “covers”
Troubleshooting algorithms, in general, plan a sequence of          more than a single diagnosis. This may reduce the number
actions that are intended to fix an abnormally behaving sys-        of repair actions until the system is fixed, thus saving repair
tem. Fixing a system includes repairing faulty components.          overhead costs. The downside in this approach is that the
Such repair actions incur a cost. These costs can be parti-         component repair costs can be high, as more healthy com-
tioned into two types of repair cost. The first, referred to as     ponents may be repaired.
the component repair cost, is the cost of repairing a compo-           For example, consider
nent. The second, referred to as the repair overhead, is the        the small system de-
cost of preparing the system to perform repair actions (e.g.,       scribed in Figure 1. It is          in1=1
                                                                                                              A
                                                                                                                   p({A})=0.6
                                                                                                                                       Consider the
halting the system may be required), and the cost of testing        a logical circuit whose                                   out1=1   Figure~\ref{f
                                                                                                                                       whose outpu
the system after performing a repair action.                        output is fault. Assume             in2=1
                                                                                                              B                        two possible

                                                                    that the “OR” gate is
                                                                                                                                       where the pr
   This paper considers the case where the repair overhead                                                        p({B})=0.4           0.4, respectiv

is not negligible and is potentially more expensive than a          known to be healthy                                                to repair A, t
                                                                                                                                       Assume that

component repair cost (of a single component). Therefore,           and there are only two                                             a componen
                                                                                                      Figure 1: An example where       chance that t
it may be more efficient to repair a batch of components            possible diagnoses: either                                         repair action
                                                                                                      repairing components one at      expected tot
in a single repair action. We call the problem of choosing          A is faulty or B is faulty,                                        Similarly, the
                                                                                                      a time is wasteful.
                                                                    where the probability that
                                                                                                                                       The best opt
which batch of components to repair the Batch Repair Prob-                                                                             together, in a

lem (BRP). BRP is an optimization problem, where the task           A and B are faulty is                                              cost of 12.


is to minimize the total repair costs, which is the sum of the      0.6 and 0.4, respectively. There are three possible repair
repair overheads and component repair costs incurred by all         actions: to repair A, to repair B, and to repair A and
the repair actions performed until the system is fixed.             B. Assume the repair overhead costs 10, and repairing a
   Note that in this paper we use the term “repair” for a sin-      component costs 1. If A is repaired, there is a 0.4 chance
gle or a set of components and the term “fix” to refer to           that the system would not be fixed and another repair action
the entire system. Thus, repairing components eventually            would be needed (repairing B). Thus, the expected total
causes the system to be fixed, and a system is only fixed if it     repair cost of repairing A first is 15.4. Similarly, the total
returned to its nominal behavior.                                   repair cost for repairing B first is 17.6. The best option is
   Most previous work assumed that components are re-               thus to repair A and B together in a single repair action,
paired one at a time [1; 2; 3; 4]. This approach can be             incurring a total repair cost of 12.
wasteful for BRP. For example, if a diagnosis engine infers            Recent work [6] proposed two high-level approaches to
that multiple faulty components need to be repaired to fix          solve BRP: as a planning under uncertainty problem, or as a
the system, then it would be wasteful to repair these com-          combinatorial optimization problem. When modeling BRP
ponents one at a time since each repair action incurring its        as a planning under uncertainty problem the task is to find a




                                                              113
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


repair policy, mapping a state of the system to the repair ac-        We assume that all repair costs are positive and non-zero,
tion that minimizes the expected total repair costs. This ap-      i.e., costrepair > 0 and costc > 0 for every component
proach, while attractive theoretically, quickly becomes not        c ∈ COM P S. As defined earlier, the task in BRP is to fix
feasible in non-trivial scenarios.                                 a system with minimum total repair cost.
   In this work we focus on the second high-level approach            As shown in Figure 1, an efficient BRP solver should con-
proposed for BRP, in which BRP is modeled as a combi-              sider the possibility of repairing a set of components in a
natorial optimization problem, searching in the combinato-         single repair action. Thus, the potential number of repair
rial space of possible repair actions for the best repair ac-      actions is 2|COM P S| . Therefore, from a complexity point of
tion. There are two challenges in implementing this ap-            view BRP is an extremely hard problem.
proach. First, how to measure the quality of a repair ac-
tion and how to efficiently search for the repair action that      3 Preliminaries
maximizes this measure. There are many efficient heuristic
                                                                   Next, we provide background and definitions required for
search algorithms in the literature, and thus the main chal-
                                                                   describing the BRP algorithms we propose.
lenge addressed in this work is in proposing several heuris-
                                                                      SD describes the behavior of the diagnosed system, and
tics for estimating the merit of a repair action.
                                                                   in particular the behavior of each component. The term be-
   The contributions of this work are practical. A range of
                                                                   havior mode of a component refers to a state of the compo-
heuristic objective functions are proposed and analyzed, and
                                                                   nent that affects its behavior. SD describes for every com-
we evaluate their effectiveness experimentally on a standard
                                                                   ponent one or more behavior modes. For every component,
benchmark. A clear observation from the results is that in-
                                                                   at least one of the behavior modes must represent the nomi-
deed considering batch repair actions can save repair cost
                                                                   nal behavior of the component.
significantly. Moreover, the most effective heuristics pro-
                                                                      A mode assignment ω is an assignment of behavior
vide a tunable tradeoff between computation time and re-
sulting repair costs.                                              modes to components. Let ω (+) be the set of components
                                                                   assigned a nominal (i.e., normal) behavior mode and ω (−)
                                                                   be the set of components assigned one of the other modes.
2 Problem Definition
                                                                   Definition 3 (Diagnosis). A mode assignment ω is called a
A classical MBD input hSD, COM P S, OBSi is assumed,               diagnosis if ω ∧ OBS ∧ SD is satisfiable.
where SD is a model of the system, COMPS represents
                                                                      A model-based diagnosis engine (MBDE) accepts as in-
the components in the system, and OBS is the observed
                                                                   put SD, OBS, and COM P S and outputs a set of diag-
behavior of the system. Every component can be either
                                                                   noses Ω. Although a diagnosis is consistent with SD and
normal or abnormal. The assumption that a component
                                                                   OBS, it may be incorrect. A diagnosis ω is correct if
c ∈ COM P S is abnormal is represented by the abnormal
predicate AB(c).                                                   by repairing the set of components in ω (−) the system is
A batch repair problem (BRP) arises when the assumption            fixed. Some diagnosis algorithms return, in addition to Ω, a
that all components are normal is not consistent with the          measure of the likelihood that each diagnosis is correct [7;
system description and observations. Formally,                     8]. Let p : Ω → [0, 1] denote this likelihood
                                                                                                              P     measure. We
                                                                   assume that p(ω) is normalized so that ω∈Ω p(ω) = 1 and
                      ^
    SD ∧ OBS ∧              ¬AB(c) is not consistent               use it to approximate the probability that ω is correct.
                                                                      A common way to estimate the likelihood of diagnoses,
                   c∈COM P S
                                                                   assumes that each component has a prior on the likelihood
In such a case, at least one component must be repaired.           that it would fail and component failures are independent.
Definition 1 (Repair Action). A repair action can be ap-           Therefore, if p(c) represents the likelihood that a component
plied to any subset of components and results in these com-        c would fail then diagnosis likelihood can be computed as
                                                                                                  Q
ponents becoming normal. Applying a repair action to a set                                          c∈ω − p(c)
of components γ is denoted by Repair(γ).                                            p(ω) =  P         Q                      (1)
                                                                                               ω 0 ∈Ω   c∈ω 0− p(c)
   Definition 1 assumes that repair actions always succeed,        where the denominator is a normalizing factor. We assume
i.e., a component is normal after it is repaired.                  in the rest of this paper that diagnoses likelihoods are com-
   After a repair action, the system is tested to check if it      puted according to Equation 1. Other methods for comput-
has been fixed. We assume that the system inputs in this test      ing likelihood of diagnoses also exist [9].
are the same as in the original observations (OBS). The
observed system outputs are then compared to the expected          3.1 System Repair Likelihood
system outputs of a healthy system. Thus, the result of a          If the MBDE returns a single diagnosis ω that is guaranteed
repair action is either that the system is fixed, or a new ob-     to be correct, then the optimal solution to BRP would be to
servation that may help choosing future repair actions.            perform a single repair action: Repair(ω − ). This, however,
   Repairing a set of components incurs a cost, composed           is rarely the case, and more often a possibly a very large
of a repair overhead and component repair costs. The repair        set of diagnoses is returned by diagnosis algorithms. This
overhead is denoted by costrepair , and the component repair       introduces uncertainty as to whether a repair action would
cost of a component c ∈ COM P S is denoted by costc .              actually fix the system. We define this uncertainty as fol-
Definition 2 (Repair Costs). Given a set of components γ ⊆         lows:
COM P S, applying a repair action Repair(γ) incurs a cost:         Definition 4 (System Repair Likelihood). The System Re-
                                          X                        pair Likelihood of a set of components γ ⊆ COM P S,
        cost(Repair(γ)) = costrepair +        costc                denoted SystemRepair(γ), is the probability that
                                           c∈γ                     Repair(γ) would fix the system.




                                                             114
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


   Consider     the     relation     between    p(ω)     and           F (c) is an estimate of the likelihood that component c
SystemRepair(ω). If ω is correct, then repairing                    is faulty given a set of diagnoses Ω and their likelihoods.
all components that are faulty, meaning ω (−) , would fix the       Based on the system’s health state, we propose the following
system. Therefore, the likelihood of repairing ω (−) causing        utility function, denoted uHP :
the system to be fixed is at least p(ω), i.e.,                                                      X
                                                                                         uHP (γ) =      F (c)
               SystemRepair(ω (−) ) ≥ p(ω)                                                          c∈γ

                                                                    where γ is any subset of COM P S that has not been re-
Moreover, if ω is correct then repairing any superset of ω (−)      paired yet.
would also fix the system. Thus, SystemRepair(ω (−) )                  The repair action that maximizes uHP is trivial — repair
may be larger than p(ω). On the other hand, repairing any           all components. This would result in the system being re-
set of components that is not a superset of ω (−) , as there        pairs, but of course, may repair many components that are
would still be faulty components in the system. Therefore,          likely to be healthy. To mitigate this effect, we propose the
a repair action Repair(COM P S 0 ) would fix the system if          k highest probability repair algorithm (k-HP), which limits
and only if ω ∗(−) ⊆ COM P S 0 , where ω ∗ is the correct           the number of components that can be repaired in a single
diagnosis. While we do not know ω ∗ , we can compute                repair action to k, where k is a user-defined parameter. Note
SystemRepair(γ) from Ω and p(·):                                    that computing k-HP does not need any exhaustive search:
                                      X                             simply sort the health state in descending order of F (·) val-
            SystemRepair(γ) =                p(ω)                   ues and repair the first k components.
                                   ω∈Ω∧ω⊆γ                             The k-HP repair algorithm has two clear disadvantages.
                                                                    First, the user needs to define k. Second, k-HP does not
For example, in the logical circuit depicted in Fig-                consider repair costs (neither component repair costs nor
ure 1, there are two diagnoses, {A} and {B}, such                   overhead costs). The next set of utility functions and cor-
that p({A}) = 0.6 and p({B}) = 0.4.            Thus,                responding repair algorithms address these disadvantages.
SystemRepair({A})=0.6, SystemRepair({B})=0.4, and
SystemRepair({A, B})=p({A})+p({B})=1.                               4.2 Wasted Costs Utilities
                                                                    Before describing the next set of proposed utility functions
4 BRP as a Combinatorial Search Problem                             we explain the over-arching reasoning behind it. Repair-
As mentioned in the introduction, the approach for solving          ing a system requires performing repair actions. Some re-
BRP that we pursue in this paper formulates BRP as a com-           pair costs are inevitable. These are the repair overhead of
binatorial search problem. The search space is the space of         a single repair action, and the component repair costs that
possible repair actions, i.e., every subset of the set of com-      repair the faulty components. We propose a family of utility
ponents there were not repaired yet. The search problem is          functions that try to estimate the expected total repair costs
to find the repair action that maximizes a utility evaluation       beyond these inevitable costs. We refer to these costs as
function u(·) that maps a repair action to a real value that        wasted costs and to utility functions of this family as wasted
estimates its merit.                                                cost functions.
   The effectiveness of this search-based approach for BRP             We model these wasted costs as being composed of two
depends on the search algorithm used and how the u(·) util-         parts.
ity function is defined. There are many existing heuristic             • False positive costs (costF P ). These are the costs
search algorithm for searching large combinatorial search                 incurred by repairing components that are not really
spaces [10; 11]. Thus, in this work we propose and evalu-                 faulty.
ate a set of possible utility functions. Note that for some of         • False negative costs (costF N ). These are the overhead
the utility functions described next it is possible to find the           costs incurred by future repair actions.
best repair action without searching the entire search space
                                                                    It is clear why the false positive costs are wasted costs —
of possible actions, while others are more computationally
                                                                    these are repair costs incurred on repairing healthy compo-
intensive.
                                                                    nents. The false negative costs are wasted costs because if
4.1 k Highest Probability                                           one knew upfront which components are faulty, then the op-
                                                                    timal repair algorithm would repair all these components in
A key source of information for all the utility functions de-       a single batch repair action, incurring no further overhead
scribed below is the set of diagnoses Ω and their likelihoods       costs. Thus, future overhead costs represent wasted costs.
(p(·)). We assume that this information is obtained by us-             We borrow the terminology of false positive and false
ing a diagnosis engine over the observations of the current         negative from the machine learning literature, but use it in a
state of the system. The set of returned diagnoses may be           somewhat different manner. To explain this choice of ter-
very large. The first utility function we propose is based on       minology, assume that positive and negative mean faulty
the system’s health state, which has been recently proposed         and healthy components respectively. Choosing to repair
as a method for aggregating information from a set of diag-         a faulty component is regarded as a true positive, and not
noses [12].                                                         repairing a healthy component is regarded as a true nega-
Definition 5 (Health State). A health state is a mapping F :        tive. Thus, the wasted costs incurred by repairing healthy
COM P S → [0, 1] where                                              components are costs incurred due to false positives, and
                              X                                     the wasted costs incurred by not repairing a faulty compo-
                 F (c) =             p(ω)                           nent are costs incurred due to false negatives. While this is
                           ω∈Ωs.t.c∈ω                               not a perfect match in terminology, we belief that it helps
                                                                    clarify the underlying intention of costF P and costF N .




                                                              115
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


The Wasted Cost Utility Function                                    4.3 Handling the Computational Complexity
For a given set of components γ, we denote by costF P (γ)           The search space is very large — the size of the power set of
and costF N (γ) the fast positive costs and false negative          all components that were not repaired so far. We explored
costs, respectively, incurred by performing a batch repair ac-      two simple ways to handle this. The first approach is to
tion of repairing all the components in γ. Given costF P (γ)        only consider subset of components with up to k compo-
and costF N (γ), we propose the following general formula           nents, where k is a parameter. This approach is referred to
for computing the expected wastes costs, denoted by CW C .          as Powerset-based search.
   costF P (γ) + (1 − SystemRepair(γ)) · costF N (γ)                   The second approach we considered is to consider only
                                                                    supersets of the diagnoses in Ω. This has the intuitive rea-
The left hand side of the formula is the false positive costs.      soning that at least one of these diagnoses is supposed to be
The right hand side of the formula is the false negative            true (according to the known observation), and thus a repair
costs, multiplied by the probability that the system will           algorithm should try to aim for fixing the problem in the
not be fixed by repairing the components in γ. Thus, the            next repair action. Thus, in this approach, we considered
formula gives the total expected wastes costs. We define            in the search for the best repair action every set of compo-
UW C = −CW C as the wasted cost utility function.                   nents that are unions of at most k diagnoses, where k is a
   The wasted cost utility function is a theoretical utility        parameter. This approach is referred to as the Union-based
function, since one does not know upfront the values of             search.
costF P and costF N . Next, we propose several ways to                 For both powerset-based search and union-based search,
estimate uW C by proposing ways to estimate costF P and             increasing k results in a larger search space. This means
costF N .                                                           higher computational complexity, but also increases the
                                                                    range of repair actions considered, and thus using higher
Estimating the False Positives Cost                                 k can potentially find better repair actions than using lower
We propose to estimate the false positive costs by consider-        k values. This provides an often desired tradeoff of com-
ing the system’s health state (Definition 5), as follows.           putation vs. solution quality. Experimentally, we observed
                         X                                          that the union-based search approach yields much better re-
           d F P (γ) =
          cost               (1 − F (c)) · cost(Ci )                sults and thus we only show results for it in the experimental
                         c∈γ                                        results below.
This estimate of the false positive costs can be understood
as an expectation over the false positive costs. The cost of a      5 Experimental Results
repaired component c ∈ γ is part of the false positive costs
                                                                    We evaluated the proposed batch selection algorithms on
only if c is in fact healthy. The probability of this occurring
                                                                    two standard Boolean circuits: 74283 and 74182. We exper-
is (1 − F (c)). Thus, (1 − F (c)) · cost(c) is the expected
                                                                    imented on 21 observations for system 74283 and 23 obser-
false positive cost due to repairing component c.
                                                                    vations for system 74182. These observations were selected
False Negatives Cost                                                randomly from Feldman et al.’s [13] set of observations. For
Correctly estimating costF N is more problematic than               each observation, all subset minimal diagnoses were found
costF P , as it requires considering the future actions of the      using exhaustive search.
repair algorithm. In the best case, only one additional repair
action would be needed. This would incur a single addi-             5.1 Baseline Repair Algorithms
tional overhead cost. We call this the optimistic costF N ,         The main hypothesis of this line of work is that performing
or simply costoF N , which is equal to costrepair . The other       a batch repair action can save repair costs. To evaluate if
extreme assumes that every component not repaired so far            the proposed batch repair algorithms are able to do so, we
would be repaired by a single repair action, and correspond-        compare them with two repair algorithms that do not con-
ingly an incurred overhead cost. We experimented with a             sider batch repair actions. These baseline repair algorithms,
slightly less extreme estimate, in which we assume that only        named “Best Diagnosis” (BD) and “Highest Probability”
faulty component will be repaired in the future, but each will      (HP), are inspired by previous work on test planning [14]
be repaired in a single repair action, incurring one costrepair     and work as follows. BD chooses to repair a single com-
per faulty component. Since we do not know the number of            ponent from the most preferred diagnosis in Ω (that with
faulty components, we use the expected number P       of faulty     the highest p(·) value). From the set of components in the
components according to the health state: c∈γ     / F (c). The      most probable diagnosis, BD chooses to repair the one with
resulting estimate is referred to as the pessimistic estimate       the lowest repair costs. The HP repair algorithm chooses
of costF N , denoted by costpF N , is thus computed as:             to repair the component that is most likely to be faulty, as
                                          X                         computed by the system’s health state (F [·]).
             costpF N (γ) = costrepair ·      F (c)                    Another baseline repair algorithm we evaluated experi-
                                        c∈γ
                                         /                          mentally that serves as a baseline is to repair all components
                                                                    of the most likely diagnosis in a single batch repair action.
   Summarizing all the above, we propose two utility func-          Note that this algorithm, denoted Batch Best Diagnosis, ig-
tions from the wasted cost utility function family. A pes-          nores repair costs, and serves as an extreme alternative to
                                         d F P and costpF N
simistic wasted cost function, that uses cost                       the BD algorithm that repairs a single component from the
to estimate costF P and costF N , and an optimistic wasted          most likely diagnosis.
                          d F P and costoF N . The cor-
cost function that uses cost                                           Table 1 shows the average repair costs incurred until the
responding repair algorithms search in the combinatorial            system was fixed for the proposed repair algorithms. The
space of all possible sets of components to find the set of         average was over all the observations we used for system
components that maximizes uW C .                                    74182. The rows labeled BD, HP, 2-HP, and 3-HP show the




                                                              116
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


                              Overhead cost                         tomated troubleshooting were proposed in previous works.
        Algorithm       10       15     20    25                    Heckerman et al. [1] proposed the decision-theoretic trou-
        BD & HP       83.5    111.3 139.1 167.0                     bleshooting (DTT) algorithm, that uses a decision theoretic
        2-HP          61.5     77.8   94.1 110.4                    approach for deciding which components to observe in or-
        3-HP          53.0     65.0   77.0  88.9                    der to identify the faulty component. Later work also ap-
        Opt.(1)       55.2     68.9   82.6  96.3                    plied a decision theoretic approach that integrated planning
        Opt.(2)       53.0     65.0   75.2  86.7                    and diagnosis to a real world troubleshooting application [3;
        Opt.(3)       55.2     66.5   72.6  83.7                    15]. Torta et al. [4] proposed using model abstractions for
        Pes.(1)       55.0     68.9   81.3  96.1                    troubleshooting while taking into account the cost of repair
        Pes.(2)       52.8     59.8   63.7  70.0                    actions. All these works did not consider the possibility of
        Pes.(3)       49.6     50.4   55.9  64.6                    repairing a set of components together, allowing only repair
                                                                    actions that repair a single component at a time.
    Table 1: Average repair costs for the 74182 system.                Our current paper on BRP do not consider applying
                                                                    further diagnostic actions such as probing and testing,
                               Overhead cost                        which are considered by previous troubleshooting algo-
       Algorithm        10        15     20         25              rithms. Thus, our work on BRP could be integrated in previ-
       BD            116.4     155.2 194.0       232.9              ous troubleshooting frameworks so as to consider both batch
       HP            109.3     145.7 182.1       218.6              repair actions and diagnostic actions. This is left to future
       2-HP           81.2     102.1 123.1       144.0              work.
       3-HP           70.5      85.7 101.0       116.2                 Friedrich and Nedjl [2] discussed the relation between di-
       Opt.(1)        76.0      95.7 115.2       134.8              agnoses and repair, in an effort to minimize the breakdown
       Opt.(2)        72.9      89.8 102.4       111.7              costs. Breakdown costs roughly correspond to a penalty in-
       Pes.(1)        75.2      95.7 114.0       134.8              curred for every faulty output in the system, for every time
       Pes.(2)        72.4      84.8   93.6       96.0              step until the system is fixed. In BRP, the goal is to mini-
                                                                    mize costs until the system if fixed, and there is no partial
    Table 2: Average repair costs for the 74283 system.             credit for repairing only some of the system outputs.

results for the BD, HP, and k-HP repair algorithms (for k=2         7 Conclusion and Future Work
and 3). The rows Opt.(1), Opt.(3), and Opt.(3) show the re-         We addressed the problem of troubleshooting with the pos-
sults for the union-based search repair algorithm using the         sibility of performing a batch repair action — a repair action
                                    d F P to estimate costF P
wasted cost utility function with cost                              in which more than a single component is repaired. Batch
and costoF N to estimate costF N . The rows Pes.(1), Pes.(2),       repair makes sense only if repairing a set of components
and Pes.(3) show results for the same configuration, except         in a single repair action is cheaper than repairing each of
for using costpF N to estimate costF N instead of costoF N .        them separately. We proposed several algorithms for select-
The repair costs of a single component was arbitrary set            ing which batch of components to repair. Experimental re-
to 5 and the cost of the overhead (costrepair ) was varied          sults clearly show the benefit of batch repair over single re-
(10,15,20,25). Each column represents results for different         pair actions, and the benefit of the algorithms we suggested
values of costrepair . In this domain, the results of HP and        for choosing these set of components to repair. Future work
BD were virtually the same, and thus we grouped them to a           will investigate when should batch repair be considered, and
single row.                                                         how to detect such cases upfront. Additionally, expanding
   The results clearly show the benefit of considering batch        beyond Boolean circuits is also needed, as well as address-
repair actions. The best performing repair algorithm is             ing uncertainty on the outcome of repair actions.
Pes.(3), which required more than half the repair costs
needed for BD and HP, which do not consider batch repair.
This supports the main hypothesis of this paper: batch re-          References
pair actions can save significant amount of repair costs. As        [1] David Heckerman, John S Breese, and Koos Rom-
expected, the gain of batch repair actions increases as the             melse. Decision-theoretic troubleshooting. Commu-
repair overhead (costrepair ) increases. Also note that for             nications of the ACM, 38(3):49–57, 1995.
Pes.(k) we observe the desired trend of increasing k result-        [2] Gerhard Friedrich and Wolfgang Nejdl. Choosing ob-
ing in lower repair costs. This is also observed for the k-HP           servations and actions in model-based diagnosis/repair
repair algorithm (note that the HP algorithm is in fact 1-HP),          systems. KR, 92:489–498, 1992.
but is not always the case for Opt.(k), where for lower over-
head cost k = 2 yielded lower repair costs than k = 3. This         [3] Anna Pernestål, Mattias Nyberg, and Håkan Warn-
suggests that the optimistic estimate of costF N is not robust.         quist. Modeling and inference for troubleshooting with
Computationally, increasing k required much more runtime,               interventions applied to a heavy truck auxiliary brak-
and we could not run experiments with k = 4 on our cur-                 ing system. Engineering Applications of Artificial In-
rent machines in reasonable time. Table 2 shows the results             telligence, 25(4):705–719, June 2012.
for the 74283 system. The trends observed are the same as           [4] Gianluca Torta, Luca Anselma, and Daniele Theseider
those discussed above for the results of 74182 system.                  Dupré. Exploiting abstractions in cost-sensitive abduc-
                                                                        tive problem solving with observations and actions. AI
6 Related Work                                                          Commun., 27(3):245–262, 2014.
BRP is a troubleshooting problem, where the goal is to per-         [5] Marie-Odile Cordier, Yannick Pencolé, Louise Travé-
form repair actions so as to fix a system. Algorithms for au-           Massuyès, and Thierry Vidal. Self-healablity = diag-




                                                              117
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


     nosability + repairability. In the International Work-
     shop on Principles of Diagnosis (DX), pages 251–258,
     2007.
[6] Roni Stern and Meir Kalech. Repair planning with
     batch repair. In International Workshop on Principles
     of Diagnosis (DX), 2014.
[7] Brian C Williams and Robert J Ragno. Conflict-
     directed A* and its role in model-based embedded sys-
     tems. Discrete Applied Mathematics, 155(12):1562–
     1595, 2007.
[8] Rui Abreu, Peter Zoeteweij, and Arjan J. C. van
     Gemund. Simultaneous debugging of software faults.
     Journal of Systems and Software, 84(4):573–586,
     2011.
[9] O.J. Mengshoel, M. Chavira, K. Cascio, S. Poll,
     A. Darwiche, and S. Uckun. Probabilistic model-based
     diagnosis: An electrical power system case study. Sys-
     tems, Man and Cybernetics, Part A: Systems and Hu-
     mans, IEEE Transactions on, 40(5):874–885, 2010.
[10] Stuart J. Russell and Peter Norvig. Artificial Intelli-
     gence - A Modern Approach (3. internat. ed.). Pearson
     Education, 2010.
[11] Stefan Edelkamp and Stefan Schroedl. Heuristic
     search: theory and applications. Elsevier, 2011.
[12] Roni Stern, Meir Kalech, Shelly Rogov, and Alexan-
     der Feldman. How many diagnoses do we need? In
     AAAI, 2015.
[13] Alexander Feldman, Gregory Provan, and Arjan van
     Gemund. Approximate model-based diagnosis using
     greedy stochastic search. Journal of Artificial Intelli-
     gence Research (JAIR), 38:371, 2010.
[14] Tom Zamir, Roni Stern, and Meir Kalech. Using
     model-based diagnosis to improve software testing. In
     AAAI (to appear), 2014.
[15] Håkan Warnquist, Jonas Kvarnström, and Patrick Do-
     herty. Planning as heuristic search for incremental
     fault diagnosis and repair. In Scheduling and Planning
     Applications Workshop (SPARK) at the International
     Conference on Automated Planning and Scheduling
     (ICAPS), 2009.




                                                            118
                           Proceedings of the 26th International Workshop on Principles of Diagnosis




       Formulating Event-Based Critical Observations in Diagnostic Problems

                                   Cody James Christopher1,2 and Alban Grastien2,1
                       1
                           Artificial Intelligence Group, The Australian National University.
                                        2
                                          Optimisation Research Group, NICTA∗



                           Abstract                                 the diagnosis, in addition to providing information as to the
                                                                    causes of the fault.
     We claim that in scenarios involving a human                      Further, we assume that a more concise explanation is
     operator with responsibility over systems being                strictly preferred to more verbose explanation, and conse-
     monitored by diagnoser, presenting said operator               quently that there is merit to isolating the “smallest” amount
     with a concise set of observations capturing the               of supporting evidence, or what we call the critical obser-
     essence of a failure improves the operator’s un-               vations. In cognitive psychology, the seminal paper on the
     derstanding of the diagnosis.                                  topic of working memory in humans supports this view, giv-
     We take this in the context of Discrete Event Sys-             ing the average working memory capacity as 7 ± 2 distinct
     tems and demonstrate how the idea can be ap-                   pieces of information [1]. Providing only the observations
     plied to systems utilising event-based observa-                critical to the diagnosis also has the additional benefit of
     tions, which can contain implicit information. We              ameliorating privacy concerns in systems where privacy is
     introduce the notion of an abstracted event stream,            considered important.
     called a sub-observation, that makes the implicit                 We extend the results of Christopher et al. [2] to event-
     information explicit for the operator and allows a             based observations. We first present preliminary theory and
     diagnoser to arrive at the same diagnosis. We call             notation, before going on to show that event-based observa-
     the most abstract of these the critical observation.           tions contain implicit information. We then introduce what
     We provide relevant definitions, properties, and a             we call sub-observations that can capture this implicit in-
     procedure for computing the critical observation               formation and make it available for use in diagnosis pro-
     in a diagnosis problem.                                        cedures. We then provide formal definitions of sufficiency
                                                                    and criticality in addition to several important properties
                                                                    that allow for a terminating algorithm. We present an algo-
1 Introduction                                                      rithm for computing the critical observation and discuss its
                                                                    complexity. A discussion of alternate ways of defining sub-
Diagnosis problems are concerned with the detection and
                                                                    observations precedes a brief discussion of related work and
identification of occurrences of specific events in a system,
                                                                    a conclusion.
generally called faults or failures. These occurrences are
difficult to detect as the fault events are typically not di-
rectly observable, however, they can be inferred from the           2 Preliminaries and Notations
system model (a description of the system behaviour) and            The present work takes place in the context and standard
the observations produced by the system.                            framework of discrete event systems (DES) [3]. We denote
   Diagnosis is the first step in the fault recovery process.       as Σ the set of events that can take place on the system. A
Once a fault has been detected and identified, the appropri-        system run is a finite sequence of events, w = e1 e2 . . . ek ,
ate actions can be taken to mitigate its effects. The issue,        and the system is modeled as the prefix-closed language
however, is that this procedure acts as a black box; given a        LM ⊆ Σ⋆ that represents all possible runs.
model and a sequence of observations, a diagnoser asserts a            The set of events is partitioned into observable events
fault by claiming that there is no possible nominal execution       Σo —events that are recorded—and unobservable events
of the system that would produce the observation sequence.          Σu —those that are not. The observation o generated by run
   The present work is written under the assumption that a          w = e1 e2 . . . ek , hereafter called the trace of w, is the pro-
diagnosis procedure is fundamentally built for a human op-          jection of w on the set of observable events (i.e., all unob-
erator in charge of taking actions after a fault is identified.     servable events of the run are deleted):
In this scenario, a black box approach does not allow for                             (
the presentation of the information relevant to the diagno-                               ε                     if k = 0
sis. We assume that providing the operators with explana-           o = PΣo (w) =         e1 PΣo (e2 . . . ek ) if k > 0 and e1 ∈ Σo
tory evidence is useful in convincing them of the validity of                             PΣo (e2 . . . ek )    otherwise.
   ∗
     NICTA is funded by the Australian Government through              The observed language of a trace o, denoted Lo , is the set
the Department of Communications and the Australian Research        of finite sequences of events that could produce the observed
Council through the ICT Centre of Excellence Program.               sequence: Lo = PΣ−1  o
                                                                                           (o) = {w ∈ Σ⋆ | PΣo (w) = o}.




                                                              119
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


  The set of unobservable events includes a subset of fault                {b, c, d}            start        {b, e}
events, Σf ⊆ Σu . With slight abuse of notation we write
f ∈ w as short for w ∈ Σ⋆ f Σ⋆ (or “f appears in w”) and                               {c, d}            a            b
F ∩ w as short for {f ∈ F | f ∈ w} (or “the subset of                         6                  1               2         3
events from F that appear in w”).                                                                                     a
  A set δ ⊆ Σf of faults is consistent with the model LM                          f2                                           f2
and the trace o if there exists a run w ∈ LM that would                                                               f1
produce this trace (PΣo (w) = o) and that exhibits exactly                               a
                                                                              7                  8               4         5
these faults (w ∩ Σf = δ). The diagnosis of trace o, denoted
∆(o), is the collection of all consistent sets of faults:                         {b, c, d, e}
                                                                                                           {a, d}        c
                        ∃w ∈ LM .
   ∆(o) = δ ⊆ Σf                                          (1)
                        PΣo (w) = o ∧ δ = w ∩ Σf
                                                                                          Figure 1: Example DES
   Hereafter we use the hat notation (ˆ) to indicate that the
given symbol represents what actually occurred. Given a             that a was followed by c is enough to convince an operator
run ŵ, δ̂ = ŵ ∩ Σf is the set of faults that occurred during      of the correctness of the diagnosis.
the run; then the following result is trivial: ŵ ∈ LM ⇒ δ̂ ∈       o2 = ababaa. The model specifies that f1 must have oc-
∆(PΣo (ŵ)). (The premise, completeness of the model, is            curred for there to be two a events that are not separated by
                                                                    another observable event. More specifically, the lack of an
assumed.)
   We find it more convenient to define the diagnosis in            intervening event is the crucial piece of information that de-
terms of emptiness of languages. Let Lδ be the language             termines the fault. In this case, reporting in some abstract
                                                                    sense that multiple a occurred consecutively is enough to
that represents all sequences that contain exactly δ:
                                   \               \                indicate the fault convincingly.
Lδ = {w ∈ Σ⋆ | w∩Σf = δ} =             Σ⋆ f Σ⋆ ∩       (Σ\{f })⋆
                                                                    3.2 Framework
                                  f ∈δ          f ∈Σf \δ
                                                                    We first present a general framework for sub-observation,
  That is, Lδ represents the set of all runs containing all         which is then further specified for our particular choice of
of the faults of δ, intersected with all possible runs where        implementation.
the faults not in δ never occur—the result is a set of all
runs where the only faults that occur are those in δ. With          General Definition
Lδ defined, we can equivalently express the diagnosis as an         Definition 1 We define a framework for sub-observations
emptiness of languages problem:                                     as a tuple: hO, , subi:
                                                                      1. A sub-observation, θ, is an abstraction over a trace that
            δ ∈ ∆(o) ⇐⇒ LM ∩ Lo ∩ Lδ 6= ∅.                 (2)           represented an intentional relaxation (or weakening) of
                                                                         the concrete knowledge contained in the trace.
3 Sub-Observations                                                    2. O is the space of possible sub-observations.
We first discuss event-based observations, and in particular
                                                                      3. The symbol  is a binary relation and partial order
that event-based observations contain implicit information
                                                                         over O and relates two sub-observations θ, θ′ such that
that must be taken into consideration when performing di-
                                                                         θ′  θ iff θ′ is a more abstracted form of θ.
agnosis. We then introduce the notion of sub-observations,
providing formal definitions and an explanatory example.              4. sub is an injective function, mapping traces to maximal
Once this has been established, a procedure is given for di-             (w.r.t. ) sub-observations θ ∈ O:
agnosing with sub-observations.                                                                      sub : Σ∗o → O
3.1   Event-Based Diagnosis and Implicit                            A sub-observation θ implicitly represents the set of traces
      Information                                                   for which it is a more abstract form of:
Event-based diagnosis, contrasted with state-based diagno-                          ψ(θ) = {o ∈ Σ∗o | θ  sub(o)}
sis, comes with a subtlety; specifically, there is a type of
implicit information encoded in the trace. Take for example         Therefore, θ′  θ ⇒ ψ(θ′ ) ⊇ ψ(θ).
the repeated observation of a window being closed without
there ever being an observation of the window opening; in             The language of a sub-observation, denoted Lθ , repre-
this case, the fact that we never observed an open event is         sents the set of all possible runs θ could represent. How-
distinctly relevant to a diagnosis procedure.                       ever, these runs are already captured by Lo , and so Lθ can
   To further illustrate this, we provide a simple abstract ex-     be expressed as the union of the languages of the traces it is
ample in the form of a DES: Take Σ = {a, b, c, d, e, f1, f2 },      a more abstract form of:
                                                                                                   [
with Σo = {a, b, c, d, e}, Σu = Σf = {f1 , f2 }. We provide                                Lθ =         Lo                    (3)
the system model in the form of a NFA in Figure 1 and con-                                              o∈ψ(θ)
sider some example traces over it:
o1 = abababc. The model specifies that f2 must have oc-             Specific Definition
curred in strings containing a followed by c. In this case,         For the purposes of our specific definition of sub-
the intervening sequence is long (babab), and could be much         observations, it is necessary to distinguish between what we
longer. The important information, however, is that a was at        call hard and soft events. A hard event is a singleton observ-
some point followed by c. Reporting in some abstract sense          able event, x ∈ Σo , and represents the firm occurrence of an




                                                              120
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


event in the system. A soft event is a subset of observable          {ac}      b     {cd}     a       {c}      d       {c}       a     ∅
events, y ⊆ Σo , that any number (including zero) of which
may have occurred along with any number of unobservable
events.
   We now explicitly characterize our construction of sub-                             {abcd}     a    {bcd}       a         ∅
observations based on the general framework presented in
Definition 1:                                                                  Figure 2: An example map satisfying 
Definition 2 A sub-observation, θ, is a strict time-ordered
alternating sequence of soft and hard events, commencing
and ending with a soft event: θ = y0 x1 y1 . . . xn yn . We de-
                                                                     3.3 Diagnosis of Sub-Observations
note O(o) the space of sub-observations for a given trace o.         We now formalize the usage of sub-observations in a diag-
θ ∈ O has length |θ| = n. For readability, sub-observations          nosis procedure by extending the procedure introduced for
may occasionally be written as a comma separated list. The           event-based diagnosis presented in §2. This involves check-
language of θ can then also be expressed:                            ing the consistency of a set of possible faults.
      Lθ = (y0 ∪ Σu )∗ x1 (y1 ∪ Σu )∗ . . . xn (yn ∪ Σu )∗              We therefore provide the construction of the diagnosis
                                                                     of θ, ∆(θ), the set of faults consistent with a given sub-
   By way of example, take the sub-observation θ =                   observation:
({b, d} , a, ∅, c, {a}) – in this case, we say the singleton         Definition 5 The diagnoses of a sub-observation θ is the
events x1 = a and x2 = c are hard and occurred in the spec-          union of the diagnoses of the traces for which θ is the more
ified order. The first soft event, y0 = {b, d}, represents the       abstract form of, represented by ψ(θ) as given in Defini-
possibility of any number of b or d events in any order hav-         tion 1:                        [
ing occurred before the first hard event – similarly, y1 = ∅                             ∆(θ) =          ∆(o)
indicates that no events occurred between the hard events                                          o∈ψ(θ)
x1 and x2 , and y2 = {a} that any number of a events could
have occurred after the final hard event. There are multiple            From Definition 5 we note that, given δ̂ ∈ ∆(ô), that if
traces ô that this could represent, ac being the simplest, but      θ  sub(ô) then δ̂ ∈ ∆(θ). That is, the actual diagnosis δ̂
traces such as ddacaa or bac, or indeed up to infinite (or           of the actual trace ô, will by definition be in ∆(θ) if θ is an
bounded length depending) other possibilities.                       abstraction of ô.
Definition 3 The function sub generates a sub-observation               First, we observe the following lemma:
in O from a given trace by inserting empty soft events at the        Lemma 3.1 The possible traces permitted by the language
head of the trace, and after every hard event:                       of a more abstracted sub-observation strictly contains all
                   For o = e1 . . . en                               the permitted traces of all its ascendants:
                   sub(o) = ∅x1 ∅ . . . xn ∅ ∈ O                                         θ′  θ =⇒ Lθ ⊆ L′θ
                   Where ∀i : xi = ei
                                                                     Proof This is a direct consequence of Equation 3
Definition 4 The relation  over O is defined such that
θ′  θ if and only if there exists a mapping function f :               Equation 2 provided a formulation of the diagnosis as a
                                                                     question of emptiness in the intersection of languages – that
     Given |θ′ | = n, |θ| = m                                        is, is there some run that is simultaneously possible accord-
     f : {0, . . . , n + 1} → {0, . . . , m + 1} such that           ing to the system model, the observations, and the faults that
     f (i) < f (i + 1), f (0) = 0, f (n + 1) = m + 1                 occurred during the run. This can similarly be extended to
                                                                     a similar question for sub-observations. As Lθ is defined in
     x′i = xf (i)                                                    Definition 2, then ∆(θ) can be equivalently extended:
                    [                        [
     yi′ ⊇                       yj ∪                     xj                       ∆(θ) ≡ {δ | LM ∩ Lδ ∩ Lθ 6= ∅}                    (4)
             f (i)≤j≤f (i+1)−1          f (i) f (i)+1 (such an index exists if the two
     38.                                                                  sizes differ). If yi+1 \yi 6= ∅, then let θ′′ = es(θ, i, e) (where
[17] G. Lamperti, F. Vivenzi, and M. Zanella, “On sub-                    e ∈ yi+1 \ yi ) be the sub-observation obtained by softening
     sumption, coverage, and relaxation of temporal ob-                   yi with e; then, θ′ ≺ θ′′ ≺ θ. Similarly if yi ⊇ yi+1 with
     servations in reuse-based diagnosis of discrete-event                θ′′ = es(θ, i + 1, e) (where e ∈ yi \ yi+1 ). Lastly the same
     systems: a unifying perspective,” in 20th International              applies if yi = yi+1 with θ′′ = coll(θ, i).
     Workshop on Principles of Diagnosis (DX-09), 2009,                      If θ and θ′ have same size, then all x′i s equal the cor-
     pp. 353–360.                                                         responding xi s, and all the yi′ s are supersets of the corre-
[18] J. Kurien and P. Nayak, “Back to the future for                      sponding yi s. Let i be an index such that yi′ 6= yi (if no
     consistency-based trajectory tracking,” in Conference                such index exists, then θ′ = θ). Let θ′′ = es(θ, i, e) where
     on Artificial Intelligence, 2000, pp. 370–377.                       e ∈ yi′ \ yi . Then θ′ ≺ θ′′ ≺ θ.
[19] F. Cassez, J. Dubreil, and H. Marchand, “Synthesis of                Complexity of FIND C RITICAL O BSERVATION
     opaque systems with static and dynamic masks,” For-                  We show that the number of ∆(·) calls in FIND C RIT-
     mal Methods in System Design, vol. 40, no. 1, pp. 88–                                                                 2 2
                                                                          ICAL O BSERVATION could be in the order of n 4m where
     115, 2012.
                                                                          n is the length of the trace and m the number of observable
                                                                          events.
8 Appendix
We provide proof sketches that will not be included in the                                       start
                                                                                                                 A
final version of the paper.                                                                              A
                                                                                           f             b              b
Proof of Lemma 4.3                                                                   4             1             2             3
The proof is three-part:                                                                                  c             c
                                                                                     c            A
 a) proving that the event-softening operation produces
    only children;
                                                                          Figure 4: Example of a system: a fault is diagnosed if there
 b) proving that the collapse operation produces only chil-               are more cs than bs after the occurrence of the last ai (A
    dren;                                                                 stands for {a1 , . . . , am−2 }).
  c) proving that there is no other child.
                                                                 def         We use the example of Figure 4 which involves faulty
Event-Softenings It is easy to see that θ2                      =         event f and observable events {a1 , . . . , am−2 , b, c}. Con-
es(θ1 , i, e) ≺ θ1 .                                                      sider the trace of (odd) length n: ô = a1 . . . a1 bc . . . bc c.
    Assume now that θ2  θ3  θ1 and let f23 and f31 be the                                                       | {z } | {z }
                                                                                                                      n/2       n/2
two mapping functions—as presented in Definition 4—used
                                                                             Clearly the trace reveals a faulty system since the number
to verify the two ordering relations.
                                                                          of cs exceeds the number of bs in this instance. The critical
    By definition of , |θ2 | ≤ |θ3 | ≤ |θ1 |. However since
                                                                          observation here is:
|θ2 | = |θ1 | (by definition of event-softening), the size of
all three sub-observations are equal and f23 = f31 are the                           Σo a1 {c}b{c}c{c} . . . {c}b{c}c{c}cΣo,
identity function.
    As a consequence, x3j = x2j = x1j for all j. Furthermore              i.e., all the second half of the trace needs to be kept.
yj ⊇ yj3 ⊇ yj1 for all j. In particular, if j 6= i, since yj2 = yj1 ,
  2                                                                          We assume that FIND C RITICAL O BSERVATION always
                                                                          tries to perform event-softening from the end of the sub-
then yj3 = yj2 = yj1 . For i, yi2 = yi1 ∪ {e}, meaning that               observation first, and only tries to collapse when no soft-
either yi3 = yi2 or yi3 = yi1 .                                           ening is possible. Neglecting the first steps where the c
    Therefore either θ3 = θ2 or θ3 = θ1 .                                 softenings are successful, the algorithm will need to make
Collapse Similarly, it is easy to see that θ2 =
                                                                 def      U = n2 × (m − 1) calls to ∆(·), unsuccessfully trying to
coll(θ1 , i) ≺ θ1 .                                                       softening the second half of the sub-observation. The num-
   Again assume that θ2  θ3  θ1 and let f23 and f31 be                  ber of successful softenings however is S = n2 × m (all the
the functions defined as before.                                          first half of the sub-observation), meaning that the number
                                                                                                                     2
   The size of θ3 now either equals that of θ2 or θ1 ; let ℓ ∈            of ∆(·) calls will be at least U × S = n m(m−1)4     calls.
{1, 2} denote the index such that |θ3 | = |θℓ |. Notice that
either f23 or f31 is the identity function.




                                                                    125
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                  126
                          Proceedings of the 26th International Workshop on Principles of Diagnosis




                     A Framework For Assessing Diagnostics Model Fidelity

                                     Gregory Provan1 and Alex Feldman2
                    1
                        Computer Science Department, University College Cork, Cork, Ireland
                                           e-mail: g.provan@cs.ucc.ie
                                     2
                                       PARC Inc., Palo Alto, CA 94304, USA
                                          e-mail: afeldman@parc.com


                           Abstract                                   models can actually perform worse than lower-fidelity mod-
                                                                      els on real-world data, as can be explained using over-fitting
     “All models are wrong but some are useful" [1].                  arguments within a machine learning framework.
     We address the problem of identifying which di-                     To our knowledge, there is no theory within Model-Based
     agnosis models are more useful than others. Mod-                 Diagnostics that relates notions of model complexity, model
     els are critical to diagnostics inference, yet little            accuracy, and inference complexity. To address these issues,
     work exists to be able to compare models. We de-                 we explore several of the factors that contribute to model
     fine the role of models in diagnostics inference,                complexity, as well as a theoretically sound approach for
     propose metrics for models, and apply these met-                 selecting models based on their complexity and diagnostics
     rics to a tank benchmark system. Given the many                  performance, i.e., their accuracy in diagnosing faults.
     approaches possible for model metrics, we argue                     Our contributions are as follows:
     that only information-theoretic methods address
                                                                         • We characterise the task of selecting a diagnosis model
     how well a model mimics real-world data. We
                                                                           of appropriate fidelity as an information-theoretic
     focus on some well-known information-theoretic
                                                                           model selection task.
     modelling metrics, demonstrating the trade-offs
     that can be made on different models for a tank                     • We propose several metrics for assessing the quality of
     benchmark system.                                                     a diagnosis model, and derive approximation versions
                                                                           of a subset of these metrics.
                                                                         • We use a dynamical systems benchmark model to
1 Introduction                                                             demonstrate our compare how the metrics assess mod-
A core goal of Model-Based Diagnostics (MBD) is to ac-                     els relative to the accuracy of diagnostics output based
curately diagnose a range of systems in real-world appli-                  on using the models.
cations. There has been significant progress in developing
algorithms for systems of increasing complexity. A key                2 Related Work
area where further work is needed is scaling-up to real-              This section reviews work related to our proposed approach.
world models, as multiple-fault diagnostics algorithms are               Model-Based Diagnostics: There is some seminal work
currently limited by the size and complexity of the models            on modelling principles within the Model-Based Diagnosis
to which they can be applied. In addition, there is still a great     (MBD) community, e.g., [2; 3]; this early work adopts an
need for defining metrics to measure diagnostics accuracy,            approach based on logic or qualitative physics for model
and to measure the computational complexity of inference              specification. However, this work provides no means for
and of the models’ contribution to inference complexity.              comparing models in terms of diagnostics accuracy. More
   This article addresses the modeling side of MBD: we fo-            recent work ([4]) provides a logic-based specification of
cus on methods for measuring the size and complexity of               model fidelity. There is also work specifying metrics for
MBD models. We explore the role that diagnostics model                diagnostics accuracy, e.g., [5].
fidelity can play in being able to generate accurate diagnos-            However, none of this work defines precise metrics for
tics. We characterise model fidelity and examine the trade-           computing both diagnostics accuracy and model complex-
offs of fidelity and inference complexity within the overall          ity, and their trade-offs. This article adopts a theoretically
MBD inference task.                                                   well-founded approach for integrating multiple MBD met-
   Model fidelity is a crucial issue in diagnostics [2]: mod-         rics.
els that are too simple can be inaccurate, yet highly detailed           Multiple Fidelity Modeling There is limited work de-
and complex models are expensive to create, have many pa-             scribing the use of models of multiple levels of fidelity. Ex-
rameters that require significant amounts of data to estimate,        amples of such work includes [6; 7; 8]. In this article we
and are computationally intensive to perform inference on.            focus on methods for evaluating multi-fidelity models and
There is an urgent need to incorporate inference complexity           their impact on diagnostics accuracy, as opposed to devel-
within modelling, since even relatively simple models, such           oping methodoligies for modelling at multiple levels of fi-
as some of the combinational ISCAS-85 benchmark models,               delity.
pose computational challenges to even the most advanced                  Multiple-Mode Modeling One approach to MBD is to
solvers for multiple-fault tasks. In addition, higher-fidelity        use a separate model for every failure mode, rather than to




                                                                127
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


define a model containing all failure modes. Examples of               We generalise that notion to incorporate inference effi-
this approach include [9; 10; 11; 12]. Note that this work          ciency as well as accuracy. We can define an inference com-
does not specify metrics for computing both diagnostics ac-         plexity measure as C(Ỹ , φ). We can then define our diagno-
curacy and model complexity, or their trade-offs.                   sis task as jointly minimising a function g that incorporates
   Model- Selection The metrics that we adopt and extend            the accuracy (based on the residual function) and the infer-
have been used extensively to compare different models,             ence complexity:
e.g., [13]. The metrics are used to compare simulation per-                                                         
formance of models only. In contrast, we extend this frame-                    ξ ∗ = argmin g R(Ỹ , Yφ ), C(Ỹ , φ) .        (2)
work to examine diagnostics performance. In the process,                              ξ∈Ξ
we explore the use of multiple loss functions for penalising        Here g specifies a loss or penalty function that induces a
models, in addition to the standard penalty functions based         non-negative real-valued penalty based on the lack of accu-
on number of model parameters.                                      racy and computational cost.
   Model-Order Reduction Model-Order reduction [14]                    In forward simulation, a model φ, with parameters θ, can
aims to reduce the complexity of a model with an aim to
                                                                    generate multiple observations Ỹ = {ỹ1 , ..., ỹn }. The di-
limit the performance losses of the reduced model. The re-
                                                                    agnostics task involves performing the inverse operation on
duction methods are theoretically well-founded, although
                                                                    these observations. Our objective thus involves optimising
they are highly domain-specific. In contrast to this ap-
                                                                    the state estimation task over a future set of observations,
proach, we assume a model-composition approach from a
                                                                    Ỹ = {Ỹ1 , ..., Ỹn }. Our model φ and inference algorithm
component library containing hand-constructed models of
multiple levels of fidelity.                                        A have different performance based on Ỹi , i = 1, ..., n: for
                                                                    example, [15] shows that both inference-accuracy and -time
                                                                    vary based on the fault cardinality . As a consequence, to
3 Diagnostics Modeling and Inference                                compute ξ ∗ we want to optimise the mean performance over
This section formalises the notion of diagnostics model             future observations. This notion of mean performance op-
within the process of diagnostics inference. We first intro-        timisation has been characterised using the Bayesian model
duce the task, and then define it more precisely.                   selection approach, which we examine in the following sec-
                                                                    tion.
3.1 Diagnosis Task
Assume that we have a system S that can operate in a nom-           3.2 Diagnosis Model
inal state, ξN , or a faulty state, ξF , where Ξ is the set of      We specify a diagnosis model as follows:
possible states of S. We further assume that we have a dis-
                                                                    Definition 1 (Diagnosis Model). We characterise a Diag-
crete vector of measurements, Ỹ = {ỹ1 , ..., ỹn } observed       nosis Model φ using the tuple hV , θ, Ξ, Ei, where
at times t = {1, ..., n} that summarizes the response of
the system S to control variables U = {u1 , ..., un }. Let             • V is a set of variables, consisting of variables denoting
Yφ = {y1 , ..., yn } denote the corresponding predictions                 the system state (X), control (U ), and observations
from a dynamic (nonlinear) model, φ, with parameter values                (Y ).
θ: this can be represented by Yφ = φ(x0 , θ, ξ, Ũ ), where x0         • θ is a set of parameters.
signifies the initial states of the system at t0 .                     • Ξ is a set of system modes.
   We assume that we have a prior probability distribution
P (Ξ) over the states Ξ of the system. This distribution de-           • E is a set of equations, with a subset Eξ ⊆ E for each
notes the likelihood of the failure states of the system.                 mode ξ ∈ Ξ.
   We define a residual vector R(Ỹ , Yφ ) to capture the dif-         We will assume that we can use a physics-based approach
ference between the actual and model-simulated system be-           to hand-generate a set E of equations to specify a model.
haviour. An example of a residual vector is the mean-               Obtaining good diagnostics accuracy, given a fixed E, en-
squared-error (MSE). We assume a fixed diagnosis task T             tails estimating the parameters θ to optimise that accuracy.
throughout this article, e.g., computing the most likely diag-
nosis, or a deterministic multiple-fault diagnosis.                 3.3 Running Example: Three-Tank Benchmark
   The classical definition of diagnosis is as a state estima-      In this paper, we use the three-tank system shown in Fig. 1
tion task, whose objective is to identify the system state that     to illustrate our approach. The three tanks are denoted as T1 ,
minimises the residual vector:                                      T2 , and T3 . Each tank has the same area A1 = A2 = A3 .
                                                                    For i = 1, 2, 3, tank Ti has height hi , a pressure sensor pi ,
                  ξ ∗ = argmin R(Ỹ , Yφ )                 (1)
                          ξ∈Ξ                                       and a valve Vi , i = 1, 2, 3 that controls the flow of liquid
                                                                    out of Ti . We assume that gravity g = 10 and the liquid has
Since this is a minimisation task, we typically need to             density ρ = 1.
run multiple simulations over the space of parameters and              Tank T1 gets filled from a pipe, with measured flow q0 .
modes to compute ξ ∗ . We can abstract this process as              Using Torricelli’s law, the model can be described by the
performing model-inversion, i.e., computing some ξ ∗ =              following non-linear equations:
φ−1 (x0 , θ, ξ, Ũ ) that minimises R(Ỹ , Yφ ).
   During this diagnostics inference task, a model φ can play              dh1          1 h     p                i
                                                                                  =         −κ1 h1 − h2 + q0 ,                 (3)
two roles: (a) simulating a behaviour to estimate R(Ỹ , Yφ );              dt         A1
(b) enabling the computation of ξ ∗ = φ−1 (x0 , θ, ξ, Ũ ). It             dh2          1 h p                  p          i
                                                                                  =         κ1 h1 − h2 − κ2 h2 − h3 , (4)
is clear that diagnostics inference requires a model that has               dt         A2
good fidelity and is computationally efficient for performing              dh3          1 h p                  p i
these two roles.                                                                  =         κ2 h2 − h3 − κ3 h3 .               (5)
                                                                            dt         A3



                                                              128
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


   q0                                                                      mode, and ξ· is the mode where · denotes the combination
                                                                           of valves (taken from a combination of {1, 2, 3}) which are
                                                                           faulty. This fault model has 9 parameters.

                 V1                   V2                   V3              4 Modelling Metrics
                                                                           This section describes the metrics that can be applied to esti-
                                                                           mate properties of a diagnosis model. We describe two types
        p1*                 p2*                  p3*                       of metrics, dealing with accuracy (fidelity) and complexity.

          Figure 1: Diagram of the three-tank system.
                                                                           4.1 Model Accuracy
                                                                           Model accuracy concerns the ability of a model to mimic a
                                                                           real system. From a diagnostics perspective, this translates
   In eq. 3, the coefficient κ1 denotes a parameter that cap-              to the use of a model to simulate behaviours that distinguish
tures the product of the cross-sectional area of the tank                  nominal and faulty behaviours sufficiently well that appro-
A√1 , the area of the drainage hole, a gravity-based constant              priate fault isolation algorithms can identify the correct type
( 2g), and the friction/contraction factor of the hole. κ2                 of fault when it occurs. As such, a diagnostics model needs
and κ3 can be defined analogously.                                         to be able to simulate behaviours for multiple modes with
   Finally, the pressure at the bottom of each tank is obtained            “appropriate" fidelity.
from the height: pi = g hi , where i is the tank index (i ∈                   Note that we distinguish model accuracy from diagnosis
{1, 2, 3}).                                                                inference accuracy. As noted above, model accuracy con-
   We emphasize the use of the κi , i = 1, 2, 3 because we                 cerns the ability of a model to mimic a real system through
will use these parameter-values as a means for “diagnos-                   simulation, and to assist in diagnostics isolation. Diagnosis
ing” our system in term of changes in κi , i = 1, 2, 3. Con-               inference accuracy concerns being able to isolate the true
sider a physical valve R1 between T1 and T2 that constraints               fault given an observation and the simulation output of a
the flow between the two tanks. We can say that the valve                  model.
changes proportionally the cross-sectional drainage area of                   A significant challenge for a diagnosis model is the need
q1 and hence κ1 . The diagnostic task will be to compute the               to simulate behaviours for multiple modes. Two approaches
true value of κ1 , given p1 , and from κ1 we can compute the               that have been taken are to use a single model with multiple
actual position of the valve R1 .                                          modes explicitly defined (a multi-mode approach), or to use
   We now characterise our nominal model in terms of Def-                  multiple models [9; 16; 17], each of which is optimised for
inition 1:                                                                 a single or small set of modes (a multi-model approach).
  • variables V         consist of variables denoting                         The AI-based MBD approach typically uses a single
    the system state (X         = {h1 , h2 , h3 }), con-                   model φ with multiple modes explicitly defined [18], or a
    trol (U = {q0 , V1 , V2 , V3 }), and observations                      single model with just nominal behaviour [19]. From a di-
    (Y = {p1 , p2 , p3 }).                                                 agnostics perspective, accuracy must be defined with respect
  • θ = {{A1 , A2 , A3 }, {κ1 , κ2 , κ3 }} is the set of pa-               to the task T . We adopt here the task of computing the most-
    rameters.                                                              likely diagnosis.
                                                                              Given evidence suggesting that model fidelity for a multi-
  • Ξ consists of a single nominal mode.                                   mode approach varies depending on the mode, it is impor-
  • E is a set of equations, given by equations 3 through 5.               tant to explicitly consider the mean performance of φ over
   Note that this model has a total of 6 parameters.                       the entire observation space Y (the space of possible obser-
                                                                           vations of the system).
   Fault Model In this article we focus on valve faults,
where a valve can have a blockage or a leak. We model                         In this article we adopt the expected residual approach,
this class of faults by including in equations 3 to 5 an addi-             i.e., given a space Y = {Ỹ1 , ..., Ỹn } of observations, the ex-
tive parameter β, which is applied to the parameter κ, i.e., as            pected residual is the average over the n observations, e.g.,
                                                                                                 Pn
κi (1+βi ), i = 1, 2, 3, where −1 ≤ βi ≤ κ1i −1, i = 1, 2, 3.              as given by: R̄ = n1 i=1 R(Ỹi , Yφ ).
β > 0 corresponds to a leak, such that β ∈ (0, 1/κ − 1];
β < 0 corresponds to a blockage, such that β ∈ [−1, 0).                    4.2 Model Complexity
The fault equations can be written as:                                     At present, there is no commonly-accepted definition of
                                                                           model complexity, whether the model is used purely for
dh1           1 h               p               i
                                                                           simulation or if it is used for diagnostics or control. Defin-
        =         −κ1 (1 + β1 ) h1 − h2 + q0 ,               (6)
  dt         A1                                                            ing the complexity of a model is inherently tricky, due to the
dh2           1 h             p                                            number of factors involved.
        =         κ1 (1 + β1 ) h1 − h2
  dt         A2                                                               Less complex models are often preferred either due to
                               p         i                                 their low computational simulation costs [20], or to min-
                − κ2 (1 + β2 ) h2 − h3 ,                                   imise model over-fitting given observed data [21; 22]. Given
dh3           1 h             p                        p i                 the task of simulating a variable of interest conditioned by
        =         κ2 (1 + β2 ) h2 − h3 − κ3 (1 + β3 ) h3 .                 certain future values of input (control) variables, overfitting
  dt         A3
                                                                           can lead to high uncertainty in creating accurate simulations.
The fault equations allow faults for any combination of                    Overfitting is especially severe when we have limited ob-
the valves {V1 , V2 , V3 }, resulting in system modes Ξ =                  servation variables for generating a model representing the
{ξN , ξ1 , ξ2 , ξ3 , ξ12 , ξ13 , ξ23 , ξ123 }, where ξN is the nominal     underlying process dynamics. In contrast, models with low




                                                                     129
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


parameter dimensionality (i.e. fewer parameters) are con-               Statistical model selection is commonly based on Oc-
sidered less complex and hence are associated with low pre-          cam’s parsimony principle (ca.1320), namely that hypothe-
diction uncertainty [23].                                            ses should be kept as simple as possible. In statistical terms,
   Several approaches have been used, based on issues like           this is a trade-off between bias (distance between the aver-
(a) number of variables [24], (b) model structure [25], (c)          age estimate and truth) and variance (spread of the estimates
number of free parameters [23], (d) number of parameters             around the truth).
that the data can constrain [26], (e) a notion of model weight          The idea is that by adding parameters to a model we ob-
[27], or (f) type and order of equations for a non-linear dy-        tain improvement in fit, but at the expense of making pa-
namical model [14], where type corresponds to non-linear,            rameter estimates “worse"’ because we have less data (i.e.,
linear, etc.; e.g., order for a non-linear model is such that a      information) per parameter. In addition, the computations
k-th order system has k-th derivates in E.                           typically require more time. So the key question is how to
   Factors that contribute to the true cost of a model include:      identify how complex a model works best for a given prob-
(a) model-generation; (b) parameter estimation; and (c) sim-         lem.
ulation complexity, i.e., the computational expense (in terms           If the goal is to compute the likelihood of a given model
of CPU-time and memory) needed to simulate the model                 φ(x0 , θ, ξ, U ), then θ and U are nuisance parameters.
given a set of initial conditions Rather than try to formu-          These parameters affect the likelihood calculation but are
late this notion in terms of the number of model variables or        not what we want to infer. Consequently, these parameters
parameters, or a notion of model structural complexity, we           should be eliminated from the inference. We can remove
specify model complexity in terms of a measure based on              nuisance parameters by assigning them prior probabilities
parameter estimation, and inference complexity, assuming a           and integrating them out to obtain the marginal probability
construction cost of zero.                                           of the data given only the model, that is, the model likeli-
   A thorough analysis of model complexity will need to              hood (also called integrative, marginal, or predictive like-
take into consideration the model equation class, since              lihood).    In equational form, this looks like: P (Y |φ) =
                                                                     R R
model complexity is class-specific. For example, for non-                    P (φ|Y  , θ, U )P (θ, U |φ)dθdU . However, this multi-
                                                                       θ U
linear dynamical models, complexity is governed by the               dimensional integral can be very difficult to compute, and it
type and order of equations [14]. In contrast, for linear dy-        is typically approximated using computationally intensive
namical models, which have only matrices and variables in            techniques like Markov chain Monte Carlo (MCMC).
equations (no derivatives), it is the order of the matrices that        Rather than try to solve such a computationally challeng-
determines complexity. In this article, we assume that mod-          ing task, we adopt an approximation to the multidimen-
els are of appropriate complexity, and hence do not address          sional integral. In the statistics literature several decompos-
Model order reduction techniques [14], which aim to gen-             able approximations have been proposed.
erate lower-dimensional systems that trade off fidelity for             Spiegelhalter et al. [26] have proposed a well-known
reduced model complexity.                                            such decomposable framework, termed the Deviance In-
4.3 Diagnostics Model Selection Task                                 formation Criterion (DIC), which measures the number of
                                                                     model parameters that the data can constrain.: DIC =
The model in this model selection problem corresponds to             D + pD , where D is a measure of fit (expected deviance),
a system with a single mode. Given a space Φ of possible             and pD is a complexity measure, the effective number of
models, we can define this model selection task as follows:          parameters. The Akaike Information Criterion (AIC) [29;
                                               
    φ∗ = argmin g1 R(Ỹ , Yφ ) + g2 C(Ỹ , φ) ,          (7)         30] is another well-known measure: AIC = −2L(θ̂) + 2k,
            φ∈Φ                                                      where θ̂ is the Maximum Likelihood Estimate (MLE) of θ
adopting the simplifying assumption that our loss function           and k is the number of parameters.
g is additively decomposable.                                           To compensate for small sample size n, a variant of AIC,
                                                                     termed AICc , is typically used:
4.4 Information-Theoretic Model Complexity
The Information-Theoretic (or Bayesian) model complex-                                                       2k(k + 1)
                                                                               AICc = −2L(θ̂) + 2k +                            (8)
ity approach, which is based on the model likelihood, mea-                                                  (n − k − 1)
sures whether the increased “complexity" of a model with
more parameters is justified by the data. The Information-              Another computationally more tractable approach is the
Theoretic approach chooses a model (and a model structure)           Bayesian Information Criterion (BIC) [31]: BIC =
from a set of competing models (from the set of correspond-          −2L(θ̂) + klogn, where k is the number of estimable pa-
ing model structures, respectively) such that the value of a         rameters, and n is the sample size (number of observations).
Bayesian criterion is maximized (or prediction uncertainty           BIC was developed as an approximation to the log marginal
in choosing a model structure is minimized).                         likelihood of a model, and therefore, the difference between
   The Information-Theoretic approach addresses prediction           two BIC estimates may be a good approximation to the nat-
uncertainty by specifying an appropriate likelihood func-            ural log of the Bayes factor. Given equal priors for all com-
tion. In other words, it specifies the probability with which        peting models, choosing the model with the smallest BIC is
the observed values of a variable of interest are generated          equivalent to selecting the model with the maximum poste-
by a model. The marginal likelihood of a model structure,            rior probability. BIC assumes that the (parameters’) prior is
which represents a class of models capturing the same pro-           the unit information prior (i.e., a multivariate normal prior
cesses (and hence have the same parameter dimensional-               with mean at the maximum likelihood estimate and variance
ity), is obtained by integrating over the prior distribution of      equal to the expected information matrix for one observa-
model parameters; this measures the prediction uncertainty           tion).
of the model structure [28].                                            Wagenmakers [32] shows that one can convert the BIC




                                                               130
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


metric to                                                              Fault Model The fault model introduces a parameter βi
                              SSE                                   associated with κi , i.e., we replace κi with κi (1 + βi ), i =
               BIC = n log           + k logn,                      1, 2, 3, where −1 ≤ βi ≤ κ1i − 1, i = 1, 2, 3. This model
                             SStotal
                                                                    has 7 parameters, adding parameters β1 , β2 , β3 .
where SSE is the sum of squares for the error term. In our
experiments, we assume that the non-linear model is the             Qualitative Model
“correct" model (or the null hypothesis H0 ), and either the        Nominal Model  p For the model we replace the non-linear
linear or qualitative models are the competing model (or al-        sub-function hi − hj with the qualitative sub-function
ternative hypothesis H1 ). Hence what we do is use BIC to           M + (hi − hj ), where M + is the set of reasonable functions
compare the non-linear to each of the competing models.             f such that f 0 > 0 on the interior of its domain [34].
   Suppose that we obtain the BIC values for the alternative           The tank-heights are constrained to be non-negative, as
and the correct models, using the relevant SS terms. When           are the parameters κi . As a consequence, we can discretize
computing ∆BIC = BIC(H1 ) − BIC(H0 ), note that both                the hi to take on values {+, 0}, which means that M + (hi −
the null (H0 ) and the alternative hypothesis (H1 ) models          hj ) can take on values {+, 0, −}. The domain for dh dt must
                                                                                                                            1

share the same SStotal term (both models attempt to explain         be {+, 0, −}, since the qualitative version of q0 , Q is non-
the same collection of scores), although they differ with re-       negative (domain of {+, 0}) and each M + (hi − hj ) can
spect to SSE. The SStotal term common to both BIC values            take on values {+, 0, −}. We see that this model has no
cancels out in computing ∆BIC , producing                           parameters to estimate.
                           SSE1                                        Fault Model
            ∆BIC = n log        + (k1 − k0 )logn,          (9)         The qualitative fault model has different M + functions
                           SSE0                                     for the modes where the valve is passing and blocked. We
where SSE1 and SSE0 are the sum of squares for the er-              derive these functions as follows. From a qualitative per-
ror terms in the alternative and the null hypothesis models,        spective, the domain of βi is {0,+} for a passing valve, and
respectively.                                                       {-,0} for a blocked valve. To create a new M + function for
                                                                    the cases of passing and blocked valve, we qualitatively ap-
5 Experimental Design                                               ply these corresponding domains to the standard M + func-
                                                                    tion with domain {-,0,+} to obtain fault-based M + func-
This section compares three tank benchmark models accord-           tions : MP+ (hi − hj ) denotes the M + function when the
ing to various model-selection measures. We adopt as our            valve is passing, and MB+ (hi − hj ) denotes the M + func-
“correct" model the non-linear model. We will examine the           tion when the valve is blocked.
fidelity and complexity tradeoffs of two simpler models over
a selection of failure scenarios.                                   5.2 Simulation Results
   The diagnostic task will be to compute the fault state
of the system, given an injected fault, which is one of             We have compared the simulation performance of the mod-
(ξN , ξB , ξP ), denoting nominal blocked and passing valves,       els under nominal and faulty conditions, considering faults
respectively. This translates to different tasks given the dif-     to individual valves V1 , V2 and V3 , as well as double-fault
ferent models.                                                      combinations of the valves. In the following we present
                                                                    some plots for simulations of faults and fault-isolation for
non-linear model estimate the true value of κ1 given p1 ,           different model types.
    which corresponds to a most-likely failure mode as-                Figure 2 shows the results from a single-fault scenario,
    signment of one of (ξN , ξB , ξP ).                             where valve V1 is stuck at 50%) at t = 250, based on the
linear model estimate the true value of κ1 given p1 , which         non-linear model. The plot from this simulation show that
     corresponds to a most-likely failure mode assignment           at the time of the fault injection, the water level in tank T1
     of one of (ξN , ξB , ξP ).                                     starts increasing while the water level at tanks T2 and T3
                                                                    start decreasing due to the lower inflow.
qualitative model estimate the failure mode assignment of
    one of (ξN , ξB , ξP ).                                                   200                                          p_1
                                                                                                                           p_2



5.1 Alternative Models
                                                                                                                           p_3

                                                                              150


This section describes the two alternative models that we                     100
compare to the non-linear model, a linear and a qualitative
model.                                                                         50



Linear Model                                                                    0

We compare the non-linear model with a linearised version.                          0   100   200
                                                                                                    time [s]
                                                                                                               300   400


We can perform this linearised process in a variety of ways
[33]. In this simple tank example, we can perform the lin-          Figure 2: Simulation with non-linear model for the scenario
earisation directly through replacement of non-linear and           of a fault in valve 1 at t = 250 s
linear operators, as shown below.
   Nominal Model We can linearise the the non-linear                  Table 1 shows the simulation error-difference between the
3-tank
p        model by replacing the non-linear sub-function             non-linear and linear models, for the nominal case and the
   hi − hj with the linear sub-function γij (hi − hj ), where       faulty case (where valve 1 is faulted). Given that we mea-
γij is a parameter (to be estimated) governing the flow be-         sure the pressure levels for p1 , p2 and p3 every second, we
tween tanks i and j. The linear model has 4 parameters,             use the difference in these outputs to identify the sum-of-
γ12 , γ12 , γ23 , γ3 .                                              squared-error (SSE) values for the simulations.




                                                              131
                            Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                         Total
                                                                                                     1
                      p1              p2                    p3                                                                                            R_1

                                                                                                                                                          R_2


       Nominal      2600.3           316.2                 118.1        3034.6                      0.8                                                   R_3



       V1 -fault    2583.1           347.5                 137.2        3067.8                      0.6



                                                                                                    0.4

Table 1: Data for SSE values for simulations using Non-
linear and Linear representations, given two scenarios:                                             0.2



nominal and faulty (valve V1 at 50% after 250 s)                                                     0

                                                                                                             100     200              300   400     500
                                                                                                                           time [s]




   Figure 3 shows the results for diagnosing the V1 -fault us-                            Figure 5: Simulation of fault isolation of fault in valve 1
ing the non-linear model. We can see that the diagnostic                                  with mixed non-linear/linear model (T1 non-linear and both
accuracy is high, as P (V1 ) converges to almost 1 with little                            T2 and T3 linear). The figure depicts the probability of
time lag.                                                                                 valves 1, 2 and 3 being faulty.
                1

                                                                                          6.1 Model Comparisons
            0.8
                                                                                          We have empirically compared the diagnostics performance
            0.6
                                                                                          of several multi-tank models. In our first set of experiments,
         R_1




            0.4
                                                                                          we ran a simulation over 500 seconds, and induced a fault
                                                                                          (valve V1 at 50%) after 250 s. The model combinations in-
            0.2
                                                                                          volved a non-linear (NL) model, a model (denoted M) with
                0                                                                         tank T1 being linear (and other tanks non-linear), a fully
                      100          200               300          400         500
                                         time [s]                                         linear model (denoted L), and a Qualitative model (denoted
                                                                                          Q).
Figure 3: Simulation of fault isolation of fault in valve 1                                  To compare the relative performance of the models, we
with non-linear model. The figure depicts the probability of                              compute a measure of diagnostics error (or loss), using the
valve 1 being faulty.                                                                     difference between the true fault (which is known for each
                                                                                          simulation) and the computed fault. We denote the true fault
   In contrast, Figure 4 shows the diagnostic accuracy and                                existing at time t using the pair (ω, t); the computed fault at
isolation time with a linear model. First, note that there is                             time t is denoted using the pair (ω̂, t̂). The inference system
a false-positive identified early in the simulation, and the                              that we use, LNG [35], computes an uncertainty measure
model incorrectly identifies both valves 2 and 3 as being                                 associated with each computed fault, denoted P (ω̂). Hence,
faulty. This linear model thus delivers both poor diagnos-                                we define a measure of diagnostics error over a time window
tic accuracy (classification errors) and poor isolation time                              [0, T ] using
(there is a lag between when the fault occurs and when                                                             T X
the model identifies the fault). After the fault injection at                                                      X
                                                                                                          γ1D =                 |P (ω̂t ) − ωt |,               (10)
t = 250 [s], the predictive accuracy improves and the cor-
                                                                                                                   t=0 ξ∈Ξ
rect fault becomes the most likely fault.
                                                                                          where Ξ is the set of failure modes for the model, and ωt
           1                                                                  R_1
                                                                              R_2
                                                                                          denotes ω at time t.
          0.8
                                                                              R_3
                                                                                             Our second metric covers the fault latency, i.e., how
                                                                                          quickly the model identifies the true fault (ω, t): γ2 = t − t̂.
          0.6
                                                                                             Table 2 summarises our results. The first columns com-
          0.4
                                                                                          pare the number of parameters for the different models, fol-
          0.2
                                                                                          lowed by comparisons of the error (γ1 ) and the CPU-time
                                                                                          (γ2 ). The data show that the error (γ1 ) does not grow very
           0
                    100      200               300          400         500               much as we increase model size, but it increases as we de-
                                    time [s]
                                                                                          crease model fidelity from non-linear through to qualitative
                                                                                          models. In contrast, the CPU-time (a) increases as we in-
Figure 4: Simulation of fault isolation of fault in valve
                                                                                          crease model size, and (b) is proportional to model fidelity,
1 with linear model.The figure depicts the probability of
                                                                                          i.e., it decreases as we decrease model fidelity from non-
valves 1, 2 and 3 being faulty.
                                                                                          linear through to qualitative models.
                                                                                             In a second set of experiments, we focused on multiple
   Figure 5 depicts the diagnostic performance with a mixed                               model types for a 3-tank system, with simulations running
linear/non-linear model (T1 is non-linear, while T2 and T3                                over 50s, and we induced a fault (valve V1 at 50%) after 25 s.
are linear). The diagnostic accuracy is almost the same as                                The model combinations involved a non-linear (NL) model,
that of the non-linear model (cf. Figure 3), except for a                                 a model with tank 3 linear (and other tanks non-linear), a
false-positive detection at the beginning of the scenario.                                model with tanks 2 and 3 linear and tank 1 non-linear, a fully
                                                                                          linear model, and a qualitative model. Table 3 summarises
6 Experimental Results                                                                    our results.
                                                                                             The data show that, as model fidelity decreases, the er-
This section describes our experimental results, summaris-                                ror γ1 increases significantly and the inference times γ2 de-
ing the data first and then discussing the implications of the                            crease modestly. If we examine the outputs from AICc , we
results.                                                                                  see that the best model is the mixed model (T3 -linear). BIC




                                                                                    132
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


          Tanks                   2        3        4               7 Conclusions
       # Parameters      NL       7        9        11              This article has presented a framework for evaluating the
                         M        6        8        10              competing properties of models, namely fidelity and com-
                         L        5        7        9               putational complexity. We have argued that model perfor-
                         Q        2        3        4               mance needs to be evaluated over a range of future observa-
             γ1          NL      242      242      242              tions, and hence we need a framework that considers the ex-
                         M       997     1076     1192              pected performance. As such, information-theoretic meth-
                         L      1236     1288     1342              ods are well suited.
                         Q      3859     3994     4261                 We have proposed some information-theoretic metrics for
             γ2          NL     10.59     23.7    39.5              MBD model evaluation, and conducted some preliminary
                         M       8.52    17.96    34.6              experiments to show how these metrics may be applied.
                         L       6.11    10.57    32.0              This work thus constitutes a start to a full analysis of model
                         Q       4.64     7.31    26.4              performance. Our intention is to initiate a more formal anal-
                                                                    ysis of modeling and model evaluation, since there is no
                                                                    framework in existence for this task. Further, the experi-
Table 2: Data for 2-, 3-, and 4-tank models using Non-linear
                                                                    ments are only preliminary, and are meant to demonstrate
(NL), Mixed (M), Linear (L) and Qualitative (Q) represen-
                                                                    how a framework can be applied to model comparison and
tations
                                                                    evaluation.
                                                                       Significant work remains to be done, on a range of fronts.
                                                                    In particular, a thorough empirical investigation is needs on
indicates the qualitative model as the best; it is worth noting
                                                                    diagnostics modeling. Second, the real-world utility of our
that BIC typically will choose the simplest model.
                                                                    proposed framework needs to be determined. Third, a theo-
                                                                    retical study of the issues of mode-based parameter estima-
                         γ1        γ2     AICc      BIC             tion and its use for MBD is necessary.
     Non-Linear         0.97      23.7    29.45     43.7
     T3 -linear         3.12     17.96    26.77     42.9            References
     T2 , T3 -linear    21.96    13.21    31.12    39.56
     Linear             77.43    10.57    35.76    37.55            [1] George EP Box. Statistics and science. J Am Stat
     Qualitative       304.41     9.74    43.01    29.13                Assoc, 71:791–799, 1976.
                                                                    [2] Peter Struss. What’s in SD? Towards a theory of mod-
Table 3: Data for 3-tank model, using Non-linear, Mixed,                eling for diagnosis. Readings in model-based diagno-
Linear and Qualitative representations, given a fault (valve            sis, pages 419–449, 1992.
V1 at 50%) after 25 s                                               [3] Peter Struss. Qualitative modeling of physical sys-
                                                                        tems in AI research. In Artificial Intelligence and
                                                                        Symbolic Mathematical Computing, pages 20–49.
6.2 Discussion                                                          Springer, 1993.
Our results show that MBD is a complex task with several            [4] Nuno Belard, Yannick Pencolé, and Michel Comba-
conflicting factors.                                                    cau. Defining and exploring properties in diagnostic
                                                                        systems. System, 1:R2, 2010.
  • The diagnosis error γ1 is inversely proportional to             [5] Alexander Feldman, Tolga Kurtoglu, Sriram
    model fidelity, given a fixed diagnosis task.
                                                                        Narasimhan, Scott Poll, and David Garcia. Em-
  • The error γ1 increases with fault cardinality.                      pirical evaluation of diagnostic algorithm performance
                                                                        using a generic framework. International Journal of
  • The CPU-time γ2 increases with model size (i.e., num-               Prognostics and Health Management, 1:24, 2010.
    ber of tanks).
                                                                    [6] Steven D Eppinger, Nitin R Joglekar, Alison Ole-
   This article has introduced a framework that can be used             chowski, and Terence Teo. Improving the systems en-
to trade off the different factors governing MBD “accuracy".            gineering process with multilevel analysis of interac-
We have shown how one can extend a set of information-                  tions. Artificial Intelligence for Engineering Design,
theoretic metrics to combine these competing factors in                 Analysis and Manufacturing, 28(04):323–337, 2014.
diagnostics model selection. Further work is necessary
to identify how best to extend the existing information-            [7] Sanjay S Joshi and Gregory W Neat. Lessons learned
theoretic metrics to suit the needs of different diagnostics            from multiple fidelity modeling of ground interferom-
applications, as it is likely that the “best" model may be              eter testbeds. In Astronomical Telescopes & Instru-
domain- and task-specific.                                              mentation, pages 128–138. International Society for
   It is important to note that we conducted experiments with           Optics and Photonics, 1998.
un-calibrated models, and we have ignored the cost of cal-          [8] Roxanne A Moore, David A Romero, and Christi-
ibration in this article. The literature suggests that linear           aan JJ Paredis. A rational design approach to gaus-
models can be calibrated to achieve good performance, al-               sian process modeling for variable fidelity models. In
though performance inferior to that of calibrated non-linear            ASME 2011 International Design Engineering Tech-
models. This class of qualitative models does not possess               nical Conferences and Computers and Information in
calibration factors, so calibration will not improve their per-         Engineering Conference, pages 727–740. American
formance.                                                               Society of Mechanical Engineers, 2011.




                                                              133
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


[9] Peter D Hanlon and Peter S Maybeck. Multiple-                 [23] S Pande, L Arkesteijn, HHG Savenije, and LA Basti-
    model adaptive estimation using a residual correlation             das. Hydrological model parameter dimensionality is a
    Kalman filter bank. Aerospace and Electronic Systems,              weak measure of prediction uncertainty. Natural Haz-
    IEEE Transactions on, 36(2):393–406, 2000.                         ards and Earth System Sciences Discusions, 11, 2014,
[10] Redouane Hallouzi, Michel Verhaegen, Robert                       2014.
     Babuška, and Stoyan Kanev. Model weight and state            [24] Martin Kunz, Roberto Trotta, and David R Parkinson.
     estimation for multiple model systems applied to fault            Measuring the effective complexity of cosmological
     detection and identification. In IFAC Symposium on                models. Physical Review D, 74(2):023503, 2006.
     System Identification (SYSID), Newcastle, Australia,         [25] Gregory M Provan and Jun Wang. Automated bench-
     2006.                                                             mark model generators for model-based diagnostic in-
[11] Amardeep Singh, Afshin Izadian, and Sohel Anwar.                  ference. In IJCAI, pages 513–518, 2007.
     Fault diagnosis of Li-Ion batteries using multiple-          [26] David J Spiegelhalter, Nicola G Best, Bradley P Car-
     model adaptive estimation. In Industrial Electronics              lin, and Angelika Van Der Linde. Bayesian measures
     Society, IECON 2013-39th Annual Conference of the                 of model complexity and fit. Journal of the Royal
     IEEE, pages 3524–3529. IEEE, 2013.                                Statistical Society: Series B (Statistical Methodology),
[12] Amardeep Singh Sidhu, Afshin Izadian, and Sohel                   64(4):583–639, 2002.
     Anwar. Nonlinear Model Based Fault Detection of              [27] Jing Du. The “weight" of models and complexity.
     Lithium Ion Battery Using Multiple Model Adaptive                 Complexity, 2014.
     Estimation. In World Congress, volume 19, pages              [28] Jasper A Vrugt and Bruce A Robinson. Treatment of
     8546–8551, 2014.                                                  uncertainty using ensemble methods: Comparison of
[13] Aki Vehtari, Janne Ojanen, et al. A survey of bayesian            sequential data assimilation and bayesian model aver-
     predictive methods for model assessment, selection                aging. Water Resources Research, 43(1), 2007.
     and comparison. Statistics Surveys, 6:142–228, 2012.         [29] Hirotugu Akaike. A new look at the statistical model
[14] Athanasios C Antoulas, Danny C Sorensen, and                      identification. Automatic Control, IEEE Transactions
     Serkan Gugercin. A survey of model reduction meth-                on, 19(6):716–723, 1974.
     ods for large-scale systems. Contemporary mathemat-          [30] Hirotugu Akaike. Likelihood of a model and infor-
     ics, 280:193–220, 2001.                                           mation criteria. Journal of econometrics, 16(1):3–14,
[15] Alexander Feldman, Gregory M Provan, and Arjan JC                 1981.
     van Gemund. Computing observation vectors for max-           [31] G. Schwarz. Estimating the dimension of a model.
     fault min-cardinality diagnoses. In AAAI, pages 919–              Ann. Statist., 6:461–466, 1978.
     924, 2008.                                                   [32] Eric-Jan Wagenmakers. A practical solution to the per-
[16] Amardeep Singh, Afshin Izadian, and Sohel Anwar.                  vasive problems ofp values. Psychonomic bulletin &
     Nonlinear model based fault detection of lithium ion              review, 14(5):779–804, 2007.
     battery using multiple model adaptive estimation. In         [33] Pol D Spanos. Linearization techniques for non-linear
     19th IFAC World Congress, Cape Town, South Africa,                dynamical systems. PhD thesis, California Institute of
     2014.                                                             Technology, 1977.
[17] Youmin Zhan and Jin Jiang. An interacting multiple-          [34] Benjamin Kuipers and Karl Åström. The composition
     model based fault detection, diagnosis and fault-                 and validation of heterogeneous control laws. Auto-
     tolerant control approach. In Decision and Control,               matica, 30(2):233–249, 1994.
     1999. Proceedings of the 38th IEEE Conference on,
     volume 4, pages 3593–3598. IEEE, 1999.                       [35] Alexander Feldman, Helena Vicente de Castro, Arjan
                                                                       van Gemund, and Gregory Provan. Model-based diag-
[18] Peter Struss and Oskar Dressler. " physical negation"             nostic decision-support system for satellites. In Pro-
     integrating fault models into the general diagnostic en-          ceedings of the IEEE Aerospace Conference, Big Sky,
     gine. In IJCAI, volume 89, pages 1318–1323, 1989.                 Montana, USA, pages 1–14, March 2013.
[19] Johan De Kleer, Alan K Mackworth, and Raymond
     Reiter. Characterizing diagnoses and systems. Arti-
     ficial Intelligence, 56(2):197–222, 1992.
[20] Elizabeth H Keating, John Doherty, Jasper A Vrugt,
     and Qinjun Kang. Optimization and uncertainty as-
     sessment of strongly nonlinear groundwater models
     with high parameter dimensionality. Water Resources
     Research, 46(10), 2010.
[21] Saket Pande, Mac McKee, and Luis A Bastidas.
     Complexity-based robust hydrologic prediction. Wa-
     ter resources research, 45(10), 2009.
[22] G Schoups, NC Van de Giesen, and HHG Savenije.
     Model complexity control for hydrologic prediction.
     Water Resources Research, 44(12), 2008.




                                                            134
Proceedings of the 26th International Workshop on Principles of Diagnosis




                         Posters




                                  135
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                  136
                         Proceedings of the 26th International Workshop on Principles of Diagnosis




         A General Process Model:Application to Unanticipated Fault Diagnosis

                    Jiongqi WANG1, Zhangming HE2, Haiyin ZHOU3 and Shuxing LI1
     1
        College of Science, National University of Defense Technology, Changsha, Hunan, P. R. China
                                           email: wjq_gfkd@163.com
2
  Institute for Automatic Control and Complex Systems, University of Duisburg-Essen, Duisburg, Germany
                                      email: hezhangming2008@sina.com
                        2
                          Beijing Institute of Control Engineering, Beijing, P. R. China
                                           email: gfkd_zhy@sina.com
      1
        College of Science, National University of Defense Technology, Changsha, Hunan, P. R. China
                                          email: lishuxingok@163.com
                          Abstract                                  (2006) [16] proposed that the UF diagnosis was carried out
                                                                    by utilizing particle filter for incomplete patterns. As a
     The improvement of the detection and diagnosis                 transmission mechanism of the UF could not be obtained in
     capability for the unanticipated fault is a tendency
                                                                    advance, the UF diagnosis could not be realized based on
     in the research and application of fault diagnosis.            model inference. George Vachtsevanos etc. (2008) [17]
     In this paper, some notions and the basic principles           proposed an UF robust detection method, however, the
     for the unanticipated fault detection and diagnosis
                                                                    isolation on the UF could not be realized. Furthermore, Z.
     are given. A general process model applied to the              M He (2012) [18] proposed a one-class principal compo-
     diagnosis for the unanticipated fault is designed,             nent analysis (OC-PCA) method, which could only be used
     by adopting a three-layer progressive structure,
                                                                    for processing the system with stable data in a normal pat-
     which is comprised of an inherent detection layer,             tern, and did not relate to the UF diagnosis at all. The ma-
     an unanticipated isolation layer and an unantici-              jority of currently published articles involve only UF de-
     pated recognition layer. Several key problems in
                                                                    tection. However, the fault isolation between the UF and the
     the general process model are analyzed. The model              AF as well as the recognition (i.e. identification) of the UF
     and methods proposed in this paper are driven by               has not yet been performed.
     pure data and they can detect and diagnose the
                                                                       For actual system, some impacts such as nonlinearity,
     unanticipated fault. The approach is evaluated by              uncertainty and external interference are inevitable in its
     using an example of a satellite’s attitude control             actual operation, which will result difficulties in setting up a
     system, and excellent results have been obtained.
                                                                    precise model for the system. Consequently, the application
                                                                    of the methods for fault detection and diagnosis based on
 1   Introduction                                                   model inference will be very limited [19-20]. With the
 At present, in the research field of fault diagnosis, a great      development of sensor technology, the input and output
 majority of methods proposed are based on the premise of a         data or the system’s status under real-time monitor is easier
 perfect fault pattern database. The treatment on the fault         to obtain. The data are redundant, real-time and reliable. As
 detection and diagnosis are carried out for anticipated fault      a result, the fault diagnosis ideology of extracting data
 (AF) [1-3]. However, due to the high complexity and un-            instead of establishing a system’s model will play a positive
 certainty of the technical structure, the process environment      role.
 and the working state of the system etc, the occurrence of            This paper proposes a data-driven fault diagnosis method
 some faults which cannot be anticipated in advance (Un-            for UF. Combined with the fault diagnosis process, a gen-
 anticipated Fault, UF) is inevitable in actual work [4]. The       eral process model (GPM) is advanced, which is comprised
 UF is not included in the anticipated fault database, and the      of an inherent detection layer (IDL), an unanticipated iso-
 occurrence of the UF affects normal operation of the system        lation layer (UIL) and an unanticipated recognition layer
 and even possibly leads to thorough failure of the system.         (URL). Firstly, according to different characteristics of the
 The improvement of unanticipated fault detection and               monitoring data, the corresponding residual statistics are
 diagnosis (UFDD) capability is a difficult issue, as well as a     built and a detection criterion of the IDL is provided for
 developing direction in the research and application for the       fault detection. Secondly, the statistic of angle similarity is
 fault diagnosis [5-8].                                             constructed on the basis of the fault feature direction, the
    In retrospect to the existing researches, rather little at-     isolation between the UF and the AF is realized in the UIL.
 tention has been paid to research UF detection and diagno-         Finally, in the URL, by the adoption of the contribution
 sis. Therefore, no mature solve scheme has been shaped for         factor, the UF is recognized. The method, as a fault diag-
 either the problem itself or the technical realization [9-12].     nosis method driven by pure data, is capable of carrying out
 Most research on the UF focus on the recognition and the           detection, isolation and recognition for the UF.
 match between different patterns based on the known fault             The paper is organized as follows. In Section 2, some
 pattern database [13-14]. For example, Tom Brotherton and          notions and the basic principles for UF and UFDD are
 Tom Johnson (2001) [15] proposed a neural network                  discussed. A three-layer GPM for UFDD is introduced in
 anomaly detector, which was essentially a single neural            Section 3. Sections 4 analyzes some key problems in the
 network classifier and could not identify the UF. Z. H. Duan       GPM and advances the corresponding solutions. In Section




                                                              137
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


5, performance evaluation of the proposed GPM and                    cess model (GPM) for UF diagnosis on the basis of pure
methods for the satellite’s attitude control system is pre-          data-driven method. The structure of GPM is shown in
sented. Conclusions are drawn in Section 6.                          Figure 1. The first layer is the IDL, which establishes a
                                                                     detection discriminator for fault detection; the second layer
2   Notions and Basic Principles for UFDD                            is the UIL, which applies the detection residual to establish
                                                                     a fault feature direction so as to build an isolation discrim-
                                                                     inator to realize the isolation of the AF and the UF; the third
2.1 Notion of UF                                                     layer is the URL, which applies a contribution factor to
The fault can be divided into the anticipated fault (AF) and         analyze the variant which is most relevant to the current UF
the unanticipated fault (UF).                                        and to realize the fault recognition based on superficial data
   Explanation 1: Anticipated fault (AF) is the fault which          characteristics.
has been recognized by people, existing in the fault pattern
database with the relevant monitoring data and the pro-
cessing strategy.
   Explanation 2: Unanticipated fault (UF) is the fault
which lacks prior knowledge without any fault samples or
with few fault data. UF does not exist in the fault pattern
database, and the corresponding elimination strategy for it
has not been detected.
   A perfect fault pattern database should be a set including
all AF patterns and UF patterns. However, due to some
objective reasons, the acquisition of the perfect fault pattern
database is extremely difficult. The AF rarely occurs, and
most of faults occurs in the actual working process are UF
[21]. At present, to detect the UF and moreover to diagnose
the UF is one of the most difficult issues in fault diagnosis
region, and it is also a great challenge for fault diagnosis
technology.

2.2 Notion for UF Detection
   Explanation 3: UF detection is a process for judging
whether UF occurs.
   The tasks of UF detection and AF detection are different.
The two methods apply previous normal monitoring data to
train a discriminator, and then the current monitoring data is
used as the testing data to be input into the discriminator to
judge whether the current status is a fault. However, the UF
detection is carried out after the completion of fault detec-
tion, and the fault is further judged whether to be UF. Ob-
viously, for AF detection, all faults are always assumed to
be anticipated. Consequently, if the UF occurs, it will be           Figure 1 The GPM for UFDD
misjudged as a certain anticipated fault.
                                                                     3.1 Inherent Detection Layer (IDL)
2.3 Notion for UF Diagnosis                                          The first issue that a diagnosis system faces is to carry out
   Explanation 4: UF diagnosis is a process of determining           normal/abnormal recognition for a feature vector of the
whether the UF occur (i.e. UF detection). In addition, the           monitoring data. The task of the IDL is to determine
UF diagnosis further includes the isolation and the recog-           whether the monitoring data is normal or abnormal. The
nition of the UF after the UF detection is completed.                detection discriminator can be used for reflecting the
   Compared with the AF diagnosis, due to lack of prior              characteristics of the normal system. In a given threshold,
knowledge of the UF, the mapping relationship from fault             the testing data is inputted to the detection discriminator for
data to fault part (essentially, the fault pattern is a function     judging whether the fault exists. If a value of the discrimi-
between fault data and fault part) cannot be found. There-           nator is smaller than the given threshold, the system is
fore, the key for UF diagnosis is to quickly establish a             thought to be normal; otherwise, a fault is thought to occur.
cognition process. The cognition comprises the recognition           Meanwhile the occurrence time (Fault time) and the feature
of superficial data characteristics or the mapping recogni-          direction of the fault (Current fault direction) should be
tion from data to a physical layer. Based on a fault diagnosis       determined, and the testing data is presented to the UIL.
method driven by pure data, this paper focuses on the                   Essentially, the IDL is a single discriminator, which can
recognition of superficial data characteristics.                     be applied to catch the characteristics of the system in a
                                                                     normal pattern as well as to complete the detection and
3   General Process Model (GPM) for UFDD                             discrimination of the testing data. Two key problems are
                                                                     involved, the first is the residual generation and the second
By combining the notion and basic principles of the UF and           is the residual evaluation. The specific techniques can be
the UFDD, this paper proposes a multi-layer general pro-             seen in Section 4.1.




                                                               138
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


3.2 Unanticipated Isolation Layer (UIL)                              is suitable for the system capable of obtaining the baseline
The task of the UIL is to finish the isolation between the UF        data, its calculation amount is small, the detection speed is
                                                                     fast, and the detection effect is the best [23]. The time-series
and AF. After detected, the current fault shall be judged
whether to be the AF or the UF. If it is, the current fault will     modeling prediction is suitable for the system with con-
be classified as some sort of AF. All AF patterns are saved          tinuous output and without input; it is also suitable for it-
                                                                     eration update of the pattern, while the defect is that the
in the pattern database of AF. The isolation discriminator
matches the feature of the current fault pattern with all those      prediction time is short [25].
of the AF patterns successively, so as to realize the isolation          In practical application, the characteristics of the moni-
                                                                     toring system and the monitoring data can be applied to
between the UF and AF. If the feature of the current fault
cannot be matched with any AF pattern, it indicates that the         select the corresponding detection method.
UF occurs. The testing data is presented to the URL. The                 Besides, for the three methods analyzed above, only the
                                                                     characteristics of data output are considered. However, for
key problem of the UIL lies in the establishment of an iso-
lator and the design of an isolation criterion. The specific         some systems (such as the satellite’s attitude control sys-
techniques can be seen in Section 4.2.                               tem), the object of the fault detection always comprises
                                                                     control input as well as measuring output, and the control
3.3 Unanticipated Recognition Layer (URL)                            input has a certain responding relationship with the meas-
                                                                     uring output. In the situation where there is no baseline
The task of the URL is to perform online learning and                training data, an input-output system identification method
analysis for the UF data, so as to generate the fault pattern.       is needed to search a model structure for the system, and
The function of the URL is to learn and summarize the                thus the fault detection both on control input and measuring
pattern found in unknown pattern. As it is different from the        output will be performed in the IDL.
AF, it is difficult to find the mapping relationship from the            If we assume that (U n −1 ,Yn −1 ) ∈ ( R ( n −1)× p , R ( n −1)× m ) are re-
fault data to the fault part for the UF. Therefore, the key          spectively as system input and system output before the nth
point of recognition lies in establishing the corresponding          time period, take them as the training data and make
relationship between the data and the unknown fault. Due to
insufficient recognition on the UF and lack of historical
                                                                      ( un , yn ) ∈ ( R1× p , R1×m ) as the current testing data. The train
                                                                     purpose is to find the model structure of the system, usually
information and prior knowledge, it is usually more difficult        with the rule as follows
to establish the mapping relationship on the physical layer.
The key point of this paper is to analyze the UF recognition                                                 min Yn −1 − f (U n −1 )                            (1)
                                                                                                               f
based on the superficial data layer. According to contribu-
tion factor, the variant which is mostly relevant to the cur-        Let          Yˆn −1 = f (U n −1 )                  is     the            tendency        term,
rent UF can be found, so that the UF recognition is finished.
                                                                     Yn −1 = Yn −1 − Yˆn −1 = Yn −1 − f (U n −1 )               is        the     residual    term;
The specific techniques can be seen in Section 4.3.
                                                                      yˆ n = f ( un , U n −1 , Yn −1 )
                                                                                                         T
                                                                                                                   is        one-step           prediction,    and
4   Some Key Problems in GPM
                                                                      rn = yn − yˆ n is the prediction residual, then the key point
In the above section, a basic framework of the UF diagnosis
                                                                     for the minimum problem in (1) is to construct the function
is provided. The task of the UF diagnosis is to detect, isolate
                                                                     f between the system input and system output.
and recognize the UF. The detection is a starting point of
fault diagnosis, and the target of the fault detection is to            If a mathematical model can be obtained for the system
                                                                     equation by the physical mechanism, the estimation of f can
judge whether the UF occurs; the isolation is the core of
                                                                     be converted into the parameter estimation (Gray-Box
fault diagnosis; and the recognition is a terminal point of
fault diagnosis. Additionally, the recognition is also the           Model); and if there is no physical background, f can be
                                                                     estimated only according to the experiment and the system
starting point of fault-tolerant control (fault processing).
                                                                     identification (Black-Box Model). Common linear black
The specific techniques on detecting, isolating and recog-
nizing the UF can be seen below.                                     box models comprise an autoregression model (AR Model)
                                                                     with external input, an autoregressive moving average
4.1 Detection Statistic Construction                                 model (ARMA Model) with external input, an output error
                                                                     model (OE Model), a Box-Jenkins model (BJ Model) and a
Just as Section 3 shows, the basic task of the IDL is to judge       prediction error minimized model (PEM Model); and
whether the testing data is normal. If it is a fault, simulta-       common nonlinear black box models comprise a nonlinear
neously the occurrence time and the feature direction of the         autoregression moving average model (NLARMA Model)
fault shall be determined. The key point of the IDL lies in          and a nonlinear Hammerstein-Wiener model (NLHW
the detection residual generation as well as the residual            Model) [26-29] with external input.
evaluation. The detection statistic is established according            After obtaining the prediction residual, the detection sta-
to the residual, and the fault detection is performed ac-            tistics are as below:
cording to the given criterion. For different monitoring data,
                                                                                                                                ( ) r
                                                                                                                                     -1
different residual generation approaches exist, including                                          T 2 ( yn ) = rnT cov Y                 n                     (2)
simple T2 detection [18, 22], baseline data smoothing de-
tection [23], and time-series modeling and predicting de-            where cov (Y ) is the covariance of the residual term Y , and
tection [24-25].                                                     a judging threshold is set to be
   The characteristics of the monitoring system and moni-                                              m ( n )( n − 2 )
toring data can be applied to select the corresponding de-                                  Tα2 =                         F       ( m, n − 1 − m )              (3)
                                                                                                    ( n − 1) ( n − 1 - m ) (1−α )
tection method. The simple T2 statistic detection is applied
to a stable data [22]. The baseline data smoothing detection




                                                               139
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


where F(1−α ) ( m, n − 1 − m ) indicates a quantile of F distri-      current directions from the same pattern. ξ 2 is another true
bution function when a significance level is α , the degree           direction, corresponding to another fault pattern. The origin
of freedom is ( m, n − 1 − m ) .                                      of the coordinates can be regarded as the true direction for
    If T 2 ( yn ) > Tα2 , yn−1 is considered as the fault point.      the normal pattern.
However, a false alarm is inevitable because of noise, thus
we need a more reliable criterion for detection as follows.
      Criterion 1: If T 2 ( yn ) > Tα2 holds continuously for W
times, then the fault has really happened, where W is                                      ξ2
                                                                                                                                                       ξ1
called time threshold. The W-th alarm time is considered as
the fault time (tf) (i.e. the occurrence time of the fault) and
the residual r of the fault time is called the current fault
direction or current direction (i.e. the feature direction of
the fault).
    The detection statistic threshold is decided by Equation                          ξ1
(3). The time threshold should not be too large (usually 2 to                                                                                         ξ2
4) to avoid any false alarms. A larger time threshold makes
a more reliable decision, but it will cause some detection
delay which will cause harm to the system. Current fault              Figure 2 True detections and current directions
direction is the key information of each fault, and it is the
base for the isolation fault. According to Criterion 1, the              Denote θ ( r , ξ ) is the angle between the current direction
current fault is detectable if and only if                            and the true direction, Ddisc ( r , ξ ) = 1 − cos (θ ( r , ξ ) ) is
                                                                      called the directional discrepancy between them. We can
                                 (                  )
                                                        −1
                  | rn ||> Tα2 rn T cov(Y ) −1 rn            (4)      find that if they are from the same pattern, Ddisc ( r , ξ ) will
                                                                      be small, otherwise, it will be large.
    In the IDL, the fault detection is realized by the adoption          Suppose that ε ∼ N ( 0, Ω ) , the current direction is
of the input-output system identification method. Moreover,
                                                                      r = ε + r ξ , and {ξ i }i =1 is all anticipated true directions, and
                                                                                                        q
the occurrence time and feature direction of the fault can
also be obtained.                                                                    {
                                                                      ξ i0 = arg min 1 − cos ( r , ξi )      }
                                                                                                                 q
                                                                                                                       , then the isolation statistic is
                                                                                                              i =1
   Obviously, the input-output system identification method                     ξ

is provided with all the advantages of the time-series mod-           given as follows
eling prediction method. It is particularly suitable for the
                                                                                                      Iso( r ) =
                                                                                                                        (              (
                                                                                                                      r 1 − cos r , ξ i0        ))                  (7)
system with discontinuous input and discontinuous output                                                                           T
                                                                                                                                 ξ Ωξ i0
at the same time, its defect is that the calculation amount is                                                                    i0


large, and the iteration process is relatively difficult.               Theorem 1: If Iso(r ) is defined in Equation (7), then
                                                                                                                     Iso(r ) ∼ N ( 0,1)                              (8)
4.2 Directional Similarity and Isolation Criterion
The basic task of the UIL is to utilize the feature direction of        Proof: Suppose that the current direction is r = ε + r ξ ,
the fault obtained in the IDL to establish the isolation dis-         where ξ is the true direction and ε is the observation
criminator, and then to realize the isolation between the AF          noise, and ε ∼ N ( 0, Ω ) . According to Explanation 5 we
and the UF. The key point lies in the isolator establishment.         have ξ = 1 . If cos(r , ξ ) ≥ 0 , we can approximately obtain
Here the concept of direction similarity is induced, and a            that
fault isolation criterion is given. In Criterion 1, the defini-
tion of current fault direction or current direction (i.e. the
                                                                                    cos(r , ξ ) =
                                                                                                        ξ Tr
                                                                                                        ξ r
                                                                                                             =
                                                                                                               ξ Tε
                                                                                                                r
                                                                                                                                −2
                                                                                                                    + 1 ∼ N 1, r ξ T Ωξ    (                    )    (9)
feature direction of a fault) is given. We adopt the true fault       i.e. cos(r , ξ ) satisfies truncated normal distribution.
feature direction as defined below to be the fault’s pattern
characteristics on superficial data layer.                            Thus
   Explanation 5: True (fault) direction of a fault pattern is                                         (                          )        (
                                                                                                     r 1 − cos(ξ i0 , r ) ∼ N 0 ,ξ iT0 Ωξ i0               )       (10)
defined as the unified mean of all possible current fault
directions from the same pattern.                                     Similarly, if cos(r , ξ ) < 0 , we can prove that
   The relationship between the current directions and the                                          r (1 + cos(ξ , r ) ) ∼ N ( 0 ,ξ T Ωξ )                         (11)
true direction is just like that between discrete random
variable and its expectation. It is easy to understand that           According to Equation (10) and Equation (11), we obtain
                                1        1
                                                                                                     r (1 − cos(ξ , r ) ) ∼ N ( 0,ξ T Ωξ ) 
                                     n       n
                    ξ = lim
                         n →∞
                                  ∑ ri / n ∑
                                n i =1     i =1
                                                ri          (5)                                                                                                    (12)
                                                  2

                        r = r ξ +ε                          (6)      Then
where {ri }i =1 are all possible current directions from the
           n

                                                                                                Iso( r ) =
                                                                                                                 (
                                                                                                             r 1 − cos r , ξ i0  (         ))
                                                                                                                                                ∼ N ( 0,1)          (13)
same pattern, and ε is the noise and r is the magnitude                                                                      T
                                                                                                                        ξ Ωξ i0
                                                                                                                            i0
of the current direction.
   It is shown in Figure 2 that there are two opposite true           and thus the theorem is proved. Therefore, the threshold for
directions for each fault pattern, e.g. the true direction , ξ1 ,      Iso(r ) is Φ (1−α ) , where α is the significance level, and Φ
is in the center of a symmetric cone, around which are the            is the inverse of the normal cumulative distribution function.
                                                                      We provide the isolation criterion as follows.




                                                                140
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


   Criterion 2: If Iso(r ) > Φ1−α holds true, the current fault             The monitoring data comprises of not only the output
is unanticipated; otherwise, it is anticipated.                          data of the measuring mechanism, but also the control input
Criterion 2 indicates that UF with too small a magnitude                 of the execution mechanism. The dimension of the data
cannot be isolated. If the current fault is unanticipated, a             output by the measuring mechanism is m = 7 , The dimen-
new fault pattern is found and the unified current direction             sion of the data input by the execution mechanism is p = 4 ,
is regarded as its true direction. If the current fault is an-           which can be seen in Table 1. There are altogether 10
ticipated, then the current direction should be added to the             batches of monitoring data, which can be seen in Table 2.
corresponding AF direction database in UIL of the GPM,                   The first batch is the normal data, and the normal pattern
and the true direction shall be updated.                                 data is discontinuous and unstable (Figure 3). The subse-
                                                                         quent 9 batches are used for testing, and different fault
4.3 Calculation for Contribution Factor                                  patterns (a sudden-change fault, a gradual-change fault and
The basic task of the URL is to carry out online learning and            so on) are given. In Figure 3, the comparison of the moni-
analysis for UF data. The key point of recognition or iden-              toring data in the fault with drift-increasing of gyro at roll
tification is to establish the corresponding relationship from           axis and the normal pattern is given. The time of each batch
the monitoring data to the unknown fault or the character-               of data is 45000s-48000s; each piece data is collected per
istics of the unknown fault. The UF diagnosis discussed in               second, and the data length n = 3000 .
this paper is an approach driven by pure data, thus the                   Additionally, the public parameters used in the simulation
characteristic recognition on the data layer is more focused.            are assigned as follows: The significance level α = 0.01
According to the contribution factor, the variant which is               and the time threshold defined in Criterion 1 is W=3.
most relevant to the current UF can be found, and then the               Table 1 Data explain of attitude control system
UF recognition is completed.
   Known from Criterion 1 that after the residual detection              Variable
                                                                                                     Code                          Sensor
statistic is established, if T 2 ( yn ) > Tα2 , it is thought that a     subscript
                                                                             1                     Wheel1         Output of the first momentum wheel
fault occurs at time period n-1. For the system with the
                                                                             2                     Wheel2         Output of the second momentum wheel
control input and measure output, firstly a residual covari-                           Input
                                                                             3                     Wheel3         Output of the third momentum wheel
ance matrix R (i.e. cov(Y ) in Equation (2)) is subjected to                 4                     Wheel4         Output of the fourth momentum wheel
the singular value decomposition, which is                                   1        Output       EarthPhi       Output of earth sensor at roll axis
                        R = P T diag ( λ) P                   (14)           2                    EarthTheta      Output of earth sensor at pitch axis
where λ = ( λ1 , … , λm ) , P = ( p1 , … , pm ) , pi indicates the           3                     SunPhi         Output of sun sensor at roll axis
                                                                             4                    SunTheta        Output of sun sensor at pitch axis
ith column of P , and p ji indicates the jth component of
 pi . Let ti = r T pi , and rj indicates the jth component of               5                       GeoPhi        Output of gyro at roll axis
the current fault feature direction r, where 1 ≤ j ≤ m .                    6                      GeoTheta       Output of gyro at pitch axis
   Explanation 6: The contribution factor of the jth variant                7                       GeoPsi        Output of gyro at yaw axis
to the current fault feature direction r is                              Table 2 Batch number of monitoring data
                    Cont ( j ) = ∑ ( ti rj p ji / λi )
                                   m
                                                               (15)       Batch                                                                   Fault
                                  i =1                                                                 Data description
                                                                         number                                                                   time
From the aspect of characteristic recognition in the data                   1        Normal data                                                  Null
layer, the variant with the largest contribution factor is the              2        Sudden-change fault data of earth sensor at roll axis       46000s
fault variant. If it is a sensor fault, the sensor corresponding            3        Gradual-change fault data of earth sensor at roll axis      46000s
to the variant with the largest contribution factor is the                  4        Sudden-change fault data of earth sensor at pitch axis      46000s
sensor hardware with the fault.                                             5        Gradual-change fault data of earth sensor at pitch axis     46000s
                                                                            6        Loss fault data of sun sensor at roll axis                  46000s
5    Simulation and Performance Evaluation                                  7        Loss fault data of sun sensor at pitch axis                 46000s
                                                                            8        Drift-increasing fault data of gyro at roll axis            46000s
The effectiveness of the proposed GPM and the corre-                        9        Drift-increasing fault data of gyro at pitch axis           46000s
sponding UF fault detection, isolation and recognition                      10       Drift-increasing fault data of gyro at yaw axis             46000s
method are demonstrated in this section through a satellite’s
attitude control system model.                                           5.2 Performance Evaluation
5.1 Input and Output of Satellite Control System                         The monitoring data are relatively more complex, com-
                                                                         prising of the output data of the measuring mechanism and
The satellite’s attitude control system is a main part of a              the control input of the execution mechanism (seen in Table
satellite, which consists of four main parts: a satellite body,          1). The normal pattern data is discontinuous and unstable
a controller, an execution mechanism and a measuring                     (seen in Figure 3), and the fault pattern is diversified (with
mechanism [30].                                                          sudden-change fault, gradual-change fault and so on).
   As the complexity of the satellite’s attitude control sys-            Therefore, the normal pattern data is difficult to be dis-
tem, faults particularly for the measuring mechanism and                 criminated from the fault pattern data (seen from Figure 3).
the execution mechanism occur rather frequently.                            With the input-output system identification method, the
   Here on consideration of the monitoring data for the sat-             Hammerstein-Wiener model (NLHW) is adopted. Equation
ellite’s attitude control system. The monitoring data are                (1) is optimized, and the responding function f between the
provided by China Aerospace Science and Technology                       input and output is estimated. Similarly, for the same data
Corporation (CASA).                                                      (Drift-increasing fault data of gyro at roll axis (the batch
                                                                         number is 8) in Table 2), the detection result of the IDL is




                                                                   141
                             Proceedings of the 26th International Workshop on Principles of Diagnosis


given in Figure 4, which can be seen that the fault detection                                                             tion is delayed caused by the time threshold, W = 3 .
is timely, the detection effect is remarkable, and 4s detec-
                                                                                                                                                                                             -3
                                                                                                                                                                                          x 10
                    -0.6                                   1.4                                     100                                    100                                       10

                    -0.8                                   1.2                                      50                                     50
                                                                                                                                                                                     5
                      -1                                     1                                       0                                      0
                                                                                                                                                                                     0
                    -1.2                                   0.8                                     -50                                    -50

                    -1.4                                                                          -100                                   -100                                        -5
                       4.5   4.6          4.7      4.8        4.5     4.6          4.7      4.8      4.5     4.6          4.7      4.8      4.5     4.6          4.7      4.8         4.5         4.6         4.7      4.8
                                   es-x            4                        es-y            4                      ss-x            4                      ss-y             4                            w-x            4
                                                x 10                                     x 10                                   x 10                                   x 10                                         x 10

                  -0.054                                  0.01                                    0.15                                   0.15                                      0.1

                                                         0.005                                     0.1                                    0.1                                     0.05
                  -0.056
                                                             0                                    0.05                                   0.05                                        0
                  -0.058
                                                         -0.005                                      0                                      0                                     -0.05

                   -0.06                                  -0.01                                   -0.05                                  -0.05                                     -0.1
                       4.5   4.6          4.7      4.8        4.5     4.6          4.7      4.8       4.5    4.6     4.7           4.8       4.5    4.6     4.7           4.8         4.5         4.6     4.7          4.8
                                   w-y             4                        w-z             4                 T-wheel-1            4                 T-wheel-2             4                       T-wheel-3           4
                                                x 10                                     x 10                                   x 10                                   x 10                                         x 10

                   0.04                                    0.3                                    0.05                                    0.4

                                                           0.2                                       0                                    0.2
                   0.02
                                                           0.1                                    -0.05                                     0
                      0
                                                             0                                     -0.1                                   -0.2

                   -0.02                                   -0.1                                   -0.15                                   -0.4
                       4.5   4.6     4.7           4.8        4.5     4.6        4.7        4.8       4.5    4.6        4.7        4.8       4.5    4.6        4.7        4.8
                              T-wheel-4            4                 x-esti-attitude        4               y-esti-attitude        4               z-esti-attitude         4
                                                x 10                                     x 10                                   x 10                                   x 10
Figure 3 Drift-increasing fault of gyro at roll axis (Blue line shows the output in the normal pattern while green line shows the output in the
fault patter

   By adopting the input-output system identification method,                                                             ture direction and the direction similarity is valid, and the
the detection results in the IDL for the data in Table 2 are                                                              isolation between the UF and the AF can be truly realized.
shown in Table 3. The fault detection is timely, and the
detection effect is more obvious (both of the FAP (false
alarm probability) and the MAP (missing alarm probability)
are much lower).
   In the IDL, the fault detection can be realized, and the                                                                                                      t f2 : 1004
                                                                                                                                                                 ln(T2 ): 5.483
fault time and the current fault direction are also determined.
In the UIL, Criterion 2 is adopted to realize the isolation
between the UF and the AF. In the initial stage, the AF
pattern is assumed to be empty, therefore, when the second
batch of data in Table 2 is filled into the UIL, the detected
fault must be the UF, and then the isolation result is trans-
ferred into the URL. When the third batch of data in Table 2
is filled into the IDL, the fault time is that t = 1001s , the
statistic     of         the       directional      similarity    is
 r (1− cos(r,ξ1 ) ) / ξ1T Rξ1 = 7.3179 , and the isolation threshold
of the UF is also Φ 0.99 = 2.3263 . Obviously
 r (1− cos(r,ξ1 ) ) / ξ1T Rξ1 > Φ0.99 , the current fault pattern is
different from the first fault pattern, and an UF occurs. Then                                                            Figure 4 The detection result (with input-output system
the UF is transferred into the URL. The fault isolation result                                                            identification method) for drift-increasing fault data of gyro at roll
for all the tested data in Table 2 can be seen in Table 4.                                                                axis
From Table 4, we know that the isolator with the fault fea-

                                                                            Table 3 Unanticipated fault diagnosis—IDL

                                                                                         Inherent Detection Layer (IDL)

      Batch     Normal       FAP                MAP                 Fault
                                                                                                                                                 Current fault direction
     number     or Fault     (%)                (%)               time (s)
        1          N          5                                                              0                  0                          0               0                        0                             0              0
        2          F          3                   2               1000+2                  0.9876            -0.0042                      0.041          -0.053                   0.0453                       -0.1342         0.0678
        3          F          4                   1               1000+1                 -0.9997             0.0005                     -0.034          0.049                    0.0049                       -0.0036         0.0222
        4          F          5                   1               1000+2                 -0.1510            -0.9747                    -0.0097          0.0105                   0.0442                       -0.1550         0.0345
        5          F          4                   1               1000+2                 -0.0018             1.0000                     0.0007          0.0006                  -0.0009                       -0.0022        -0.0077
        6          F          5                   1               1000+2                  0.0086            -0.0093                    -0.9752         0.0046                   -0.0007                        0.0003         0.0008
        7          F          3                   2               1000+3                 -0.0067             0.0052                     0.0016         -0.9925                  -0.1553                        0.0028        -0.0016
        8          F          5                   1               1000+4                 -0.0769             0.0051                     0.0037          0.0018                   0.9682                       -0.0139        -0.0549
        9          F          3                   1               1000+2                 -0.0742             0.0215                    -0.0029          0.0016                   0.0454                       -0.9968        0.0447
       10          F          3                   1               1000+2                  0.0627            -0.0201                    -0.0079          0.0086                  -0.0476                       -0.0441        -0.9849




                                                                                                            142
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                  Table 4 Unanticipated fault diagnosis—UIL
                                                          Unanticipated Isolation Layer (UIL)
        Batch         Anticipated or        Fault Pattern
                                                                                        Updated true fault direction
       number         Unanticipated            code
          1               Null                    0               0         0           0            0           0         0           0
          2                 U                     1               1     -0.0043     0.0415      -0.0537       0.0459   -0.1359      0.0687
          3                 U                     2              -1         0      -0.0340        0.049          0         0        0.0223
          4                 U                     3           -0.1549      -1        -0.01      0.0108        0.0453   -0.1590      0.0354
          5                 U                     4           -0.0018       1          0             0           0     -0.0022     -0.0077
          6                 U                     5            0.0088   -0.0095        -1       0.0047           0         0           0
          7                 U                     6           -0.0068    0.0052     0.0016          -1       -0.1565    0.0028     -0.0016
          8                 U                     7           -0.0794    0.0053     0.0038       0.0019          1     -0.0144     -0.0567
          9                 U                     8           -0.0744    0.0216     0.0029       0.0016       0.0455      -1        0.0447
         10                 U                     9            0.0637   -0.0204    -0.0080      0..0087      -0.0483   -0.0448        -1


   In the IDL, the fault detection can be realized, and the
fault time and the current fault direction are also determined.               6    Conclusion
In the UIL, Criterion 2 is adopted to realize the isolation                   The paper firstly takes the UF as a main diagnosis object.
between the UF and the AF. In the initial stage, the AF                       The detection and diagnosis method based on data driven
pattern is assumed to be empty, therefore, when the second                    for the UFs has been researched. The GPM for the UF di-
batch of data in Table 2 is filled into the UIL, the detected                 agnosis has been designed. The GPM is comprised of the
fault must be the UF, and then the isolation result is trans-                 IDL, the UIL and the URL. This GPM has provided a
ferred into the URL. When the third batch of data in Table 2                  framework support for the UF diagnosis. According to the
is filled into the IDL, the fault time is that t = 1001s , the                system both with the control input and the measure output,
statistic      of        the       directional      similarity    is          the system identification detection method corresponding to
 r (1− cos(r,ξ1 ) ) / ξ1T Rξ1 = 7.3179 , and the isolation threshold          the IDL has been provided. The current fault feature direc-
of the UF is also Φ 0.99 = 2.3263 . Obviously                                 tion and the feature direction of the AF pattern have been
 r (1− cos(r,ξ1 ) ) / ξ1T Rξ1 > Φ0.99 , the current fault pattern is          used to establish the statistic of directional similarity. The
different from the first fault pattern, and an UF occurs. Then                isolation between the AF and the UF has been realized in
the UF is transferred into the URL. The fault isolation result                the UIL. According to the singular value decomposition, the
for all the tested data in Table 2 can be seen in Table 4.                    fault contribution factor of each variance has been obtained,
From Table 4, we know that the isolator with the fault fea-                   and the fault recognition in data layer has been completed.
ture direction and the direction similarity is valid, and the                 The application to fault diagnosis of the satellite’s control
isolation between the UF and the AF can be truly realized.                    system has demonstrated its validity.
   After isolating the UF, the recognition of the UF should                      Our research shall be furthered in two directions. Firstly,
be carried out on the data layer. For the data in Table 2, the                based on the framework of the GPM, the fault detection,
recognition result is that: the fault feature direction                       isolation and recognition method on the foundation of
is ( 0.9876,-0.0042,0.041,-0.053,0.0453, -0.1342, 0.0678 ) . The
                                                           T
                                                                              model inference shall be researched. Secondly, the GPM
variance with the largest contribution factor is the first                    and methods shall be applied to the diagnosis of other
dimension. According to Explanation 6, the contribution                       complex system for both military and civil use.
factor reaches 97 percent, and it indicates that the fault
occurs for the earth sensor at the roll axis. Similarly, the                  Acknowledgments
result of the UF recognition in the URL for other batches of
data is shown in Table 5. From Table 5, the recognition of                    This work was supported in part by National Natural Sci-
the UF corresponding to the fault variance is correct, and                    ence Foundation of China (NSFC) under Grant No.
the UF recognition of the data layer is reached.                              61304119. Besides, we would like to especially thank
                                                                              China Aerospace Science and Technology Corporation
          Table 5 Unanticipated fault diagnosis—URL
                                                                              (CASA) for providing the satellite control system data.
                 Unanticipated Recognition Layer

        Batch
                  Anticipated      Fault
                                             Variable subscript
                                                                              References
                      or          pattern
       number
                 Unanticipated     code
                                                in Table 3                    [1] P. Nomikos, J. F. Macgregor. Monitoring of batch processes
                                                                                  using multiday principal component analysis. AIChE J, 1994,
          1          Null            0                0
                                                                                  40(8): 1361-1375.
          2           U              1                1
          3           U              2                1                       [2] R. Isermann. Supervision, Fault detection and fault diagnosis
          4           U              3                2                           methods-an introduction. Journal of Control Engineering
          5           U              4                2                           Practice, 1997, 5(5): 639-652.
          6           U              5                3                       [3] D. M. J. Tax. One-class classification. Ph.D., Delft University
          7           U              6                4                           of Technology, Holland, 2001.
          8           U              7                5                       [4] S. Gayaka1, B. Yao. Accommodation of unknown actuator
          9           U              8                6                           faults using output feedback-based adaptive robust control.
          10          U              9                7                           International Journal of Adaptive Control and Signal Pro-
                                                                                  cessing, 2008, 25(11): 965-982.




                                                                        143
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


[5] P. Smyth. Markov monitoring with unknown States. IEEE                  detection of unanticipated faults. International Conference on
    Journal on Selected Areas in ConununiCationS, 1994,                    Prognostics and Health Management, 6-9, Oct, Denver, CO,
    12(9):1600-1610.                                                       2008: 1-8.
[6] Hofbaur, B.C. Williams. Hybrid diagnosis with unknown               [18] Z. M He, H. Y. Zhou, J. Q. Wang. Model for Unanticipated
    behavioral modes. Proceedings of the 13th International                Fault Detection by OCPCA. Advanced Materials Research,
    Workshop on Principles of Diagnosis (DX02), May, 2002.                 Vols. 591-593, 2012: 2108-2113.
[7] V.J. Hodge, J. Austin. A survey of outlier detection method-        [19] J. Chen, R.J. Patton. Robust model-based fault diagnosis for
    ologies. Artificial intelligence review. Kluwer Academic Pub-          dynamic systems. Boston: Kluwer Academic Publishers, 1999.
    lishers, Vol. 22, 2004, 85-124.                                     [20] B. Zhang, S. Chris, B. Carl. A probabilistic fault detection
[8] Patcha, J. M Park. An overview of anomaly detection tech-              approach: application to bearing fault detection. IEEE Trans-
    niques: existing solutions and latest technology trends. Com-          actions on Industrial Electronics, 2010, 58(5): 2011-2018.
    puter Networks, 2007, 51: 3448-3470.                                [21] Pierre Sens. An unreliable failure detector for unknown and
[9] K. Kojima, K. Ito. Autonomous learning of novel patterns by            mobile networks. OPODIS 2008, LNCS 5401, 2008, 555–559.
    utilizing chaotic dynamics. IEEE International Conference on        [22] Anna M. Bartkowiak. Anomaly, novelty, one-class classifi-
    Systems, Man, and Cybernetics, IEEE SMC '99, 1999,                     cation: a short introduction. Computer Information Systems
    1:284-289.                                                             and Industrial Management Applications (CISIM), 2010 In-
[10] Petra Perner. Concepts for Novelty Detection and Handling             ternational Conference, Wrocław, Poland, 8-10, Oct, 2010,
    Based on a Case-Based Reasoning Process Scheme. Spring-                1-6.
    er-Verlag Berlin Heidelberg, 2007.                                  [23] F. N. Zhou. Extended DCA method for unknown multiple
[11] Satnam Singh, Haiying Tu, William Donat. Anomaly detec-               faults diagnosis. Huazhong Univ. of Sci. & Tech. (Natural
    tion via feature-aided tracking and Hidden Markov Models.              Science Edition), 2009, 37(4): 84-94 [in Chinese].
    IEEE Transactions on Systems, Man, and Cybernetics, Part A:         [24] N. Gebraeel, J. Pan. Prognostic degradation models for
    Systems and Humans, 2009, 39(1): 144-159.                              computing and updating residual life distributions in a
[12] Ching-Fang Lin. Predictive fault diagnosis system for intel-          time-varying environment. IEEE Trans. Rel., 2008, 57(4):
    ligent and robust health monitoring. AIAA In-                          539–550.
    fotech@Aerospace, 20-22, April, 2010, Atlanta, Georgia.             [25] Wang Z M, Yi D Y, Duan X J. Measurement data modeling
[13] E. Sobhani-Tehrani, H. A. Talebi, K. Khorasani1. Neural               and parameter estimation. CRC Press, 2011.
    parameter estimators for hybrid fault diagnosis and estimation      [26] Adrian Wills, Brett Ninness. On gradient-based search for
    in nonlinear systems. IEEE International Conference on Sys-            multivariable system estimates. IEEE Trans. Automat. Control,
    tems, Man and Cybernetics, Montreal, 7-10, Oct, 2007,                  2008, 53(1): 298–306.
    3171-3176.
                                                                        [27] E. Wernholt, S. Moberg. Nonlinear gray-box identification
[14] Amitabh Barua. Hierarchical fault diagnosis and health                using local models applied to industrial robots. Automatica,
    monitoring in satellites formation flight. IEEE Transactions on        2011, 4(47): 650-660.
    Systems, Man and Cybernetics-Part C: Applications and Re-
    views, 2011, 41(2): 223-239.                                        [28] Lennart Ljung. System identification: Theory for the User.
                                                                           Linkoping University, Sweden Published, 1998.
[15] B. Tom and J. Tom: Anomaly detection for advanced military
    aircraft using neural networks. Aerospace Conference, IEEE          [29] Goethals, K. Pelckmans, J. A. K. Suykens, B. De Moor.
    Proceedings, 2001, 6: 3113-3134.                                       Sup-space identification of Hammerstein systems using least
                                                                           squares support vector machines. IEEE Transactions on Au-
[16] Z. H. Duan: Theoretic and methodological research on fault            tomatic Control, 2005, 50(10): 1509-1519.
    diagnosis of mobile robots based on adaptive particle filters.
    Ph.D., Central South University, 2007, 63-89. [in Chinese].         [30] Tu S C. Satellite attitude dynamics and control. Beijing:
                                                                           Chinese Astronautic Publishing House, 2003; 125-168 [in
[17] B. Zhang, Chris Sconyers, Carl Byington, Romano Patrick,              Chinese].
    Marcos Orchard. Anomaly detection: A robust approach to




                                                                  144
                           Proceedings of the 26th International Workshop on Principles of Diagnosis




                       A SCADA Expansion for Leak Detection in a Pipeline∗

                           Rolando Carrera∗ Cristina Verde∗ and Raúl Cayetano∗∗
                       ∗
                         Universidad Nacional Autónoma de México, Instituto de Ingeniería
                                   e-mail: rcarrera@unam.mx, verde@unam.mx
                      ∗∗
                         Universidad Nacional Autónoma de México, Posgrado de Ingeniería
                                          email: rcayetanos@ii.unam.mx


                           Abstract                                 pattern recognition and analytical models for failure diagno-
                                                                    sis.
       A solution for expanding an already existing
       pipeline SCADA for real time leak detection is                  But all the later is pure academical, our aim here is
       presented. The work consisted in attaching a FDI             to share some of our practical experiences acquired dur-
       scheme to an industrial SCADA that regulates liq-            ing a re-engineering project that consisted on adding a real
       uid distribution from its source to end user. For            time leak detection and location layer to an already exist-
       isolation of the leak a lateral extraction is pro-           ing SCADA. The original objectives of that SCADA were
       posed instead of the traditional pressure profile of         the administration and delivery of some products, through
       the pipeline. Friction value is a function of pipe           pipelines, from the source to the end user. As it was our first
       physical parameters, but on line friction estima-            approach to integrating a FDI to an existing SCADA and
       tion achieved better results. Aspects that were im-          that we didn’t have experience on this subject, we proposed
       portant in the integration of the FDI scheme into            a solution that involves simple algorithms for detecting and
       the SCADA were the non synchrony of pipeline                 locating a leak. In future work we’ll use more elaborate
       variables (flow, pressure) and their accessibility,          algorithms as dedicated observers or detecting two simulta-
       that leaded to data extrapolation and the use of             neous leaks.
       data base techniques. Vulnerability of the loca-                In order to show how we solved the targets of the project
       tion algorithm due to sensors bandwidth and sen-             we divided the solution in five major parts (each one in-
       sitivity is showed, so the importance of selecting           cluded in sections 2 to 6 down here). Some of them are
       them. The FDI scheme was programmed in Lab-                  extracted from available theory, as the dynamical model for
       VIEW and executed in a personal computer.                    a flow in a pipe and the expression for leak location, and
                                                                    others are consequence of the experience achieved in our lab
                                                                    facilities, as the calculus of pipe friction and and the choice
1 Introduction                                                      of sensors, and finally the data acquisition imposed by the
                                                                    nature of the available SCADA.
Leak detection and isolation in pipelines is an old problem
that has attracted the attention of the scientific community           Delivering a fluid to clients means steady operation, then
since decades. A paradigmatic example is the oil leakage in         our solution required a suitable model for that condition,
the Siberian region [1], where the effects on the surrounding       section two describes how to achieve a simple steady state
nature have been disastrous. In Mexico, a semi desert coun-         model for a pipeline. Once the model is at hand an appro-
try, there is the need to transport water to the population on      priate expression for leak location is needed, for that pur-
long distances via aqueducts; this requires complex supervi-        pose in section three a simple method for locating a leak is
sion systems that detect leakages in early ways. Also, there        presented. From our experience, pipe friction plays a fun-
exist a complex net of pipelines that transport oil and its         damental role in the exact location of the leak and that real
by-products; in this net, besides the leakage problem, there        time estimated friction is better than a beforehand constant
exist also the illegal extraction of the product transported in     one; an on-line expression for calculating the pipeline fric-
the pipeline; this forces that the distribution system should       tion is showed in section four. In this project we didn’t have
have a leak detection and location monitoring system.               the option to choose sensors, but we consider appropriate to
   Since the 1970’s years have been issued several works            share here our experience in this matter, a comparative study
that have been fundamental for the detection and location           on how different type of sensors affect the leak location is
of leakages as the one of Siebert [2], where on the basis           presented in section five. The data acquisition system of
of the steady state pressure profile along the pipeline sim-        the SCADA is based on a MODBUS system and a database
ple expressions are derived, based on correlations,that detect      with the information of the pipe variables, we didn’t have
and locate a leakage. Later Isermann [3] published a survey         the right to get into the MODBUS but in the database, sec-
showing the state of the art on fault detection by using the        tion six shows how the indirect measurement of pipe vari-
plant model and parameter identification. Recently, Verde           ables issue was solved by using ethernet and data bases,
published a book [4] making emphasis on signal processing,          also, the extrapolation of data of non existing data during
                                                                    sample times is presented. Finally, the concluding remarks
   ∗
       Supported by II-UNAM and IT100414-DGAPA-UNAM.                of this work are presented in section seven.




                                                              145
                                                            Proceedings of the 26th International Workshop on Principles of Diagnosis


2 Pipeline steady state model                                                                                   with
In most applications a dynamical model of the system is re-                                                      M i (Qi ) = µi Qi |Qi | + sin(αi ) = mi (Qi ) + sin(αi ) (5)
quired but not here because of the steady operation of the
pipeline, then a steady state model is more suitable. Be-                                                       that is independent of the spacial coordinate z i , and µi :=
sides, the pipeline lies buried in the field and has an irregular                                               f i /2Di (Ai )2 g. Then the solution of (4) reduces to
topography, but it is possible to derive a model that handles
it like a horizontal one. This model is simpler as will be                                                        H i (z i ) = −M i (Qi )z i + H i (0)        for 0 ≤ z i ≤ Li (6)
showed.                                                                                                         with H i (0) the pressure head at the beginning of section
   In the following we modify the model of a pipeline with                                                      i. Defining boundary conditions for section i in terms of
topographical profile as showed in Figure 1 into one with a                                                     pressure at the ends:
right profile piezometric head, where the pressure variable
depends on a reference value h, as is the hight over sea level                                                         H i (z i = 0) := Hin
                                                                                                                                         i
                                                                                                                                                      H i (z i = Li ) := Hout
                                                                                                                                                                          i
                                                                                                                                                                              .    (7)
along the pipeline. Consider the one dimension simplified
                                                                                                                with (7) in (6), we obtain
flow model in a pipeline with n sections [5],
                                                                                                                     i     i
                                                    1 ∂Qi (z i , t)    ∂H i (z i , t)                               Hin − Hout = M i (Qi )Li = mi (Qi )Li + ∆Hi ,                  (8)
                                                                    +g
                                                   A  i    ∂t              ∂z i
                                                                                                        (1)     where ∆Hi = L sin(α ) is the height difference between
                                                                                                                                   i        i
                                                i i       i i
                                             f Q (z , t)|Q (z , t)|                                             section ends.
                                           +                        + g sin αi = 0                                It is reported in [7] and [8] that the pressure head
                                                  2Di (Ai )2
             ∂H i (z i , t)    b2 ∂Qi (z i , t)                                                                                                       P i (z i )
                            +                   =0          (2)                                                                        H i (z i ) =                                (9)
                   ∂t         gAi      ∂z i                                                                                                             ρg
which assumes that fluid is slightly compressible, pipe walls
are slightly deformable and negligible convective changes                                                       can be written in terms of the piezometric head H̃ i (z i ), wich
in velocity. Q is volumetric flow, H is pressure head, A                                                        depends on a heigth h that can be related to sea level, i.e.
is pipe cross-sectional area, g is gravity, f 1 is the D’Arcy-                                                                    H̃ i (z i ) = H i (z i ) + h(z i ),             (10)
Weissbach friction [6], b is the velocity of pressure wave,D
is pipe diameter, z is distance variable and t the time. Super                                                  h(z i ) in m over reference datum or sea level, ρ is fluid den-
index i = 1, 2, ..., n indicates pipeline section characterized                                                 sity. Then the profile pressure (8) is equivalent to
by its slop with angle αi , n is the total number of sections.                                                                      i      i
                                                                                                                                  H̃in − H̃out = mi (Qi )Li                       (11)
                                   2340
                                                                                                                for section i and sea level h(z i ) along the section. Finally,
                                   2320
                                                     Sensors locations                                          considering that boundary conditions are related by
       Heigth over sea level [m]




                                                                                                                                           i       i+1
                                   2300                                                                                                  H̃out = H̃in  ,                          (12)
                                   2280                                                                         from this equation and (11) one gets
                                                                                                                                                    n
                                                                                                                                                    ∑
                                   2260                                                                                           1      n
                                                                                                                                H̃in − H̃out =              Li mi (Qi )           (13)
                                   2240                                                                                                               i=1

                                   2220                                                                         which is function of the piezometric head for a pipeline with
                                                                                                                n sections without branches.
                                   2200
                                       0          1          2        3        4         5          6              The profile of Figure 1 corresponds to the topography of
                                                                  Length [m]                    4
                                                                                             x 10               the pipeline under study. The pressure head H(z) and the
                                                                                                                resulting piezometric head H̃(z) are shown in Figures 2 and
                                   Figure 1: 60 km Pipeline topographical layout                                3, respectively. Take into account the uniformity of H̃(z)
                                                                                                                similar to the one of a horizontal pipeline. The reference
   We start with the following hypothesis: the system works                                                     datum was the height of the first sensors location.
in steady state and that the pipeline lay on an horizontal sur-                                                    As a consequence, if H̃in1
                                                                                                                                              = H̃in and H̃out
                                                                                                                                                             n
                                                                                                                                                                 = H̃out , be-
face. Therefore we need a steady state model that takes into                                                    sides if m (Q ) = m(Q) = M (Q) for all i, then Equation
                                                                                                                           i  i

account these conditions.                                                                                       (13) becomes
   In order to describe the behaviour of the pressure head
H i (z i , t) along a section without branches it is assumed                                                                       H̃in − H̃(z) = LM (Q)                          (14)
steady state flow, so from (2) one gets                                                                                      ∑n
                                                                                                                where L = i=1 Li the total length of the pipeline. Equa-
           ∂Q (z , t)                            i     i                                                        tion (14) is the steady state piezometric model for the
                      =0                                                 ⇒ Qi constant                  (3)     pipeline viewed as a horizontal one.
               ∂z i
Combining (1) and (2)
                                                                                                                3 Leak location
                                                  dH i (z i )
                                                              + M i (Qi ) = 0,                          (4)     We consider a leakage as an outlet pipe at the leak location
                                                    dz i                                                        as is shown in Figure 4. A branch or lateral pipe in sec-
   1
     This friction characterizes the shear stress exerted by the con-                                           tion i breaks the continuity of variables Q(z, t) and H(z, t),
duit walls on the flowing fluid.                                                                                therefore new boundary conditions must be satisfied [9]. In




                                                                                                          146
                                                  Proceedings of the 26th International Workshop on Principles of Diagnosis


                        1000                                                                     H1 = H2 = H3 . Thereafter in the study was included only
                                                                                                 the balance
                        950                                                                                      Q1 − Q2 − Q3 = 0,                   (17)
                                                                                                 as consequence,
    Pressure head [m]




                        900
                                                                                                                   Q1 = Qin ,        Q3 = Qout             (18)

                        850
                                                                                                 with Qin y Qout flows at the ends of the pipeline. So the
                                                                                                 differential equation (4) transforms in two equations
                        800                                                                          dH 1 (z)
                                                                                                              − M (Q1 )       = 0;     for 0 ≤ z ≤ zb
                                                                                                       dz
                        750
                                                                                                                                                          (19)
                           0             1        2       3         4      5           6              dH 3 (z)
                                                      Length [m]
                                                                                x 10
                                                                                       4
                                                                                                                − M (Q3 ) = 0; for zb < z ≤ L,
                                                                                                         dz
                         Figure 2: Pipeline pressure head profile H(z)                           describing the pressure head along the section with a branch
                                                                                                 in point zb . As the equations (19) have the same form as (4),
                                                                                                 their solutions also have the same as (6). Therefore, with
                        980
                                                                                                 boundary conditions:
                        960
                                                                                                   1. H 1 (z = 0) = Hin ,
                        940
                                                                                                   2. H 3 (z = L) = Hout ,
     Piezometric head




                        920                                                                        3. Qin = Qout + Qzb and
                        900                                                                        4. Hzb − ϵ = Hzb + ϵ with ϵ → 0
                        880                                                                      Assuming that all pipes have same diameters, solutions of
                                                                                                 (19) evaluated at the ends are reduced to
                        860
                                                                                                                  Hin − Hzb
                        840                                                                                                 − M (Qin ) = 0
                                                                                                                      zb
                        820
                           0            1         2       3         4     5            6                                                                   (20)
                                                      Length [m]                   4
                                                                                x 10                             Hzb − Hout
                                                                                                                            − M (Qout ) = 0.
                                                                                                                   L − zb
                        Figure 3: Profile of the piezometric head H̃(z)
                                                                                                 Obtaining the variable zb associated to the position of the
                                                                                                 branch
particular, the union of three pipes is associated to a geome-                                                    M (Qout )Li + Hout − Hin
try shown in Figure 4 and the corresponding conditions that                                            zb   =
                                                                                                                     M (Qout ) − M (Qin )
describe the action of separating flow are reduced to
                                                                                                                  L sin α + m(Qout )L + Hout − Hin
                                        H2    = H1 + κ12 (H2 , H1 )                    (15)                 =                                      , (21)
                                                                                                                          m(Qout ) − m(Qin )
                                        H3    = H1 + κ13 (H3 , H1 )                    (16)
                                                                                                 in terms of the piezometric head
where H2 and H3 are pressures at the beginning of pipes
2 and 3 and the functions κ1η (·, ·) with η = 2, 3 repre-                                                              m(Qout )L + H̃out − H̃in
                                                                                                                zb =                            .          (22)
sent losses caused by friction and change of flow direc-                                                                 m(Qout ) − m(Qin )
tion. For adjusting the order of magnitude of these func-
tions flow simulations were held with Pipelinestudio [10]                                           Equation (22) is the key for leak isolation. In order to see
with the topology of the study case shown in Figure 1. Sim-                                      the performance of this leak location method some experi-
ulation reported that terms κ12 and κ13 were negligible, then                                    ments were held in our pipe prototype [11], which is an iron
                                                                                                 pipe of 200 m long, 4 inches diameter and six valves at-
                                                                                                 tached to it for leak simulations. Table 1 shows the percent
                                                                                                 deviations of locating the leak position. In each experiment
                                                                                                 a valve was fully open. Coriolis sensors were used.
                                                               ࡽ૛
                                                                                                 4 Pipeline friction
                               ‹’‡ͳ                ࡴ૛                 ‹’‡͵
                                                                                                 The D’Arcy-Weissbach friction is a function of the pipe
                                                                                                 parameters, [6] and [12], and operation conditions, as the
                              ࡽ૚             ࡴ૚                                                  Reynolds number. For practical purposes the friction f is
                                                        ࡴ૜          ࡽ૜
                                                                                                 obtained from tables provided by the pipe manufacturers.
                                                                                                 But we observed that that value differs from the real one of a
                                   ࢠ࢈                        ࡸ െ ࢠ࢈                              working pipeline where, no matter that is working in steady
   ࢠൌ૙                                                                         ࢠൌࡸ
                                                                                                 state, the value is influenced by noise -caused by pipe in-
                                                                                                 ner surface imperfections and attachments (nipples, elbows,
Figure 4: Union of three branches in point zb of pipeline                                        etc.)-, therefore using a previous fixed value of f is of no
with transversal section areas A1 , A2 and A3                                                    use in Equation (1).




                                                                                           147
                                   Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                              ods are based on processing a residual that is a flow differ-
 Table 1: Location error in percentage of total pipe length                   ence. Due to our lack of experience, and by suggestion of
       Experiment ∆zb [%] Valve position [m]                                  a supplier, we start our flow measurements with a paddle
            1          1.66             11.54                                 wheel flow sensor [13]. Later on, as ultrasonic sensors are
            2          2.93             49.83                                 widely used in the field, we decide to change to them [14],
            3          0.135            80.36                                 thinking that our measurements would be better. Finally,
            4          0.54            118.37                                 we reached the conclusion that success on leak detection
            5          0.375           148.93                                 and location depends strongly on the sensors quality (make
            6          3.42            186.95                                 and sensing principle), so we acquired sensors based on the
          Mean          1.0                                                   Coriolis effect [15].
                                                                                 An experiment that we made in our pipe prototype was
                                                                              to cause a leakage (outflow in a extraction point) and esti-
   To overcome the problem of not having the friction right                   mate the location with the measurements of the three sen-
value, we proposed a solution that was an on line friction                    sors. Figure 6 shows the deviation of the calculated location
estimation. In the following we show how to calculate this                    depending on the type of sensor. Oscillations are observed
friction. For that, we part from the steady state momentum                    around the operating point, which leads to the necessity of
equation, Equation (4). Turning back the original parame-                     signal filtering in the diagnosis process. Table 2 shows the
ters we get                                                                   error leak location, Paddle Wheel and Coriolis sensors have
                                                                              similar error, but standard deviation is bigger with the Pad-
               dH         f                                                   dle Wheel. In order to compare performance in the fourth
                    +  g      Q |Q| + gsinα = 0          (23)
               dz     2DA2                                                    column the accuracy of the instruments are presented; re-
   solving the integral, considering that H0 and HL are pres-                 mark that Coriolis error standard deviation is about seventy
sures at he beginning and at the end of the pipeline and L                    times bigger than sensor accuracy. The observation here is
the length, results                                                           that the quality of the results depends more on the behavior
                                                                              of the flow than on the accuracy of the instrument used.
                               f
                  g(HL − H0 ) = −( Q2 + gsinα)L       (24)                                                   Leak location (Real position= 49.8 m)
                            2DA2 ∞                                                                60
   where Q∞ is volumetric flow in steady state, the abso-
lute term disappears when flow goes in one direction only.                                        50

Friction has the following expression                                                             40
                                                                                                                                           Coriolis
                                                                                  Location [m]




                               2                                                                                                           Paddle wheel
                  2DA g (H0 − HL − Lsinα)                                                         30
             f=                                           (25)                                                                             Ultrasonic
                      L            Q2∞
                                                                                                  20
   Equation (25) is used to calculate on line the friction
value, as is shown in Figure 5, experiment realized in our                                        10
pipeline prototype. The calculated friction has a consider-
                                                                                                   0
able amount of noise, but this noise can be attenuated via
weighted mean value with forgetting factor (MVFF, contin-                                        −10
                                                                                                    0     1000   2000   3000   4000      5000    6000     7000
uous line in figure). Actually, we are working on the use of                                                              Samples
recursive identification procedures for a better friction esti-
mated.                                                                                            Figure 6: Leak location with the three sensors
               0.026
                                                          Unfiltered
                                                          MVFF
                                                                                                          Table 2: Leak location errors
               0.025
                                                                                                 Sensor          Error Error STD Accuracy
                                                                                                                 [%]        [%]         [% FS]
                                                                                                 Paddle wheel -0.28         3.36         0.50
    Friction




               0.024
                                                                                                 Ultrasonic      2.12       1.39         2.00
                                                                                                 Coriolis        0.28       0.84         0.05
               0.023


                                                                                 One of our goals in the SCADA expansion project was to
               0.022
                                                                              deliver results in real time. For this, sensors experiments
                   0        100           200       300           400         were performed to determine which one would have the
                                        Samples                               faster response. An index to take into account is the time
                                                                              response, it can be appreciated in Figure 6 but is practically
                Figure 5: Friction estimated, raw and filtered                the same, therefore we measured the settling time from the
                                                                              moment when the leakage valve is opened. In Figures 7,
                                                                              8 and 9 the flow development is observed, dotted line indi-
                                                                              cates the time when the leakage valve is opened to 100%. In
5 Influence of sensors on location                                            Table 3 are the measured times, being the ultrasonic sensor
Flow measurement in a pipeline is fundamental for leak lo-                    which requires more time (this by the number of points used
cation, in view that most of the pipeline leak detection meth-                to calculate a mean value).




                                                                        148
                                             Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                                                Considering the settling time and noise in measurements
                             Table 3: Sensors settling time                                  (taking the STD as the measure for that), Coriolis sensor has
                                 Sensor           ts [s]                                     the best performance. Experiments showed in this section
                                 Paddle wheel        3                                       were made with 1 s sampling period.
                                 Ultrasonic         35
                                 Coriolis            4                                       6 Asynchronous data and data bases
                                                                                             In the academy, we are used to work with benchmark sys-
                                                                                             tems or laboratory facilities with ad hoc data acquisition
                                                                                             systems, sufficient sensors, controlled environments, etc.
                21
                                                                                Qe           But these conditions are not necessarily in the practice, as
                                                                                Qs           was the case of the SCADA expansion, where the access
                20
                                                                                             to flow and pressure sensors of the pipeline were not avail-
                                                                                             able, but through a database. So the solution adopted was as
                19
                                                                                             follows:
   Flow [L/s]




                18                                                                             1. The leak locator is on a dedicated computer, indepen-
                                                                                                   dent of the system that regulates de distribution of the
                17                                                                                 fluid, it connects to the database server, see Figure 10,
                                                                                                   via intranet or VPN (Virtual Private Network) connec-
                16                                                                                 tion in a LAN (Local Area Network) system.
                                                                                               2. With proper permission a program, task performed
                        leak start


                15
                 20     40             60    80      100     120    140   160    180               with Visual Studio 2010 tool that runs every minute
                                                  Time [min]                                       (it is a program without GUI -Graphic User Interfacer-
                                                                                                   that runs silently), brings system data and creates a
Figure 7: Flow measurement at the pipe ends, paddle wheel                                          database with pipeline flow and pressure information,
sensors                                                                                            data required by the locator for proper operation.
                                                                                               3. The locator program (made in the LabVIEW plat-
                                                                                                   form, [16]) periodically takes data (through SQL data
                 19
                                                                                Qe                 server of Microsoft), applies the detection algorithm
                                                                                Qs                 and when detects a leak proceeds to locate it, displays
                18.5                                                                               on the screen the location of the leak (Figures 12 and
                                                                                                   13), generates a visual warning and creates a file with
                 18                                                                                data leakage.
   Flow [l/s]




                17.5


                 17

                         leak start

                16.5
                   20     40            60    80      100     120   140   160    180
                                                   Time [min]


Figure 8: Flow measurement at the pipe ends, ultrasonic
sensors
                                                                                             Figure 10: Communication scheme between leak locator
                                                                                             and database
                19.5
                                                                                Qe              But the data acquisition system of SCADA do not meet
                 19
                                                                                Qs           the condition of sampling the system variables with con-
                                                                                             stant sampling period. The nominal sampling period was
                18.5                                                                         3 min, but in reality this varies from one to several tens of
                                                                                             minutes. On the other hand, the locator was assigned a sam-
   Flow [l/s]




                 18
                                                                                             pling period of 3 min, determined by the condition that nom-
                17.5                                                                         inally SCADA performs a polling of all measuring stations
                                                                                             in that time span. To solve the problem of having a value
                 17                                                                          of flow and pressure of each station at all sampling time, it
                                                                                             was added to the localizer an algorithm that extrapolates the
                16.5
                          leak start
                                                                                             missing data when it is not available. Two algorithms were
                 16                                                                          tested, one that retains the last data in the following sam-
                  20       40           60    80      100     120   140   160    180
                                                   Time [min]                                pling periods and one that generates straight line with the
                                                                                             last two values available, that when the value of the variable
Figure 9: Flow measurement at the pipe ends, Coriolis sen-                                   that is brought from the database is not a new one, then the
sors                                                                                         one determined by straight line is used. In order to compare
                                                                                             results with both proposals a simulation with real data with




                                                                                       149
                                   Proceedings of the 26th International Workshop on Principles of Diagnosis


three leaks was carried on, in Figure 11 the real and extrapo-                  It connects to the database in the SCADA through TCP
lated input flow data are shown. It can be seen that at certain                 sockets and VPN.
intervals the extrapolation by a straight line delivers values
that may be beyond the normal range of measurements, this
situation is exacerbated in large intervals with empty data as
the line grows monotonically delivering data outside the re-
gion of validity. In Figure 12 the location of a leak is shown
when extrapolated data are used and in Figure 13 when re-
tained data are used. The pipe length is about 20 km, so
that retention has outperformed extrapolation, since the lat-
ter yields higher values than the length of the pipe. Original
leak location was about 10 km.

                   17.5                                   Extrapolation
                                                          Retained
                                                          Real
                    17
   Flow [m3/min]




                   16.5



                    16                                                           Figure 14: Communications between client and database


                   15.5
                                                                                   For data handling JSON format is used, which is broadly
                       0      50        100
                                        Time [min]
                                                   150       200                used for information interchange trough internet. JSON
                                                                                (Java Script Object Notation) is a data interchange text for-
                                                                                mat, easy for humans to read and write [17]. JSON is a
Figure 11: Graphics with original, extrapolated and retained                    collection of pairs {variable name : value}, realized as an
data of input flow with three leaks                                             object, record, structure, dictionary, hash table, keyed list,
                                                                                or associated array, see in Figure 15 an object example.




             Figure 12: Leaks location with extrapolated data                            Figure 15: JSON data format for an object

                                                                                   An example of a JSON string for reporting a leak is the
                                                                                following:
                                                                                {"service":"event",
                                                                                "options": {
                                                                                  "action":"new",
                                                                                  "vector": {
                                                                                     "Module":XXX,
                    Figure 13: Leaks location with retained data                     "EventID":XXX,
                                                                                     "Quantity":XXX,
6.1 Alternate database communication                                                 "PipeID":XXX,
                                                                                     "Location":XXX,
As part of the project requirements, an alternate way of com-                        "TimeEvent":"yyyymmddhhmmss"}
munication with the SCADA database was experimented. In                         }}
previous section the communication between leak locator
and database was direct trough a LAN system, the alternate                         Communications broker attends clients requests (leaks lo-
way was through a third party via internet and VPN connec-                      cator is not the only one) and also SCADA requests. The
tion. Figure 14 shows the principal elements of this scheme.                    database attached to the broker contains not only pipeline
   The client is the computer with the locator program build                    data but also data generated by the other clients. At the end,
in LabVIEW platform that performs basically two activities:                     the SCADA has an interface in which information of leak-
leak detection and location, and request and sending data                       age events is displayed.
to communications broker using JSON strings. The remote                            Figure 16 shows a test ran with real data but off line. That
client interface is a Java process that runs locally and han-                   experience showed that locator not always received answers
dles communication, authentication, data formatting, en-                        from the broker. But this communications scheme is still in
cryption and security of the communication with data server.                    development.




                                                                          150
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                    References
                                                                    [1] I. Bazilescu and B. Lyhus.            Russia oil spill.
                                                                         http://www1.american.edu/ted/KOMI.HTM, 1994.
                                                                    [2] H. Siebert and R. Isermann. Leckerkennung und -
                                                                         lokalisierung bei pipelines durch on-line-korrelation
                                                                         mit einem prozessrechner. Regelugnstechnik, 25:69–
                                                                         74, 1977.
                                                                    [3] R. Isermann. Process fault detection based on mod-
                                                                         eling and estimation methods - a survey. Automatica,
Figure 16: Off line experiment with real data. Detail of the             20(4):387–404, July 1984.
graph, y axis is leak location in km                                [4] C. Verde, S. Gentil, and R. Morales. Monitoreo y di-
                                                                         agnóstico automático de fallas en sistemas diámicos.
                                                                         Trillas, 2013.
7 Conclusions
                                                                    [5] M. Hanif Chaudhry. Applied Hydraulic Transients.
An interesting result is that a pipeline with certain topog-             Springer, third edition, 2014.
raphy may be analyzed as an horizontal pipe in which the            [6] C. F. Colebrook and C. M. White. Experiments with
piezometric head is a sum of measurements and terrain                    fluid friction in roughened pipes. Proceedings of the
heights, Equation (10), as seen in section 2.                            Royal Society of London, 1937.
   Compared with traditional methods for locating a leak in
                                                                    [7] J. Saldarriaga. Hidráulica de acueductos. Mc Graw-
a pipe, the method shown here, Equation (22), requires less
computational effort and has a simple expression for calcu-              Hill, 2003.
lating it.                                                          [8] R. Bansal. Fluid mechanics and hydraulic machines.
   Another relevant result is the expression for on line calcu-          Laxmi Publications (P) LTD, 2005.
lation of the pipeline friction, Equation (25), as it is enough     [9] Oke A. Mahgerefteh, H. and O. Atti. Modelling out-
to measure pressure at the ends and steady state flow. The               flow following rupture in pipeline networks. Chemical
value of friction was found to be a key parameter for the                Engineering Science, (61):1811–1818, 2006.
exact location of the leak. It is to remark that when a leak
                                                                    [10] PipelineStudio. Software in energy solutions interna-
occurs the pressures change modifying the friction value; in
order to avoid wrong location of the leak we keep a delayed              tional. http://www.energy-solutions.com/, 2010.
value of friction that is frozen when leak alarm occurs.            [11] R. Carrera and C. Verde. Prototype for leak detection
   On the other hand, is to highlight the importance of                  in pipelines: Users Manual. Instituto de Ingeniería,
choosing the appropriate sensor. It is not enough to choose              UNAM, Ciudad Universitaria, D.F., Noviembre 2010.
a sensor capable of measuring a certain physical variable,               In Spanish.
also must be included in the selection process the purpose          [12] C. Tzimopoulos G. Papaevangelou, C. Evangelides.
for which the measurements are needed.                                   A new explicit relation for the friction coefficient in
   The world of measurements for control targets is not lim-             the darcy-weisbach equation. In PRE10: Protection
ited to direct measurement of the physical variable, it is               and Restoration of the Environment, Corfu, 05-09 July,
possible to achieve the control objectives with indirect mea-            2010, 2010.
surements, as was the case of reading the variables from the        [13] G. Fisher. Signet 2540 high performance flow sensor.
plant via the network to a database. Also, with the partial              Georg Fisher Signet LLC, El Monte, CA, 2004.
absence of data we cannot use the plant model to predict
data, then the use of extrapolation methods proves to be a          [14] Panametrics. Two-Channel TransPort Mod. 2PT868
powerful tool that helped to achieve the goal of this project;           Portable Flowmeter. User’s Manual. PANAMET-
in this paper we use two simple methods, but this is an area             RICS, Inc., Waltham, MA, USA, 1997.
that we continue to explore.                                        [15] E+H. Proline Promass 83 Operating Instructions. En-
   The experience with JSON format strings showed that it                dress + Hauser Flowtec, Greenwood, IN, USA, 2010.
is easier to work with text characters than with specialized        [16] National Instruments. LabVIEW user manual. Na-
database commands and, no matter the VPN connection and                  tional Instruments Corporation, Austin, 2013.
data encryption, the scheme depends strongly on internet
conditions. If internet fails leak detection scheme fails, sit-     [17] JSON Organization. Ecma-404 the json data inter-
uation that scarcely appears when the locator connects with              change standard. http://www.json.org/.
database through a LAN system.
   To the moment this paper was written our FDI system
is in the proof stage at the SCADA facilities and we are
waiting for in the field results.

8 Acknowledgments
Authors are very thankful to Jonathán Velázquez who
helped us by solving the database issues emerged in this
project.




                                                              151
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                  152
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




             Automatic Model Generation to Diagnose Autonomous Systems

               Jorge Santos Simón1 and Clemens Mühlbacher1 and Gerald Steinbauer1
                                   1
                                     Institute for Software Technology
                            e-mail: {jsantos, cmuehlba, gstein}@ist.tugraz.at



                        Abstract                                  RoboEarth [7]) make semi-automated derivation of mod-
                                                                  els possible. Despite recent advances on this area [8; 9;
     Autonomous systems’ dependability can be im-                 10], most techniques focus on very specific applications of
     proved by performing diagnosis during run-time.              the generated formal models. Thus, we pose the problem
     This can be achieved through model-based diag-               of generating a common knowledge base as an interme-
     nosis (MBD) techniques. The required models of               diate representation with a well defined semantics out of
     the system are for the most part handcrafted. This           documents used during the system design process. From
     task is time consuming and error prone. To over-             this central repository, different algorithms can extract dif-
     come this issue, we propose a framework to gen-              ferent formal models for particular needs. We believe that
     erate formal models out of natural language doc-             this work can increase the acceptance of model-based tech-
     uments, such as technical requirements or FMEA,              niques and broaden their use.
     using natural language processing (NLP) tools
     and techniques from the knowledge representa-                   The motivation for this work came during the develop-
     tion and reasoning (KRR) domain. Therefore, we               ment of a model-based diagnosis and repair (MBDR) sys-
     aim to enable the usage of MBD in autonomous                 tem for an industrial application. The aim is to improve the
     systems with few extra burden. So doing, we ex-              dependability of a fleet of robots that automatically deliver
     pect a significant increase in the usage of MBD              goods in a warehouse. As stated in [11], even minor fail-
     techniques on real-world systems.                            ures often prevent a robot from accomplishing its task, de-
                                                                  creasing the overall performance of the system. Moreover,
                                                                  the frequent need of human intervention increases costs and
1 Introduction                                                    customer dissatisfaction. Using MBDR techniques, many
Dependability is a key feature of modern autonomous sys-          of these failures can be automatically handled, allowing the
tems. It can be achieved by sound design and implemen-            robot to remain on service, perhaps with its capabilities
tation, thorough testing and runtime diagnosing. To date,         gracefully degraded [12; 13]. In extreme cases, diagnos-
all these processes are still not completely automated and        ing a failure on time can prevent robot behaviors harmful
need substantial manual work. However, all these fields can       for humans, itself or other elements in the environment.
greatly benefit from the use of model-based techniques. De-          Confronted with the lack of any formal model of the
sign and implementation can be greatly improved through           system, we were forced to manually code the models we
model-driven engineering, as stated in [1]. Model-based           need. However, this is both a time-consuming and error
testing (MBT) has been demonstrated [2] to outperform tra-        prone task, and also impose a maintaining burden as the
ditional testing techniques in both invested time and number      system evolves. Accordingly, we believe that a mostly au-
of errors found. Model-based diagnosis (MBD) is the main          tomated approach is not only convenient for the intended
target of this work. It has been successfully used in indus-      project but can also help extending the use of MBDR
trial settings [3], reducing the need for human intervention.     techniques to other projects and domains. Following this
Although it has being increasingly adopted in recent years,       idea, we propose a framework that, in a first step, gath-
we believe that its full potential is still to be developed.      ers the information from the project together with domain
   All model-based techniques require appropriate models          and common-sense knowledge in a machine-understandable
of the system. As stated in [4; 5], creating these models         knowledge base. Then, a suit of algorithms can extract for-
is the most prevalent limiting factor for their adoption. To      mal models from this knowledge base for particular pur-
overcome this barrier, we propose a method that automates         poses. Though our aim is to automate the process as much
models creation from the documents used during the sys-           as possible, human assistance will be requested whenever
tem design. These comprise requirements documents, ar-            some pieces of information are missing or contradictory [14;
chitectural designs, FMEA and FTA, among others. The              15].
content of these documents is often given in natural lan-            The novelty of our proposal is two-fold: first, we empha-
guage and in semi-structured form and lacks a common              sizes the usability of the resulting models for MBD. Second,
semantics. Thus, the contained information is not acces-          we aim to integrate all the sources of information typically
sible for a computer. However, advances in natural lan-           available in an industrial development process, such as re-
guage processing (NLP) and the availability of common             quirements, architecture, and failure modes. As a result, we
sense and domain-specific knowledge bases (e.g. Cyc [6],          expect to boost the range and applicability of the automat-




                                                            153
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


ically generated models. To better illustrate the proposed        3 Framework overview
framework, we will use a small running example extracted          We propose the framework depicted in Figure 1 to transform
from a real-world application. It comes to the robot’s box        informal documents and knowledge into models suitable for
loading operation, performed by the robot’s load handling         MBD. The informal inputs (white squares with solid lines)
device (LHD).                                                     are processed into intermediate representations (light gray
   The remainder of the paper is organized as follows: Re-        squares with dashed lines) using techniques from NLP and
lated research on model generation is discussed in Section        KRR, as well as ontologies (e.g. Cyc). We condense them
2. Section 3 provides an overview of the proposed process.        into a knowledge base together with all our knowledge about
Section 4 describes the inputs used, while Section 5 de-          the system and its domain. Finally, a variety of algorithms
scribes the proposed NLP and KRR tool-chain to interpret          can produce formal models suitable for MBD (gray squares
them. Section 6 provides an example of an output model            with dot-dash lines).
and its use for MBD. Finally, Section 7 summarizes the pre-
sented framework and discusses future work.                       4 Sources of information
                                                                  The proposed framework takes artifacts from the design
                                                                  phase as inputs. We propose the use of the following four in-
2 Related research                                                puts, though additional sources can be incorporated if avail-
                                                                  able:
We start the brief discussion of related research with the          1. Requirements document: The technical requirements
work using NLP methods to derive models. The work of                    document describes the expected system behavior.
[9] uses NLP methods to derive a formal model out of re-                Therefore, it is a mandatory input. The models’ quality
quirements. This formal model can afterwards be trans-                  and so the resulting MBD will heavily depend on the
formed into different representations to test or synthesize             quality of the requirements. Thus, iterative improve-
the system. The method proposed in [10] uses NLP meth-                  ment of the requirements and models is used, as pro-
ods to derive design documents (class diagrams, etc.) out               posed in [15]. For our running example, we have taken
of requirements. These design documents can afterwards be               four requirements that describe the box loading process
used to implement the system. The authors of [8] proposes               of a robot:
a method to extract action receipts from websites. These                 (a) When the robot is docked, it lowers the barrier.
action receipts comprises the desired behavior in order to               (b) When the robot is ready to load, the load handling
achieve a given goal. The method use how-to instructions                       device starts rotating backward.
and NLP tools to derive an action receipt which can be ex-
                                                                         (c) The load handling device stops rotating back-
ecuted by a robot. Missing parts are inferred with the help
                                                                               wards when the laser beam is triggered.
of common sense knowledge about actions. In contrast to
all these approaches, we propose a framework which incor-                (d) After stopping the load handling device the barrier
porates different information sources to get a better under-                   is raised.
standing of the system. Furthermore, our framework gen-             2. Domain knowledge: This is the most fuzzy input, as
erates different models out of an internal formal description           it is available not as an artifact but as the knowledge
depending on the needs of the intended diagnosis and testing            and experience of the engineers involved. We dis-
tasks.                                                                  tinguish three kinds of knowledge. Common sense
   Beside NLP methods, machine learning can also be used                knowledge can be provided by existing ontologies as
to generate a model of the system. The work in [4] pre-                 Cyc [16]. Generic knowledge about the autonomous
sented a method to statistically learn the model of the sys-            systems domain can be provided by dedicated ontolo-
tem under nominal conditions. The model describes the                   gies as KnowRob [17]. Particular knowledge about the
static interaction of the system components. In contrast, the           targeted system itself can be partially inferred from the
method proposed in [5] learns the behavior of a system. The             system architecture, though other parts must be pro-
method infers from observed events similar/different states             vided by the project engineers. The use of ontologies
and merges similar ones. Furthermore, the variables in the              range from providing meaning to natural language con-
system for each state are estimated. Both methods are only              cepts to inferring missing pieces of information.
applicable if the system is already built. Instead, we create       3. Architecture: The architecture of the system defines its
a model during the design phase, and so the model can be                composing elements plus the relations between them.
used right at the first stages of the life-cycle.                       It is typically described as a set of diagrams generated
   Missing or contradicting information must be detected                during the design phase of the system. For our run-
and handled when generating models. The method in [15]                  ning example, we use the architecture excerpt depicted
tries to avoid faults in the requirements document. This is             in Figure 2. It states that a robot consists of a LHD
done through the transformation of the requirements into so             and other unspecified elements. Furthermore, the LHD
called boilerplates. Through this semi-structured text, am-             consists of a laser beam, rollers and a barrier.
biguities are removed and a consistent naming is enforced.          4. Failure Modes and Effects Analysis: FMEA looks at
A different approach was proposed in [14] to diagnose a                 all potential failure modes, their effects and causes and
knowledge base for consistency. If the knowledge base is                determines a risk priority factor. FMEA can be used to
inconsistent, the user is asked as an oracle to pinpoint the            determine which potential errors are critical, how they
problem. Afterwards, the user needs to fix this issue. In our           can be pinpointed, and how the effects thereof can be
framework, we will use ideas from both methods to derive a              avoided [18]. We incorporate the failure modes into
consistent knowledge base of the system.                                the resulting behavior models to diagnose these known




                                                            154
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




Figure 1: Abstract work-flow for the proposed framework. Starting from left with inputs in natural language, we generate
models that can be applied for diagnosis (right).




Figure 2: Robot architecture excerpt. The figure shows re-
lations of the type part of for components of the Robot.


     failures. For our running example, we include the two
     failure modes that can occur during the load operation,
     depicted in Table 1.                                          Figure 3: Sample syntax tree of the first sentence (a) of the
  The biggest challenge for handling all these inputs is to        running example.
understand semi-structured information. So, we will depict
a NLP/KRR tool-chain using state-of-the-art techniques in
the following section.                                             Note for example that the 3rd person “s” has been removed
                                                                   from the verbs. Furthermore complex terms such as “load
5 NLP/KRR tool chain                                               handling device” have been replaced by lhd. Finally, the
                                                                   propositions order is rearranged in a consistent structure.
The process generates three intermediate artifacts: semi-
formal text (boilerplates), syntax trees and semantic cate-        5.2 Syntax trees
gories. As a showcase, we will concentrate on the require-         A syntax tree comprises the information of the type of each
ments of our running example, though these techniques can          word in the sentence, e.g. ”lower“ is a verb. Furthermore,
be extended to other textual inputs, as we will see at the end     the tree specifies how the sentence is constructed with these
of this section.                                                   words. For example, the syntax tree of the first require-
                                                                   ment in our running example is depicted in Figure 3. In this
5.1 Boilerplates                                                   syntax tree we can identify that “robot” is a noun and “the
This is a semi-formal representation where most of the             robot” is a so called noun phrase. An example of a tool to
spelling errors, poor grammar and ambiguities have been            extract syntax trees is the probabilistic context free grammar
removed. Boilerplates also enforce the use of a consistent         parser, described in [20].
naming scheme. There exist tools such as [19] to perform
this task semi-automatically. In our example, the four re-         5.3 Semantic categories
quirements become the four equivalent boilerplates:                The semantic categories conceptually describe our system,
(a) when the robot is docked, it lower the barrier.                e.g. a transition describing the motion of an actuator. These
                                                                   semantic categories are hierarchical in nature, as more com-
(b) when the robot is ready to load, the lhd start rotating        plex and abstract concepts are composed of simpler ones,
    backward.                                                      e.g. a transition is composed by an action, pre and post
(c) when the lb is triggered, the lhd stop backward rotation.      conditions, etc. We obtain the semantic categories by pars-
(d) after stopping the lhd, the barrier is raised.                 ing the syntax trees and applying transformation rules in a




                                                             155
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


                              Component                      Failure                      Observations
           Failure 1             Barrier                 Barrier stuck up    Barrier stuck up regardless commands
           Failure 2   Load Handling Device (LHD)         Rotation fail            Laser beam not triggered

                                         Table 1: FMEA from the running example.


                                                                    1. Relations representing a direct transition, as depicted
                                                                       in Figure 4. Such a transition can be directly mapped
                                                                       into a transition on the automaton, as can be seen in
                                                                       Figure 5 through the transitions from state 1 to 2.
                                                                    2. Relations representing an action with a duration. Such
                                                                       a relation must be translated into several transitions:
                                                                       the start of the action, the termination event and a tran-
                                                                       sition to a final state. Such transformed relation is de-
                                                                       picted in Figure 5 through the transition from state 2 to
                                                                       5.
                                                                    3. Relations representing a failure of the system. The
                                                                       failure event is represented as a divergent path from
Figure 4: Concepts created from the syntax tree in Figure 3.
                                                                       a normal transition. Thus, the start state is the same
The word in quotes is the word as it appears in the sentence.
                                                                       as the one of the normal transition. Afterwards, we
The word in parenthesis is the Cyc concept it belongs to.
                                                                       need a state representing the failure. Finally, we need
                                                                       an observation transition that leads to a final state rep-
                                                                       resenting a general failure of the system. The observ-
bottom up fashion, following [8]. We start at the leafs of             able transition is cased due to the fact that use a fault
the syntax tree, containing single words. Each word has                model which is derived from the FMEA. Thus every
assigned a part-of-speech (POS) label describing its gram-             fault has an observable discrepancy to the real system.
matical role in the sentence. Furthermore, each word has an            Additionally it is important to notice that the state rep-
additional label with its WordNet [21] synset, used to de-             resenting the general failure is state where the system
rive its semantics from the common sense knowledge base.               can exhibit arbitrary behavior. Thus we can model the
From the leafs, higher level transformations can be applied            lack of knowledge which impact the fault has on the
to create more complex semantic categories. For example,               system. The transformed failure is is depicted in Fig-
on our running example we create a semantic category for               ure 5 through the transitions from state 2 to 9.
each word in the sentence “lower the barrier”. Then, we
can derive that “lower” is an action acting on something.           4. Relations representing a failure of a system compo-
We can after that use the semantic category of the word to-            nent. The failure event is represented as a divergent
gether with its position in the syntax tree to apply further           path from a normal transition. To determine all the
transformation rules. This process is repeated till the root           possible affected transitions, we must perform an infer-
node is reached. Then, a new semantic category is assigned             ence of the effects each transition has. This inference is
to the sentence capturing its semantics. For the running ex-           based on common sense and domain knowledge. In our
ample, the semantic category for “lower the barrier” is a              running example, we can infer that lowering the barrier
transition. A transition must contain a precondition, a post           causes the barrier to be finally down. A failure such
condition, an action and optionally an object of the action.           as barrier_stuck_up can prevent this transition, and so
The semantic category specifies that the action “lower” is             they can share a common source state. Then, as before
performed on the object “barrier”. With the help of com-               we need an observation transition that leads to a final
mon sense (Cyc ontology [16]) we can reason that this ac-              state representing a general failure of the system. Such
tion causes the “barrier” from state “up” to state “down”.             a sequence is depicted in Figure 5 though the transi-
Thus, we can infer the pre and post conditions of “lower”.             tions from state 1 to 9 through the states 7 and 8.
Finally, the semantic category together with the reasoning
results are packed into statements on our knowledge base,         7 Conclusion and future work
as it is depicted in Figure 4.
   We can incorporate other documents into the knowledge          In this paper we propose a framework to automatically gen-
base by using a similar NLP tool chain. However, how the          erate formal models out of documents represented in semi-
information is treated depends heavily on the context inher-      structured form and natural language (requirements, domain
ent to each document type.                                        knowledge, architecture, failure modes, etc.). The parsed in-
                                                                  formation is gathered together with domain knowledge in a
                                                                  knowledge base. Accessing this common repository, a va-
6 Model generation for behavior diagnosis                         riety of algorithms can generate different kinds of models
To illustrate how the framework can be used to diagnose           for different purposes. Our main target is to derive models
the behavior of the robot, we create an automaton as output       suitable for state-of-the-art MBD techniques applied to au-
model. To use techniques such as [22], the automaton must         tonomous systems. We plan to implement this framework
describe both nominal and faulty behaviors of the system.         to assist us on creating the models required for MBD. Do-
To generate this automaton from the knowledge base, we            ing so, we expect to improve the dependability in the indus-
use four different relations stated on it as transitions:         trial application of a fleet of transport robots in a warehouse.




                                                            156
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                           tory Automation (ETFA), 2011 IEEE 16th Conference
                                                                           on, pages 1–9. IEEE, 2011.
                                                                      [6] Cynthia Matuszek, John Cabral, Michael Witbrock,
                                                                           and John Deoliveira. An introduction to the syn-
                                                                           tax and content of Cyc. In Proceedings of the 2006
                                                                           AAAI Spring Symposium on Formalizing and Com-
                                                                           piling Background Knowledge and Its Applications to
                                                                           Knowledge Representation and Question Answering,
                                                                           pages 44–49, 2006.
                                                                      [7] Markus Waibel, Michael Beetz, Raffaello D’Andrea,
                                                                           Rob Janssen, Moritz Tenorth, Javier Civera, Jos
                                                                           Elfring, Dorian Gálvez-López, Kai Häussermann,
                                                                           J.M.M. Montiel, Alexander Perzylo, Björn Schießle,
                                                                           Oliver Zweigle, and René van de Molengraft.
                                                                           RoboEarth - A World Wide Web for Robots. Robotics
                                                                           & Automation Magazine, 18(2):69–82, 2011.
                                                                      [8] Moritz Tenorth, Daniel Nyga, and Michael Beetz. Un-
                                                                           derstanding and executing instructions for everyday
Figure 5: Automaton generated from the running example.                    manipulation tasks from the world wide web. In
Shaded states are reached through some fault. Double cir-                  Robotics and Automation (ICRA), 2010 IEEE Interna-
cled states represent final states. State number 9 is the gen-             tional Conference on, pages 1486–1491. IEEE, 2010.
eral failure state for readability the self loops with all possi-     [9] Shalini Ghosh, Daniel Elenius, Wenchao Li, Patrick
ble labels are omitted.                                                    Lincoln, Natarajan Shankar, and Wilfried Steiner.
                                                                           Automatically extracting requirements specifi-
                                                                           cations from natural language.          arXiv preprint
Besides this immediate result, we expect that the proposed
                                                                           arXiv:1403.3142, 2014.
framework will ease the creation of formal models for other
applications. Thus, we hope to contribute to the widespread           [10] Sven J Körner and Mathias Landhäußer. Semantic en-
use of MBD techniques, with the consequent improve of au-                  riching of natural language texts with automatic the-
tonomous systems dependability.                                            matic role annotation. In Natural Language Process-
                                                                           ing and Information Systems, pages 92–99. Springer,
Acknowledgments                                                            2010.
                                                                      [11] Gerald Steinbauer. A survey about faults of robots
The research presented in this paper has received funding
                                                                           used in robocup. In Xiaoping Chen, Peter Stone,
from the Austrian Research Promotion Agency (FFG) under
                                                                           LuisEnrique Sucar, and Tijn van der Zant, editors,
grant 843468 (Guaranteeing Service Robot Dependability
                                                                           RoboCup 2012: Robot Soccer World Cup XVI, volume
During the Entire Life Cycle (GUARD)).
                                                                           7500 of Lecture Notes in Computer Science, pages
                                                                           344–355. Springer Berlin Heidelberg, 2013.
References                                                            [12] Gerald Steinbauer, Franz Wotawa, et al. Detecting and
[1] Stuart Kent. Model driven engineering. In Michael                      locating faults in the control software of autonomous
    Butler, Luigia Petre, and Kaisa Sere, editors, Inte-                   mobile robots. In IJCAI, pages 1742–1743, 2005.
    grated Formal Methods, volume 2335 of Lecture Notes               [13] Mathias Brandstötter, Michael Hofbaur, Gerald Stein-
    in Computer Science, pages 286–298. Springer Berlin
                                                                           bauer, and Franz Wotawa. Model-based fault diagnosis
    Heidelberg, 2002.
                                                                           and reconfiguration of robot drives. In 2007 IEEE/RSJ
[2] Mark Utting and Bruno Legeard. Practical model-                        International Conference on Intelligent Robots and
    based testing: a tools approach. Morgan Kaufmann,                      Systems (IROS), San Diego, CA, USA, 2007.
    2010.                                                             [14] Kostyantyn Shchekotykhin, Gerhard Friedrich, Patrick
[3] Peter Struss, Raymond Sterling, Jesús Febres, Um-                      Rodler, and Philipp Fleiss. A direct approach to
    breen Sabir, and Marcus M. Keane. Combining engi-                      sequential diagnosis of high cardinality faults in
    neering and qualitative models to fault diagnosis in air               knowledge-bases. In International Workshop on Prin-
    handling units. In European Conference on Artificial                   ciples of Diagnosis (DX), Graz, Austria, 2014.
    Intelligence (ECAI) - Prestigious Applications of Intel-          [15] Bernhard K Aichernig, Klaus Hormaier, Florian Lor-
    ligent Systems (PAIS 2014), pages 1185–1190, 2014.                     ber, Dejan Nickovic, Rupert Schlick, Didier Si-
[4] Safdar Zaman and Gerald Steinbauer. Automated Gen-                     moneau, and Stefan Tiran. Integration of Require-
    eration of Diagnosis Models for ROS-based Robot                        ments Engineering and Test-Case Generation via
    Systems. In International Workshop on Principles of                    OSLC. In Quality Software (QSIC), 2014 14th Inter-
    Diagnosis (DX), Jerusalem, Israel, 2013.                               national Conference on, pages 117–126. IEEE, 2014.
[5] Dennis Klar, Michaela Huhn, and J Gruhser. Symp-                  [16] Stephen L Reed, Douglas B Lenat, et al. Mapping
    tom propagation and transformation analysis: A prag-                   ontologies into Cyc. In AAAI 2002 Conference Work-
    matic model for system-level diagnosis of large au-                    shop on Ontologies For The Semantic Web, pages 1–6,
    tomation systems. In Emerging Technologies & Fac-                      2002.




                                                                157
                      Proceedings of the 26th International Workshop on Principles of Diagnosis


[17] Moritz Tenorth, Alexander Clifford Perzylo, Reinhard
     Lafrenz, and Michael Beetz. The roboearth language:
     Representing and exchanging knowledge about ac-
     tions, objects, and environments. In Robotics and Au-
     tomation (ICRA), 2012 IEEE International Conference
     on, pages 1284–1289. IEEE, 2012.
[18] Hongkun Zhang, Wenjun Li, and Jun Qin. Model-
     based functional safety analysis method for automo-
     tive embedded system application. In International
     Conference on Intelligent Control and Information
     Processing, 2010.
[19] Stefan Farfeleder, Thomas Moser, Andreas Krall, Tor
     Stålhane, Herbert Zojer, and Christian Panis. Dodt:
     Increasing requirements formalism using domain on-
     tologies for improved embedded systems develop-
     ment. In Design and Diagnostics of Electronic Cir-
     cuits & Systems (DDECS), 2011 IEEE 14th Interna-
     tional Symposium on, pages 271–274. IEEE, 2011.
[20] Dan Klein and Christopher D. Manning. Accurate
     unlexicalized parsing. In Proceedings of the 41st
     Annual Meeting on Association for Computational
     Linguistics-Volume 1, pages 423–430. Association for
     Computational Linguistics, 2003.
[21] George Miller and Christiane Fellbaum. Wordnet: An
     electronic lexical database, 1998.
[22] Meera Sampath, Raja Sengupta, Stéphane Lafortune,
     Kasim Sinnamohideen, and Demosthenis Teneket-
     zis. Diagnosability of discrete-event systems. Au-
     tomatic Control, IEEE Transactions on, 40(9):1555–
     1575, 1995.




                                                         158
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




     Methodology and Application of Meta-Diagnosis on Avionics Test Benches

                 R. Cossé1,2 , D. Berdjag2 , S. Piechowiak2 , D. Duvivier2 , C. Gaurel1
           1
            AIRBUS HELICOPTERS, Marseille International Airport, 13725 Marignane France
                               {ronan.cosse, christian.gaurel}@airbus.com
           2
             LAMIH UMR CNRS 8201, University of Valenciennes, 59313 Valenciennes France
               {denis.berdjag, sylvain.piechowiak, david.duvivier}@univ-valenciennes.fr


                         Abstract                                   we call a meta-diagnosis.
                                                                    Many diagnosis approaches have been proposed to deal with
     This paper addresses Model Based Diagnosis for                 specific avionics problems. Two different classes of repre-
     the test of avionics systems that combines aero-               sentation are applied: data-based diagnosis or model-based
     nautic computers with simulation software. Just                diagnosis. The first one, as studied by Berdjag et al. [3] is
     like the aircraft, those systems are complex since             used to recognize faulty behaviors of an Inertial Reference
     additional tools, equipments and simulation soft-              System (IRS) thanks to normal or faulty categories of in-
     ware are needed to be consistent with the test re-             put/output data. In this work, data fusion of outputs sensors
     quirements. We propose a structural diagnostic                 is computed to eliminate faulty sources. In [2], the time
     framework based on the lattice concept to reduce               dependency is introduced in data of failure messages to im-
     the time of unscheduled maintenance when the                   prove problems detection.
     tests cannot be performed. Here, we also describe              In Model Based Diagnosis (MBD), Kuntz et al. [4] have
     a diagnosis algorithm that is based on the formal              studied an avionics system using minimal cuts notions. Be-
     lattice description and designed for test systems.             lard et al. have defined a new approach based on the MBD
     The benefits is to capture the system structure and            hypotheses called Meta-Diagnosis in [5] dealing with mod-
     communication specificities to diagnose the con-               els issues. Berdjag et al. [6] present an algebraic decompo-
     figuration, the equipments, the connections, and               sition of the model to reduce the complexity of the required
     the simulation software.                                       model-based diagnosers. Giap [7] has proposed a formalism
                                                                    of an iterative process to give a solution when models are not
1 Introduction                                                      complete but it lacks of applications on more complex in-
                                                                    dustrial systems. Nevertheless, it gives clues for an iterative
Avionics systems are complex since tens of subsystems and
                                                                    diagnosis. Another diagnostic software has been developed
components interact to achieve required functions. Exist-
                                                                    by Pulido et al. in [8] to perform consistency-based diagno-
ing devices for aircraft fault monitoring are based on ded-
                                                                    sis of dynamic system simulating diagnosis scenarios. The
icated avionics functions but the existing solutions are in-
                                                                    architecture is quite novel and is applied to the three-tank
sufficiently flexible for test systems and can be improved.
                                                                    system.
In [1], the framework of an health management algorithms
                                                                    Structural approaches as graph theory are also popular
for maintenance is described and implemented on an air-
                                                                    for MBD to describe the structure of the system as with
craft. In [2], the diagnostic of avionics equipments is per-
                                                                    Bayesian Networks in [9]. They enable us to incorporate
formed through dynamic fault trees. To prevent important
                                                                    the system complexity as with the lattice concept to inte-
failures on the aircraft, avionics systems are checked on rigs
                                                                    grate the sub-models dependencies. For example, in [10],
called Avionics Test Bench (ATB) composed of the avionics
                                                                    the lattice model represents fault modes to compute testable
equipments and flight simulation software.
                                                                    subsystems from redundancy equations. We want to get the
The environment of the ATB needs to be compliant with the
                                                                    main ideas that will serve our proposal. To our knowledge,
configuration of the avionics equipments. Faults of the ATB
                                                                    there is no method for the diagnostic of test systems based
can concern the avionics equipments, their configurations,
                                                                    on embedded softwares behaviour. Moreover, our proposi-
or the ATB itself i.e the movable connections and the simu-
                                                                    tion has been adapted from embedded systems to the ATB
lation software. Since it does not exist monitoring functions
                                                                    behaviour. Its complexity is relevant to the objectives of
of the ATB itself, a new method needs to be applied to pre-
                                                                    the avionics embedded systems certification, as for exam-
vent long periods of unavailability. In fact, during the devel-
                                                                    ple high levels of safety requirements, or the simulation of
opment of embedded softwares, its architecture and the test
                                                                    specific test conditions. In our model, we must consider the
environment surrounding the ATB are redesigned by adapt-
                                                                    fact that our representation must put forward the ATB be-
ing the test means to the specification’s requirements. Since
                                                                    haviour in case of failures concerning embedded systems,
the ATB is a test system, and the main knowledge are based
                                                                    connections, communications, simulation softwares and all
on its embedded systems, we need a new approach to deal
                                                                    settings to configure the test. Considering those features, the
with the ATB issues. As the embedded systems are already
                                                                    high number of needed ATB reconfigurations, it is proposed
tested on the ATB, and the test results are used to focus on
                                                                    a structural representation associated with hierarchical ver-
the ATB issues thanks to a new representation based on the
                                                                    ifications that reduce the faulty candidates. The motiva-
model of the test system, the diagnosis of the ATB is what




                                                              159
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


tion of the proposed meta-diagnosis approach was presented          2.2 Diagnostic function
in [11]. Here, we propose an extended diagnosis methodol-           A basic diagnostic function is defined to help the diagno-
ogy originally defined by De Kleer, Williams [12], [13] and         sis: the check function. Depending on the granularity, the
Davis [14] and we present a software implementation run-            check function is applied on a component, a subsystem or
ning on a real ATB. It differs from the Belard et al.’s meta-       a partition. First, the checkC function is used to deter-
diagnosis definition because the ATB is still defined as the        mine if a component is faulty or not. However, we do not
main system under study. Here, we extend the diagnostic-            know precisely how a unique component behaves regarding
world tools for a specific system and due to the lack of            a fault. So we need to define the checkS function of a sub-
knowledge and data in case of issues, our proposal is based         system. The behaviour of a faulty subsystem may also not
on a MBD representation with a structural and functional            be sufficient to explain a fault. In fact, subsystems are inter-
decomposition without fault models.                                 connected making the system structure and the partitioning
First, we describe the diagnostic framework, the lattice-           concept allows us to focus on different levels of abstrac-
based representation used to model the ATB system and the           tion that we call granularities. In our study, we only focus
diagnostic algorithm. In the third section, we provide a de-        on faults with observable and measurable symptoms. These
scription of the ATB and the application of the lattice con-        faults can only be localized by testing a functionality on a
cept. In the fourth section, we illustrate the approach with a      specific architecture. That is why, functional and structural
case study of the ATB. In the final section, we describe the        partitions are used to decompose the system into testable
development of a software application to perform automati-          partitions.
cally the ATB diagnosis.
                                                                    Definition 3. The checkC function of a component ci is
                                                                    defined by:
2 Diagnostic framework                                              checkC : COM P S → {0, 1, −1} s.a checkC(c) = 0 if
                                                                    the component c is faulty, checkC(c) = 1 if the component
2.1 System representation                                           c is unfaulty and checkC(c) = −1 if the component state is
                                                                    unknown.
The system is composed of several subsystems that inter-
                                                                    Definition 4. The checkP function of a partition P is de-
act together to achieve a global function. The decomposi-
                                                                    fined by:
tions into subsystems is guided by the communication be-
                                                                    checkP : P → {0, 1, −1} s.a checkP (P ) = 1 ⇔
tween components to fulfill this goal. Partitions are used
                                                                    ∀σi ∈ P, checkS(σi ) = 1, checkP (P ) = 0 ⇔ ∃σi ∈
to decompose the system into functional and communica-
                                                                    P, checkS(σi ) = 0, and checkP (P ) = −1 ⇔ the checked
tions categories. So, there are two classes of partitions: the
                                                                    value is unknown.
partitions that represent the structure and the connections of
                                                                    Some partitions cannot be checked. The set of pos-
the system; and the partitions that represent the functions of
                                                                    sible checked partitions is Cons. It defined a con-
the system. As an example, P1 is associated with a func-
                                                                    straint. A constraint Cons is a subset of P s.a: ∀P ∈
tionality of the system P1 = {σ1 ; σ2 }, σ1 = {C1 } and
                                                                    Cons, checkP (P ) 6= −1.
σ2 = {C2 , C3 }. If a problem appears, i.e the functionality
is not performed, then a fault is detected for this partition P        Once the checkP value of a partition is known, we have
and symptoms are seen and linked to subsystems σ.                   to define the checkS function of subsystems that are not sin-
In the following paragraphs, we use the following notation:         gletons σi 6= {ci }. If the partition is faulty, either it exists
P for a partition, σ for a subsystem and ci for a compo-            a component ci ∈ σi such as checkC(ci ) = 0, or the com-
nent. S = {ci , i ∈ [1, n]} is the set of all the n components      munication between the components in σi is faulty. This
of a system. We note Σ the set of all subsystems, i.e the           is modeled by checkCom(σi ) = 0. If the partition is un-
power set of components. A partition P is a set of np sub-          faulty, then all communications between the components in
systems σi ∈ Σ: P = {σi , i ∈ [1, np ]|∀i 6= j; σi ∩ σj =           σi 6= {ci } are unfaulty and all singletons σi = {ci } are
         n
         Sp                                                         unfaulty.
∅, and      σi = S}. We note P the set of all partitions.
        i=1                                                         Definition 5. The checkCom function of a subsystem σi ⊆
We recall the definition 1 of inclusion relation between par-       COM P S is defined by:
titions and the definition 2 of multiplication.                     checkCom : Σ → {0, 1, −1} s.a checkCom(σi ) = 1 ⇔
                                                                    the communication between components in σi is unfaulty;
Definition 1. Two partitions P1 and P1 are said to be in            checkCom(σi ) = 0 ⇔
inclusion relation P1 ⊆ P2 if and only if every subsystems          the communications between components in σi is faulty.
of P1 is contained in a subsystem of P2 . The relation ⊆
means that P1 is a sub-partition of P2 .                               To help the diagnosis of the system, we decompose it
                                                                    into subsystems and we introduce the checkS function of a
Definition 2. The subsystems σk of the multiplication of two        subsystem σi ⊆ COM P S defined by:
partitions P = {σi , i ∈ [1, np ]} and Q = {σj , i ∈ [1, nq ]}
are defined by: ∀σk ∈ P × Q, ∃σi ∈ P, ∃σj ∈ Q, σk =                 Definition 6. checkS : Σ → {0, 1, −1} s.a checkS(σi ) =
σi ∩ σj .                                                           1 ⇔ ∀ci ∈ σi , checkC(ci ) = 1 ∧ checkCom(σi ) =
   This operation is used to order subsystems with respect to       1 ; checkS(σi ) = 0 ⇔ ∃ci ∈ σi , checkC(ci ) = 0 ∨
the proposed diagnostic algorithm. The inclusion relation ⊆         checkCom(σi ) = 0 and checkS(σi ) = −1 ⇔ ∃ci ∈
is used to organize the components with the lattice concept         σi , checkC(ci ) = −1 ∧ checkCom(σi ) = −1.
L (Σ, ⊆) with a partial ordering relation. It is different from        With the above definitions, it is now time to define the
the concept of partially ordered set (poset) because the ar-        diagnosis problem. Given a system representation with the
rangement of elements is not based on sets but on partitions.       lattice concept L (Σ, ⊆) and the set of constraints Cons =




                                                              160
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


{P ∈ P, checkP (P ) 6= −1}, the problem is defined by                  Algorithm 1: DIAG(L (Σ, ⊆))
the consistency between L (Σ, ⊆) that contains the system
representation, and Cons that describes system issues.                  Input: d = {pi , i ∈ [1, n]}, Cons = {consi }
                                                                        Output: ∆(Diagnosis)
Definition 7. The problem formulation is to find the faulty             Global variables: End
components whose current state may explain the con-                     Fc (f aulty components), Uc (unf aulty components),
straints. It is defined as a function DIAG(L (Σ, ⊆)) under              Σ− (f aulty subsystems), Σ+ (unf aulty subsystems),
the constraints Cons.                                                   P − (f aulty partitions), P + (unf aulty partitions)
   There are two kinds of faults: the fault of a component              ∆, Fc , Uc , P + , P − , Σ− , Σ+ ← {}; End ← f alse;
Ci modeled with checkC(Ci ) = 0, and the communica-                     N Cons ← {};
tion fault of a subsystem σi = {Ci , Cj , ...} modeled with             while ¬End do
checkCom(σi ) = 0. With the P1 partition, suppose that C2                   F indF aultySubsystems(d, Cons);
and C3 are linked with an ARINC 429 link that is not work-                  V erif ication(Fc , Σ− );
ing. The constraint is checkP (P1 ) = 0 because the global                  if ¬End then
function is broken. The reason is that checkCom(σ2 ) = 0.                       foreach pi ∈ N Cons do
Knowing that checkCom(σ2 ) = 0 for the P1 functionality                              GET checkP (pi )
is giving the information to fix the system.                                         Cons ← Cons ∪ {pi }
2.3 Diagnostic algorithm
It is now necessary to introduce a diagnostic method whose
aim is to solve the above problem. The algorithm is based on
the following proposition that extends the verification from           Algorithm 2: F indF aultyElements
the multiplication of partitions to partitions, see Proposi-            Input: d = {pi }, Cons = {consi }
tion 1. Then, a functional verification is propagated from              Outputs: Fc , P − , Σ− , Σ+
partitions to subsystems, and from subsystems to compo-                 foreach (pj , pk ) ∈ P 2 : pi 6= pj do
nents.                                                                      pmult ← pj × pk
                                                                            if pmult ∈ Cons then
Proposition 1. ∀P, Q ∈ P 2 , checkP (P × Q) = 0 ⇒                               if checkP (pmult ) = 0 then
checkP (P ) = 0 ∧ checkP (Q) = 0.
                                                                                    P − ← P − ∪ {pi }
    In order to increase the readability of the algorithm, it has                   foreach σi ∈ pi do
been split into three: DIAG(L (Σ, ⊆)) is the main algo-                                  foreach ck ∈ Uc do
rithm, it initializes the framework with the partitions of the                               σi ← σi \ {ck }
system {pi , i ∈ [1, n]} and the constraints Cons = {P ∈                                 if σi = {ci } then
P, checkP (P ) 6= x}.                                                                        Fc ← Fc ∪ σi
F indF aultyElements checks the partitions that are de-
fined as a constraint. If the checked value of a partition                               else if σi ∈
                                                                                                    / Σ+ then
pmult is faulty (resp. unfaulty), we add it to the faulty (resp.                             Σ ← Σ− ∪ {σi }
                                                                                                −

unfaulty) partitions set P − (resp. P + ), and every subsystem
σi of the partition is possibly faulty (resp. unfaulty), we add                 if checkP (pmult ) = 1 then
it in Σ+ , (resp. Σ− ). If another partition pmult can help to                      P + ← P + ∪ {pi }
get more faulty or unfaulty components, a new constraint is                         foreach σi ∈ pi do
proposed and added to N Cons.                                                           if σi = {ci } then
                                                                                            Uc ← Uc ∪ σi
V erif ication is used to check the possible components that
may be faulty, i.e include in Fc with the checkC function,                              else
and the communication of the subsystems in Σ− with the                                      Σ+ ← Σ+ ∪ {σi }
checkCom function.
    Two functions have been introduced: the checkP (pi )                    if pmult ∈/ Cons then
value of a partition pi and the CheckCom(σi ) of a subsys-                      if ∃{ci } ∈ pmult then
tem. Their values can be automatically computed thanks to a                         if ¬(ci ∈ Uc ∪ Fc ) then
program developed on the system to automate the diagnosis.                               N Cons ← N Cons ∪ {pmult }
This is performed by the GET function whose purpose is to
model the computation of checkP (pi ) or CheckCom(σi ).
2.4    Formal example
In order to illustrate the problem formulation and the diag-          function is introduced to choose the next topology and the
nostic algorithm, a formal example is provided. It is com-            next functionality to be tested. It is guided by the minimum
posed of eight components {Ci , i ∈ [1, 8]} organized into            of tests to perform in order to fix the system. For a set of
three partitions:                                                     partitions P, we define Choose : {P} → P × P.
P1 = { {C1 ,C2 , C3 ,C4 }, {C5 ,C6 , C7 ,C8 }},                       As the two functionalities are modeled by P1 and P2 , and
P2 = { {C1 ,C2 }, {C3 ,C4 ,C5 ,C6 ,C7 ,C8 }},                         the the topology is modeled by P3 , we have two possi-
P3 ={{C1 }, {C2 ,C4 ,C6 ,C8 }, {C3 ,C5 ,C7 }}.                        bilities. We assume that P2 is prior to P1 , the first itera-
P3 describes the topology of the system. P1 and P2 describe           tion is defined with Choose(P)=(P1 , P3 ). We begin with
functionalities. We set the C2 component as faulty. The idea          checkP (P1 ×P3 ) = 0, s.a P1 × P3 = { { C1 }, {C2 ,C4 },
is to combine the topology of the system with its function-           {C3 }, {C6 ,C8 }, {C5 ,C7 }}. The possible faulty component
alities to find the faulty component or subsystem. A choice           are C1 and C3 . We check the C1 and C3 components and




                                                                161
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


 Algorithm 3: V erif ication                                                           Components        CheckC
                                                                                          C1               1
  Inputs: Fc                                                                              C2               0
  Outputs: ∆ Fc , Uc , End                                                                C3               1
  Initialization: σ+ , σ− ← I;
                                                                                          C4               −1
  foreach ci ∈ Fc do
                                                                                          C5               −1
      if checkC(ci ) = 0 then
          ∆ ← ∆ ∪ {ci }                                                                   C6               −1
          End ← true                                                                      C7               −1
      else                                                                                C8               −1
          Fc ← Fc \ {ci }
          Uc ← Uc ∪ {ci }                                             Table 2: Diagnostic results for components in P2 × P3

    foreach Σi ∈ Σ− do
        GET checkCom(Σi )                                           more than twelve national customers in over twenty dif-
        if checkCom(Σi ) = 0 then                                   ferent basic helicopter configurations. The NH90 Avionics
            ∆ ← ∆ ∪ {Σi }                                           System consists of two major subsystems: the CORE Sys-
            End ← true                                              tem and the MISSION System. A computer is the bus con-
        else                                                        troller and manages each subsystem communications: the
            Σ− ← Σ− \ {Σi }                                         Core Management Computer (CMC) for the CORE Sys-
            Σ+ ← Σ+ ∪ {Σi }                                         tem and the Mission Tactical Computer (MTC) for the MIS-
                                                                    SION System. Each computer is connected to one or both
                                                                    subsystems via a multiplex data bus (MIL-STD-1553), point
                                                                    to point connections (ARINC429) and serial RS-485 lines.
find them as unfaulty, see Tables 1. The possible faulty sub-       Additional redundant computers are used as backup. One
systems are {C2 , C4 }, {C6 , C8 } and {C5 , C7 } and they are      of the two CMC is the Bus Controller (BC) of the CORE
unfaulty. The diagnosis is not sufficient, we must relax the        multiplex data bus. The avionics system of the ATB is
constraint P2 × P3 .                                                composed of fourteen computers and the above connec-
The second iteration is defined with Choose(P)=(P2 , P3 ),          tions: two CMC: c1 = CM C1 and c2 = CM C2; two
s.a P2 × P3 = {{C1 }, {C2 }, {C4 ,C6 ,C8 }, {C3 ,C5 ,C7 }}.         Plant Management Computer (PMC): c3 = P M C1 and
We get checkP (P2 × P3 ) = 0, the possible faulty compo-            c4 = P M C2; five Multifunction Display (MFD): c5 =
nents are C1 and C2 but C1 has already been checked in the          M F D1, c6 = M F D2, c7 = M F D3, c8 = M F D4,
previous iteration. So, the possible faulty subsystems are          c9 = M F D5; two Display and Keyboard Unit (DKU):
{C3 ,C5 ,C7 } and {C4 ,C6 ,C8 }. We check the C2 component          c10 = DKU 1, c11 = DKU 2; two IRS: c12 = IRS1,
and find it as faulty. For this example, the computed faulty        c13 = IRS2; one Radio Altimeter (RA): c14 = RA. For-
or unfaulty components is, see Table 2, C2 in P2 × P3 .             mally, COM P SAT B = {ci , i ∈ [1, 14]}.
If no components has been found faulty, the upper topo-             The avionics system under test COM P SSU T is a sub-
logical level is treated i.e subsystems: {C2 ,C4 }, {C6 ,C8 },      system of COM P SAT B .         It is described Figure 1.
{C5 ,C7 }, {C4 ,C6 ,C8 } and {C3 ,C5 ,C7 }}. Here, they are         COM P SSU T = {c1 , c2 , c3 , c4 , c5 , c10 , c12 , c14 }. For the
unfaulty.                                                           rest of the article, COM P SSU T will be the primary system
                                                                    under study.
                   Components      CheckC
                      C1              1
                      C2             −1
                      C3              1
                      C4             −1
                      C5             −1
                      C6             −1
                      C7             −1
                      C8             −1

    Table 1: Diagnostic results for components in P1 × P3                 Figure 1: Architecture of the avionics subsystem


    The method has permitted to detect quickly the faulty
component using functional partition and a structural par-               From        To         Messages         Subsystems
titioning. Thanks to this result, possible faults regarding ei-         DKU 1       CM C1       Mode on            σSerial1
ther the topology or the functionality are checked.                     CM C1       IRS1        Mode on             σM IL
                                                                        IRS1         RA         Mode on        σN AV ; σARIN C
                                                                          RA        IRS1         Alert         σN AV ; σARIN C
3     The Automatic Test Benchmark
                                                                        IRS1        CM C1        Alert          σM IL ; σN AV
3.1 Avionics system                                                     CM C1       DKU 1        Alert         σSerial1 ; σN AV
The avionics system of the NH90 helicopter is designed                                    Table 3: Messages
to support multiple hardware and software platforms from




                                                              162
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


   The PMC is used to monitor the status of all the avion-
ics computers. It displays the alert informations on the
MFD. We define the performances partition pP ERF =
{σP ERF ,σ¬P ERF } with:
σP ERF = {P M C1,P M C2,RA,IRS1,M F D1}
σ¬P ERF = {CM C1,CM C2,DKU 1} and the navigation                  Figure 2: Navigation func-      Figure 3:      Performance
partition pN AV = {σN AV ,σ¬N AV } with:                          tion decomposition with         function     decomposition
σN AV = { RA,IRS1,M F D1}                                         dprotocol                       with dprotocol
σ¬N AV = {CM C1,CM C2,DKU 1,P M C1,P M C2}.
The test consists in the simulation of a high roll. Normally
the RA should be deactivated above the value of forty de-         DKU 1}; {P M C1, P M C2}; {M F D1, IRS1, RA}};
grees. The procedure contains the following actions: en-          pN AV.ARIN C = pN AV × pARIN C = {{M F D1, IRS1,
gage the RA with the DKU 1; simulating a roll of 50 de-           RA}; {CM C1, CM C2, P M C1, P M C2}; {DKU 1}}.
grees; check that the RA functionality is deactivated on the        The performance function can give insights about the
DKU 1. Several messages are sent to achieve this func-            fault. We compute the partitions with this functionality:
tionality, see Table 3, defining a data-flow for two mes-         pP ERF.M IL = pP ERF ×pM IL = { {M F D1,RA};
sages : "Mode on" and "Alert" messages: from DKU 1                {DKU 1}; {CM C1,CM C2}; {P M C1,P M C2,IRS1} }
to CM C1 via serial communication to activate the radioal-        pP ERF.Serial =pP ERF ×pSerial = { {CM C1,CM C2,
timeter’s specific mode ("Mode on" message); from CM C1           DKU 1}; {P M C1,P M C2}; {M F D1,IRS1,RA} }
to IRS1 via MIL-STD-1553 communication to relay the               pP ERF.ARIN C = pP ERF ×pARIN C = { { P M C1, P M C2,
activation information; from IRS1 to RA via ARINC com-            M F D1, IRS1, RA};{CM C1, CM C2}; {DKU 1} }.
munication to send a request to the RA to get the roll angle;       Those partitions will serve to improve the diagnosis.
from RA to IRS1 via ARINC communication to send the               3.3 Outlooks about the decompositions
response to the IRS that compute the angle; from IRS1 to
CM C1 via ARINC communication, from CM C to DKU                   We describe an iterative method to update the diagnostic re-
via serial communication to display the alert and disable the     sult by providing new topologies of the system. We need to
functionality ("Alert" message).                                  get precise observations to find the faulty components. The
                                                                  subsystems are computed with the framework of the previ-
3.2 System Under Test (SUT) decomposition                         ous section.
                                                                  Given the components, the messages sent between them,
The ATB is used to perform the realization of the avionics        and the protocol of these messages, we can obtain an
functions with the necessary equipments and a simulated en-       overview of the system decomposition: pSU T can be
vironment needed to check the system specification.               decomposed into dprotocol = {pSU T × pM IL ; pSU T ×
   The ATB is described as a structural decomposition with        pSerial ; pSU T × pARIN C }. This hierarchical structure is
components subsets. These sets provide partitions of the          provided with a dependency graph, see Figures 2 and 3.
whole system. We define subsystems σi and the partitions             The following partitions are used:
pi with regards to the connections of the avionics system of      σcom1 = {{DKU 1, CM C1, IRS1, RA}};
Figure 1, the serial communication:                               σ¬com1 = {{M F D1, CM C2, P M C1, P M C2}};
  σSerial1 = {CM C1, CM C2, DKU 1}                                pcom1 = {σcom1 , σ¬com1 }.
  σSerial2 = {P M C1, P M C2}                                        The path of the informations "RA mode on" and "RA
  σ¬Serial = {M F D1, IRS1, RA}                                   alert" on copilot side defines another decomposition: σcom2
  pSerial     = {σSerial1 ; σSerial2 ; σ¬Serial }                 = {{CM C2, IRS1, RA, DKU 1}}; σ¬com2 = {{M F D1,
the ARINC communications:                                         CM C1, P M C1, P M C2}}; pcom2 = {σcom2 , σ¬com2 }.
  σARIN C       = {CM C1,CM C2,P M C1,P M C2,
                     M F D1,IRS1,RA}                                 We describe the decomposition dcom = {pcom1 , pcom2 }
  σ¬ARIN C = {DKU 1}                                              on Figures 4 and 5. We compute partitions with the
  pARIN C       = {σARIN C ; σ¬ARIN C }                           navigability functionality and this structural decomposition:
the MIL-STD-1553 communications:                                  pN AV.com1 = pN AV × pcom1 = {{RA, IRS1}; {M F D1};
  σM IL     = {CM C1, CM C2, P M C1, P M C2, IRS1}                {CM C1, DKU 1}; {CM C2, P M C1, P M C2}};
  σ¬M IL = {M F D1, DKU 1, RA}                                    pN AV.com2 = pN AV × pcom2 = {{RA, IRS1}; {DKU 1,
  pM IL     = {σM IL ; σ¬M IL }                                   CM C2}; {M F D1}; {CM C1, P M C1, P M C2}};
The above partitions describe the topology of the problem.        pP ERF.com1 = pP ERF × pcom1 = {{RA, IRS1};
We classify the partitions into two categories: functional        {CM C2}; {CM C1, DKU 1}; {M F D1, P M C1,
partitions and communication partitions. The functional           P M C2}};
partitions contain the subsystems that compute and send           pP ERF.com2 = pP ERF × pcom2 = {{RA, IRS1}; {DKU 1,
the informations. The communication partitions contain the        CM C2}; {CM C1}; {M F D1, P M C1, P M C2}}.
subsystems that relay these informations. In our example,
the navigation functionality is tested. Functional partition      4 Illustration of the Meta-Diagnostic
are: {pN AV ,pP ERF }, connection partitions are: {pM IL ,          Approach
pSerial , pARIN C }. We need to define additional partitions
that can be checked with the check function on the system         4.1 Application of the meta-diagnosis approach
thanks to this representation:                                    An iterative approach is very helpful in this case of dis-
pN AV.M IL = pN AV × pM IL = {{M F D1,RA};{IRS1};                 tributed systems since diagnosis can use new subsys-
{CM C1,CM C2,P M C1,P M C2};{DKU 1}};                             tems and partitions. The results of the diagnosis are
pN AV.Serial = pN AV × pSerial = {{CM C1, CM C2,                  re-injected in the upper system to refine the results.




                                                            163
                       Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                         pi        checkP (pi )      Uc          Fc
                                                                     pN AV.com1        0          {DKU 1,       {RA,
                                                                                                   IRS1}       M F D1}
                                                                     pN AV.com2         1         {DKU 1,       {RA}
                                                                                                   IRS1,
                                                                                                  M F D1}
Figure 4: Navigation func-      Figure 5:   Performance
tion decomposition with         function  decomposition
dcom                            with dcom                        Table 7: Iterations of CheckM ultiplicationP artition
                                                                 with dcom
The first symptom is the misbehavior of the navigation
functionality. We describe the iterations of the algo-                  Subsystems                checkCom       Partition
rithms with two topologies. We have launched the meta-                 {RA, IRS1}                     1         pN AV.com1
diagnostic algorithm with the topology: dN AV.protocol =             {CM C1, DKU 1}                   1         pN AV.com1
{pN AV.M IL ,pN AV.ARIN C ,pN AV.SERIAL } and dN AV.com           {CM C2, P M C1, P M C2}             1         pN AV.com1
= {pN AV.com1 , pN AV.com2 }. The constraint is CON S =
{checkP (pi ), ∀pi ∈ dN AV.protocol ∪ dN AV.com }. The iter-     Table 8: Diagnostic results of subsystems with pN AV.com1
ations of the algorithms are described in Tables 4, and 5.

            pi        checkP (pi )   Uc        Fc                faults. Thanks to the impacted functionality, we know that
      pN AV.ARIN C        0          ∅      {DKU 1}              only messages concerning the IRS roll are concerned. At
      pN AV.SERIAL        1          ∅      {DKU 1}              this stage, the simulation of the message or the bad connec-
       pN AV.M IL         0          ∅       {IRS1,              tion of the IRS are the two main solutions.
                                            DKU 1}
                                                                 4.2 Application with updated constraints
                                                                 We describe a new problem: the navigation func-
Table 4: Iterations of CheckM ultiplicationP artition            tionality and the performance function do not be-
with dprotocol                                                   have normally.       The new constraint is CON S =
                                                                 {checkP (pi ), ∀ pi ∈ dN AV.protocol ∪ dN AV.com ∪
  The third step gives a state of the components in Fc set       dP ERF.protocol ∪ dP ERF.com }. The algorithm is loaded
that can be faulty: DKU 1 and IRS1 in Table 5. If the com-       from CheckM ultiplicationP artition with the decompo-
ponents are faulty, this may explain the system behavior and     sition dcom . The algorithm iterations are described in Ta-
the algorithm ends. At the same time, the communications         ble 9. Once checkP (pP ERF.com2 ) = 1, we deduce that
of subsystems in Σ− can be faulty. They are checked in           CM C1 is not faulty.We continue with dprotocol knowing
Table 6.                                                         the CM C1 is not faulty in Table 10. We deduce that we
      ci        checkC(ci )      Fc           Uc                 have to check DKU 1 and CM C2.
     DKU 1          1          {IRS1}       {DKU 1}
                                                                        pi         checkP (pi )        Uc         Fc
     IRS1           0          {IRS1}       {DKU 1}
                                                                   pP ERF.com1         0                ∅       {CM C2}
                                                                   pP ERF.com2         1             {CM C1}    {CM C2}
Table 5:    Iterations of the CheckComponents with
dprotocol                                                                Table 9: Algorithm 2’s iterations with dcom


       Subsystems          checkCom         Partition
      {M F D1, RA}             1          pN AV.ARIN C                   pi           checkP (pi )      Uc          Fc
     {CM C1, CM C2,            1          pN AV.ARIN C             pP ERF.ARIN C          0           {CM C1}     {DKU 1,
     P M C1, P M C2}                                                                                              CM C2}
                                                                  pP ERF.SERIAL             1         {CM C1}     {DKU 1
        Table 6: Diagnostic results for subsystems                                                                CM C2}
                                                                    pP ERF.M IL             0         {CM C1}     {DKU 1,
                                                                                                                  CM C2}
   The IRS1 is not faulty, the algorithm is relaunched
with Uc = {DKU 1, IRS1} and the other decomposition
dcom = {pN AV.com1 , pN AV.com2 }. The algorithm itera-          Table 10: Iterations of CheckM ultiplicationP artition
tions are described in Tables 7 and 8.                           with dprotocol
   Once checkP (pN AV.com2 ) = 1, we deduce that M F D1
is not faulty, see Table 7. At this step, the unfaulty com-        At this state, we check the components on the system.
ponents are {DKU 1, IRS1, M F D1}, and the diagnosis is          Since the reparation of CM C2 has fixed the problem, we
{RA}.                                                            conclude that CM C2 has been faulty. We also check the
   Here the RA is faulty with pN AV.com1 , and the algorithm     DKU 1 configuration, and find nothing. The diagnosis is
ends. The solution is RA for pN AV.com1 . The data flow          ∆ = {CM C2}.
of the messages are checked as the impacted connections,           The evolution of the number of faulty and unfaulty com-
wiring and, routing. The system specificities of the com-        ponents is reviewed on figure 6. As expected, the number of
munication modeled with com1 five clues of the possible          unfaulty components is increasing with new tests, i.e tests




                                                           164
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




Figure 6: Evolution of the number of faulty and unfaulty
components

of partitions. It reveals that the algorithm is converging to a
solution because the number of components is limited.
                                                                                                     Figure 10: State of the con-
5 Software implementation                                                                            straints
5.1 Diagnostic software architecture                                Figure 9: Initial state of the
The algorithms are implemented in a spy software of AR-             diagnosis
INC and MIL-STD-1553 buses, see Figure 7. They are de-
veloped using C++ for effective diagnosis, and to be im-
plemented in the AIRBUS software. The user interfaces are           initialStateP anel panel, Figure 9 defines the status of
developed with Java 1.7 and the Swing Graphical User Inter-         equipments before launching the diagnosis and a button the
face (GUI) widget toolkit. The architecture of the diagnostic       run the algorithm. The check values computed by the al-
                                                                    gorithm defined in the Controller are provided to the oper-
                                                                    ator in Figure 11. The constraintsPanel panel lets to edit
                                                                    and update constraints, see Figure 10. The result of the di-
                                                                    agnostic algorithm is provided on Figures 11. It gives the
                                                                    faulty components (observation equal to zero) and the im-
                                                                    pacted functionality. If a component is suspected, the data


       Figure 7: Data flow of the diagnosis software

framework has been adapted to the ATB specificities as de-
scribed with the Model-View-Controller (MVC) paradigm
on Figure 8. Three main objects are defined for the Model:
the Component, the Set, and the Partition objects. Four main
objects are defined in the View to define specific panels: the
diagnosisPanel, the constraintsPanel, the initialStatePanel
and the resultsPanel objects. The model is implemented
with the ArrayList class. It is used to define the list of com-
ponents, the subsystems and the list of partitions. eXtensible
Markup Language (XML) files have been used to describe
the system structure. The Controller dispatches the user re-
quests and selects the panels for presentation. The diagnosis
algorithm is implemented in it. A GUI is provided for han-
dling user inputs such as partitions check values and com-
ponents observations values.
                                                                                     Figure 11: Diagnosis results

                                                                    flow of the functional chain described by the partition must
                                                                    be checked. As described in the case study, it gives insights
                                                                    about the possible connections, wiring and, routing that can
                                                                    be wrong.
                                                                       We compute the results ∆ = { IRS1, DKU 1, CM C2,
                                                                    RA } and display them on Figure 11. If some components
                                                                    are unfaulty, we can update their status in Figure 9. The al-
                                                                    gorithm is relaunched using the "GO" button in Figure 9.
                                                                    The good diagnosis rate is evaluated on Figure 12. It is de-
      Figure 8: Architecture of the diagnosis software              fined by the number of faulty components that the operator
                                                                    has to fix over the number of proposed faulty components.
5.2 User interfaces                                                 5.3 Discussion
The panels are displayed one after the others for each              We have proposed a solution for the diagnosis of a complex
step of the algorithm defined in the Controller. The                system in aeronautics based on the MBD paradigm and the




                                                              165
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                          equipment based on dynamic fault tree. In Proceed-
                                                                          ings of the IFAC-CEA conference, October 2007.
                                                                     [3] Denis Berdjag, Jérôme Cieslak, and Ali Zolghadri.
                                                                          Fault detection and isolation of aircraft air data/inertial
                                                                          system. pages 317–332. EDP Sciences, 2013.
                                                                     [4] Fabien Kuntz, Stéphanie Gaudan, Christian San-
               Figure 12: Good diagnosis rate                             nino, Éric Laurent, Alain Griffault, and Gérald Point.
                                                                          Model-based diagnosis for avionics systems using
                                                                          minimal cuts. DX 2011 22nd International Workshop
lattice concept. It is an other solution for the meta-diagnosis           on Principles of Diagnosis, 2011.
problem as described in [5] since we consider the test sys-          [5] Nuno Belard, Yannick Pencole, and Michel Comba-
tem environment as the main system. Belard has extended
                                                                          cau. A theory of meta-diagnosis: reasoning about
the framework, here we use the original one with the lat-
                                                                          diagnostic systems. In Proceedings of the Twenty-
tice concept to represent the system description. It is also
                                                                          Second international joint conference on Artificial In-
provided a diagnostic algorithm implemented on the system
                                                                          telligence, IJCAI’11, pages 731–737, Barcelona, Cat-
to evaluate our method. Since hundreds of diagnosis are
                                                                          alonia, Spain, 2011.
possible on the ATB, since it is not possible to check all
those possibilities, we have introduced a methodology for            [6] Denis Berdjag, Vincent Cocquempot, Cyrille
the ATB diagnosis that reduce the number of iterations to get             Christophe, Alexey Shumsky, and Alexey Zhirabok.
the diagnosis. We have upgraded the applications of MBD                   Algebraic approach for model decomposition:
for avionics systems evaluated in [4] and [2]. It is proposed             Application for fault detection and isolation in
the integration and evaluation of a diagnostic algorithm for              discrete-event systems.         International Journal of
an ATB, taking the test systems environment into account.                 Applied Mathematics and Computer Science (AMCS),
It differs from other applications of MBD like [8] because                21(1):109–125, March 2011.
the model decomposition is driven by the test systems speci-         [7] Quang-Huy Giap, Stephane Ploix, and Jean-Marie
ficities that are represented with the lattice concept.                   Flaus. Managing Diagnosis Processes with Interac-
                                                                          tive Decompositions. In Artificial Intelligence Appli-
6 Conclusion                                                              cations and Innovations III, IFIP International Federa-
This paper extends the MBD approach to propose a diagnos-                 tion for Information Processing, pages 407–415. 2009.
tic software that is developed for the diagnosis of test sys-        [8] Belarmino Pulido, Carlos Alonso-González, Anibal
tems. The current framework is based on the lattice decom-                Bregon, Alberto Hernández Cerezo, and David Ru-
position and is used to model a test system. First, the lat-              bio. DXPCS: A software tool for consistency-based di-
tice decomposition has been used to decompose the system                  agnosis of dynamic systems using Possible Conflicts.
into its functionalities and connections. The second contri-              25st Annual Workshop Proceedings, DX-14, 2014.
bution consists in the proposal of an algorithm that reduce          [9] Veronique Delcroix, Mohamed-Amine Maalej, and
the diagnostic ambiguity. The lattice description has been
                                                                          Sylvain Piechowiak. Bayesian Networks versus Other
implemented with JAVA native packages. The software ar-
                                                                          Probabilistic Models for the Multiple Diagnosis of
chitecture and diagnostic iterations are provided for a formal
                                                                          Large Devices. International Journal on Artificial In-
example and an industrial case study. The diagnostic algo-
                                                                          telligence Tools, 16(3):417–433, 2007.
rithm has shown to reduce the number of faulty candidates.
The results is either faulty equipment or a group of equip-          [10] Mattias Krysander, Jan Aslund, and Erik Frisk. A
ments with the associated system functionality that is unable             Structural Algorithm for Finding Testable Sub-models
to meet its goal. Together, they are sufficient to point out the          and Multiple Fault Isolability Analysis. 21st Annual
reparations that will fix the system. The tests on the Avion-             Workshop Proceedings, DX-10, 2010.
ics Test Systems in AIRBUS HELICOPTERS have shown                    [11] Ronan Cossé, Denis Berdjag, David Duvivier, Sylvain
good results. The development of models may confront our                  Piechowiak, and Christian Gaurel. Meta-Diagnosis for
solution to many others real problems. In future works, al-               a Special Class of Cyber-Physical Systems: the Avion-
gorithms will be improved with adaptable decompositions                   ics Test Benches. In The 28th International Confer-
and automatic tests. Furthermore, as the method is generic,               ence on Industrial, Engineering & Other Applications
we want to demonstrate the validity of our method for others              of Applied Intelligent Systems, [Accepted], IEA/AIE
test systems used in AIRBUS HELICOPTERS.                                  2015, Seoul, Corea, 2015.
                                                                     [12] Johan de Kleer and B.C. Williams. Diagnosing multi-
References                                                                ple faults. Artificial Intelligence, 32(1):97–130, 1987.
[1] Canh Ly, Kwok Tom, Carl S. Byington, Romano
                                                                     [13] Johan de Kleer, Alan K. Mackworth, and Raymond
    Patrick, and George J. Vachtsevanos. Fault Diagno-
                                                                          Reiter. Characterizing diagnoses and systems. Artifi-
    sis and Failure Prognosis for Engineering Systems: A
                                                                          cial Intelligence, 56(2-3):197–222, 1992.
    Global Perspective. In Proceedings of the Fifth An-
    nual IEEE International Conference on Automation                 [14] Randall Davis and Walter C. Hamscher. Model-Based
    Science and Engineering, CASE’09, pages 108–115,                      Reasoning: Troubleshooting. pages 297–346, July
    Piscataway, NJ, USA, 2009. IEEE Press.                                1988. San Francisco, CA, USA.
[2] Arnaud Lefebvre, Zineb Simeu-Abazi, Jean-Pierre
    Derain, and Mathieu Glade. Diagnostic of the avionic




                                                               166
                           Proceedings of the 26th International Workshop on Principles of Diagnosis




                                     SAT-Based Abductive Diagnosis

                                       Roxane Koitz1∗ and Franz Wotawa1
                                    1
                                      Graz University of Technology, Graz, Austria
                                        e-mail: {rkoitz, wotawa}@ist.tugraz.at



                           Abstract                                constitute the diagnoses. At the same time [2] presents
                                                                   the General Diagnosis Engine (GDE) for multiple fault
        Increasing complexity and magnitude of tech-               identification, drawing on the connection between in-
        nical systems demand an accurate fault local-              consistencies and causes as well. Their approach em-
        ization in order to reduce maintenance costs               ploys an assumption-based truth maintenance system
        and system down times. Resting on solid                    (ATMS) to detect conflicts and thereon compute diag-
        theoretical foundations, model-based diagno-               noses. Over the years much work has concentrated on
        sis provides techniques for root cause identi-             model-based diagnosis applications in various domains,
        fication by reasoning on a description of the              such as space probes [3] or the automotive industry [4].
        system to be diagnosed. Practical implemen-                   Besides the consistency-based approach, a second
        tations in industries, however, are sparse due             method emerged within the field of model-based diag-
        to the initial modeling effort and the compu-              nosis, which exploits the concept of entailment to infer
        tational complexity. In this paper, we utilize             explanations for given observables. While related to
        a mapping function automating the modeling                 the more traditional technique based on consistency,
        process by converting fault information avail-             abductive model-based diagnosis requires a system for-
        able in practice into propositional Horn logic             malization representing faults and their manifestations
        sentences to be used in abductive model-based              [5].
        diagnosis. Furthermore, the continuing per-
        formance improvements of SAT solvers moti-                    Even though based on a well-defined theory, a
        vated us to investigate a SAT-based approach               widespread acceptance of model-based diagnosis among
        to abductive diagnosis. While an empirical                 industries has not been accounted for yet. Two main
        evaluation did not indicate a computational                contributing factors can be identified: the initial model
        benefit over an ATMS-based algorithm, the                  development and the computational complexity of di-
        potential to diagnose more expressive models               agnosis [6]. In order to diminish the modeling effort,
        than Horn theories encourages future research              [7] formulates a conversion of failure assessments avail-
        in this area.                                              able in practice into a propositional logic representation
                                                                   suitable for abductive diagnosis. Failure mode and ef-
                                                                   fect analysis (FMEA) is an established reliability eval-
1       Introduction                                               uation method utilized in various industrial fields. It
                                                                   considers possible component faults as well as their im-
Fault identification of technical systems is becoming in-          plications on the system’s behavior [8]. Whereas there
creasingly difficult due to their rising complexity and            has been extensive research on the automatic genera-
scale. Economic and safety considerations have put ac-             tion of FMEAs from system models [9], we argue in
curate diagnosis not only into research focus but has              favor of the inverse process. As these assessments re-
led to a growing interest in practice as well.                     port on failures and how they reveal themselves in the
   Model-based diagnosis has been presented as a                   artifact’s behavior, they provide knowledge requisite for
method to derive root causes for observable anoma-                 abductive reasoning. In this paper, we present a com-
lies utilizing a description of the system to be diag-             pilation of FMEAs to models which can be used in ab-
nosed [1, 2]. Reiter [1] proposed a component-oriented             ductive diagnosis.
model encompassing the correct system behavior and
                                                                      Apart from discovering inconsistencies, an ATMS is
structure. Discrepancies, i.e. conflicts, arise when
                                                                   capable of inferring abductive diagnoses. However, it
the observed and expected system performance diverge.
                                                                   may face computational challenges and is restricted to
Based on the minimal conflict sets, root causes for the
                                                                   operate on propositional Horn clauses. In the case of
inconsistencies are obtained by hitting set computation.
                                                                   the models we are extracting from the FMEAs, this is
Hence, fault diagnosis is a two step process, where first
                                                                   not a limitation so far. Nevertheless, as we anticipate
contradicting assumptions on component health, given
                                                                   to exploit more expressive representations, a different
a set of symptoms and the model, are identified. Then
                                                                   approach is required.
the sets intersecting all conflict sets are computed which
                                                                      The performance of Boolean satisfiability (SAT)
    ∗
        Authors are listed in alphabetical order.                  solvers has improved immensely over the last years and




                                                             167
                      Proceedings of the 26th International Workshop on Principles of Diagnosis


several applications of SAT solvers in practice have           calls for existing methods as well as a novel algorithm
proven successful. Furthermore, we are able to encode          for MCSes computation.
a greater variety of models in SAT. Thus, we propose a            As stated by [20] the complexity of abduction sus-
SAT-based approach to abductive diagnosis and empir-           pends of a polynomial-time transformation to SAT.
ically compare its performance to a procedure depen-           Thus, in their work the authors present a fixed-
dent on an ATMS.                                               parameter tractable transformation from propositional
   The remainder of this paper is structured as follows.       abduction to SAT exploiting backdoors and describe
After formally providing the theoretical background on         how to use their transformations to enumerate all solu-
abductive diagnosis as well as relevant definitions in         tions for a given abduction instance.
the context of SAT, we formulate the modeling process
based on FMEAs and give information on the prop-               3     Preliminaries
erties of the obtained system descriptions. In Section
                                                               This section provides a brief introduction to abduc-
5 we describe our SAT-based approach to abductive
                                                               tive model-based diagnosis. In particular, we describe
diagnosis and present an algorithm computing expla-
                                                               the propositional Horn clause abduction problem (PH-
nations for a given abduction problem. An empirical
                                                               CAP) which provides the basis for our research. Note
evaluation comparing our method to an ATMS-based
                                                               that throughout the paper we consider the closed-world
diagnosis engine follows in Section 6. Subsequently, we
                                                               assumption. In addition to the background on abduc-
provide some concluding remarks and give an outlook
                                                               tive model-based diagnosis, we formally define MUSes
on future research possibilities.
                                                               and MCSes.
2   Related Work
                                                               3.1    Abductive Diagnosis
Mechanizing logic-based abduction has been an active
                                                                In contrast to the traditional consistency-based ap-
research field for several decades with different ap-
                                                                proach, abductive model-based diagnosis depends on a
proaches for generating explanations emerging, such as
                                                                stronger relation between faults and observable symp-
proof tree completion [10] and consequence finding [11].
                                                                toms, namely entailment. Hence, whereas consistency-
While the former exploits a refutation proof involving
                                                                based diagnosis reasons on the description of the cor-
hypotheses, the latter computes causes as logical conse-
                                                                rect system operation, abductive reasoning requires the
quences of the theory. As resolution is not consequence
                                                                model to capture the behavior in presence of a fault.
finding complete, [12] devised a procedure based on lin-
                                                                By exploiting the notion of entailment and the causal
ear resolution which is sound and complete for conse-
                                                                links between defects and their corresponding effects,
quence finding for propositional as well as first order
                                                                we can reason about explanations for observed anoma-
logic.
                                                                lies. In general, abductive diagnosis is an NP-hard
   While the number of practical applications in the
                                                                problem. However, there are certain subsets of logic,
context of abductive model-based diagnosis is rather
                                                                such as propositional definite Horn theory, which are
small, in [13] the authors describe abductive reasoning
                                                                tractable [21]. On these grounds we consider the PH-
in environmental decision support systems.
                                                                CAP as defined in [22], which represents the connec-
   Most recently [14] present a SAT encoding for
                                                                tions between causes and effects as propositional Horn
consistency-based diagnosis. The system description
                                                                sentences. Similar to [22], we define a knowledge base
is compiled into a Boolean formula, such that the for-
                                                                as a set of Horn clauses over a finite set of propositional
mula’s satisfying assignments correspond to the solu-
                                                                variables.
tions of the diagnosis problem. Based on the encoding,
a SAT solver directly computes the diagnoses. In or-            Definition 1 (Knowledge base (KB)). A knowledge
der to improve the solver’s performance, the authors            base (KB) is a tuple (A,Hyp,Th) where A denotes the
utilize several preprocessing techniques. An empirical          set of propositional variables, Hyp ⊆ A the set of hy-
comparison of their approach to other model-based di-           potheses, and Th the set of Horn clause sentences over
agnosis algorithms indicates that their SAT encoding            A.
yields performance benefits. Contrasting these results,            The set of hypotheses contains the propositions,
[15] propose a translation to Max-SAT which could not           which can be assumed to either be true or false and
outperform the stochastic model-based diagnosis algo-           refer to possible causes. In order to form an abduction
rithm SAFARI [16].                                              problem, a set of observations has to be considered for
   In [17] the authors present an algorithm which ties          which explanations are to be computed.
constraint solving to diagnosis, thus renders the detec-        Definition 2. (Propositional Horn Clause Abdu-
tion of inconsistencies and subsequent hitting set com-         ction Problem (PHCAP)) Given a knowledge base
putation unnecessary. Another direct approach by [18]           (A,Hyp,Th) and a set of observations Obs ⊆ A then
computes minimal diagnoses for over-constrained prob-           the tuple (A,Hyp,Th,Obs) forms a Propositional Horn
lems by finding the sets of constraints to be relaxed           Clause Abduction Problem (PHCAP).
in order to restore consistency. For Boolean formu-
las, those relaxations correspond to Minimal Correc-            Definition 3 (Diagnosis; Solution of a PHCAP).
tion Subsets (MCSes). Their hitting set dual, mini-             Given a PHCAP (A,Hyp,Th,Obs). A set ∆ ⊆ Hyp is
mal unsatisfiable subsets (MUSes), constitute the set           a solution if and only if ∆ ∪ Th |= Obs and ∆ ∪ Th
of subformulas explaining the unsatisfiability, i.e. refer     6|= ⊥. A solution ∆ is parsimonious or minimal if and
to conflicts. While there are several algorithms for ef-        only if no set ∆0 ⊂ ∆ is a solution.
ficiently computing MCSes, most recently [19] develop              A solution to a PHCAP is equivalent to an abduc-
three techniques for reducing the number of SAT solver          tive diagnosis, as it comprises the set of hypotheses




                                                         168
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


explaining the observations. Even though Definition 3                Definition 5. (Minimal Correction Subset
does not impose the constraint of minimality on a solu-              (MCS)) A subset M ⊆ φ is an MCS if φ \ M is satis-
tion, in practice only parsimonious explanations are of              fiable and ∀Ci ∈ M, φ \ (M \ {Ci }) is unsatisfiable.
interest. Hence, we refer to minimal diagnoses simply                   Since an MCS is a set of clauses correcting the un-
as diagnoses. Notice that finding solutions for a given              satisfiable formula when removed, a single clause of an
PHCAP is NP-complete [22].                                           MUS is an MCS for this MUS. Note that the hitting
   As aforementioned an ATMS derives abductive ex-                   set duality of MUSes and MCSes has been established
planations for propositional Horn theories, thus it can              [26].
be utilized to find solutions to a PHCAP. Based on                      Example. Consider the unsatisfiable formula φ in
a graph structure where hypotheses, observations, and                CNF.
contradiction are represented as nodes, the Horn clause
                                                                                      C1           C2       C3      C4
sentences defined in T h determine the directed edges in                        z    }|     { z }| { z}|{ z}|{
the graph. Each node is assigned a label containing the                     φ = (¬a ∨ ¬b ∨ c) ∧ (¬c ∨ d) ∧ (c) ∧ (¬d)
set of hypotheses said node can be inferred from. By
updating the labels, the ATMS maintains consistency.                 It is apparent that the combination of clauses C2 , C3
   Algorithm abductiveExplanations exploits an                       and C4 results in φ being unsatisfiable, hence
ATMS and returns consistent abductive explanations                                MUSes(φ) = {{C2 , C3 , C4 }}.
for a set of observations [23]. In case the observa-
tion consists of a single effect, the label of the corre-            By hitting set computation we arrive at the following
sponding proposition already contains the abductive                  set of MCSes:
diagnoses. To account for multiple observables, i.e.
Obs = {o1 , o2 , . . . , on }, an individual implication is                     MCSes(φ) = {{C2 }, {C3 }, {C4 }}.
added, such that o1 ∧ o2 . . . ∧ on → obs, where obs is              Removing any MCS of φ results in the formula being
a new proposition not yet considered in A. Every set                 satisfiable.
contained in the label of obs constitutes a solution to                 It is worth noticing that utilizing subsets of un-
the particular PHCAP.                                                satisfiable formulas has been proposed in regard to
                                                                     consistency-based diagnosis. In this context, a diagno-
Algorithm 1 abductiveExplanations [23]                               sis is defined as the set of components which assumed
  procedure                    abductiveExplanations                 faulty retains the consistency of the system. Thus, a
  (A, Hyp, T h, Obs)                                                 consistency-based diagnosis corresponds to an MCS.
     Add TV h to AT M S                                             For instance, [18] presents a direct diagnosis method
     Add      o∈Obs o → obs to AT M S       . obs ∈
                                                  /A                 computing MCSes for over-constrained systems. In
     return the label of obs                                         conflict-directed algorithms, as proposed by Reiter [1],
  end procedure                                                      the minimal conflicts, arising from the deviations of
                                                                     the modeled to the experienced behavior, equate to the
                                                                     MUSes. In Section 5 we discuss our abductive diagnosis
3.2    Minimal Unsatisfiable Subset and                              approach based on MUSes and MCSes.
       Minimal Correction Subset
We assume standard definitions for propositional logic               4   Modeling Methodology
[24]. A propositional formula φ in CNF, defined over                 As mentioned before model-based diagnosis depends on
a set of Boolean variables X = {x1 , x2 , . . . xn }, is a           a formal description of the system to be examined. The
conjunction of m clauses (C1 , C2 , . . . , Cm ). A clause           generation of appropriate models, however, is still an
Ci = (l1 , l2 , . . . , lk ) is a disjunction of literals, where     issue preventing a wide industrial adoption, since the
each literal l is either a Boolean variable or its comple-           modeling process is time-consuming and typically de-
ment. A truth assignment is a mapping µ : X ⇒ {0, 1}                 manding for system engineers.
and a satisfying assignment for φ is a truth assignment                 Therefore, we present a modeling methodology rely-
µ such that φ evaluates to 1 under µ. Given a formula φ,             ing on FMEAs available in practice. An FMEA com-
the decision problem SAT consists of deciding whether                prises a systematic component-oriented analysis of pos-
there is a satisfying assignment for the formula.                    sible faults and the way they manifest themselves in
  In case φ is unsatisfiable there are subsets of φ, which           the artifact’s behavior and functionality [8]. This type
are of special interest in the diagnosis context, namely             of assessment is gaining importance and has become a
the MUSes and MCSes. A Minimal Unsatisfiable Sub-                    mandatory task in certain industries, especially for sys-
set (MUS) comprises a subset of clauses which cannot                 tems that require a detailed safety analysis. Due to the
be satisfied simultaneously. Notice that every proper                knowledge capturing the causal dependencies between
subset of MUS is satisfiable. A Minimal Correction                   specific fault modes and symptoms, an FMEA provides
Subset (MCS) is the set of clauses which corrects the                information suitable for abductive reasoning [7].
unsatisfiable formula, i.e. by removing any MCS the                  Definition 6 (FMEA). An FMEA is a set of tuples
formula becomes satisfiable.                                         (C, M, E) where C ∈ COM P is a component, M ∈
  Given an unsatisfiable formula φ, an MUS and MCS                   M ODES is a fault mode, and E ⊆ P ROP S is a set of
are defined as follows [25]:                                         effects.
Definition 4. (Minimal Unsatisfiable Subset                             Running Example. In order to illustrate our mod-
(MUS)) A subset U ⊆ φ is an MUS if U is unsatisfi-                   eling process, we use the converter of an industrial
able and ∀Ci ∈ U, U \ ({Ci }) is satisfiable.                        wind turbine as our running example [27]. Table 1




                                                               169
                      Proceedings of the 26th International Workshop on Principles of Diagnosis


illustrates a simplified FMEA neglecting all parts af-          The set of propositional variables A is defined as the
filiated with reliability analysis, such as severity rat-       union of all effects stored in the FMEA as well as all
ings. Each row specifies a particular failure mode, (i.e.       hypotheses, that is the set of component-fault mode
Corrosion, Thermo-mechanical fatigue (TMF) or High-             pairs, i.e.:
cycle fatigue (HCF)) of a subsystem and determines its                                [
corresponding symptoms, such as P turbine referring                    A =def                  E ∪ {mode(C, M )}
to a deviation between expected and measured turbine                          (C,M,E)∈F M EA
power output.                                                   Continuing our converter example:
                                                                                                          
  Component       Fault Mode       Effect                                 
                                                                                 T cabinet, P turbine,    
                                                                                                           
     Fan           Corrosion       T cabinet, P turbine
                                                                          
                                                                           T inverter cabinet, T nacelle, 
                                                                                                           
     Fan             TMF           T cabinet, P turbine               A=         mode(F an, Corrosion),
    IGBT             HCF           T inverter cabinet,                    
                                                                                  mode(F an, T M F ),     
                                                                                                           
                                                                          
                                                                                                          
                                                                                                           
                                   T nacelle, P turbine                            mode(IGBT, HCF )

   Table 1: Excerpt of the FMEA of the converter                Applying M results in the following set of proposi-
                                                                tional Horn clauses representing T h and thus complet-
   Consider the FMEA of the converter in Table 1. We            ing KBConverter :
can map the columns to their corresponding represen-                                                                
tations from Definition 6. The entries in the column                       mode(F an, Corrosion) → T cabinet,       
                                                                      
                                                                           mode(F an, Corrosion) → P turbine,       
                                                                                                                     
Component constitute the elements of COM P , the en-                  
                                                                                                                    
                                                                                                                     
                                                                      
                                                                              mode(F an, T M F ) → T cabinet,       
                                                                                                                     
tries in Fault Mode of M ODES and P ROP S subsumes
the entries of Effect.                                          Th =           mode(F an, T M F ) → P turbine,
                                                                      
                                                                        mode(IGBT, HCF ) → T inverter cabinet,     
                                                                      
                                                                                                                    
                                                                                                                     
              COM P = { F an, IGBT }                                  
                                                                             mode(IGBT,    HCF  ) →  T nacelle,     
                                                                                                                     
                                                                                                                    
                                                                              mode(IGBT, HCF ) → P turbine
      M ODES = { Corrosion, T M F, HCF }
                                                              On account of the mapping function M and the un-
                   T cabinet, P turbine,
   P ROP S =                                                    derlying structure of the FMEAs, the compiled models
               T inverter cabinet, T nacelle
                                                                feature a certain topology. First, the set of hypotheses
Through Definition 6 we obtain F M EAConverter =                and symptoms are disjoint sets. Second, since there is
                                                              a causal link from faults to effects but not vice versa,
    (F an, Corrosion, {T cabinet, P turbine}),                the descriptions exhibit a forward and acyclic structure.
                                                
       (F an, T M F, {T cabinet, P turbine}),                   Specifically, each implication connects one hypothesis
  (IGBT, HCF, {T inverter cabinet, T nacelle, 
                                                              to one effect, thus are bijunctive clauses. In order to
                     P turbine})                                account for impossible observations, we append addi-
   Since the FMEA already represents the relation be-           tional implications to KB stating that an effect and its
tween defects and their manifestations the conversion to        negation cannot occur simultaneously, i.e. e ∧ ¬e |= ⊥.
a suitable abductive KB is straightforward. It is worth            The question remains whether the generated models
noting that FMEAs usually consider single faults; thus,         are suitable for the diagnostic task. Abductive expla-
the resulting diagnostic system holds the single fault as-      nations are consistent by definition and complete given
sumption. Let HC be the set of horn clauses. We define          an exhaustive search. Thus, the appropriateness of the
a mapping function M : 2F M EA 7→ HC generating a               system description is determined by whether a single
corresponding propositional Horn clause for each entry          fault diagnosis can be obtained given all necessary in-
of the FMEA [7].                                                formation is available.
Definition 7 (Mapping function M). Given an                     Definition 8. (One Single Fault Diagnosis
FMEA, the function M is defined as follows:                     Property (OSFDP)) Given a KB (A, Hyp, T h). KB
                                 [                              fulfills the OSFDP if the following hold:
            M(F M EA) =def             M(t)                      ∀m ∈ Hyp : ∃Obs ⊆ A : {m} is a diagnosis of (A, Hyp,
                               t∈F M EA
                                                                 T h, Obs) and ¬∃m0 ∈ Hyp : m0 6= m such that {m’} is
where M(C, M, E) =def {mode(C, M ) → e |e ∈ E } .                a diagnosis for the same PHCAP.
We utilize the proposition mode(C, M ) to denote that              The property ensures that under the assumption
component C experiences fault mode M . Thus, the                enough knowledge is available all single fault diagnoses
set of component-fault mode couples forms the set of            can be distinguished and subsequently unnecessary re-
hypotheses.                                                     placement activities are avoided. To verify whether the
                        [                                       OSFDP holds or not, we compute the set of proposi-
        Hyp =def               {mode(C, M )}.                   tions δ(h) implied by each hypothesis h and the theory.
                  (C,M,E)∈F M EA                                It is not fulfilled if we can record for two or more hy-
                                                                potheses the same δ(h). [7] describes a polynomial al-
In regard to the running example the following elements         gorithm testing for the property. Note that the OSFDP
compose the set Hyp:                                            check can be done on side of the FMEA before compil-
                 (                           )                  ing the model. This is advantageous as the absence of
                    mode(F an, Corrosion),
         Hyp =        mode(F an, T M F ),                       the property indicates that internal variables or obser-
                      mode(IGBT, HCF )                          vations have not been considered in the FMEA.




                                                          170
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


   Assume the set of hypotheses {h1 , h2 , . . . , hn } share      of SAT solvers and their application to a vast number
the same δ(h). We cannot distinguish h1 , h2 , . . . , hn          of different AI problems and industrial domains have
from one another and thus all corresponding compo-                 motivated us to consider a SAT-based approach for ab-
nents have to be repaired or replaced in case they are             ductive diagnosis.
part of the diagnosis. Therefore, we can treat them                   Recall Definition 3 of a diagnosis: ∆ is an abduc-
as a unit by replacing h1 , h2 , . . . , hn with a new hy-         tive explanation if ∆ ∪ T h |= Obs and ∆ ∪ T h 6|= ⊥.
pothesis h0 . Once all indistinguishable hypotheses have           Through logical equivalence we recast the first condi-
been removed, the KB satisfies the OSFDP. Regarding                tion to ∆ ∪ T h ∪ {¬Obs} |= ⊥, where {¬Obs} denotes
the hypotheses, which cannot be differentiated, as one             the set containing the complement of each observation
cause during diagnosis has an effect on the computa-               in Obs, i.e. ∀o ∈ Obs : ¬o ∈ {¬Obs} [10]. In general, we
tional effort as fewer hypotheses are to be considered.            can state the relation as follows: given the theory and
   Algorithm distinguishHypotheses replaces all in-                assuming the hypotheses to be true whereas stating the
distinguishable causes and ensures that after termi-               absence of a set of observations, results in an inconsis-
nation the given KB satisfies the OSFDP. Evidently,                tency due to the fact that the causes entail the effects,
the algorithm’s complexity is determined by the three              i.e. Hyp ∪ T h ∪ {¬Obs} |= ⊥. Thus, we draw on this
nested loops, hence O(|Hyp|2 |A − Hyp|). Since there               relationship and reformulate the problem of generating
is a finite number of hypotheses and effects possibly              minimal abductive explanations for a set of observa-
included in δ(h) the algorithm must terminate.                     tions to computing minimal unsatisfiable subformulas.
                                                                      Since MUSes contain several unsatisfiable subsets
Algorithm 2 distinguishHypotheses                                  irrelevant for the diagnostic task, we define the set
    procedure distinguishHypotheses (A, Hyp, T h)
                                                                   M U SesHyp , which only contains subset minimal MUS
       Ψ[|Hyp|] ← Hyp                                              comprising clauses referring to hypotheses:
       for all h1 ∈ Ψ do                                           Definition 9. (M U SesHyp ) Let M U Ses be the set of
          for all h2 ∈ Ψ do                                        MUSes of Hyp∪T h∪{¬Obs}, then ∀M ∈ M U SesHyp :
              if h1 6= h2 then                                     ∃U ∈ M U Ses : M = U ∩ Hyp and ¬∃M 0 ∈
                  if δ(h1 ) = δ(h2 ) and δ(h1 ) 6= ∅ then
                      Create new hypothesis h0 . h0 ∈    / Hyp
                                                                   M U SesHyp : M 0 ⊂ M.
                      Add h0 to Ψ                                  Corollary 1. Given a P HCAP (A, Hyp, T h, Obs), let
                      Add h0 to A                                  M U SesHyp be the set of interesting MUSes. A set
                      for all e ∈ δ(h1 ) do                        ∆ ⊆ Hyp is a minimal abductive diagnosis if ∃M ∈
                          Add (h0 → e) to T h                      M U SesHyp : ∆ = M and ∆ ∪ T h 6|= ⊥.
                          Remove (h1 → e) from T h
                          Remove (h2 → e) from T h
                      end for                                      Proof. We can restate the problem of computing in-
                      Remove h1 ∧ h2 from Ψ                        consistencies to finding the set of prime implicates of
                      Remove h1 ∧ h2 from A                        T h∧Hyp∧{¬Obs}. By definition, the prime implicates
                  end if                                           are equivalent to the MUSes of said formula.
              end if
          end for                                                     Deriving a minimal abductive explanation corre-
       end for                                                     sponds to computing a minimal subset of the hypothe-
       return KB(A, Ψ, T h)                                        ses, which cannot be simultaneously satisfied with the
    end procedure                                                  theory and the negation of observations.
                                                                      We devised the algorithm satAB, which computes the
   Our running example of the converter does not                   set of abductive diagnoses for a given PHCAP based on
fulfill the OSFDP, since mode(F an, Corrosion) and                 MUS enumeration. First, in order to take advantage of
mode(F an, T M F ) are not distinguishable. By re-                 the MUSes, which correspond to the solutions of the
moving both hypotheses and introducing h0 =                        PHCAP, we create an unsatisfiable CNF encoding of
mode((F an, Corrosion), (F an, T M F )) the property is            the problem. Since the T h consists of Horn clauses
fulfilled.                                                         a conversion into CNF is straightforward. Note that
   Notice that abductive diagnosis is premised on the              we are, however, not limited to Horn clause models, as
assumption that the model is complete; thus, we pre-               we can create a CNF representation based on Tseitin
sume that all significant fault modes for each con-                transformation [28]. We refer to the set of clauses as-
tributing part of the system have been contemplated                sociated with the theory as T . For each h ∈ Hyp we
in the FMEA. Furthermore, we expect on the one hand                create a single clause assuming h to be true. Addition-
that the symptoms described within the FMEA are de-                ally, we generate a disjunction containing the negated
tectable in order to constitute observations. On the               observations. The resulting unsatisfiable formula is re-
other hand, the automated mapping demands a consis-                ferred to as φ. ∆ − Set is the set of diagnoses obtained
tent effect denotation throughout the analysis.                    from the PHCAP.
                                                                      The diagnostic task consists in computing the sets of
5     Abductive Diagnosis via SAT                                  hypotheses which are responsible for the unsatisfiabil-
Although an ATMS derives abductive diagnoses, it is                ity of φ, i.e. M U SesHyp (φ). Since finding satisfiable
limited to propositional Horn theories and subject to              subsets is an NP-hard problem whereas UNSAT resides
performance issues. Both problems have been accom-                 in Co-NP, we employ an MCSes enumeration algorithm
modated through ATMS extensions and focus strate-                  on the unsatisfiable formula and then derive the diag-
gies. Nevertheless, the advances in the development                noses via hitting set computation [25]. As we are only




                                                             171
                          Proceedings of the 26th International Workshop on Principles of Diagnosis




                                         Figure 1: SAT encoding of the running example

Algorithm 3 satAB                                                          Hence      the      abductive  diagnoses             are
  procedure satAB (A, Hyp, T h, Obs)                                     ∆1   =    {mode(F an, Corrosion)} and ∆2                =
     M CSes ← ∅                                                          {mode(F an, T M F )}.
     M CSesHyp ← ∅
     T ← CNF(T h) W        . CNF representation of T h                   6    Empirical Evaluation
     φ ← T ∪ Hyp ∪ o∈Obs ¬o                                              To determine whether computing abductive diagnoses
     M CSes ← MCSes(φ) . MCS enumeration algorithm                       via SAT yields any computational advantages in the
     for all m ∈ M CSes do
                                                                         case of our models, we conducted an empirical eval-
         if m ⊆ Hyp and m ∪ T h is consistent then
            M CSesHyp ← m ∪ M CSesHyp                                    uation, comparing abductiveExplanations to satAB
         end if                                                          on several instances of FMEAs. In case of the former
     end for                                                             we employed a Java implementation of an unfocused
     ∆ − Set ← MHS(M CSesHyp ) . Minimal hitting set                     ATMS. The algorithm satAB exploits on the one hand
  algorithm                                                              an MCS enumeration procedure and on the other hand
     return ∆ − Set                                                      an implementation of a hitting set algorithm. We uti-
  end procedure                                                          lized the MCSLS tool by [19] to compute the MCSes.
                                                                         MCSLS is written in C++, employs Minsat 2.2 as the
                                                                         SAT solver, and provides the possibility to apply sev-
interested in the conflicts stemming from the assump-                    eral MCS enumeration algorithms. We decided for the
tions that all hypotheses are true, we select each MCS                   CLD approach of MCSLS , which takes advantage of
only containing clauses referring to explanations. For                   disjoint unsatisfiable cores and showed the best over-
this reason, we create the set M CSesHyp such that                       all performance in a preliminary experimental set-up.
∀m ∈ M CSesHyp : m ⊆ Hyp. This has one prac-                             Regarding the hitting set computation, we engaged a
tical rational: it diminishes the number of sets to be                   Java implementation of the Binary Hitting Set Tree al-
considered by the hitting set algorithm. The corre-                      gorithm [29] which performed well in a comparison of
sponding MUSes derived via hitting set computation                       minimal hitting set algorithms [30]. All the numbers
of M CSesHyp already constitute the abductive diag-                      presented in this section were obtained from a Lenovo
noses.                                                                   ThinkPad T540p Intel Core i7-4700MQ processor (2.60
   Consider again our running example of the converter.                  GHz) with 8 GB RAM running Ubunutu 14.04 (64-bit).
We already obtained the KB via the mapping function                         Several publicly available as well as project internal
M. Let us assume that the condition monitoring sys-                      FMEAs provide the basis for our evaluation. They
tem of the wind turbine encountered that the turbine’s                   cover various technical systems and subsystems with
power output is lower than expected (P turbine) and                      different underlying structures. In particular they de-
that the cabinet temperature exceeds a certain thresh-                   scribe faults in electrical circuits, a connector system by
old (T cabinet), i.e. Obs = {P turbine, T cabinet}. In                   Ford (FCS), the Focal Plane Unit (FPU) of the Hetero-
Figure 1 we depict the CNF representation φ of the ab-                   dyne Instrument for the Far Infrared (HIFI) built for
duction problem. Clauses C1 to C7 refer to T , C8 to                     the Herschel Space Observatory, printed circuit boards
C10 to the set Hyp and clause C11 contains the negation                  (PCB), the Anticoincidence Detector (ACD) mounted
of the set of observations.                                              on the Large Area Telescope of the Fermi Gamma-ray
   Computing the M CSes of φ we obtain: M CSes =                         Space Telescope, the Maritim ITStandard (MiTS), and
(                                                                )       rectifier, inverter, transformer, backup components, as
    {C11 } , {C1 , C3 } , {C1 , C9 } , {C3 , C8 } , {C9 , C8 } ,
       {C4 , C7 , C2 } , {C4 , C10 , C2 } , {C4 , C7 , C8 } ,      .     well as main bearing of an industrial wind turbine. By
       {C4 , C10 , C8 } , {C2 , C9 , C7 } , {C2 , C9 , C10 }             applying the mapping function M, we generated the
                                                                         corresponding abductive knowledge bases KB for each
   Extracting the MCSes, which only contain clauses                      FMEA. Table 2 provides an overview of the FMEAs’
from Hyp and are consistent with regard to the theory,                   structure and the evaluation results. It is worth noting
results in                                                               that the FMEAs vary in the number of hypotheses, i.e.
                                                                         component-fault mode couples, the number of effects,
                 M CSesHyp = {{C9 , C8 }}.                               and the number of rules, i.e. the links between faults
By computing the hitting set of M CSesHyp , we obtain                    and symptoms. Due to T h of an abductive KB com-
the set of MUSes solely referring to explanations, which                 prising Horn clauses, a conversion into a CNF represen-
is in fact the set of diagnoses:                                         tation, suitable for the MCSLS tool, is straightforward.
                                                                         We do not address the model compilation times, since
                  ∆ − Set = {{C9 } , {C8 }}.                             the system description would be compiled offline and




                                                                   172
                      Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                               the hitting set computation accounted for a negligible
                                                               fraction of the total runtime.
                                                                  Figure 2 illustrates the cumulative log runtimes for
                                                               satAB and abductiveExplanations on the FMEA
                                                               models generated. Although abductiveExplanations
                                                               performs on average better, the first model requires a
                                                               longer computation time for both algorithms. More-
                                                               over, the illustration reveals the high computational ef-
                                                               fort necessary for satAB to compute the diagnoses for
                                                               the model of the inverter. As expected we observe par-
                                                               ticularly high runtimes when the set of observations
                                                               contains effects corresponding to different hypotheses.
                                                               This has a greater impact on satAB than on the ATMS
                                                               implementation. For the section from the models FCS
                                                               to PCB in Figure 2, however, we can see that the cu-
Figure 2: Cumulative runtimes of abductiveExplana-             mulative runtime for abductiveExplanations rises at
tions and satAB for the FMEA instances                         a steeper angle. Generally, the data gathered in the
                                                               experiment do not suggest a performance benefit of the
the mapping execution consumed less than one second            SAT-based approach over an ATMS implementation.
for the examples we utilized so far.
   Table 2 shows that none, except of the model re-            7   Conclusion and Future Work
sulting from the transformer’s FMEA, of the original           In the course of the paper, we presented a mapping
models satisfy the OSFDP. Therefore, we compiled a             from failure assessments available to propositional Horn
second set of models fulfilling the property by exchang-       clause models. The modeling methodology relies on
ing each set of indistinguishable hypotheses with a new        FMEAs as they comprise information on faults and
single hypothesis representing said set. For example,          their symptoms. Hence, they provide a suitable source
Algorithm distinguishHyp ensures that the resulting            for model compilation. Although in our case an ATMS
KB satisfies the OSFDP. In Table 2 the original models         can be used to compute abductive diagnoses, it is lim-
are identified accordingly, and the adapted models are         ited to propositional Horn theories. We proposed a
provided with the label OSFDP. Note that the num-              SAT-based approach to abductive model-based diagno-
ber of hypotheses and rules diminishes for the adapted         sis which allows us to reason on more expressive repre-
models.                                                        sentations. Our method is based on computing conflict
   In the experiments, we computed the abductive ex-           sets, i.e. MUSes, resulting from a rewritten, unsatisfi-
planations for |Obs| from one to the maximum number            able system description. Subsets of these unsatisfiable
of effects possible. The observations were generated           cores constitute the minimal abductive explanations.
randomly; however, the same set was used for satAB             Since the computation of MUSes is computationally de-
and abductiveExplanations on the original as well as           manding our proposed algorithm exploits its hitting set
adapted model. The results reported in Table 2 have            dual, MCSes, in order to derive minimal diagnoses.
been obtained from ten trials and both algorithms faced           We empirically compared an implementation of a di-
a 200 seconds runtime limit. Whereas some of the small         agnosis engine employing an ATMS to our SAT-based
runtimes are arguable due to the measurement in the            algorithm. The results indicate that while for some of
milliseconds range, Table 2 reveals that satAB (Mean           the models, the algorithm performs well, in general we
= 703.73 ms, SD = 8432.07 ms, Median = 0.59 ms,                could not observe a performance advantage. Particular
Skewness = 18.61) does not outperform abductiveEx-             examples led to even longer computation times than the
planations (Mean = 3.08 ms, SD = 16.38 ms, Median              ATMS-based implementation. Despite the fact that the
= 1 ms, Skewness = 12.68) in general. From the statis-         data provided no evidence of a computational benefit in
tical data we can infer that the underlying distribution       employing a SAT-based approach, we believe that the
of both algorithms is highly right skewed, thus the bulk       possibility to utilize more expressive models provides
of values is located towards the lower runtimes. We can        an interesting incentive for future research in this area.
even observe that for certain instances, the SAT-based            Since the evaluation results, did not indicate a supe-
approach performs rather poorly. Amongst these are             riority of the SAT-based approach on grounds of MC-
the model of an inverter and a rectifier of an industrial      Ses enumeration, we currently investigate direct conflict
wind turbine. satAB exceeded the given timeout four            generation methods. Additionally, due to the model
times for the former. Notice that in all these cases the       structure and the experiment data we are planning on
MCSes generation already reached the time threshold.           employing compilation methods [31, 32], in order to
According to [19] CLD requires |φ| − p + 1 SAT solver          divert some of the computational inefficiency to the
calls, where p refers to the size of the smallest MCS          model generation process.
of φ. In our case p = 1, as the clause representing the
set of negated observations always constitutes an MCS.
                                                               Acknowledgments
Thus, |φ| SAT solver calls are necessary, where |φ| is de-     The work presented in this paper has been supported
termined by |T h| + |Hyp| + 1, with 1 referring to the         by the FFG project Applied Model Based Reasoning
clause containing the observations. Unsurprisingly, the        (AMOR) under grant 842407. We would further like to
larger FMEAs are more computationally demanding.               express our gratitude to our industrial partner, Uptime
It is worth mentioning that in the majority of cases           Engineering GmbH.




                                                         173
                          Proceedings of the 26th International Workshop on Principles of Diagnosis




                            Model Structure                 #Diagnoses                        Runtime [in ms]
    Component           #Hyp  #Effects    #Rules   MAX     AVG   SF    DF     TF     Algorithm   MIN      MAX         AVG
   Electrical circuit
                         32       17        52      792    197.15   11   11   66     abductive        <1      425      27.87
       Original                                                                    Explanations
                                                                                         satAB        <1    181.33     76.05
                         15       17        35       1         1    1     1    1     abductive        <1         8      0.33
       OSFDP                                                                       Explanations
                                                                                         satAB        <1      1.91      0.16
         FCS
                         17       17        51      18       2.93   3     6   18     abductive        <1        1       0.42
       Original                                                                    Explanations
                                                                                         satAB        <1      6.41      1.28
                         15       17        49      18       2.75   3     6   18     abductive        <1        61      2.04
       OSFDP                                                                       Explanations
                                                                                         satAB        <1      4.73      0.56
         ACD
                         13       16        41      15       2.89   5    15   15     abductive        <1       84       1.38
       Original                                                                    Explanations
                                                                                         satAB        <1      2.89      0.35
                         12       16        39      10       2.04   5    10   10     abductive        <1         1      0.29
       OSFDP                                                                       Explanations
                                                                                         satAB        <1     2.435      0.28
    Main bearing
                          3        5        20       3       2.54   3     0    0     abductive        <1        1       0.16
       Original                                                                    Explanations
                                                                                         satAB        <1        1       0.09
                          2        5        15       2       1.54   2     0    0     abductive        <1        1       0.12
       OSFDP                                                                       Explanations
                                                                                         satAB        <1      0.61      0.03
     HIFI - FPU
                         17       11        36      63       8.64   3     7   21     abductive        <1       86       2.54
       Original                                                                    Explanations
                                                                                         satAB        <1      8.33         3
                          9       11        27       6       1.55   2     2    3     abductive        <1         1      0.15
       OSFDP                                                                       Explanations
                                                                                         satAB        <1        1       0.09
       MiTS 1
                         18       21        48      24       8.40   3     2    6     abductive        <1       94       3.40
       Original                                                                    Explanations
                                                                                         satAB        <1      3.02      0.39
                         13       21        43       1         1    1     1    1     abductive        <1       100      1.54
       OSFDP                                                                       Explanations
                                                                                         satAB        <1      2.15      0.16
       MiTS 2
                         22       15        48      288     39.98   4     8   18     abductive        <1      109       4.49
       Original                                                                    Explanations
                                                                                         satAB        <1     15.16      3.43
                         14       15        37       5       2.02   1     5    2     abductive        <1         1      0.33
       OSFDP                                                                       Explanations
                                                                                         satAB        <1      1.68      0.20
         PCB
                         10       11        24       2       1.49   2     2    2     abductive        <1        1       0.21
       Original                                                                    Explanations
                                                                                         satAB        <1      1.49       0.1
                          9       11        23       1         1    1     1    1     abductive        <1         1      0.11
       OSFDP                                                                       Explanations
                                                                                         satAB        <1        1        0.1
       Inverter
                         30       38       144      450     23.73   19    5   50     abductive        <1      107       6.15
       Original                                                                    Explanations
                                                                                         satAB        <1   166593    5007.37
                         23       38       124      66       5.89   14    3    6     abductive        <1       94       1.67
       OSFDP                                                                       Explanations
                                                                                         satAB        <1   1110.82     38.23
       Rectifier
                         20       17        93      88      10.83   8    24   32     abductive        <1        6       1.07
       Original                                                                    Explanations
                                                                                         satAB        <1   24236.9   1070.88
                         14       17        66      22       3.06   5    18    8     abductive        <1         1      0.63
       OSFDP                                                                       Explanations
                                                                                         satAB        <1     44.74      4.88
     Transformer
                          5        8        22       2       1.06   2     2    1     abductive        <1        1       0.16
       Original                                                                    Explanations
                                                                                         satAB        <1      1.69      0.06
                          5        8        22       2       1.06   2     2    1     abductive        <1         1      0.13
       OSFDP                                                                       Explanations
                                                                                         satAB        <1      1.91      0.08
       Backup
     components
                         25       30       114      252     23.06   8    12   21     abductive        <1      138       5.24
       Original                                                                    Explanations
                                                                                         satAB        <1     41.98     12.89
                         19       30        95      48       3.29   7     7   10     abductive        <1         4      0.79
       OSFDP                                                                       Explanations
                                                                                         satAB        <1     10.06      3.09


Table 2: Features of the FMEAs and experimental results. For each component we conducted the experiment
using an implementation of abductiveExplanations and satAB. The columns SF, DF, TF display the maximum
number of single faults, double faults, and triple faults, respectively.




                                                            174
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


References                                                          [17] Iulia Nica and Franz Wotawa. ConDiag-computing
[1]  Raymond Reiter. A theory of diagnosis from first prin-              minimal diagnoses using a constraint solver. In Inter-
     ciples. Artificial Intelligence, 32(1):57–95, 1987.                 national Workshop on Principles of Diagnosis, pages
                                                                         185–191, 2012.
[2] Johan de Kleer and Brian C Williams. Diagnosing Mul-
     tiple Faults. Artificial Intelligence, 32(1):97–130, 1987.     [18] Alexander Felfernig and Monika Schubert. Fastdiag:
                                                                         A diagnosis algorithm for inconsistent constraint sets.
[3] Brian C Williams and P Pandurang Nayak. A model-
                                                                         In Proceedings of the 21st International Workshop on
     based approach to reactive self-configuring systems. In             the Principles of Diagnosis (DX 2010), Portland, OR,
     Proceedings of the National Conference on Artificial In-            USA, pages 31–38, 2010.
     telligence, pages 971–978, 1996.
                                                                    [19] Joao Marques-Silva, Federico Heras, Mikolás Janota,
[4] Peter Struss, Andreas Malik, and Martin Sachen-
                                                                         Alessandro Previti, and Anton Belov. On comput-
     bacher. Case studies in model-based diagnosis and fault             ing minimal correction subsets. In Proceedings of the
     analysis of car-subsystems. In Proc. 1st Int’l Workshop
                                                                         Twenty-Third international joint conference on Artifi-
     Model-Based Systems and Qualitative Reasoning, pages                cial Intelligence, pages 615–622. AAAI Press, 2013.
     17–25, 1996.
                                                                    [20] Andreas Pfandler, Stefan Rümmele, and Stefan Szei-
[5] Luca Console, Daniele Theseider Dupre, and Pietro
                                                                         der. Backdoors to abduction. In Proceedings of the
     Torasso. On the Relationship Between Abduction
                                                                         Twenty-Third international joint conference on Artifi-
     and Deduction. Journal of Logic and Computation,
                                                                         cial Intelligence, pages 1046–1052. AAAI Press, 2013.
     1(5):661–690, 1991.
[6] Peter Zoeteweij, Jurryt Pietersma, Rui Abreu, Alexan-           [21] Gustav Nordh and Bruno Zanuttini. What makes
     der Feldman, and Arjan JC Van Gemund. Auto-                         propositional abduction tractable. Artificial Intelli-
     mated fault diagnosis in embedded systems. In Secure                gence, 172:1245–1284, 2008.
     System Integration and Reliability Improvement, 2008.          [22] Gerhard Friedrich, Georg Gottlob, and Wolfgang Ne-
     SSIRI’08. Second International Conference on, pages                 jdl. Hypothesis classification, abductive diagnosis and
     103–110. IEEE, 2008.                                                therapy. In Expert Systems in Engineering Principles
[7] Franz Wotawa. Failure mode and effect analysis for ab-               and Applications, pages 69–78. Springer, 1990.
     ductive diagnosis. In Proceedings of the International         [23] Franz Wotawa, Ignasi Rodriguez-Roda, and Joaquim
     Workshop on Defeasible and Ampliative Reasoning                     Comas. Abductive Reasoning in Environmental De-
     (DARe-14), volume 1212. CEUR Workshop Proceed-                      cision Support Systems. In AIAI Workshops, pages
     ings, ISSN 1613-0073, 2014. http://ceur-ws.org/Vol-                 270–279, 2009.
     1212/.                                                         [24] Chin-Liang Chang and Richard Char-Tung Lee. Sym-
[8] Peter G. Hawkins and Davis J. Woollons. Failure                      bolic logic and mechanical theorem proving. Academic
     modes and effects analysis of complex engineering sys-              press, 1973.
     tems using functional models. Artificial Intelligence in       [25] Mark H Liffiton and Karem A Sakallah. Algorithms for
     Engineering, 12:375–397, 1998.
                                                                         computing minimal unsatisfiable subsets of constraints.
[9] Chris Price and Neil Taylor. Automated multiple fail-                Journal of Automated Reasoning, 40(1):1–33, 2008.
     ure fmea. Reliability Engineering & System Safety,
                                                                    [26] Elazar Birnbaum and Eliezer L Lozinskii. Consis-
     76:1–10, 2002.
                                                                         tent subsets of inconsistent systems: structure and be-
[10] Sheila A McIlraith. Logic-based abductive inference.                haviour. Journal of Experimental & Theoretical Artifi-
     Knowledge Systems Laboratory, Technical Report KSL-                 cial Intelligence, 15(1):25–46, 2003.
     98-19, 1998.
                                                                    [27] Christopher S Gray, Roxane Koitz, Siegfried Psutka,
[11] Pierre Marquis. Consequence finding algorithms. In                  and Franz Wotawa. An abductive diagnosis and mod-
     Handbook of Defeasible Reasoning and Uncertainty                    eling concept for wind power plants. In International
     Management Systems, pages 41–145. Springer, 2000.                   Workshop on Principles of Diagnosis, 2014.
[12] Katsumi Inoue. Linear resolution for consequence find-         [28] Gregory Tseitin. On the complexity of proofs in propo-
     ing. Artificial Intelligence, 56(2):301–353, 1992.                  sitional logics. In Seminars in Mathematics, volume 8,
[13] Franz Wotawa, Ignasi Rodriguez-Roda, and Joaquim                    pages 466–483, 1970.
     Comas. Environmental decision support systems based            [29] Li Lin and Yunfei Jiang. The computation of hitting
     on models and model-based reasoning. Environmental                  sets: review and new algorithms. Information Process-
     Engineering and Management Journal, 9(2):189–195,                   ing Letters, 86(4):177–184, 2003.
     2010.
                                                                    [30] Ingo Pill, Thomas Quaritsch, and Franz Wotawa. From
[14] Amit Metodi, Roni Stern, Meir Kalech, and Michael
                                                                         conflicts to diagnoses: An empirical evaluation of min-
     Codish. A novel SAT-based approach to model based
                                                                         imal hitting set algorithms. In 22nd Int. Workshop on
     diagnosis. Journal of Artificial Intelligence Research,
                                                                         the Principles of Diagnosis, pages 203–210, 2011.
     pages 377–411, 2014.
                                                                    [31] Adnan Darwiche. Decomposable negation normal
[15] Alexander Feldman, Gregory Provan, Johan de Kleer,
                                                                         form. Journal of the ACM (JACM), 48(4):608–647,
     Stephan Robert, and Arjan van Gemund. Solving
                                                                         2001.
     model-based diagnosis problems with Max-SAT solvers
     and vice versa. In DX-10, International Workshop on            [32] Pietro Torasso and Gianluca Torta.          Computing
     the Principles of Diagnosis, 2010.                                  minimum-cardinality diagnoses using OBDDs. In KI
[16] Alexander Feldman, Gregory M Provan, and Arjan JC                   2003: Advances in Artificial Intelligence, pages 224–
     van Gemund. Computing minimal diagnoses by greedy                   238. Springer, 2003.
     stochastic search. In AAAI, pages 911–918, 2008.




                                                              175
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                  176
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




             Fault Tolerant Control for a 4-Wheel Skid Steering Mobile Robot

                George K. Fourlas1, George C. Karras2 and Kostas J. Kyriakopoulos 2
      1
        Department of Computer Engineering, Technological Educational Institute (T. E. I.) of Central
                                         Greece, Lamia, Greece
                                        email: gfourlas@teiste.gr
    2
      Control Systems Laboratory, School of Mechanical Eng. National Technical University of Athens
                                        (NTUA) Athens, Greece
                           email: karrasg@mail.ntua.gr, kkyria@mail.ntua.gr
                         Abstract
     This paper studies a fault tolerant control strategy
     for a four wheel skid steering mobile robot
     (SSMR). Through this work the fault diagnosis
     procedure is accomplished using structural analy-
     sis technique while fault accommodation is based
     on a Recursive Least Squares (RLS) approxima-
     tion. The goal is to detect faults as early as possi-
     ble and recalculate command inputs in order to
     achieve fault tolerance, which means that despites
     the faults occurrences the system is able to recov-
     er its original task with the same or degraded per-                   Figure 1. 4-Wheel Skid Steering Mobile Robot.
     formance. Fault tolerance can be considered that
     it is constituted by two basic tasks, fault diagnosis         Fault diagnosis and accommodation for wheeled mobile
     and control redesign. In our research using the di-           robots is a complex problem due to the large number of
     agnosis approach presented in our previous work               faults that can be present such as faults of sensors and ac-
     we addressed mainly to the second task proposing              tuators [10] - [20].
     a framework for fault tolerant control, which al-             Model based fault detection and isolation is a method to
     lows retaining acceptable performance under sys-              perform fault diagnosis using a certain model of the sys-
     tems faults. In order to prove the efficacy of the            tem. The goal is to detect faults as early as possible in or-
     proposed method, an experimental procedure was                der to provide a timely warning [8]. The aim of timely
     carried out using a Pioneer 3-AT mobile robot.                handling the fault occurrence is to accommodate their con-
                                                                   sequences so that the system remains functional. This can
1    Introduction                                                  be achieved with fault tolerance.
                                                                   In cases where fault could not be tolerated, it is necessary
The higher demands to achieve more reliable performance
                                                                   to use redundant hardware. In practice there exist two dif-
in modern robotic systems have necessitated the develop-
                                                                   ferent approaches for fault tolerance control, static redun-
ment of appropriate fault diagnosis methods. The appear-
                                                                   dancy and dynamic redundancy [8].
ance of faults is inevitable in all systems, such as wheeled
                                                                   In [10] and [16], the research is focused only on the prob-
robots, either because their elements are worn out or be-
                                                                   lem of fault detection and identification in a mobile robot
cause the environment in which they operate, presents un-
                                                                   and different approaches related to state estimation were
anticipated situations [4].
                                                                   introduced. In [9] and [15], the research interest is focused
In a large number of applications, as for example search
                                                                   only on the problem of fault detection which is a sepa-
and rescue, planetary exploration, nuclear waste cleanup or
                                                                   rate problem in the fault diagnosis domain. The research
mine decommissioning, the wheeled robots operate in en-
                                                                   efforts in [7] and [12] – [14] are primarily intended to de-
vironments where human intervention can be very costly,
                                                                   tect faults in the sensors of a wheeled robot. Concerning
slow or even impossible. They can move freely in such
                                                                   the research area of detection and accommodation on
dynamic environments. It is therefore essential for the ro-
                                                                   wheeled robots there is also a small number of efforts [18]
bots to monitor their behavior so that faults may be ad-
                                                                   with different approaches and methodologies.
dressed before they result in catastrophic failures.
                                                                   As a fault, it can be considered any unpermitted deviation
A wheeled mobile robot is usually an embedded control
                                                                   from the normal behavior of a system. Fault diagnosis is
platform, which consists of an on-board computer, power,
                                                                   the procedure of determination of the component which is
motor control system, communications, sonars, cameras,
                                                                   faulty. Consequently, the aim of fault diagnosis is to pro-
laser radar system and sensors such as gyroscope, encod-
                                                                   duce the suitable fault statement regarding the malfunction
ers, accelerometers etc, Fig. 1.
                                                                   of a wheeled robot.




                                                             177
                       Proceedings of the 26th International Workshop on Principles of Diagnosis


Fault diagnosis includes fault detection, which is the indi-     The kinematic model describes the motion constrains of
cation that something is going wrong in the system and           the system, as well as the relationship of the sensors meas-
fault isolation, which is the determination of the magnitude     urements with the system states and it is crucial for the
of the fault, by evaluating symptoms. Follows fault detec-       fault diagnosis procedure.
tion. Fault detection and isolation tasks together are re-
ferred to as fault diagnosis (FDI - Fault Detection and Iso-     2.1 Kinematic Model
lation).                                                         The geometry of the robot is presented in Fig.2. To con-
Among the various methods in the design of a residual            sider the model of the four wheel skid steering mobile ro-
generator, only few deal with nonlinear systems. Structural      bot (SSMR) it is assumed that the robot is placed on a
analysis is a technique that provides feasible solutions to
                                                                 plane surface where ( Χ Ι ,Υ Ι ) is the inertial reference
the residual generation of nonlinear systems
Structural analysis methods are used in research publica-        frame and ( Χ ,Υ ) is a local coordinate frame fixed on the
tions [2] and [6]. Paper [3] presents a structural analysis      robot at its center of mass (COM). The position of the
for complex systems such as a ship propulsion benchmark.
In [13] and [14] the authors discusses how structural anal-      COM is ( x, y ) with respect to the inertial frame and ϑ is
ysis technique is applied to an unmanned ground vehicle          the orientation of the local coordinate frame with respect to
for residual generation.                                         the inertial frame.
In this research, a model based fault diagnosis for a four
wheel skid steering mobile robot (SSMR) is presented.                    YΙ
The basic idea is to use structural analysis based technique                                               v1
                                                                                                                                   X
in order to generate residuals. For this purpose we use the                     Y                                v1x
kinematic model of the mobile robot that serves to the de-                                 b       v1y                                  ϑ
                                                                                                yICR
sign of the structural model of the system. This technique                     a                                  v
                                                                                      ΙCR
provides the parity equations which can be used as residual                                                                       v4
                                                                                                                             vx        v4 x
generators. The advantage of the proposed method is that
                                                                     y        2 RL v2 x                  vy              v4 y
offers feasible solution to the residual generation of non-                                      v2
                                                                                                                 COM

linear systems. Additionally, we a propose a fault accom-                                                 xICR
                                                                                         v2 y                                     2 RR
modation technique based on RLS approximation in order                                                                 v3x
to provide recalculated control inputs in the case that the                                                                  v3
left or right set of the robot tires becomes flat.                                  2c
The mobile robot is supposed to be equipped with two                                                              v3 y
high resolution optical quadrature shaft encoders mounted                                                                                     XΙ
on reversible-DC motors which provide rotational speeds
                                                                                                   x
of the left and right wheels ωL and ωR respectively and
                                                                                    Figure 2. Mobile Robot Geometry.
an inertial measurement unit (IMU) which provides the
forward linear acceleration and the angular velocity well as     As depicted in Fig. 2, a is the distance between the center
the angle θ between the mobile robot axle and the x axis         of mass and the front wheels axle along X, b is the distance
of the mobile robot. The absolute pose (horizontal position      between the center of mass and the rear wheels axle along
and orientation) of the robot is available via a camera sys-     X, c is half distance between wheels along Y and RL , RR
tem mounted on the workspace of the robot. A distinctive
                                                                 are the radii of left and right wheels respectively. The co-
marker is place at the top side of the robot.
                                                                 ordinates of the instantaneous center of rotation (ICR) are
The paper is organized as follows. We start by presenting
the mathematical model of a Pioneer 3-AT mobile robot in          ( xICR ,yICR ) .
section 2. Section 3 describes the fault diagnosis proce-        Assuming that the robot moves on a horizontal plane the
dure. Section 4 describes the methodology of fault ac-           linear velocity with respect to the local frame is given by
commodation. In section 5 we present the application re-
sults of the proposed method to the robotic platform. Con-                                                     υx 
clusions and directions for future work are presented in                                                 υ = υy                                (1)
Section 6.
                                                                                                               0 
2   Mathematical Model of Pioneer 3-AT Mo-                       and its angular velocity is given by
    bile Robot
                                                                                                               0 
                                                                                                          ω =  0 
In this work, the mobile robot Pioneer 3-AT was used as a
robotic platform. This robot is a four wheel skid – steering                                                                                       (2)
vehicle actuated by two motors, one for the left sided                                                         ω 
wheels and the other for the right sided wheels. The wheels
on the same side are mechanically coupled and thus have          The state vector with respect to the inertial frame is
the same velocity. Also, they are equipped with encoders
and the angular readings are available through routine                                                         x
calls.                                                                                                    q =  y                               (3)
                                                                                                              ϑ 




                                                           178
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


The time derivatives of (3) denotes the robot’s velocity                  We suppose that the mobile robot localization is calculated
vector and is given by                                                    via the following measurement devices:
                                                                              • two high resolution optical quadrature shaft encod-
                x  cos ϑ − sin ϑ 0  υ x                                     er mounted on reversible-DC motors which provide
                y  =  sin ϑ cos ϑ 0  υ 
                                      y                 (4)                  rotational speeds of the left and right wheels ωL
               ϑ   0       0   1   ω                                 and ωR respectively,
Assuming that longitudinal slip between the wheels and                        •   an Inertial Measurement Unit (IMU) which pro-
the surface can be neglected we have the following equa-                          vides the forward linear acceleration and the angu-
tion,                                                                             lar velocity as well as the angle ϑ between the
                                                                                  mobile robot axle and the x axis of the mobile ro-
                            υix = Riωi                        (5)                 bot.
                                                                              • A camera system, which calculates the pose of the
where υix is the longitudinal component of the total veloci-                      robot, by tracking a marker placed at the top side of
ty vector υi of the i-th wheel expressed with respect to the                      it.
                                                                          In this work we are only interested in abrupt faults which
local frame and Ri is the rolling radius of that wheel.                   occur in the actuators of the mobile robot and as conse-
If we take into account all wheels (Fig. 2), the following                quence, we make the following assumptions.
relationships between the wheels can be obtained [11],                        • Assumption 1: When the mobile robot starts func-
                                                                                  tioning all its components are in normal mode.
                          υ=
                           L υ=1x υ2 x                                        • Assumption 2: The magnitude of the noise is as-
                         υ=  υ=   υ4 x                                            sumed to be significantly smaller than the magni-
                                                              (6)
                           R   3x

                         υ=  υ=   υ                                               tude of the faults.
                           F   1y   4y
                                                                              • Assumption 3: Regarding the wheel radius the fol-
                         υ=B υ=2y υ 3y                                            lowing inequalities are satisfied:
where υ L refers to the longitudinal coordinates of the left                          RR + δ RR > 0         &       RL + δ RL > 0
wheels velocities, υ R refers to the longitudinal coordinates             According to this assumption, faults that result in the com-
of the right wheels velocities, υ F refers to the lateral coor-           plete loss of the wheel are not considered.
dinates of the front wheels velocities and υ B refers to the
lateral coordinates of the rear wheels velocities.                        3   Fault Detection and Isolation
Unlike other mobile robots, lateral velocities of the four                Between several techniques for generating residuals, lim-
wheel skid steering mobile robot are generally nonzero                    ited number of them concerns nonlinear systems. Such one
since from its mechanical structure the lateral skidding is               is structural analysis. Using this method we can extract
necessary if the robot changes its orientation. Therefore, in             information about system components that we are not able
order to complete the kinematic model, the following non-                 to measure. Also we can take the parity equations that al-
holonomic constrain in Pfaffian form is introduced                        low generating residuals.
                                                                          The structure of the mobile robot is described using the
                                    x                                  following sets of constrains C and variables V
          [ − sin ϑ cos ϑ − xICR ]  y  =A ( q ) q =0    (7)                                C = {c1 , c2 ,..., c9 }            (12)
                                    ϑ 
                                                                                                     V= X ∪ K                       (13)
Then we have
                                                                          X is a subset of the unknown ones and K is a subset of
                           q = S ( q )η                      (8)         known that are measurements and inputs.
                                                                          The above subsets are
where

                       cos ϑ         xICR sin ϑ 
                                                                                                      {
                                                                                                X = x , y , ϑ,υ x ,υ y   }       (14)

              S ( q )  sin ϑ
              =                      − xICR cos ϑ           (9)                          K = { x, y, ϑ ,  y, ω , ωL , ωR }
                                                                                                           x,                     (15)
                        0                1       
                                                                          The constrain set of the mobile robot is
                                  υ                                                      c1 : x cos ϑυ x − sin ϑυ y
                                                                                           =                                        (16)
                             η =  x                        (10)
                                  ω
                                                                                           c=    sin ϑυ x + cos ϑυ y
                                                                                            2 : y                                   (17)
 S ( q ) is a full rank matrix, whose columns are in the null
space of A ( q ) ,                                                                                   c3 : ϑ = ω                    (18)
                        S T ( q ) AT ( q ) = 0               (11)                                                dx
                                                                                                     c4 : 
                                                                                                          x=                        (19)
It is noted that since dim ( η ) =2 < dim ( q ) =
                                                3 , equation                                                     dt
(8) describes the kinematic of a sub-actuated robot with the
nonholonomic constraint given by (7).




                                                                    179
                            Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                 dy                                                                         t
                              c5 : 
                                   y=                                          (20)                                r2= ϑ − ∫ ω dt                    (32)
                                                 dt                                                                          0
                                             t
                           c6 : ϑ = ∫ ω dt                                     (21)                                              dx
                                             0
                                                                                                                =r3     ∫ xdt − dt                 (33)

                                           r                                                                                     dy
                        υx
                    c7 :=                    (ωR + ωL )                        (22)                             =r4     ∫ ydt − dt                 (34)
                                           2
                                                 dx
                              c8 : x =                                        (23)         4   Fault Accommodation
                                                 dt
                                                                                            Fault accommodation is the phase that follows the fault
                                                 dy                                         diagnosis. One of the most important issues to consider for
                              c9 : y =                                        (24)
                                                 dt                                         the design of fault tolerant control is relative to the per-
                                                                                            formance and functionality of the system under considera-
Through the above technique we create the following inci-                                   tion. More specific it should take into consideration, the
dence matrix that describes the robot structure, Table 1.                                   degree of performance degradation that is acceptable.
                   Table 1. Incidence Matrix                                                There are two aspects of system performance, dynamic and
                                                                                            steady state. In our approach we take into account the sec-
                   KNOWN                                        UNKNOWN                     ond one. We also use the aforementioned fault diagnosis
                                                                                υy
                                                                                            method to monitor the system. The goal is to have the nec-
      x   y    θ    x      y       ω    ωL ωR          x   y   ϑ   υx
                                                                                            essary information about the fault occurrence for timely
 c1            1                                           1              1      1          counteraction. Figure 3 shows the overall structure of the
 c2                                                             1                           proposed fault tolerant mechanism. It consists of two parts:
               1                                                          1      1
                                                                                            i) the fault detection module which accepts as inputs the
 c3                                    1                             1                      measurement of the linear and angular velocity of the
 c4                 1                                      1                                SSMR and decides about the type of fault according to the
                                                                                            method described in Section 3, and ii) the fault accommo-
 c5                         1                                   1                           dation module which accepts as inputs the type of fault as
 c6            1                                                     1                      well as the measurement of the linear and angular velocity
 c7                                          1         1                  1                 and recalculates accordingly the command inputs in order
                                                                                            to compensate for the fault.
 c8   1                                                    1
 c9       1                                                     1


Applying matching algorithm [1] to the incidence matrix,
we take out the following matched M and unmatched U
constrains
                    M = {c1 , c2 , c3 , c4 , c7 }                              (25)

                          U = {c5 , c6 , c8 , c9 }                             (26)

In order to have residual generators we use the following
parity equations
                              c5 ( y , 
                                        y) = 0                                 (27)                 Figure 3. Fault Tolerance System Architecture.

                                   (
                              c6 ϑ , ϑ = 0  )                                 (28)
                                                                                            When a fault occurs the appropriate action is undertaken
                              c8 ( x, x ) = 0                                 (29)         (e.g. maintenance, repair, reconfiguration, stop operation)
                                                                                            in such a way to prevent system failures. In that level the
                              c9 ( y, y ) = 0                                 (30)         performance degradation that is acceptable is relative to
                                                                                            the minimum requirements that ensure the system func-
By starting from the unknown variables through backtrack-                                   tionality. There is always the case that the malfunction
ing to known variables, the residuals are:                                                  may cause hazard for the process or the environment, and a
                                                                                            decision for stopping the operation is unavoidable.
                          d        r                                                       In this work, we propose a fault accommodation technique
              r1 =
                  y−          sin ϑ (ωR + ωL ) +                                           which is employed when either the left or the right set of
                          dt       2
                                                                                            tires becomes flat during the operation of a SSMR. It is
                            r                                               (31)
                       cos ϑ (ωR + ωL ) − ∫ 
                                             xdt                                          obvious that when a flat tire fault occurs, the total nominal
              + cos ϑ                           
                             2
                                                                                            radius RNOM (rim and tire) of the fault wheel changes to
                                sin ϑ              
                                                                                           RF , where RF < RNOM . The proposed fault accommoda-
                                                   




                                                                                      180
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


tion strategy relays on the online estimation of the new                  3. Update the estimate Rˆ Fk and the covariance Pk of the
radius RF , in order to correct the commanded rotational
                                                                                                           estimation error sequentially according to:
speeds of the faulty wheel and compensate for the fault
                                                                       K k Pk −1 H kT ( H k Pk −1 H kT + Rk )
which otherwise will inevitable lead the vehicle to diverge                                                                                                       −1
                                                                      =
from its nominal course.
As explained in [11], the kinematic model of the SSMR
can be consider equivalent with the unicycle differential
                                                                                                                      Rˆ Fk =               (
                                                                                                                             Rˆ Fk −1 + K k ωk − H k Rˆ Fk −1     )        (39)

drive one, mainly due to the existence of a single motor                                                              P=
                                                                                                                       k    ( I − K k H k ) Pk −1
drive and a transmission belt for each set of wheels (left
and right), which impose the same rotational speed for                       where ωk is the actual measurement of the body an-
each set of wheels. According to this assumption we can                      gular velocity as delivered by the IMU sensor.
safely assume that:                                                       4. Using the estimated wheel radius Rˆ Fk we correct the
                    u x  1  c c  u L                                                                 commanded wheel angular velocity as follows:
                     ω  = 2c  −1 1  u              (35)
                                    R                                                                                              −2ωk c + ωR RR
                                                                                                                           ωL _ cor =                                      (40)
 =
where   υ L ω=  L RL , υ R ωR RR are the equivalent linear                                                                                    Rˆ  Fk

velocities of the left and right wheels respectively in rela-
                                                                                                                                         2ωk c + ωL RL
tion to the rotational speeds and radii. If we consider that                                                                ωR _ cor =                                     (41)
the fault will occur only at the one set of the wheels (left or                                                                               Rˆ  Fk
right), we may consider only the angular velocity equation
for the accommodation. Thus, only the angular velocity                in case there is a left or a right wheel fault respectively.
measurement is needed. The fault accommodation is based
on the online estimation of the new radius RF employing a             5                                    Application Results
Recursive Least Squares algorithm. More specific, we may              The proposed method has been implemented and tested
consider the following linear equation for the measurement            experimentally on Pioneer 3-AT mobile robot. All experi-
of the mobile’s robot body angular velocity, in case a left           ments have been performed indoors. We consider a faulty
side fault occurs:                                                    situation where the right wheel set is flat (forward and
                                                                      backward wheels). We apply a command of
                       0.5
                  ωˆ k −    ωR RR = H k Rˆ Fk + vk                    ω= L  ω=R   5 rad / s for both set of wheels. In the nominal
                         c                                            situation (no faults) the robot should move (almost)
                                 0.5                                  straight forwards without any deviation. The robot starts
                  H k = H Lk = −     ωL                  (36)
                                  c                                   from the origin of the inertial frame and moves for 2.5m.
                  vk − ( 0, Rk )                                      The time interval dt between successive IMU measure-
                                                                      ments is 2.5m sec . The nominal radius of the wheels
                                                                      (proper inflation) is R=
                                                                                             L   R=
                                                                                                  R   0.115 m .
while in the case of a right side fault:
                                                                      In the first experiment (Fig. 4), the fault accommodation
                       0.5                                            algorithm is not enabled and as we can observe from the
                  ωˆ k +     ωL RL = H k Rˆ Fk + vk                   trajectory of the vehicle, the SSMR significantly diverges
                         c
                                                                      from its nominal course to the right.
                                 0.5
                  =
                  H k H=    Rk       ωR                  (37)
                                  c
                  vk  ( 0, Rk )
                                                                          SSMR Position along Y-axis (m)




                                                                                                           0.4
Having defined the measurement model of the robot angu-
lar velocity in the body frame, we proceed to the on line
estimation of the fault wheel radius employing the follow-                                                 0.2
ing Recursive Least Squares approximation algorithm:

 1. Initialize the estimator:                                                                                0


                   =      ( RF ) RL R
                   Rˆ F0 E=

                              ((           ))
                                            2            (38)                                              -0.2
                  =P0 E RF − Rˆ F0                                                                                0        0.5             1
                                                                                                                              SSMR Position along X-axis (m)
                                                                                                                                                            1.5        2



     where RL| R is the nominal radius of the left or right                                     Figure 4. Robot’s position while the right wheel set is flat.
     wheel set.
                                                                      The fault detection algorithm is enabled, and as we can see
 2. Obtain a new measurement ωk , assuming that it is                 from Fig. 5 the fault was successfully detected by the pro-
                                                                      posed structural analysis algorithm.
    given by the equation (36), or (37).




                                                                181
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                      As we can observe from Fig. 8 the trajectory of the SSMR
                                                                      was successfully detained in an almost straight line form.




       Figure 5. Fault signal as the right wheel set is flat.
In the second experiment we impose the same control in-
                                                                               Figure 8. SSMR Corrected Planar Trajectory.
puts to the SSMR                         , but this time not
only the fault detection but also the proposed fault accom-           6   Conclusion
modation algorithm is enabled. As we can see in Fig. 6 the
on line estimation algorithm quickly converge to the new              The notion of fault tolerant control for a 4-wheel skid
radius of the faulty wheel set and consequently the fault             steering mobile robot is an important problem to deal with,
accommodation algorithm provides modified inputs to the               since faults appearance is inevitable in such systems. The
right wheel set (Fig. 7).                                             most significant challenge arises from the complexity of
                                                                      the system. In this paper we have introduced the underly-
                                                                      ing concepts for our approach to fault tolerant control for
                                                                      mobile robots focusing our attention mainly to control re-
                                                                      configuration. As concerning the issue of fault diagnosis
                                                                      the structural analysis based technique is used in order to
                                                                      generate residuals. We use the kinematic model of the mo-
                                                                      bile robot that serves to the development of the structural
                                                                      model of the system. The above technique provides the
                                                                      parity equations which can be used as residual generators
                                                                      since model based fault diagnosis approach is based on
                                                                      residuals. The advantage of the above method is that it can
                                                                      offer a feasible solution to the residual generation of non-
                                                                      linear systems. The fault accommodation procedure targets
                                                                      in the case where one of the two wheel tire sets becomes
                                                                      flat. The proposed accommodation method is based on a
                                                                      RLS approximation of the new faulty wheel radius and via
                                                                      this information a new control input is calculated in order
          Figure 6. On line estimation of faulty radius.              to compensate for the fault.
                                                                      The efficacy of the proposed method is demonstrated
                                                                      through an extensive experimental procedure using a mo-
                                                                      bile robot Pioneer 3-AT.

                                                                      Acknowledgments
                                                                      This research is implemented through the Operational Pro-
                                                                      gram "Education and Lifelong Learning" and is co-
                                                                      financed by the European Union (European Social Fund)
                                                                      and Greek national funds. The work is part of the research
                                                                      project entitled «DIAGNOR - Fault Diagnosis and Ac-
                                                                      commodation for Wheeled Mobile Robot» of the Act "Ar-
                                                                      chimedes III - Strengthening Research Groups in TEI La-
                                                                      mia".

   Figure 7. Recalculated input from the fault accommodation
                           algorithm.




                                                                182
                       Proceedings of the 26th International Workshop on Principles of Diagnosis


References                                                       [14] A. Monteriù, P. Asthan, K. Valavanis and S. Longhi,
                                                                      “Model-Based Sensor Fault Detection and Isolation
[1] M. Blanke, M. Kinnaert, J. Luzne, M. Staroswiecki,                System for Unmanned Ground Vehicles: Experi-
     «Diagnosis and Fault Tolerant Control», ser. Heidel-             mental Validation (part II), 2007 IEEE International
     berg, Springer-Verlag, 2003.                                     Conference on Robotics and Automation Roma, Italy,
[2] M. Blanke, H. Niemann, and T. Lorentzen, “Structural              10-14 April 2007.
     analysis – a case study of the Rømer satellite,” in         [15] P. Sundvall and P. Jensfelt, “Fault detection for mo-
     Proc. of IFAC Safeprocess 2003, Washington, DC,                  bile robots using redundant positioning systems” Pro-
     USA, 2003.                                                       ceedings of the 2006 IEEE International Conference
[3] M. Blanke, V. Cocquempot, R. I. Zamanabadi, and M.                on Robotics and Automation, Orlando, Florida - May
     Staroswiecki, “Residual generation for the ship                  2006.
     benchmark using structural approach,” in Proc. of Int.      [16] C. Valdivieso, and A. Cipriano, “Fault Detection and
     Conference on CONTROL’98, Swansea, UK, Sep                       Isolation System Design for Omnidirectional Soccer-
     1998.                                                            Playing Robots”, Proceedings of the 2006 IEEE Con-
[4] Steven X. Ding, “Model-based Fault Diagnosis Tech-                ference on Computer Aided Control Systems Design
     niques: Design Schemes, Algorithms, and Tools,                   Munich, Germany, October 4-6, 2006.
     Springer-Verlag Berlin, 2008.                               [17] R. Izadi-Zamanabadi, “Structural Analysis Approach
[5] G.K. Fourlas, K.J. Kyriakopoulos, N.J. Krikelis,                  to Fault Diagnosis with Application to Fixed-wing
     “Fault Diagnosis of Hybrid Systems”, Proceedings of              Aircraft Motion” in Proc. of American control Con-
     the 2005 IEEE International Symposium on Intelligent             ference, USA, 2002.
     Control, 13th IEEE Mediterranean Conference on Con-         [18] D. Zhuo-hua, CAI Zi-xing, YU Jin-xia “Fault Diagno-
     trol and Automation, Limassol, Cyprus, 2005.                     sis and Fault Tolerant Control for Wheeled Mobile
[6] G.K. Fourlas, “Theoretical Approach of Model Based                Robots under Unknown Environments: A Survey”,
     Fault Diagnosis for a 4 - Wheel Skid Steering Mobile             Proceedings of the 2005 IEEE International Confer-
     Robot, 21st IEEE Mediterranean Conference on Con-                ence on Robotics and Automation, Barcelona, Spain,
     trol and Automation (MED '13), Platanias-Chania,                 April 2005.
     Crete, GREECE, June 25-28, 2013.                            [19] S.Zaman, G. Steinbauer, J.Maurer, P.Lepej, and
[7] G. Ippoliti, S. Longhi, A. Monteriµu, “Model-based                S.Uran, “An integrated model-based diagnosis and re-
     sensor fault detection system for a smart wheelchair”,           pair architecture for ROS-based robot systems”, IEEE
     IFAC 2005.                                                       International Conference on Robotics and Automa-
[8] R. Isermann, “Fault Diagnosis Systems – An Introduc-              tion (ICRA), Karlsruhe, Germany, 2013.
     tion from Fault Detection to Fault Tolerant”, Springer-     [20] D. Portugal, and Rui P. Rocha, “Scalable, Fault-
     Verlag Berlin Heidelberg, 2006.                                  Tolerant and Distributed Multi-Robot Patrol in Real
[9] B. Halder, and N. Sarkar, “Robust Fault Detection of              World Environments” 2013 IEEE/RSJ International
     Robotic Systems: New Results and Experiments”,                   Conference on Intelligent Robots and Systems
     Proceedings of the 2006 IEEE International Confer-               (IROS), November 3-7, Tokyo, Japan, 2013.
     ence on Robotics and Automation, Orlando, Florida -
     May 2006.
[10] Z. Kira, “Modeling Cross-Sensory and Sensorimotor
     Correlations to Detect and Localize Faults in Mobile
     Robots”, Proceedings of the 2007 IEEE/RSJ Interna-
     tional Conference on Intelligent Robots and Systems
     San Diego, CA, USA, Oct 29 - Nov 2, 2007.
[11] K. Kozlowski and D. Pazderski, “Modeling and Con-
     trol of a 4-Wheel Skid-Steering Mobile Robot”, Int. J.
     Appl. Math, Comput. Sci., 2004, vol. 14, no. 4, 477-
     496.
[12] Y. Morales, E. Takeuchi and T. Tsubouchi, “Vehicle
     Localization in Outdoor Woodland Environments with
     sensor fault detection”, 2008 IEEE International Con-
     ference on Robotics and Automation, Pasadena, CA,
     USA, May 19-23, 2008.
[13] A. Monteriù, P. Asthan, K. Valavanis and S. Longhi,
     “Model-Based Sensor Fault Detection and Isolation
     System for Unmanned Ground Vehicles: Theoretical
     Aspects (part I)”, 2007 IEEE International Conference
     on Robotics and Automation Roma, Italy, 10-14 April
     2007.




                                                           183
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                  184
                          Proceedings of the 26th International Workshop on Principles of Diagnosis




               Data-Driven Monitoring of Cyber-Physical Systems
   Leveraging on Big Data and the Internet-of-Things for Diagnosis and Control
          Oliver Niggemann1,3 , Gautam Biswas2 , John S. Kinnebrew2 , Hamed Khorasgani2 ,
                                 Sören Volgmann1 and Andreas Bunte3
                1
                  Fraunhofer Application Center Industrial Automation, Lemgo, Germany
                  e-mail: {oliver.niggemann, soeren.volgmann}@iosb-ina.fraunhofer.de
        2
          Vanderbilt University and Institute for Software Integrated Systems, Nashville, TN, USA
            e-mail: {john.s.kinnebrew, hamed.g.khorasgani, gautam.biswas}@vanderbilt.edu
                                 3
                                   Institute Industrial IT, Lemgo, Germany
                                    e-mail: {andreas.bunte}@hs-owl.de
                          Abstract                                     modeled. However, the last 20 years have clearly shown that
                                                                       such models are rarely available for complex CPSs; when
     The majority of projects dealing with monitoring                  they do exist, they are often incomplete and sometimes in-
     and diagnosis of Cyber Physical Systems (CPSs)                    accurate, and it is hard to maintain the effectiveness of these
     relies on models created by human experts. But                    models during a system’s life-cycle.
     these models are rarely available, are hard to ver-                  A promising alternative is the use of data-driven ap-
     ify and to maintain and are often incomplete.                     proaches, where monitoring and diagnosis knowledge can
     Data-driven approaches are a promising alterna-                   be learned by observing and analyzing system behavior.
     tive: They leverage on the large amount of data                   Such approaches have only recently become possible: CPSs
     which is collected nowadays in CPSs, this data is                 now collect and communicate large amounts of data (see Big
     then used to learn the necessary models automati-                 Data [9]) via standardized interfaces, giving rise to what is
     cally. For this, several challenges have to be tack-              now called the Internet of Things [10]. This large amount
     led, such as real-time data acquisition and storage               of data can be exploited for the purpose of detecting and an-
     solutions, data analysis and machine learning al-                 alyzing anomalous situations and faults in these large sys-
     gorithms, task specific human-machine-interfaces                  tems: The vision is developing CPSs that can observe their
     (HMI) and feedback/control mechanisms. In this                    own behavior, recognize unusual situations during opera-
     paper, we propose a cognitive reference architec-                 tions, inform experts, who can then update operations proce-
     ture which addresses these challenges. This ref-                  dures, and also inform operators, who use this information
     erence architecture should both ease the reuse of                 to modify operations or plan for repair and maintenance.
     algorithms and support scientific discussions by                     In this paper, we take on the challenges of proposing
     providing a comparison schema. Use cases from                     a common data-driven framework to support monitoring,
     different industries are outlined and support the                 anomaly detection, prognosis (degradation modeling), diag-
     correctness of the architecture.                                  nosis, and control. We discuss the challenges for developing
                                                                       such a framework, and then discuss case studies that demon-
1 Motivation                                                           strate some initial steps toward data-driven CPSs.
The increasing complexity and the distributed nature of
technical systems (e.g. power generation plants, manufac-
                                                                       2 Challenges
turing processes, aircraft and automobiles) have provided              In order to implement data-driven solutions for the moni-
traction for important research agendas, such as Cyber Phys-           toring, diagnosis, and control of CPSs, a variety of chal-
ical Systems (CPSs) [1; 2], the US initiative on the “Indus-           lenges must be overcome to enable the learning pathways
trial Internet” [3] and its German counterpart “Industrie 4.0”         illustrated in Figure 1:
[4]. In these agendas, a major focus is on self-monitoring,            Data Acquisition: All data collected from distributed
self-diagnosis and adaptivity to maintain both operability             CPSs, e.g. sensors, actuators, software logs, and business
and safety, while also taking into account humans-in-the-              data, must meet real-time requirements, as well as includ-
loop for system operation and decision making. Typical                 ing time synchronization and spatial labeling when relevant.
goals of such self-diagnosis approaches are the detection              Often sensors and actuators operate at different rates, so data
and isolation of faults and anomalies, identifying and an-             alignment, especially for high-velocity data, becomes an is-
alyzing the effects of degradation and wear, providing fault-          sue. Furthermore, data must be annotated semantically to
adaptive control, and optimizing energy consumption [5;                allow for a later data analysis.
6].                                                                    Data Storage, Curation, and Preprocessing: Data will be
   So far, the majority of projects and papers for analy-              stored and preprocessed in a distributed way. Environmen-
sis and diagnosis has relied on manually-created diagno-               tal factors and the actual system configuration (e.g., for the
sis models of the system’s physics and operations [6; 7;               current product in a production system) must also be stored.
8]: If a drive is used, this drive is modeled, if a reactor is in-     Depending on the applications, a relational database format,
stalled, the associated chemical and physical processes are            or increasingly distributed noSQL technologies [11], may




                                                                 185
                                       Proceedings of the 26th International Workshop on Principles of Diagnosis


       Cyber Physical System                                                                                                                                                Task-specific
                                                                                                                                                                        Human-Machine-Interface
                                                         Distributed                    Abstracted
                                                                                        Diagnosis                                                    Condition Monitoring
                                                        Data Storage
                                          Data                                          Knowledge
                                          Acquisition                  Machine
                                                                                                                     Usage and Editing                                  Diagnosis
                                                                                                                                                                            OK Cancel
                                                                       Learning
                                                          ……                                                         of Knowledge

   Controller             Controller
                                                          …...                                                                                                             Energy
                                                                                                                                                                              OK Analysis
                                                                                                                                                                                   Cancel
                                                          …....
                Network                                                                                                                                                                      OK          Cancel



                                                Feedback mechanisms
                                                and control




                                                        Figure 1: Challenges for the analysis of CPSs.

need to be adopted, so that the right subsets of data may be                             raising the overall efforts, preventing any reuse of hard-
retrieved for different analyses. Real-world data can also be                            ware/software and impeding a comparison between solu-
noisy, partially corrupted, and have missing values. All of                              tions.
these need to be accommodated in the curation, storage, and                                 To achieve better standardization, efficiency, and repeata-
pre-processing applications.                                                             bility, we suggest a generic cognitive reference architecture
Data Analysis and Machine Learning: Data must be ana-                                    for the analysis of CPSs. Please note that this architecture is
lyzed to derive patterns and abstract the data into condensed                            a pure reference architecture which does not constraint later
usable knowledge. For example, machine learning algo-                                    implementations and introduction of application-specific
rithms can generate models of normal system behavior in                                  methods.
order to detect anomalous patterns in the data [12]. Other                                  Figure 2 shows its main components:
algorithms can be employed to identify root causes of ob-
served problems or anomalies. The choice and design of
appropriate analyses and algorithms must consider factors                                                                                                                                                      User
                                                                                                                 Task-Specific HMI
like the ability to handle large volumes and sometimes high                                                                                                                                          Task-Specific HMI
velocities of heterogeneous data. At a minimum, this gener-                                                    Conceptual Interface
                                                                                                                                                                                                   I/F 4                    I/F 5
ally requires machine learning, data mining, and other anal-


                                                                                                                                                     System Synthesis
                                                                                                               Data
                                                                                          System Analysis




                                                                                                                                            System
ysis algorithms that can be executed in parallel, e.g., using                                               Abstraction
                                                                                                             and ML
                                                                                                                                            Repair                                                    Conceptual Layer

the Spark [13], Hadoop [14], and MapReduce [15] architec-




                                                                                                                                                                                                                                      System Synthesis
                                                                                                                                                                                                   I/F 3                    I/F 6




                                                                                                                                                                           System Analysis
                                                                                                            Real-time Big Data Platform
tures. In some cases, this may be essential to meet real-time                                                                                                                                Learning                    Adaptation
analysis requirements.                                                                                              Cyber Physical System


                                                                                                                                                                                                    I/F 2                   I/F 7
Task-specific Human-Machine-Interfaces: Tasks such as
condition monitoring, energy management, predictive main-                                                       Controller             Controller
                                                                                                                                                                                                Big Data Platform

tenance or diagnosis require specific user interfaces [16].                                                                  Network
                                                                                                                                                                                                    I/F 1
One set of interfaces may be more tailored for offline analy-                                                                                                                                        Cyber Physical System


sis to allow experts to interact with the system. For example,
experts may employ information from data mining and ana-
lytics to derive new knowledge that is beneficial to the future                                                                                                                                  Controller

                                                                                                                                                                                                              Network
                                                                                                                                                                                                                        Controller




operations of the system. Another set of interfaces would be
appropriate for system operators and maintenance person-
nel. For example, appropriate operator interfaces would be                               Figure 2: A cognitive architecture as a solution for the anal-
tailored to provide analysis results in interpretable and ac-                            ysis of CPSs.
tionable forms, so that the operators can use them to drive
decisions when managing a current mission or task, as well                               Big Data Platform (I/F 1 & 2): This layer receives all rel-
as to determine future maintenance and repair.                                           evant system data, e.g., configuration information as well
Feedback Mechanisms and Control: As a reaction to rec-                                   as raw data from sensors and actuators. This is done by
ognized patterns in the data or to identified problems, the                              means of domain-dependent, often proprietary interfaces,
user may initiate actions such as a reconfiguration of the                               here called interface 1 (I/F 1). This layer then integrates,
plant or an interruption of the production for the purpose of                            often in real-time, all of the data, time-synchronizes them
maintenance. In some cases, the system may react without                                 and annotates them with meta-data that will support later
user interactions; in this case, the user is only informed.                              analysis and interpretation. For example, sensor meta-data
                                                                                         may consist of the sensor type, its position in the system and
3 Solutions                                                                              its precision. This data is provided via I/F 2, which, there-
                                                                                         fore, must comprise the data itself and also the meta-data
As Section 4 will show, the challenges from Section 2 reap-                              (i.e., the semantics). A possible implementation approach
pear in the majority of CPS examples. While details, such                                for I/F 2 may be the mapping into and use of existing of Big
as the machine learning algorithms employed or the nature                                Data platforms, such as Sparks or Hadoop, for storing the
of data and data storage formats can vary, the primary steps                             data and the Data Distribution Service (DDS) for acquiring
are about the same. Most CPS solutions re-implement all of                               the data (and meta-data).
these steps and even employ different solution strategies—                               Learning Algorithms (I/F 2 & 3): This layer receives all




                                                                                  186
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


data via I/F 2. Since I/F 2 also comprises meta-data, the ma-      4 Case Studies
chine learning and diagnosis algorithms need not be imple-         We present a set of case studies that cover the manufacturing
mented specifically for a domain but may adapt themselves          and process industries, as well as complex CPS systems,
to the data provided. In this layer, unusual patterns in the       such as aircraft.
data (used for anomaly detection), degradation effects (used
for condition monitoring) and system predictions (used for         4.1 Manufacturing Industry
predictive maintenance) are computed and provided via I/F          The modeling and learning of discrete timing behavior for
3. Given the rapid changes in data analysis needs and capa-        manufacturing industry (e.g., automative industry) is a new
bilities, this layer may be a toolbox of algorithms where new      field of research. Due to the intuitive interpretation, Timed
algorithms can be added by means of plug-and-play mecha-           Automata are well-suited to model the timing behavior of
nisms. I/F 3 might again be implemented using DDS.                 these systems. Several algorithms have been introduced to
Conceptual Layer (I/F 3 & 4): The information provided             learn such Timed Automata, e.g. RTI+ [17] and BUTLA
by I/F 3 must be interpreted according to the current task         [18]. Please note that the expert still has to provide struc-
at hand, e.g. computing the health state of the system.
                                                                   tural information about the system (e.g. asynchronous sub-
Therefore, the provided information about unusual patterns,        systems) and that only the temporal behavior is learned.
degradation effects and predictions are combined with do-
main knowledge to identify faults, their causes and rate them
according to the urgency of repair. A semantic notation will                                  Aspirator on
                                                                                               [25…2500]
                                                                                                                 Muscle on
                                                                                                                  [8…34]           Silo empty            Muscle off
be added to the information, e.g. the time for next main-                                0                   1                2
                                                                                                                                     [8…34]
                                                                                                                                                     3
                                                                                                                                                          [8…34]

tenance or a repair instruction, which will be provided at                                                       Muscle off
                                                                                                                                                                      4
I/F 4 in a human understandable manner. From a computer                                                           [7…35]             Aspirator off
                                                                                                                                     [2200…2500]
science perspective, this layer provides reasoning capabili-
ties on a symbolic or conceptual level and adds a semantic
context to the results.
Task-Specific HMI (I/F 4 & 5): The user is in the center
of the architecture presented here, and, therefore, requires
task-, context- and role-specific Human-Machine-Interfaces
(HMIs). This HMI uses I/F 4 to get all needed analysis
results and presents them to the user. Adaptive interfaces,
rather than always showing the results of the same set of
                                                                                                                 Silo empty       Conveyor off
                                                                                                                 [8…3400]           [8…25]
                                                                                                             1                2                  3
analyses, could allow a wider range of information to be                                                                    Silo full
provided, while maintaining efficiency and preventing in-                                 0                              [1000…34000]
                                                                                                                                                                      4
formation overload. Beyond obvious dynamic capabilities
like alerts for detected problems or anomalies, the interfaces
could further adapt the information displayed to be more            Figure 3: Learned Timed Automata for a manufacturing plant.
relevant to the current user context (e.g. the user’s loca-
tion within a production plant, recognition of tasks the user         The data acquisition for this solution (I/F 1 in Figure 2)
may be engaged in, observed patterns of the user’s previous        has been implemented using a direct capturing of Profinet
information-seeking behavior, and knowledge of the user’s          signals including an IEEE 1588 time-synchronization. The
technical background). If the user decides to influence the        data is offered via OPC UA (I/F 2). On the learning layer,
system (e.g. shutdown of a subsystem or adaptation of the          timed automata are learned from historical data and com-
system behavior), I/F 5 is used to communicate this deci-          pared to the observed behavior. Also, the sequential behav-
sion to the conceptual layer. Again, I/F 4 and I/F 5 might be      ior of the observed events as well as the timing behavior
implemented using DDS.                                             is checked, anomalies are signaled via I/F 3. On the con-
Conceptual Layer (I/F 5 & 6): The user decisions will be           ceptual layer it is decided whether an anomaly is relevant.
received via I/F 5. The conceptual layer will use the knowl-       Finally, a graphical user interface is connected to the con-
edge to identify actions which are needed to carry out the         ceptual layer via OPC UA (I/F 4).
users’ decisions. For example, a decision to decrease the             Figure 3 shows learned automata for a manufacturing
machine’s cycle time by 10 % could lead to actions such as         plant: The models correspond to modules of the plants, tran-
decreasing the robot speed by 10 % and the conveyor speed          sitions are triggered by a control signals and are annotated
by 5 % or the decision to shutdown a subsystem. These ac-          with a learned timing interval.
tions are communicated via I/F 6 to the adaption layer.
Adaption (I/F 6 & 7): This layer receives system adaption          4.2 Energy Analysis In Process Industry
commands on the conceptual level via I/F 6—which again             Analyzing the energy consumption in production plants has
might be based on DDS. Examples are the decrease of robot          some special challenges: Unlike the discrete systems de-
speed by 10 % or a shutdown of a subsystem. The adap-              scribed in Section 4.1, also continuous signals such as the
tion layer takes these commands on the conceptual level            energy consumption must be learned and analyzed. But also
and computes, in real-time, the corresponding changes to           the discrete signals must be taken into consideration because
the control system. For example, a subsystem shutdown              continuous signals can only be interpreted with respect to
might require a specific network signal or a machine’s tim-        the current system’s status, e.g. it is crucial to know whether
ing is changed by adapting parameters of the control algo-         a valve is open or whether a robot is turned on. And the
rithms, again by means of network signals. I/F 7 therefore         system’s status is usually defined by the history of discrete
uses domain-dependent interfaces.                                  control signals.




                                                             187
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                   the production cycles. In Figure 6 the architecture of the big
                                                                   data platform is depicted.

                                                                      Cyber-Physical System      Hadoop Ecosystem     Grafana Webvisualisation

                                                                                                 Hadoop Distributed
                                                                                                 Filesystem (HDFS)
                                                                                                    OpenTSDB


                                                                      Controller    Controller         HBase

                                                                              Network




                                                                       Figure 6: Data Analysis Plattform in Manufacturing

                                                                      The CPS is connected through OPC UA (I/F 1 in Figure 2)
                                                                   with an Hadoop ecosystem. Hadoop itself is an software
                                                                   framework for scalable distributed computing. The process
                                               llll
                                               2                   data is stored in an non-relational database (HBase) which is
                                                                   based on a distributed file-system (HDFS). On top of HBase,
                                                                   a time-series database OpenT SDB is used as an interface
                                                                   to explore and analyze the data (I/F 2 in Figure 2). Through
   Figure 4: A learned hybrid automaton modeling a pump.           this database it is possible to do simple statistics such as
                                                                   mean-values, sums or differences, which is usually not pos-
                                                                   sible within the non relational data stores.
   In [19], an energy anomaly detection system is de-
                                                                      Using the interfaces of OpenTSDB or Hadoop, it be-
scribed which analyzes three production plants. Ethercat
                                                                   comes possible to analyze the data directly on the storage
and Profinet is used for I/F 1 and OPC UA for I/F 2. The col-
                                                                   system. Hence, the volume of a historical dataset need not
lected data is then condensed on the learning layer into hy-
                                                                   be loaded into a single computer system, instead the algo-
brid timed automata. Also on this layer, the current energy
                                                                   rithms can work distributively on the data. A web interface
consumption is compared to the energy prediction. Anoma-
                                                                   can be used to visualize the data as well as the computed re-
lies in the continuous variables are signaled to the user via
                                                                   sults. In Figure 6, grafana is used for data visualization. In
mobile platforms using web services (I/F 3 and 4).
                                                                   the SmartFactoryOWL this big data platform is currently be-
   In Figure 4, a pump is modeled by means of such au-
                                                                   ing connected to the application scenarios from Sections 4.1
tomata using the flow rate and switching signals. The three
                                                                   and 4.2.
states S0 to S2 are separating the continuous function into
three linear pieces which can then be learned automatically.       4.4 Anomaly Detection in Aircraft Flight Data
   Figure 5 shows a typical learned energy consumption
(here for bulk good production).                                   Fault detection and isolation schemes are designed to detect
                                                                   the onset of adverse events during operations of complex
                                                                   systems, such as aircraft and industrial processes. In other
                                                                   work, we have discussed approaches using machine learn-
                                                                   ing classifier techniques to improve the diagnostic accuracy
                                                                   of the online reasoner on board of the aircraft [20]. In this
                                                                   paper, we discuss an anomaly detection method to find pre-
                                                                   viously undetected faults in aircraft system [21].
                                                                      The flight data used for improving detection of existing
                                                                   faults and discovering new faults was provided by Honey-
                                                                   well Aerospace and recorded from a former regional airline
                                                                   that operated a fleet of 4-engine aircraft, primarily in the
                                                                   Midwest region of the United States. Each plane in the fleet
                                                                   flew approximately 5 flights a day and data from about 37
                                                                   aircraft was collected over a five year period. This produced
Figure 5: A measured (black line) and a learned power consump-     over 60,000 flights. Since the airline was a regional carrier,
tion (red line).                                                   most flight durations were between 30 and 90 minutes. For
                                                                   each flight, 182 features were recorded at sample rates that
                                                                   varied from 1Hz to 16Hz. Overall this produced about 0.7
4.3 Big Data Analysis in Manufacturing Systems                     TB of data.
Analyzing historical process data during the whole produc-            Situations may occur during flight operations, where the
tion cycle requires new architectures and platforms for han-       aircraft operates in previously unknown modes that could be
dling the enormous volume, variety and velocity of the data.       attributed to the equipment, the human operators, or envi-
Data analysis pushes the classical data acquisition and stor-      ronmental conditions (e.g., the weather). In such situations,
age up to its limits, i.e. big data platforms are need.            data-driven anomaly detection methods [12], i.e., finding
   In the assembling line of the SmartFactoryOWL, a small          patterns in the operations data of the system that were not
factory used for production and research, a big data platform      expected before can be applied. Sometimes, anomalies
is established to acquire, store and visualize the data from       may represent truly aberrant, undesirable and faulty behav-
                                                                   ior; however, in other situations they may represent behav-




                                                             188
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


iors that are just unexpected. We have developed unsuper-            4.5 Reliability and Fault Tolerant Control
vised learning or clustering methods for off-line detection          Most complex CPSs are safety-critical systems that operate
of anomalous situations. Once detected and analyzed, rele-           with humans-in-the-loop. In addition to equipment degrada-
vant information is presented to human experts and mission           tion and faults, humans can also introduce erroneous deci-
controllers to interpret and classify the anomalies.                 sions, which becomes a new source of failure in the system.
   Figure 7 illustrates our approach. We started with cu-            Figure 8 represents possible faults and cyber-attacks that can
rated raw flight data (layer ”Big Data Platform” in Figure           occur in a CPS.
2), transforming the time series data associated with the dif-          There are several model-based fault tolerant control
ferent flight parameters to a compressed vector form using           strategies for dynamic systems in the literature (see for ex-
wavelet transforms. The next step included building a dis-           ample [23] and [24]). Research has also been conducted to
similarity matrix of pairwise flight segments using the Eu-          address network security and robust network control prob-
clidean distance measure, followed by a subsequent step              lems (see for example [25] and [26]). However, these meth-
where the pairwise between flight distances was used to              ods need mathematical models of the system, which may
run a ‘complete link’ hierarchical clustering algorithm [22]         not exist for large scale complex systems. Therefore, data
(layer ”Learning” in Figure 2). Run on the flight data, the          driven control [27] and data driven fault tolerant control [28]
algorithm produced a number of large clusters that we con-           have become an important research topic in recent years.
sidered to represent nominal flights, and a number of smaller        For CPSs, there are more aspects of the problem that need
clusters and outlier flights that we initially labeled as anoma-     to be considered. As it is shown in Figure 8, there are many
lous. By studying the feature value differences between the          sources of failure in these systems.
larger nominal and smaller anomalous clusters with the help             We propose a hybrid approach that uses an abstract model
of domain experts, we were able to interpret and explain the         of the complex system and utilizes the data to ensure the
anomalous nature (”Conceptual Layer” in Figure 2).                   compatibility between model and the complex system. Data
   These anomalies or faults represented situations that the         abstraction and machine learning techniques are employed
experts had not considered before; therefore, this unsuper-          to extract patterns between different control configurations
vised or semi-supervised data driven approach provided a             and system outputs unit by computing the correlation be-
mechanism for learning new knowledge about unanticipated             tween control signals and the physical subsystems outputs.
system behaviors. For example, when analyzing the aircraft           The highly correlated subsystems (layer ”Learning” in Fig-
data, we found a number of anomalous clusters. One of                ure 2) become candidates for further study of the effects of
them turned out to be situations where one of the four en-           failure and degradation at the boundary of these interacting
gines of the aircraft was inoperative. On further study of ad-       subsystems. For complex systems, all possible inteeractions
ditional features, the experts concluded that these were test        and their consequences are hard to pre-determine, and data-
flights conducted to test aspects of the aircraft, and, there-       driven approaches help fill this gap in knowledge to support
fore, they repesented known situations, and, therefore, not          more informed decision-making and control. A case-based
an interesting anomaly. A second group of flights were in-           reasoning module can be designed to provide input on past
terpreted to be take offs, where the engine power was set            successes and failed opportunities, which can then be trans-
much higher than most flights in the same take off condition.        lated by human experts into operational monitoring, fault di-
Further analysis of environmental features related to these          agnosis, and control situations (’Conceptual Layer” in Fig-
set of take-off’s revealed that these were take-offs from a          ure 2). Some of the control paradigms that govern appro-
high altitude airport at 7900 feet above sea level.                  priate control configurations, such as modifying sequence
   A third cluster provided a more interesting situation. The        of mission tasks and switching between different objectives
experts when checking on the features that had significantly         or changing the controller parameters (layer Adaptation in
different values from the nominal flights realized that the          Figure 2) are being studied in a number of labs including
auto throttle disengaged in the middle of the aircraft’s climb       ours [29].
trajectory. The automatic throttle is designed to maintain              Example Fault Tolerant Control of Fuel Transfer Sys-
either constant speed during takeoff or constant thrust for          tem The fuel system supplies fuel to the aircraft engines.
other modes of flight. This was an unusual situation where           Each individual mission will have its own set of require-
the auto thruster switched from maintaining speed for a              ments. However, common requirements such as saving the
takeoff to a setting that applied constant thrust, implying          aircraft Center of Gravity (CG), safety, and system relia-
that the aircraft was on the verge of a stall. This situation        bility are always critical. A set of sensors included in the
was verified by the flight path acceleration sensor shown in         system to measure different system variables such as the
Figure 7. By further analysis, the experts determined that in        fuel quantity contained in each tank, engines fuel flow rates,
such situations the automatic throttle would switch to a pos-        boost pump pressures, position of the valves and etc.
sibly lower thrust setting to level the aircraft and compensate         There are several failure modes such as the total loss or
for the loss in velocity. By examining the engine parame-            degradation in the electrical pumps or a leakage in the tanks
ters, the expert verified that all the engines responded in an       or connecting pipes in the system. Using the data and the ab-
appropriate fashion to this throttle command. Whereas this           stract model we can detect and isolate the fault and estimate
analysis did not lead to a definitive conclusion other than the      its parameters. Then based on the type fault and its severity
fact the auto throttle, and therefore, the aircraft equipment,       the system reconfiguration unit chooses the proper control
responded correctly, the expert determined that further anal-        scenario form the control library. For example in normal sit-
ysis was required to answer the question “why did the air-           uation the transfer pumps and valves are controlled to main-
craft accelerate in such a fashion and come so close to a            tain a transfer sequence to keep the aircraft center of gravity
stall condition?”. One initial hypothesis to explain these           within limits. This control includes maintaining a balance
situations was pilot error.                                          between the left and right sides of the aircraft. When there




                                                               189
                                    Proceedings of the 26th International Workshop on Principles of Diagnosis


                        Raw Flight Data

                                                                                                                                     Hierarchical




                                                                Wavelet Transform
                                                                                                                                     Clustering
                                                                                          Dii   Dij        …    Din


                                                                                                   Flight
                                                                                                Dissimilarity
                                                                                                   Matrix




                        Anomalous
                                  max(dAN)

                             Current
                             Flight



                                         Figure 7: Data Driven Anomaly Detection Approach for Aircraft Flights


                     System Reconfiguration
                                                                                                                      adapted frequently.
                                                                                                                         In Sections 4.1 and 4.2, structural information about the
                                                                                                                      plant is imported from the engineering chain and the tempo-
                   Cyber-attack
                                                                       System and
                                                                                                                      ral behavior is learned in form of timed automata. In Section
  Human error
                                                                       actuator faults                                4.5, an abstract system model describing the input/output
                                              Actuator faults
       Operator
                           Controller
                          Parameters
                                                                                            Sensor fault              structure and the main failure types is provided and again the
                                                                               Physical
                                                                                                                      behavior is learned. These approaches are typical because in
                           Controller           Actuators
                                                                               System
                                                                                                Sensors
                                                                                                                      most applications structural information can be gained from
                           Controller                                                                                 earlier engineer phases while behavior models hardly exist
                            Library
                                                                                                                      and are almost never validated with the real system.
                                            Communication network                                                        Looking at the learning phase, the first thing to notice
                                                                                                                      is that all described approaches work and deliver good re-
                                        Communication error and noise                                                 sults: For CPSs, data-driven approaches have moved into
                                                                                                                      the focus of research and industry. And they are well suited
                                  Cyber Physical System                                                               for CPSs: They adjust automatically to new system config-
                                                                                                                      urations, they do not need manual engineering efforts and
                  Figure 8: Possible faults in a CPS.                                                                 they make usage of the now available large number of data
                                                                                                                      signals—connectivity being a typical feature of CPSs.
                                                                                                                         Another common denominator of the described appli-
is a small leak, normally the system can tolerate it depend-                                                          cation examples is that the focus is on anomaly detec-
ing on where the leak is, but the leak usually grows over                                                             tion rather than on root cause analysis: for data-driven ap-
time. Therefore we need to estimate the leakage rate and re-                                                          proaches it is easier to learn a model of the normal behav-
configure the system to move the fuel from the tank or close                                                          ior than learning erroneous behavior. And it is also typi-
the pipe before critical situation.                                                                                   cal that the only root cause analysis uses a case-based ap-
                                                                                                                      proach (Section 4.5), case-based approaches being suitable
5 Conclusions                                                                                                         for data-driven solutions to diagnosis.
Data-driven approaches to the analysis and diagnosis of                                                                  Finally, the examples show that the proposed cognitive
Cyber-Physical Systems (CPSs) are always inferior to clas-                                                            architecture (Figure 2) matches the given examples:
sical model-based approaches, where models are created                                                                Big Data Platform: Only a few examples (e.g. Section 4.3)
manually by experts: Experts have background knowledge                                                                make usage of explicit big data platforms, so-far solutions
which can not be learned from models and experts automat-                                                             often use proprietary solutions. But with the growing size of
ically think about a larger set of system scenarios than can                                                          the data involved, new platforms for storing and processing
be observed during a system’s normal lifetime.                                                                        the data are needed.
   So the question is not whether data-driven or expert-                                                              Learning:        All examples employ machine learning
driven approaches are superior. The question is rather                                                                technologies—with a clear focus on unsupervised learning
which kind of models can we realistically expect to ex-                                                               techniques which require no a-priori knowledge such as
ist in real-world applications—and which kind of models                                                               clustering (Section 4.4) or automata identification (Sections
must therefore be learned automatically. This becomes es-                                                             4.1, 4.2).
pecially important in the context of CPSs since these sys-                                                            Conceptual Layer: In all examples, the learned models are
tems adapt themselves to their environment and show there-                                                            evaluated on a conceptual or symbolic level: In Section 4.4,
fore a changing behavior, i.e. models would also have be                                                              clusters are compared to new observations and data-cluster




                                                                                                               190
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


distances are used for decision making. In Sections 4.1 and              The 24th International Workshop on Principles of Di-
4.2, model predictions are compared to observations. And                 agnosis, pages 71–78, 2013.
again, derivations are decided on by a conceptual layer.            [8] D. Klar, M. Huhn, and J. Gruhser. Symptom propaga-
Task-Specific HMI: None of the given examples works com-                 tion and transformation analysis: A pragmatic model
pletely automatically, in all cases the user is involved in the          for system-level diagnosis of large automation sys-
decision making.                                                         tems. In Emerging Technologies Factory Automation
Adaption: In most cases, reactions to detected problems                  (ETFA), 2011 IEEE 16th Conference on, pages 1–9,
are up to the expert. The use case from Section 4.5 is an                Sept 2011.
example for an automatic reaction and the usage of analysis
results for the control mechanism.                                  [9] GE. The rise of big data - leveraging large time-
                                                                         series data sets to drive innovation, competitiveness
   Using such a cognitive architecture would bring several               and growth - capitalizing on the big data oppurtu-
benefits to the community: First of all, algorithms and                  nity. Technical report, General Electric Intelligent
technologies in the different layers can be changed quickly              Platforms, 2012.
and can be re-used. E.g. learning algorithms from one
application field can be put on top of different big data           [10] A. Katasonov, O. Kaykova, O. Khriyenko, S. Nikitin,
platforms. Furthermore, currently most existing approaches               and V. Terziyan. Smart semantic middleware for the
mix the different layers, making the comparison of ap-                   internet of things. In 5th International Conference
proaches to the analysis of CPSs difficult. Finally, such an             on Informatics in Control, Automation and Robotics
architecture helps to clearly identify open issues for the               (ICINCO), 2008.
development of smart monitoring systems.                            [11] Michael Stonebraker. Sql databases v. nosql databases.
                                                                         Communications of the ACM, 53(4):10–11, 2010.
   Acknowledgments The work was partly supported                    [12] Varun Chandola, Arindam Banerjee, and Vipin Kumar.
by the German Federal Ministry of Education and Re-                      Anomaly detection: A survey. ACM Computing Sur-
search (BMBF) under the project "Semantics4Automation"                   veys (CSUR), 41.3:1–72, Sept 2009.
(funding code: 03FH020I3), under the project "Analyse               [13] M. Zaharia, M. Chowdhury, M. J. Franklin,
großer Datenmengen in Verarbeitungsprozessen (AGATA)"                    S. Shenker, and I Stoica. Spark: cluster comput-
(funding code: 01IS14008 A-F) and by NASA NRA                            ing with working sets. In Proceedings of the 2nd
NNL09AA08B from the Aviation Safety program. We also                     USENIX conference on Hot topics in cloud computing,
acknowledges the contributions of Daniel Mack, Dinkar                    page 10, June 2010.
Mylaraswamy, and Raj Bharadwaj on the aircraft fault di-
agnosis work.                                                       [14] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.
                                                                         The hadoop distributed file system. In Proceedings
References                                                               26th IEEE Symposium on Mass Storage Systems and
                                                                         Technologies (MSST), pages 1–10, May 2010.
[1] E.A. Lee. Cyber physical systems: Design challenges.
    In Object Oriented Real-Time Distributed Computing              [15] M JAYASREE. Data mining: Exploring big data us-
    (ISORC), 2008 11th IEEE International Symposium                      ing hadoop and mapreduce. International Journal of
    on, pages 363–369, 2008.                                             Engineering Sciences Research-IJESR, 4(1), 2013.
[2] Ragunathan (Raj) Rajkumar, Insup Lee, Lui Sha, and              [16] Friedhelm Nachreiner, Peter Nickel, and Inga Meyer.
    John Stankovic. Cyber-physical systems: The next                     Human factors in process control systems: The de-
    computing revolution. In Proceedings of the 47th De-                 sign of human–machine interfaces. Safety Science,
    sign Automation Conference, DAC ’10, pages 731–                      44(1):5–26, 2006.
    736, New York, NY, USA, 2010. ACM.                              [17] Sicco Verwer. Efficient Identification of Timed Au-
[3] Peter C. Evans and Marco Annunziata. Industrial in-                  tomata: Theory and Practice. PhD thesis, Delft Uni-
    ternet: Pushing the boundaries of minds and machines.                versity of Technology, 2010.
    Technical report, GE, 2012.                                     [18] Oliver Niggemann, Benno Stein, Asmir Vodenčarević,
[4] Promotorengruppe Kommunikation. Im fokus: Das                        Alexander Maier, and Hans Kleine Büning. Learning
    industrieprojekt industrie 4.0, handlungsempfehlun-                  behavior models for hybrid timed systems. In Twenty-
    gen zur umsetzung. Forschungsunion Wirtschaft-                       Sixth Conference on Artificial Intelligence (AAAI-12),
    Wissenschaft, March 2013.                                            pages 1083–1090, Toronto, Ontario, Canada, 2012.
[5] L. Christiansen, A. Fay, B. Opgenoorth, and J. Neidig.          [19] Bjoern Kroll, David Schaffranek, Sebastian Schriegel,
    Improved diagnosis by combining structural and pro-                  and Oliver Niggemann. System modeling based on
    cess knowledge. In Emerging Technologies Factory                     machine learning for anomaly detection and predic-
    Automation (ETFA), 2011 IEEE 16th Conference on,                     tive maintenance in industrial plants. In 19th IEEE In-
    Sept 2011.                                                           ternational Conference on Emerging Technologies and
[6] Rolf Isermann. Model-based fault detection and diag-                 Factory Automation (ETFA), Sep 2014.
    nosis - status and applications. In 16th IFAC Sympo-            [20] D.L.C. Mack, G. Biswas, X. Koutsoukos, and D. My-
    sium on Automatic Control in Aerospace, St. Peters-                  laraswamy. Learning bayesian network structures to
    bug, Russia, 2004.                                                   augment aircraft diagnostic reference model, “to ap-
[7] Johan de Kleer, Bill Janssen, Daniel G. Bobrow, Tolga                pear”. IEEE Transactions on Automation Science and
    Kurtoglu Bhaskar Saha, Nicholas R. Moore, and Sar-                   Engineering, 17:447–474, 2015.
    avan Sutharshana. Fault augmented modelica models.




                                                              191
                       Proceedings of the 26th International Workshop on Principles of Diagnosis


[21] Daniel LC Mack. Anomaly Detection from Complex
     Temporal Spaces in Large Data. PhD thesis, Vander-
     bilt University, Nashville, TN. USA, 2013.
[22] Stephen C Johnson. Hierarchical clustering schemes.
     Psychometrika, 32(3):241–254, 1967.
[23] Jiang Jin. Fault tolerant control systems - an intro-
     ductory overview. Acta Automatica Sinica, 31(1):161–
     174, 2005.
[24] M. Blanke, M. Kinnaert, J. Lunze, and
     M. Staroswiecki.        Diagnosis and fault-tolerant
     control. Springer-Verlag, Sep 2003.
[25] L. Schenato, B. Sinopoli, M. Franceschetti, K. Poolla,
     and S. S. Sastry. Foundations of control and estimation
     over lossy networks. In In Proceedings of the IEEE,
     volume 95, pages 163 – 187, Jan 2007.
[26] B. Schneier. Security monitoring: Network security
     for the 21st century. In Computers Security, 2001.
[27] Zhong-Sheng Hou and Zhuo Wang. From model-
     based control to data-driven control: Survey, classi-
     fication and perspective. Information Sciences, 235:3–
     35, 2013.
[28] Hongm Wang, Tian-You Chai, Jin-Liang Ding, and
     Martin Brown. Data driven fault diagnosis and fault
     tolerant control: some advances and possible new
     directions. Acta Automatica Sinica, 25(6):739–747,
     2009.
[29] Z. S. Hou and J. X. Xu. On data-driven control theory:
     the state of the art and perspective. Acta Automatica
     Sinica, 35:650–667, 2009.




                                                           192
                   Proceedings of the 26th International Workshop on Principles of Diagnosis




           Diagnosing Advanced Persistent Threats: A Position Paper
  Rui Abreu and Danny Bobrow and Hoda Eldardiry and Alexander Feldman and
     John Hanley and Tomonori Honda and Johan de Kleer and Alexandre Perez
                                 Palo Alto Research Center
                                    3333 Coyote Hill Rd
                                Palo Alto, CA 94304, USA
  {rui,bobrow,hoda.eldardiry,afeldman,john.hanley,tomo.honda,dekleer,aperez}@parc.com
                                       Dave Archer and David Burke
                                                 Galois, Inc.
                                        421 SW 6th Avenue, Suite 300
                                          Portland, OR 97204, USA
                                          {dwa,davidb}@galois.com
                      Abstract                                individual applications. Thus current techniques give
                                                              operators little system-wide situational awareness, nor
     When a computer system is hacked, analyz-                any viewpoint informed by a long-term perspective.
     ing the root-cause (for example entry-point              Adversaries have taken advantage of this opacity by
     of penetration) is a diagnostic process. An              adopting a strategy of persistent, low-observability
     audit trail, as defined in the National Infor-           operation from inside the system, hiding effectively
     mation Assurance Glossary, is a security-                through the use of long causal chains of system and
     relevant chronological (set of) record(s),               application code. We call such adversaries advanced
     and/or destination and source of records that            persistent threats, or APTs.
     provide evidence of the sequence of activi-
     ties that have affected, at any time, a specific            To address current limitations, this position pa-
     operation, procedure, or event. After detect-            per discusses a technique that aims to track causal-
     ing an intrusion, system administrators man-             ity across the enterprise and over extended periods of
     ually analyze audit trails to both isolate the           time, identify subtle causal chains that represent ma-
     root-cause and perform damage impact as-                 licious behavior, localize the code at the roots of such
     sessment of the attack. Due to the sheer vol-            behavior, trace the effects of other malicious actions
     ume of information and low-level activities              descended from those roots, and make recommenda-
     in the audit trails, this task is rather cum-            tions on how to mitigate those effects. By doing so,
     bersome and time intensive. In this posi-                the proposed approach aims to enable stakeholders to
     tion paper, we discuss our ideas to automate             understand and manage the activities going on in their
     the analysis of audit trails using machine               networks. The technique exploits both current and
     learning and model-based reasoning tech-                 novel forms of local causality to construct higher-level
     niques. Our approach classifies audit trails             observations, long-term causality in system informa-
     into the high-level activities they represent,           tion flow. We propose to use a machine learning ap-
     and then reasons about those activities and              proach to classify segments of low-level events by the
     their threat potential in real-time and foren-           activities they represent, and reasons over these ac-
     sically. We argue that, by using the outcome             tivities, prioritizing candidate activities for investiga-
     of this reasoning to explain complex evi-                tion. The diagnostic engine investigates these candi-
     dence of malicious behavior, we are equip-               dates looking for patterns that may represent the pres-
     ping system administrators with the proper               ence of APTs. Using pre-defined security policies and
     tools to promptly react to, stop, and mitigate           related mitigations, the approach explains discovered
     attacks.                                                 APTs and recommends appropriate mitigations to op-
                                                              erators. We plan to leverage models of APT and nor-
                                                              mal business logic behavior to diagnose such threats.
1 Introduction                                                Note that the technique is not constrained by availabil-
Today, enterprise system and network behaviors are            ity of human analysts, but can benefit by human-on-
typically “opaque”: stakeholders lack the ability to as-      the-loop assistance.
sert causal linkages in running code, except in very             The approach discussed in the paper will offer un-
simple cases. At best, event logs and audit trails can        precedented capability for observation of long-term,
offer some partial information on temporally and spa-         subtle system-wide activity by automatically con-
tially localized events as seen from the viewpoint of         structing such global, long-term causality observa-




                                                        193
                   Proceedings of the 26th International Workshop on Principles of Diagnosis




tions. The ability to automatically classify causal                     execute processes on the front-end as the non-
chains of events in terms of abstractions such as ac-                   privileged user www-data.
tivities, will provide operators with a unique capabil-
                                                                    2. The attacker notices that the front-end is run-
ity to orient to long-term, system-wide evidence of
                                                                       ning an unpatched U BUNTU L INUX OS version
possible threats. The diagnostic engine will provide
                                                                       13.1. The attacker uses the nc Linux utility to
a unique capability to identify whether groups of such
                                                                       copy an exploit for obtaining root privileges. The
activities likely represent active threats, making it eas-
                                                                       particular exploit that the attacker uses utilizes
ier for operators to decide whether long-term threats
                                                                       the x32 recvmmsg() kernel vulnerability reg-
are active, and where they originate, even before those
                                                                       istered in the Common Vulnerabilities and Expo-
threats are identified by other means. Thus, the ap-
                                                                       sures (CVE) database as CVE 2014-0038. After
proach will pave the way for the first automated, long-
                                                                       running the copied binary for a few minutes the
horizon, continuously operating system-wide support
                                                                       attacker gains root access to the front-end host.
for an effective defender Observe, Orient, Decide, and
Act (OODA) loop.                                                    3. The attacker installs a root-kit utility that inter-
                                                                       cepts all input to ssh;
2 Running Example                                                   4. A system administrator uses the compromised
The methods proposed in this article are illustrated on                ssh to connect to the back-end revealing his back-
a realistic running example. The attackers in this ex-                 end password to the attacker;
ample use sophisticated and recently discovered ex-                 5. The attacker uses the compromised front-end to
ploits to gain access to the victim’s resources. The at-               bypass firewalls and uses the newly acquired
tack is remote and does not require social engineering                 back-end administrator’s password to access the
or opening a malicious email attachment. The meth-                     back-end;
ods that we propose, however, are not limited to this
class of attacks.                                                   6. The attacker uses a file-tree traversing utility on
                                                                       the back-end that collects sensitive data and con-
                                          router                       solidates it in an archive file;
                                                                    7. The attacker sends the archive file to a third-party
                        Internet
                                                                       hijacked computer for analysis.
  hacker

                 victim’s local network
                                                                   3 Auditing and Instrumentation
                                                                   Almost all computing systems of sufficiently high-
                                                                   level (with the exception of some embedded systems)
                                                                   leave detailed logs of all system and application activ-
                                                                   ities. Many UNIX variants such as L INUX log via the
                                                                   syslog daemon, while W INDOWSTM uses the event
    web server      data storage               system              log service. In addition to the usual logging mecha-
     front-end       back end                administrator         nisms, there is a multitude of projects related to se-
                                                                   cure and detailed auditing. An audit log is more de-
      Figure 1: Network topology for the attack                    tailed trail of any security or computation-related ac-
                                                                   tivity such as file or RAM access, system calls, etc.
                                                                      Depending on the level of security we would like
   The network topology used for our running example               to provide, there are several methods for collecting in-
is shown in figure 1. The attack is executed over sev-             put security-related information. On one extreme, it is
eral days. It starts by (1) compromising the web server            possible to use the existing log files. On the other ex-
front-end, followed by (2) a reconnaissance phase and              treme there are applications for collecting detailed in-
(3) compromising the data storage back end and ulti-               formation about the application execution. One such
mately extracting and modifying sensitive information              approach [1] runs the processes of interest through a
belonging to the victim.                                           debugger and logs every memory read and write ac-
   Both the front-end and the back end in this example             cess.
run unpatched U BUNTU 13.1 L INUX OS on an I N -                      It is also possible to automatically inject logging
TEL R S ANDY B RIDGE TM architecture.
                                                                   calls in the source files before compiling them, allow-
   What follows is a detailed chronology of the events:            ing us to have static or dynamic logging or a combi-
  1. The attacker uses the A PACHE httpd server, a                 nation of the two. Log and audit information can be
     cgi-bin script, and the S HELLSHOCK vulnera-                  signed, encrypted and sent in real-time to a remote
     bility (GNU bash exploit registered in the Com-               server to make system tampering and activity-hiding
     mon Vulnerabilities and Exposures database as                 more difficult. All these configuration decisions im-
     CVE 2014-6271 (see https://nvd.nist.                          pose different trade-offs in security versus computa-
     gov/) to gain remote shell access to the victim’s             tional and RAM load [2] and depend on the organiza-
     front-end. It is now possible for the attacker to             tional context.


                                                              2


                                                             194
                   Proceedings of the 26th International Workshop on Principles of Diagnosis




                             ..                                   abstractions may be composed of lower-order abstrac-
                              .
                                                                  tions that are in turn abstractions of low-level events.
front_end.secure_access_log:11.239.64.213 - [22/Apr/2014          For example, a sequential set of logged events such as
06:30:24 +0200] "GET /cgi-bin/test.cgi HTTP/1.1" 401 381          ‘browser forking bash’, ‘bash initiating Netcat’, and
                             ..                                   ‘Netcat listening to new port’, might be abstracted as
                              .                                   the activity ‘remote shell access’. The set of activities,
front_end.rsyslogd.log:recvmsg(3, msg_name(0)       =             ‘remote shell access’, and ‘escalation of privilege’ can
NULL, msg_iov(1) = ["29/Apr/2014 22:15:49 ...", 8096],            be abstracted as the activity ‘remote root shell access’.
msg_controllen = 0, msg_flags = MSG_CTRUNC,
MSG_DONTWAIT) = 29                                                   We approach activity annotation as a supervised
                                                                  learning problem that uses classification techniques to
                             ..                                   generate activity tags for audit trails. Table 1 shows
                              .
                                                                  multiple levels of activity classifications for the above
back_end:auditctl:type = SYSCALL msg = au-                        APT example.
dit(1310392408.506:36): arch = c000003e syscall = 2
success = yes exit = 3 a0 = 7fff2ce9471d a1 = 0 a2 = 61f768          Table 1 represents one possible classification-
a3 = 7fff2ce92a20 items = 1 ppid = 20478 pid = 21013 auid         enriched audit trail for such an APT. There can be
= 1000 uid = 0 gid = 0 euid = 0 suid = 0 fsuid = 0 egid = 0       many relatively small variations. For example, ob-
sgid = 0 fsgid = 0 ses = 1 comm = "grep" exe = "/bin/grep"        scuring the password file could be done using other
                             ..                                   programs. A single classifier only allows for a single
                              .                                   level of abstraction, and a single leap from low-level
                                                                  events to very abstract activities (for example, from
Figure 2: Part of log files related to the attack from the        ‘bash execute perl’ level to ‘extracting modified file’
running example                                                   level) will have higher error caused by these additional
                                                                  variations.
   Figure 2 shows part of the logs collected for our run-            To obtain several layers of abstraction for reason-
ning example. The first entry is when the attacker ex-            ing over, and thus reduce overall error in classifica-
ploits the S HELLSHOCK vulnerability through a CGI                tion, we use a multi-level learning strategy that models
script of the web server. The second entry shows sys-             information at multiple levels of semantic abstraction
log strace-like message resulting from the kernel                 using multiple classifiers. Each classifier solves the
escalation. Finally, the attacker uses the grep com-              problem at one abstraction level, by mapping from a
mand on the back-end server to search for sensitive               lower-level (fine) feature space to the next higher-level
information and the call is recorded by the audit sys-            conceptual (coarse) feature space.
tem.                                                                 The activity classifier rely on both a vocabulary of
   It is often the case that the raw system and secu-             activities and a library of patterns describing these ac-
rity log files are preprocessed and initial causal links          tivities that will be initially defined manually. This vo-
are computed. If we trace the exec, fork, and                     cabulary and pattern set reside in a Knowledge Base.
join POSIX system calls, for example, it is possi-                   In our training approach, results from training lower
ble to add graph-like structure to the log files comput-          level classifiers are used as training data for higher
ing provenance graphs. Another method for comput-                 level classifiers. In this way, we coherently train all
ing local causal links is to consider shared resources,           classifiers by preventing higher-level classifiers from
e.g., two threads reading and writing the same memory             being trained with patterns that will never be gener-
address [1].                                                      ated by their lower-level precursors. We use an ensem-
                                                                  ble learning approach to achieve accurate classifica-
4 Activity Classification                                         tion. This involves stacking together both bagged and
The Activity Classifier continuously annotates audit              boosted models to reduce both variance and bias er-
trails with semantic tags describing the higher-order             ror components [3]. The classification algorithm will
activity they represent. For example, ‘remote shell ac-           be trained using an online-learning technique and in-
cess’, ‘remote file overwrite’, and ‘intra-network data           tegrated within an Active Learning Framework to im-
query’ are possible activity tags. These tags are used            prove classification of atypical behaviors.
by the APT Diagnostics Engine to enable higher-order                 Generating Training Data for Classification To
reasoning about related activities, and to prioritize ac-         build the initial classifier, training data is generated
tivities for possible investigation.                              using two methods. First, an actual deployed sys-
                                                                  tem is used to collect normal behavior data, and a
4.1 Hierarchical semantic annotation of                           Subject Matter Expert manually labels it. Second,
    audit trails                                                  a testing platform is used to generate data in a con-
A key challenge in abstracting low-level events into              trolled environment, particularly platform dependent
higher-order activity patterns that can be reasoned               vulnerability-related behavior. In addition, to gener-
about efficiently is that such patterns can be described          ate new training data of previously unknown behavior,
at multiple levels of semantic abstraction, all of which          we use an Active Learning framework as described in
may be useful in threat analysis. Indeed, higher-order            Section 5.


                                                              3


                                                          195
                   Proceedings of the 26th International Workshop on Principles of Diagnosis




                           Table 1: Sample classification problem for running example
                  Activity 1                      Activity 2                        Activity 3
            Remote Shell Access            Remote File Overwrite            Modified File Download
                Shell Shock                  Trojan Installation              Password Exfiltration
         Browser (Port 80) fork bash     Netcat listen to Port 8443        Netcat listen to Port 8443
              bash fork Netcat          Port 8443 receive binary file          Port 8443 fork bash
          Netcat listen to port 8080   binary file overwrites libns.so          bash execute perl
                                                                         Perl overwrite /tmp/stolen_pw
                                                                         Port 8443 send /tmp/stolen_pw


5 Prioritizer                                                    are ranked according to their threat level by aggregat-
                                                                 ing a severity measure (determined by classified threat
As the Activity Classifier annotates audit trails with
                                                                 type) and a confidence measure. We complement the
activity descriptors, the two (parallel) next steps in our
                                                                 initial set of training data to calibrate our classifiers by
workflow are to 1) prioritize potential threats to be re-
                                                                 using an Active Learning Framework, which focuses
ferred to the Diagnostic Engine (see Section 6) for in-
                                                                 on improving the classification algorithm through oc-
vestigation, and 2) prioritize emergent activities that
                                                                 casional manual labeling of the most critical activities
(after suitable review and labeling) are added to the ac-
                                                                 in the audit trails.
tivity classifier training data. This module prioritizes
                                                                    Unsupervised ranking using normalcy charac-
activities by threat severity and confidence level. This
                                                                 terization to catch unknown threats. The second
prioritization process presents three key challenges.
                                                                 component of the prioritizer is a set of unsupervised
                                                                 normalcy rankers, which rank entities based on their
5.1 Threat-based rank-annotation of
                                                                 statistical “normalcy". Activities identified as un-
    activities                                                   usual will be fed to the Active Learning framework
One challenge in ranking activities according to their           to check if any of them are “unknown” APT activities.
threat potential is the complex (and dynamic) notion of          This provides a mechanism for detecting “unknown”
what constitutes a threat. Rankings based on matching            threats while also providing feedback to improve the
to known prior threats is necessary, but not sufficient.         APT classifier.
An ideal ranking approach should take known threats
into account, while also proactively considering the             5.2 Combining Multiple Rankings
unknown threat potential of new kinds of activities.             One of the key issues with combining the outputs of
Another such challenge is that risk may be assessed              multiple risk ranking is dealing with two-dimensional
at various levels of activity abstraction, requiring that        risk (severity, confidence) scores that may be on very
overall ranking must be computed by aggregating risk             different scales. A diverse set of score normalization
assessments at multiple abstraction levels.                      techniques have been proposed [4; 5; 6] to deal with
   We implement two ranking approaches: a super-                 this issue, but no single technique has been found to
vised ranker based on previously known threats and an            be superior over all the others. An alternative to com-
unsupervised ranker that considers unknown potential             bining scores is to combine rankings [7]. Although
threats.                                                         converting scores to rankings does lose information, it
   Supervised ranking using APT classification to                remains an open question if the loss in information is
catch known threats. The goal of APT classifica-                 compensated for by the convenience of working with
tion is to provide the diagnostic engine with critical           the common scale of rankings.
APT related information such as APT Phase, severity                 We will develop combination techniques for
of attack, and confidence level associated with APT              weighted risk rankings based on probabilistic rank ag-
tagging for threat prioritization. Since the audit trails        gregation methods. This approach builds on our own
are annotated hierarchically into different granularity          work [8] that shows the robustness of the weighted
of actions, multiple classifiers will be built to consider       ranking approach. We also build on principled meth-
each hierarchical level separately. APT classifiers are          ods for combining ranking data found in the statistics
used to identify entities that are likely to be instances        and information retrieval literature.
of known threats or phases of an APT attack. Two                    Traditionally, the goal of rank aggregation [9; 10]
types of classifiers are used. The first classifier is           is to combine a set of rankings of the same candi-
hand-coded and the second classifier is learned from             dates into a single consensus ranking that is “better”
training data.                                                   than the individual rankings. We extend the tradi-
   The hand-coded classifier is designed to have high            tional approach to accommodate the specific context
precision, using hand-coded rules, mirroring SIEM                of weighted risk ranking. First, unreliable rankers will
and IDS systems. Entities tagged by this classifier are          be identified and either ignored or down-weighted,
given the highest priority for investigation. The second         lest their rankings decrease the quality of the over-
classifier, which is learned from training data, will pro-       all consensus [7; 10]. Second, we will discount ex-
vide higher recall at the cost of precision. Activities          cessive correlation among rankers, so that a set of


                                                             4


                                                         196
                  Proceedings of the 26th International Workshop on Principles of Diagnosis




highly redundant rankers do not completely outweigh            vulnerabilities or the combined use of social engineer-
the contribution of other alternative rankings. To ad-         ing and insufficiency of the organizational security
dress these two issues, we will associate a probabilis-        policies. We use MBD for computing the set of si-
tic latent variable Zi with the i’th entity of interest,       multaneously exploited vulnerabilities that allowed the
which indicates whether the entity is anomalous or             deployment of the APT. Computing such explanations
normal. Then, we will build a probabilistic model              is possible because MBD reasons in terms of multiple-
that allows us to infer the posterior distribution over        faults [14]. In our running example this set would in-
the Zi based on the observed rankings produced by              clude both the fact the the web server has been ex-
each of the input weighted risk rankings. This poste-          ploited due to the Shellshock vulnerability and that a
rior probability of Zi being normal will then be used          the attacker gained privileged access on the front-end
as the weighted risk rank. Our model will make the             due to the use of the X64_32 escalation vulnerability.
following assumptions to account for both unreliable              The abstract security model is used to gather infor-
and correlated rankers: 1) Anomalies are ranked lower          mation about types of attacks the system is vulnerable
than all normal instances and these ranks tend to be           to, and to aid deciding the set of actions required to
concentrated near the lower rankings of the provided           stop an APT campaign (policy enforcement). Various
weighted risk rankings, and 2) Normal data instances           heuristics exist to find the set of meaningful diagnosis
tend to be uniformly distributed near the higher rank-         candidates. As an example, one might be interested
ings of the weighted risk rankings.                            in the minimal set of actions to stop the attack [15;
   There are various ways to build a probabilistic             16] or select those candidates that capture significant
model that reflects the above assumptions and al-              probability mass [17]. In the rest of this section, for
lows for the inference of the Zi variables through             illustration purposes, we use minimality as the heuris-
Expectation-Maximization [11]. In addition to these            tic of interest. MBD is the right tool for dealing with
assumptions, we will explore allowing other factors to         computation of diagnosis candidates as it offers sev-
influence the latent Zi variables, such as features of         eral ways to address the modeling and computational
the entities as well as feedback provided by an expert         complexity [18; 19].
analysts.
                                                               6.2 Detection and Isolation of Attacks from
                                                                   Abstract Security Model and Sensor
6 Diagnosis                                                        Data
We view the problem of detecting, isolating, and ex-           The abstract security model provides an abstraction
plaining complex APT campaigns behavior from rich              mechanism that is originally missing in the audit trails.
activity data is a diagnostic problem. We will use             More precisely what is not in the audit trails and what
an AI-based diagnostic reasoning to guide the global           is in the security model is how to connect (possibly
search for possible vulnerabilities that enabled the           disconnected) activities for the purpose of global rea-
breach. Model-based diagnosis (MBD) [12] is a par-             soning. The abstract security model and the sensor
ticularly compelling approach as it supports reasoning         data collected from the audit trails are provided as in-
over complex causal networks (for example, having              puts to an MBD algorithms that performs the high-
multiple conjunctions, disjunctions, and negations)            level reasoning about possible vulnerabilities and at-
and identifies often subtle combinations of root causes        tacks similar to what a human security officer would
of the symptoms (the breach).                                  do.
                                                                  The information in the “raw” audit trails is of too
6.1 An MBD approach for APT detection                          high fidelity [2] and low abstraction to be used by a
    and isolation: Motivation                                  “crude” security model. That is the reason the diag-
                                                               nostic engine needs the machine learning module to
Attack detection and isolation are two distinct chal-          temporally and spatially group nodes in the audit trails
lenges. Often diagnostic approaches use separate               and to provide semantically rich variable/value sensor
models for detection and isolation [13]. MBD how-              data about actions, suitable for MBD. Notice that in
ever uses a single model, to combine these two rea-            this process, the audit trail structure is translated to se-
sonings. The security model contains both part of              mantic categories, i.e., the diagnostic engine receives
the security policy (that communicating with certain           as observations time-series of sensed actions.
blacklisted hosts may indicate an information leak)               The listing that follows next shows an abstract se-
and information about the possible locations and con-          curity model for the running example in the LYDIA
sequence of a vulnerability (a privilege escalation may        language [20]. This bears some resemblance to P RO -
lead to an information leak). The security model also          LOG , except that LYDIA is a language for model-
contains abstract security constraints such as if a pro-       based diagnosis of logical circuits while P ROLOG is
cess requires authentication, a password must be read          for Horn-style reasoning. The use of LYDIA is for
and compared against.                                          illustration purposes only, in reality computer sys-
   The diagnostic approach takes into consideration            tems can be much more easily modeled as state ma-
the bootstrapping of an APT which we consider the              chines. There is a significant body of literature deal-
root-cause of the attack. What enables a successful            ing with diagnosis of discrete-event systems [21; 22;
APT is either a combination of software component              23], to name just a few.


                                                           5


                                                       197
                       Proceedings of the 26th International Workshop on Principles of Diagnosis




 1   system front_end (bool know_root_password)                                               know root password


 2   {                                                                                                  root shell
 3       bool httpd_shell_vuln ; // vulnerability                             httpd shell
                                                                                                                                                      >
         bool buffer_overflow_vuln ; // vulnerability
                                                                                                                     2

 4
                                                                                                                         r


         bool escalation_vuln ; // vulnerability
                                                                          leak pw1
 5                                                                                                        2
                                                                                                              p
 6                                                                                                                           Legend:
                                                                                                                               1
                                                                                                                                 assumable variable
 7       bool httpd_shell ;                                        buffer overflow vuln   1
                                                                                                          2
                                                                                                              q                2
                                                                                                                                 internal variable


 8       bool root_shell ;
 9       bool leak_passwd;
10                                                                 Figure 3: Part of the abstract security model for the
11       // weak−fault models
                                                                   running example
12       if (! httpd_shell_vuln ) { // if healthy
13           ! httpd_shell ; // forbid shells via httpd
14       }                                                         rity constraints in it is notoriously difficult, hence we
15                                                                 plan to create or use specialized modal logic similar to
16       if (! escalation_vuln ) { // if healthy
17           ! root_shell ; // no root shell is possible
                                                                   the one proposed in [25].
18       }                                                            Notice that the format of the Boolean circuit shown
19                                                                 in figure 3 is very close to the one used in Truth Main-
20       if (! buffer_overflow_vuln ) { // if healthy              tenance System (TMS) [26]. The only assumable vari-
21           !leak_passwd; // passwords don’t leak                 able in figure 3 is buffer_overflow_vuln and its
22       }                                                         default value is false (i.e., there is no buffer overflow
23                                                                 vulnerability in the web server process).
24       bool access_passwd;                                          We next show how a reasoning engine can discover
25       attribute observable (access_passwd) = true;              a conflict through forward and backward propagation.
26
                                                                   Looking at figure 3, it is clear that r must be true be-
27        !access_passwd => !leak_passwd;
28                                                                 cause it is an input to an AND-gate whose output is set
29       /∗∗                                                       to true. Therefore either p or q (or both) must be true.
30        ∗ Knowing the root password can be explained             This means that either buffer_overflow_vuln or
31        ∗ by a root shell ( for example there is a               leak_pw must be false. If we say that leak_pw is
32        ∗ password sniffer ).                                    assumed to be true (measured or otherwise inferred),
33        ∗/                                                       then leak_pw and buffer_overflow_vuln are to-
34       know_root_password =>                                     gether part of a conflict. It means that the reasoning
35       (( httpd_shell || leak_passwd) && root_shell );           engine has to change one of them to resolve the con-
36   }                                                             tradiction.
37
38   system back_end(bool know_root_password)
                                                                      Based on the observation from our running exam-
39   {                                                             ple and a TMS constructed from the security model
40      bool comm;                                                 shown in figure 3, the hitting set algorithm computes
41       attribute observable (comm) = true;                       two possible diagnostic hypotheses: (1) the attacker
42                                                                 gained a shell access through a web-server vulnerabil-
43       /∗∗                                                       ity and the attacker performed privilege escalation or
44        ∗ Normal users can only communicate with a               (2) the attacker injected binary code through a buffer
45        ∗ list of permitted hosts .                              overflow and the attacker performed privilege escala-
46         ∗/                                                      tion.
47       if (! know_root_password) {                                  If we use LYDIA to compute the set of diagnoses for
48             comm == true;
49       }
                                                                   the running example, we get the following two (am-
50   }                                                             biguous) diagnoses for the root-cause of the penetra-
51                                                                 tion:
52   system main()                                                 $ lydia example.lm example.obs
53   {                                                             d1 = { fe.escalation_vuln,
54       bool know_root_password;
                                                                          fe.httpd_shell_vuln }
55
56       system front_end fe (know_root_password);                 d2 = { fe.buffer_overflow_vuln,
57       system back_end be(know_root_password);                          fe.escalation_vuln }
58   }                                                                MBD uses probabilities to computes a sequence of
                                                                   possible diagnoses ordered by likelihood. This proba-
        LYDIA translates the model to an internal proposi-         bility can be used for many purposes: decide which di-
     tional logic formula. Part of this internal representa-       agnosis is more likely to be the true fault explanation,
     tion is shown in figure 3, which uses the standard VLSI       whether there is the need for consider further evidence
     [24] notation to denote AND-gates, OR-gates, and              from the logs or limit the number of diagnoses that
     NOT-gates. Wires are labeled with variable names.             need to be identified. Many policies exist to compute
     Boolean circuits (matching propositional logic), how-         these probabilities [27; 28].
     ever, have limited expressiveness and modeling secu-             For illustration purposes we consider that the diag-


                                                               6


                                                           198
                      Proceedings of the 26th International Workshop on Principles of Diagnosis




noses for the running example are ambiguous. Before            and their diagnostic accuracy in the context of trans-
we discuss methods for dealing with this ambiguity,            parent computing.
we address the major research challenge of model gen-
eration.                                                       7 Conclusions
6.3 Model Generation                                           Identifying the root-cause and perform damage im-
The abstract vulnerability model can either be con-            pact assessment of advanced persistent threats can be
structed manually or semi-automatically. The chal-             framed as a diagnostic problem. In this paper, we dis-
lenge with modeling is that an APT campaign gener-             cuss an approach that leverages machine learning and
ally exploits unknown vulnerabilities. Therefore, our          model-based diagnosis techniques to reason about po-
approach to address this issue is to construct the model       tential attacks.
which captures expected behavior (known goods) of                 Our approach classifies audit trails into high-level
the system. Starting from generic parameterized vul-           activities, and then reasons about those activities and
nerability models and security objectives, the abstract        their threat potential in real-time and forensically. By
vulnerability model can be extended with information           using the outcome of this reasoning to explain com-
related to known vulnerabilities (known bads).                 plex evi- dence of malicious behavior, the system
   Generating the model can be done either manu-               administrators is provided with the proper tools to
ally or semi-automatically. We will explore venues to          promptly react to, stop, and mitigate attacks.
generate this model manually, which requires signif-
icant knowledge about potential security vulnerabili-          References
ties, while being error prone and not detailed enough.
Amongst company specific requirements, we envisage             [1] Kyu Hyung Lee, Xiangyu Zhang, and Dongyan
the abstract vulnerability model to capture the most               Xu. High accuracy attack provenance via binary-
common attacks that target software systems, as de-                based execution partition. In Proceedings of
scribed in the Common Attack Pattern Enumeration                   the 20th Annual Network and Distributed System
and Classification (CAPEC1 ). The comprehensive list               Security Symposium, San Diego, CA, February
of known attacks has been designed to better under-                2013.
stand the perspective of an attacker exploiting the vul-       [2] Kyu Hyung Lee, Xiangyu Zhang, and Dongyan
nerabilities and, from this knowledge, devise appro-               Xu. LogGC: Garbage collecting audit log. In
priate defenses.                                                   Proceedings of the 2013 ACM SIGSAC Confer-
   As modeling is challenging, we propose to explore               ence on Computer and Communications Secu-
semi-automatic approaches to construct models. The                 rity, pages 1005–1016, Berlin, Germany, 2013.
semi-automatic method is suitable to addressing the
modeling because in security, similarly to diagno-             [3] M. Galar, A. Fernández, E. Barrenechea,
sis, there is (1) component models and (2) structure.              H. Bustince, and F. Herrera. A review on ensem-
While it is difficult to automate the building of com-             bles for the class imbalance problem: Bagging-,
ponent models (this may even require natural language              boosting-, and hybrid-based approaches. IEEE
parsing of databases such as CAPEC), it is feasible to             Transactions onSystems, Man, and Cybernetics,
capture diagnosis-oriented information from structure              Part C: Applications and Reviews, 42(4):463–
(physical networking or network communication).                    484, July 2012.
   Yet another approach to semi-automatically gener-           [4] Charu C Aggarwal. Outlier ensembles: Position
ate the model is to learn it from executions of the                paper. ACM SIGKDD Explorations Newsletter,
system (e.g., during regression testing, just before               14(2):49–58, 2013.
deployment). This approach to system modeling is
inspired by the work in automatic software debug-              [5] Jing Gao and Pang-Ning Tan. Converting out-
ging work [29], where modeling of program behav-                   put scores from outlier detection algorithms into
ior is done in terms of abstraction of program traces              probability estimates. In Proceedings of the Sixth
– known as spectra [30], abstracting from modeling                 International Conference on Data Mining, pages
specific components and data dependencies                          212–221. IEEE, December 2006.
   The outlined approaches to construct the abstract           [6] Hans-Peter Kriegel, Peer Kröger, Erich Schu-
vulnerability model entail different costs and diagnos-            bert, and Arthur Zimek. Interpreting and unify-
tic accuracies. As expected, manually building the                 ing outlier scores. In Proceedings of the Eleventh
model is the most expensive one. Note that build-                  SIAM International Conference on Data Mining,
ing the model is a time-consuming and error-prone                  pages 13–24, April 2011.
task. The two semi-automatic ways also entail differ-
ent costs: one exploits the available, static informa-         [7] Erich Schubert, Remigius Wojdanowski, Arthur
tion and the other requires the system to be executed              Zimek, and Hans-Peter Kriegel. On evaluation of
to compute a meaningful set of executions. We will in-             outlier rankings and outlier scores. In Proceed-
vestigate the trade-offs between modeling approaches               ings of the Twelfth SIAM International Confer-
                                                                   ence on Data Mining, pages 1047–1058, April
   1
       http://capec.mitre.org/                                     2012.


                                                           7


                                                        199
                  Proceedings of the 26th International Workshop on Principles of Diagnosis




[8] Hoda Eldardiry, Kumar Sricharan, Juan Liu,                      sis using greedy stochastic search. Journal of Ar-
     John Hanley, Bob Price, Oliver Brdiczka, and                   tificial Intelligence Research, 38:371–413, 2010.
     Eugene Bart. Multi-source fusion for anomaly              [19] Nuno Cardoso and Rui Abreu. A distributed
     detection: using across-domain and across-time                 approach to diagnosis candidate generation. In
     peer-group consistency checks.          Journal of             Progress in Artificial Intelligence, pages 175–
     Wireless Mobile Networks, Ubiquitous Com-                      186. Springer, 2013.
     puting, and Dependable Applications (JoWUA),
                                                               [20] Alexander Feldman, Jurryt Pietersma, and Ar-
     5(2):39–58, 2014.
                                                                    jan van Gemund.          All roads lead to fault
[9] Yoav Freund, Raj D. Iyer, Robert E. Schapire,                   diagnosis: Model-based reasoning with LY-
     and Yoram Singer. An efficient boosting al-                    DIA . In Proceedings of the Eighteenth Belgium-
     gorithm for combining preferences. Journal of                  Netherlands Conference on Artificial Intelli-
     Machine Learning Research, 4(Nov):933–969,                     gence (BNAIC’06), Namur, Belgium, October
     2003.                                                          2006.
[10] Ke Deng, Simeng Han, Kate J Li, and Jun S Liu.            [21] Meera Sampath, Raja Sengupta, Stephane Lafor-
     Bayesian aggregation of order-based rank data.                 tune, Kasim Sinnamohideen, and Demosthenis C
     Journal of the American Statistical Association,               Teneketzis. Failure diagnosis using discrete-
     109(507):1023–1039, 2014.                                      event models. Control Systems Technology, IEEE
[11] Arthur P Dempster, Nan M Laird, and Donald B                   Transactions on, 4(2):105–124, 1996.
     Rubin. Maximum likelihood from incomplete                 [22] Alban Grastien, Marie-Odile Cordier, and Chris-
     data via the EM algorithm. Journal of the royal                tine Largouët. Incremental diagnosis of discrete-
     statistical society. Series B, 39(1):1–38, 1977.               event systems. In DX, 2005.
[12] Johan de Kleer, Olivier Raiman, and Mark                  [23] Alban Grastien, Patrik Haslum, and Sylvie
     Shirley. One step lookahead is pretty good. In                 Thiébaux. Conflict-based diagnosis of discrete
     Readings in Model-Based Diagnosis, pages 138–                  event systems: theory and practice. 2012.
     142. Morgan Kaufmann Publishers, San Fran-
                                                               [24] Behrooz Parhami. Computer Arithmetic: Algo-
     cisco, CA, 1992.
                                                                    rithms and Hardware Designs. Oxford Univer-
[13] Alexander Feldman, Tolga Kurtoglu, Sriram                      sity Press, Inc., New York, NY, USA, 2nd edi-
     Narasimhan, Scott Poll, David Garcia, Johan                    tion, 2009.
     de Kleer, Lukas Kuhn, and Arjan van Gemund.
                                                               [25] Janice Glasgow, Glenn Macewen, and Prakash
     Empirical evaluation of diagnostic algorithm
                                                                    Panangaden. A logic for reasoning about secu-
     performance using a generic framework. In-
                                                                    rity. ACM Transactions on Computer Systems,
     ternational Journal of Prognostics and Health
                                                                    10(3):226–264, August 1992.
     Management, pages 1–28, 2010.
                                                               [26] Kenneth Forbus and Johan de Kleer. Building
[14] Johan de Kleer and Brian Williams. Diagnosing
                                                                    Problem Solvers. MIT Press, 1993.
     multiple faults. Artificial Intelligence, 32(1):97–
     130, 1987.                                                [27] Johan de Kleer. Diagnosing multiple persistent
[15] Oleg Sheyner, Joshua Haines, Somesh Jha,                       and intermittent faults. In Proceeding of the 2009
                                                                    International Joint Conference on Artificial In-
     Richard Lippmann, and Jeannette M Wing. Au-
                                                                    telligence, pages 733–738, July 2009.
     tomated generation and analysis of attack graphs.
     In Proceeding of the 2002 IEEE Symposium                  [28] Rui Abreu, Peter Zoeteweij, and Arjan J. C.
     on Security and Privacy, pages 273–284. IEEE,                  Van Gemund. A new bayesian approach to multi-
     May 2002.                                                      ple intermittent fault diagnosis. In Proceeding of
[16] Seyit Ahmet Camtepe and Bülent Yener. Mod-                     the 2009 International Joint Conference on Arti-
                                                                    ficial Intelligence, pages 653–658, July 2009.
     eling and detection of complex attacks. In Pro-
     ceeding of the Third International Conference on          [29] Rui Abreu, Peter Zoeteweij, and Arjan JC
     Security and Privacy in Communications Net-                    Van Gemund. Spectrum-based multiple fault lo-
     works, pages 234–243, September 2007.                          calization. In Proceedings of the 24th IEEE/ACM
[17] Rui Abreu and Arjan JC van Gemund. A                           International Conference on Automated Soft-
                                                                    ware Engineering, pages 88–99, November
     low-cost approximate minimal hitting set algo-
                                                                    2009.
     rithm and its application to model-based diagno-
     sis. In Proceedings of the Eighth Symposium on            [30] Mary Jean Harrold, Gregg Rothermel, Kent
     Abstraction, Reformulation and Approximation,                  Sayre, Rui Wu, and Liu Yi. An empirical in-
     pages 2–9, July 2009.                                          vestigation of the relationship between spectra
[18] Alexander Feldman, Gregory Provan, and Arjan                   differences and regression faults. Software Test-
                                                                    ing Verification and Reliability, 10(3):171–194,
     van Gemund. Approximate model-based diagno-
                                                                    2000.




                                                           8


                                                       200
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




  A Structural Model Decomposition Framework for Hybrid Systems Diagnosis
                Matthew Daigle1 and Anibal Bregon2 and Indranil Roychoudhury3
                     1
                       NASA Ames Research Center, Moffett Field, CA 94035, USA
                                  e-mail: matthew.j.daigle@nasa.gov
          2
            Department of Computer Science, University of Valladolid, Valladolid, 47011, Spain
                                      e-mail: anibal@infor.uva.es
 3
   Stinger Ghaffarian Technologies Inc., NASA Ames Research Center, Moffett Field, CA 94035, USA
                               e-mail: indranil.roychoudhury@nasa.gov

                         Abstract                                   allows the modeler to focus on the discrete behavior only at
                                                                    the component level, the pre-enumeration of all the system-
     Nowadays, a large number of practical systems in               level modes can be avoided [6, 7]. Additionally, building
     aerospace and industrial environments are best rep-            models in a compositional way facilitates reusability and
     resented as hybrid systems that consist of discrete            maintenance, and allows the validation of the components
     modes of behavior, each defined by a set of contin-            individually before they are composed to create the global
     uous dynamics. These hybrid dynamics make the                  hybrid system model.
     on-line fault diagnosis task very challenging. In
     this work, we present a new modeling and diagno-                  In a system model, the effects of mode changes in individ-
     sis framework for hybrid systems. Models are com-              ual components may force other components to reconfigure
     posed from sets of user-defined components using               their computational structures, or causality, during the sim-
     a compositional modeling approach. Submodels                   ulation process, which requires developing efficient online
     for residual generation are then generated for a               causality reassignment procedures. As an example of this
     given mode, and reconfigured efficiently when the              kind of approach, Hybrid Bond Graphs (HBGs) [8] have
     mode changes. Efficient reconfiguration is estab-              been used by different authors [9, 10], and efficient causality
     lished by exploiting causality information within              reassignment has been developed previously for such mod-
     the hybrid system models. The submodels can then               els [11]. However, the main limitation of HBGs is that the
     be used for fault diagnosis based on residual gen-             set of possible components is restricted (e.g., resistors, ca-
     eration and analysis. We demonstrate the efficient             pacitors, 0-junctions, etc.), with each component having to
     causality reassignment, submodel reconfiguration,              conform to a certain set of mathematical constraints, and
     and residual generation for fault diagnosis using              modelers do not have the liberty to define and use their own
     an electrical circuit case study.                              components. Another example is that of [7], which uses a
                                                                    more general modeling framework, and tackles the causality
                                                                    reassignment problem from a graph-theoretic perspective.
1 Introduction                                                         In this work, we propose a compositional modeling ap-
Robust and efficient fault diagnosis plays an important role in     proach for hybrid systems, where models are made up of
ensuring the safe, correct, and efficient operation of complex      sets of user-defined components. Here, a component is con-
engineering systems. Many engineering systems are modeled           structed by defining a set of discrete modes, with a different
as hybrid systems that have both continuous and discrete-           set of mathematical constraints describing the continuous
event dynamics, and for such systems, the complexity of             dynamics in each mode. Then, we borrow ideas for efficient
fault diagnosis methodologies increases significantly. In this      causality reassignment in HBGs [11], and propose algorithms
paper, we develop a new modeling framework and structural           for efficient causality assignment in our component-based
model decomposition approach that enable efficient online           models, extending and generalizing those from HBGs. We
fault diagnosis of hybrid systems.                                  then apply structural model-decomposition [12] to compute
   During the last few years, different proposals have been         minimal submodels for the initial mode of the system. These
made for hybrid systems diagnosis, focusing on either hy-           submodels are used for fault diagnosis based on residual gen-
brid modeling, such as hybrid automata [1–3], hybrid state          eration and analysis. Based on efficient causality reassign-
estimation [4], or a combination of on-line state tracking and      ment, submodels can be reconfigured upon mode changes
residual evaluation [5]. However, in all these cases, the pro-      efficiently. Using an electrical circuit as a case study, we
posed solutions involve modeling and pre-enumeration of the         demonstrate efficient causality reassignment and submodel
set of all possible system-level discrete modes, which grows        reconfiguration and show that these submodels can correctly
exponentially with the number of switching components.              compute system outputs for residual generation in the pres-
Both steps are computationally very expensive or unfeasible         ence of known mode changes.
for hybrid systems with complex interacting subsystems.                The paper is organized as follows. Section 2 presents the
   One solution to the mode pre-enumeration problem is to           modeling approach and introduces the case study. Section 3
build hybrid system models in a compositional way, where            presents the overall approach for hybrid systems fault diag-
discrete modes are defined at a local level (e.g., at the com-      nosis based on structural model decomposition. Section 4
ponent level), in which the system-level mode is defined            develops the causality analysis and assignment algorithms.
implicitly by the local component-level modes. Since this           Section 5 presents the structural model decomposition ap-




                                                              201
                              Proceedings of the 26th International Workshop on Principles of Diagnosis


                Sw1       C1      R1      Sw2   C2    R2
                                                                          Example 1. Consider the component R1 (δ6 ). It has only
                                                                          a single mode with a single constraint v5 = i5 ∗ R1 over
                                                i11
                                                                          variables {v5 , i5 , R1 }.
         v(t)      i3    L1        v8    L2
                                                                          Example 2. Consider the component Sw2 (δ10 ). It has two
                                                                          modes: on and off. In the off mode, it has three constraints
                                                                          setting each of its currents (i9 , i10 , i11 ) to 0. In the on mode,
                                                                          it has also three constraints, setting the three currents equal
        Figure 1: Electrical circuit running example.                     to each other and establishing that the voltages sum up (it
                                                                          acts like a series connection when in the on mode).
proach. Section 6 describes efficient causality reassignment.               We can define a system model by composing components:
Section 7 demonstrates the approach for the electrical case               Definition 3 (Model). A model M = {δ1 , δ2 , . . . , δd } is a
study. Section 8 reviews the related work and current ap-                 finite set of d components for d ∈ N.
proaches for hybrid systems fault diagnosis. Finally, Sec-
tion 9 concludes the paper.                                               Example 3. The model of the electrical system is made up
                                                                          of the components detailed in Table 1, i.e., M = {δ1 , δ2 , . . . ,
                                                                          δ15 }. For each component, the variables and constraints are
2 Compositional Hybrid Systems Modeling                                   defined for each component mode (third column).
We define hybrid system dynamics in a general composi-                       Note that the set of variables for a model does not change
tional way, where the system is made up of a set of com-                  with the mode, hence we need only a variable set in a com-
ponents. Each component is defined by a set of discrete                   ponent and not a set of variable sets as with constraints.
modes, with a different set of constraints describing the con-            The set of variables for a model, VM , is simply the union
tinuous dynamics of the component in each mode. Here,                     of all the component variable sets, i.e., for d components,
system-level modes are defined implicitly through the com-                VM = Vδ1 ∪ Vδ2 ∪ . . . ∪ Vδd . The interconnection struc-
position of the component-level modes. Because the number                 ture of the model is captured using shared variables between
of system-level modes is exponential in the number of switch-             components, i.e., we say that two components are connected
ing components, we want to avoid generating and reasoning                 if they share a variable, i.e., components δi and δj are con-
over the system-level hybrid model, instead working directly              nected if Vδi ∩ Vδj 6= ∅. VM consists of five disjoint sets,
with the component models.                                                namely, the set of state variables, XM ; the set of parame-
   To illustrate our proposal, throughout the paper we will               ters, ΘM ; the set of inputs (variables not computed by any
use a circuit example, shown in Fig. 1. The components of                 constraint), UM ; the set of outputs (variables not used to
the circuit are a voltage source, V, two capacitors, C1 and C2 ,          compute any other variables), YM ; and the set of auxiliary
two inductors, L1 and L2 , two resistors, R1 and R2 , and two             variables, AM . Parameters, ΘM , include explicit model pa-
switches, Sw1 and Sw2 , as well as components for series and              rameters that are used in the model constraints (e.g., fault
parallel connections. Sensors measure the current or voltage              parameters). Auxiliary variables, AM , are additional vari-
in different locations (i3 , v8 , and i11 , as indicated in Fig. 1).      ables that are algebraically related to the state, parameter,
Because each switch has two modes (on and off), there are                 and input variables, and are used to simplify the structure of
four total modes in the system.                                           the equations.
   In the following, we present the main details of our hy-
                                                                          Example 4. In the circuit model, we have XM =
brid system modeling framework, which may be viewed as
                                                                          {i3 , v6 , i8 , v11 }, ΘM = {L1 , R1 , C1 , L2 , R2 , C2 }, UM =
an extension of our modeling approach described in [12],
                                                                          {uv }, and YM = {i∗3 , i∗11 , v8∗ }. Remaining variables belong
extended with the notion of components, and with hybrid
                                                                          to AM . Here, the ∗ superscript is used to denote a measured
system dynamics.
                                                                          value of a physical variable, e.g., i3 ∈ XM is the current
                                                                          and i∗3 ∈ YM is the measured current. Since i3 is used to
2.1 System Modeling                                                       compute other variables, like i2 , it cannot belong to YM and
At the basic level, the continuous dynamics of a component                a separation of the variables is required. Connected com-
in each mode are modeled using a set of variables and a set               ponents are known by shared variables, e.g., R1 and Series
of constraints. A constraint is defined as follows:                       Connection1 are connected because they share i5 and v5 .
Definition 1 (Constraint). A constraint c is a tuple (εc , Vc ),             The model constraints, CM , are a union of the component
where εc is an equation involving variables Vc .                          constraints over all modes, i.e., CM = Cδ1 ∪ Cδ2 ∪ . . . ∪ Cδd ,
                                                                          where Cδi = Cδ1i ∪ Cδ2i ∪ . . . ∪ Cδni for n modes. Constraints
   A component is defined by a set of constraints over a set
                                                                          are exclusive to components, that is, a constraint c ∈ CM
of variables. The constraints are partitioned into different
                                                                          belongs to exactly one Cδ for δ ∈ M.
sets, one for each component mode. A component is then
                                                                             To refer to a particular mode of a model we use the con-
defined as follows:
                                                                          cept of a mode vector. A mode vector m specifies the current
Definition 2 (Component). A component δ with n discrete                   mode of each of the components of a model. So, the con-
modes is a tuple δ = (Vδ , Cδ ), where Vδ is a set of variables           straints for a mode m are denoted as CM    m
                                                                                                                       .
and Cδ is a set of constraints sets, where Cδ is defined as
                                                                          Example 5. Consider a model with five components, then
Cδ = {Cδ1 , Cδ2 , . . . , Cδn }, with a constraint set, Cδm , defined
                                                                          if m = [1, 1, 3, 2, 1], it indicates that components δ1 , δ2 ,
for each mode m = {1, . . . , n}.
                                                                          and δ5 use constraints of their mode 1, component δ3 use
   The components of the circuit are defined in Table 1 (first            constraints of its mode 3, and component δ4 use constraints
three columns).                                                           of its mode 2.




                                                                    202
                         Proceedings of the 26th International Workshop on Principles of Diagnosis



                                           Table 1: Components of the electrical circuit.
                                                                             [1 2]    [1 2]    [1 2]              [2 1]    [2 1]    [2 1]
       Component                   Mode Constraints            A[1 2]      Ai∗       Av∗      Ai∗      A[2 1]   A i∗      Av∗      Ai∗
                                                                             3         8        11                 3        8        11
       δ1 : V                      1      v1 =uv                v1          v1                          v1       v1        v1
       δ2 : Sw1                    1       i1 =0                i1
                                           i2 =0                i2          i2        i2       i2
                                   2       i1 =i2                                                        i1
                                          v1 =v2                                                        v2       v2        v2
       δ3 : Parallel Connection1   1      v2 =v3                    v3      v3                          v3       v3
                                          v2 =v4                    v2      v2                          v4                 v4
                                           i2 =i3 + i4              i4      i4        i4       i4        i2
       δ4 : L1                     1       i̇3 =vR3 /L1             i̇3     i̇3                          i̇3     i̇3
                                                  t
                                           i3 = t0 i̇3              i3      i3                           i3      i3
       δ5 : Series Connection1     1       i4 =i5                   i5      i5                           i5                i5
                                           i4 =i6                   i6      i6                           i6                i6
                                           i4 =i7                   i7                i7       i7        i4                i4
                                          v4 =v5 + v6 + v7          v4      v4                          v7                 v7
       δ6 : R1                     1      v5 =i5 ∗ R1               v5      v5                          v5                 v5
       δ7 : C1                     1      v̇6 =iR6 /C1              v̇6     v̇6                         v̇6                v̇6
                                                  t
                                          v6 = t0 v̇6               v6      v6                          v6                 v6
       δ8 : Parallel Connection2   1      v7 =v8                    v8      v7        v8                v8                 v8
                                          v7 =v9                    v7                v7                v9
                                           i7 =i8 + i9              i9                i9       i9        i7                i7
       δ9 : L2                     1       i̇8 =vR8 /L2             i̇8               i̇8      i̇8       i̇8               i̇8
                                                  t
                                           i8 = t0 i̇8              i8                i8       i8        i8                i8
       δ10 : Sw2                   1       i9 =0                                                         i9                i9
                                         i10 =0                                                         i10
                                         i11 =0                                                         i11                         i11
                                   2       i9 =i10              i10                   i10
                                           i9 =i11              i11                            i11
                                          v9 =v10 + v11          v9                   v9
       δ11 : R2                    1    v10 =i10 ∗ R2           v10                  v10                v10
       δ12 : C2                    1    v̇11 =iR11 /C1          v̇11                 v̇11               v̇11
                                                  t
                                        v11 = t0 v̇11           v11                  v11                v11
                                          ∗
       δ13 : Current Sensor11      1     i11 =i11               i∗11                 i11       i∗11     i∗11                        i∗11
       δ14 : Voltage Sensor8       1      v8∗ =v8                v8∗        v8        v8∗      v8        v8∗               v8∗
       δ15 : Current Sensor3       1       i∗3 =i3               i∗3        i∗3       i3        i3       i∗3     i∗3


   For shorthand, we will refer to the modes only of the                  act as vcout . However, in some cases some causal assign-
components with multiple modes. So, for the circuit, we will              ments may not be possible, e.g., if we have noninvertible
refer only to components δ2 and δ10 , and we will have four               nonlinear constraints. Also, if we assume integral causality,
possible mode vectors, [1 1], [1 2], [2 1], and [2 2].                    then state variables must always be computed via integration,
   The switching behavior of each component can be de-                    and so the derivative causality is not allowed. Further, when
fined using a finite state machine or a similar type of control           placed in the context of a model, additional causalities may
specification. The state transitions may be attributed to con-            not be applicable, because the causal assignments of other
trolled or autonomous events. However, for the purposes of                constraints may limit the potential causal assignments. To de-
this paper, we view the switching behavior as a black box                 note this concept, we use Ac to refer to the set of permissible
where the mode change event is given, and refer the reader                causal assignments of a constraint c.
to many of the approaches already proposed in the literature                 For a given mode, we have the set of (specific) causal
for modeling the switching behavior [1, 8].                               assignments over the entire model in its mode, denoted using
                                                                          Am . So, some α ∈ Am would refer to the causal assignment
2.2 Causality                                                             of some constraint in some component of the model in its
Given a constraint c, which belongs to a specific mode of a               correct mode. The consistency of the causal assignments
specific component, the notion of a causal assignment is used             Am is defined as follows,
to specify a possible computational direction, or causality,
for the constraint c. This is done by defining which v ∈ Vc               Definition 5 (Consistent Causal Assignments). Given a
is the dependent variable in equation εc .                                mode m, we say that a set of causal assignments Am , for
                                                                          a model M is consistent if (i) for all v ∈ UM ∪ ΘM , Am
Definition 4 (Causal Assignment). A causal assignment αc                  does not contain any α such that α = (c, v), i.e., input or
to a constraint c = (εc , Vc ) is a tuple αc = (c, vcout ), where         parameter variables cannot be the dependent variables in
vcout ∈ Vc is assigned as the dependent variable in εc . We               the causal assignment; (ii) for all v ∈ YM , Am does not
use Vcin to denote the independent variables in the constraint,           contain any α = (c, vcout ) where v ∈ Vcin , i.e., an output
where Vcin = Vc − {vcout }.                                               variable can only be used as the dependent variable; and
   In general, the set of possible causal assignments for a               (iii) for all v ∈ VM − UM − ΘM , Am contains exactly
constraint c is as big as Vc , because each variable in Vc can            one α = (c, v), i.e., every variable that is not an input or




                                                                203
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


parameter is computed by only one (causal) constraint.              from one constraint to the related constraints (those sharing
   With causality information, we can efficiently derive a set      a variable with the fixed causality constraint). This will help
of submodels for residual generation [12].                          to reduce the number of permissible causal assignments. For
                                                                    example, if we again assume integral causality, then any
                                                                    constraint involving a state variable cannot be in a causal
3 Hybrid Systems Diagnosis Approach                                 assignment where the state variable is the dependent/output
We propose a hybrid systems diagnosis approach based on             variable, because the integration constraint is the one that
structural model decomposition. In this approach, we gener-         must compute it. For such a constraint, 1 < |Ac | < |Vc |.
ate submodels for the purpose of computing residuals. Resid-           Given a system model and a set of outputs, Algorithm 1
uals can then be used for diagnosis.                                searches over the model constraints to reduce the set of per-
   For hybrid systems, however, the problem is that these sub-      missible causal constraints based on system-level informa-
models may change as the result of a mode change. That is,          tion.1 First, it determines which constraints are mode-variant,
we may obtain two different submodels when decomposing              i.e, they can appear/disappear from the model depending on
the model in two different modes. There are two approaches          the mode (so belong to components with multiple modes),
to this problem. One is to find a set of submodels that work        and which are mode-invariant, i.e., they are present in all
for all modes, and can be easily reconfigured by executing          system modes (so belong to components with a single mode).
only local mode changes within the submodels [10]. This             It is only the mode-invariant constraints for which causal
approach requires the least online effort, with some offline        assignments can be removed. We then construct a queue
effort in finding these submodels, which exist only in limited      of variables from which to propagate. This queue contains
cases. The other approach is to generate submodels for the          the inputs and parameters (which must always be indepen-
current mode, and when a mode change occurs, reconfigure            dent/input variables in constraints), and the outputs (which
the submodels to be consistent with the new system mode.            must always be dependent/output variables in constraints).
This is the approach we develop in this paper.                      We create a variable set V that refers to the variables that are
   In order to execute this type of approach, however, we           resolved, i.e., either they are inputs/parameters or there is a
must be able to efficiently reconfigure submodels online. In        constraint with a single causal assignment that will compute
order to do this, we take advantage of causality in two ways.       the variable. So, V is initially set to include UM and ΘM .
First, we perform an offline model analysis to determine            Further, for any mode-invariant constraints that only has a
which causalities of the hybrid system model are not per-           single causal assignment, the output variable is added to V ,
missible, i.e., they will never be used in any mode of the          and all variables of the constraint added to the queue.
system (determine AM for a model). Second, we use an effi-             The main idea is to analyze the causality restrictions im-
cient causality reassignment algorithm, so that the causality       posed by variables in the queue, which will be propagated
of a hybrid systems model is updated incrementally when             throughout the model. While the queue is nonempty, we pop
a mode changes (given A for the previous mode, compute              a variable v off the queue. We then count the number of con-
it for the new mode). Since causal changes usually only             straints involving v that have no set causal assignment yet,
propagate in a local area in the model, causality does not          including constraints that are both mode-variant and mode-
need to be reassigned at the global model level. Together,          invariant. We then go through all mode-invariant constraints
these algorithms reduce the number of potential causalities to      involving v, and remove causal assignments that will never
search within the model decomposition algorithm and allow           be possible. There are three conditions in which this holds:
efficient submodel reconfiguration.                                 a causal assignment is not possible in any system mode if (i)
                                                                    the output variable is already computed by another constraint,
4 Causality Assignment                                              or is an input/parameter (i.e., in V ), (ii) any of the input vari-
In order to compute minimal submodels for residual genera-          ables are in the model outputs (i.e., in Y ), or (iii) v is not yet
tion, we need a model M with a valid causal assignment Am .         computed by any constraint (i.e., not in V ), there is only one
As described in Section 2, causality assignment can only be         noncausal constraint involving v remaining, and v is not the
defined for a given mode. However, there are some causal            output in this causality (in this case, v needs to be computed
assignments that are independent of the system mode, i.e.,          by some constraint and there is only one option left, so this
they are valid for all system modes. We capture this through        constraint must only be in the causality computing v). These
the notion of permissible causal assignments, introduced as         causal assignments are removed. If only one is left, then
AM in Section 2.                                                    we add the output for that causal assignment to V , and add
   Given a model with a number of modes, some constraints           the constraint’s variables to the queue. The algorithm stops
will always have the same causal assignment in all modes,           when causalities can no longer be removed, i.e., there are
and we say these constraints are in fixed causality.                not enough restrictions imposed by the current permissible
                                                                    causalities to reduce AM further.
Definition 6 (Fixed Causality). A constraint cδ is in fixed
causality if (i) component δ has only a single mode, i.e.,          Example 6. For the circuit, we assume integral causality, so
|Cδ | = 1, and (ii) for cδ in the single C ∈ Cδ , it always has     all constraints with the state variables are limited to causal
the same causal assignment in all system modes.                     assignments in which the states are computed via integration.
                                                                    Further, the constraint with uV is also fixed so that uV is the
   If a constraint is in fixed causality, then |Ac | = 1, i.e.,
                                                                    independent variable. For any specified outputs, AM is also
there is only one permissible causal assignment. For ex-
ample, if we make the integral causality assumption, then               1
                                                                          For structural model decomposition, some output variables may
constraints computing state variables will always be in the         become input variables and so the causal assignments permitting
integral causality, and thus they are in fixed causality.           that must be retained. Therefore, the algorithm only reduces the
   Additionally, when the constraint is viewed in the context       permissible set of causal assignments for a given set of outputs
of the model, the concept of fixed causality can be propagated      Y ⊆ YM .




                                                              204
                                 Proceedings of the 26th International Workshop on Principles of Diagnosis


Algorithm 1 AM ← ReduceCausality(M, AM , Y )                                Algorithm 2 A ← AssignCausality(M, m, A)
 1: Cinvariant ← ∅                                                           1: A ← ∅
 2: Cvariant ← ∅                                                             2: V ← UM ∪ ΘM
 3: for all δ ∈ M do                                                         3: Q ← UM ∪ ΘM ∪ YM
 4:     if |Cδ | = 1 then                                                    4: for all c ∈ CMm do
 5:         Cinvariant ← Cinvariant ∪ Cδ1                                    5:     if |Ac | = 1 then
 6:     else                                                               6:         (c, v) ← Ac (1)
                                       [                                     7:         Q←Q∪v
 7:         Cvariant ← Cvariant ∪          C                               8: while |Q| > 0 do
                                     C∈Cδ                                    9:     v ← pop(Q)
 8: Q ← UM ∪ ΘM ∪ Y                                                         10:     for all c ∈ CMm (v) do
 9: V ← UM ∪ ΘM                                                             11:         if c ∈
                                                                                             / {c : (c, v) ∈ A} then
10: for all c ∈ Cinvariant do                                               12:             α∗ ← ∅
11:     if |Ac | = 1 then                                                   13:             for all α ∈ Ac do
12:         (c, v) ← Ac (1)                                                 14:                 if Vc − {vα∗ } ∪ V 6= ∅ then
13:         Q ← Q ∪ Vc                                                      15:                     α∗ ← α
14:         V ←V ∪v                                                         16:                 else if αv ∈ Y then
15: while |Q| > 0 do                                                        17:                     α∗ ← α
16:     v ← pop(Q)                                                          18:                 else if vα∗ = v and |CMm (v)|−|{c0 : (c0 , v 0 ) ∈ A∧v ∈
17:     nnoncausal ← 0                                                                        vc }| = 1 then
18:     for all c ∈ Cinvariant (v) do                                       19:                   α∗ ← α
19:         if |Ac | > 1 or (|Ac | = 1 and vAc (1) ∈/ V ) then              20:           if α∗ 6= ∅ then
                                                                            21:               A ← A ∪ {α∗ }
20:             nnoncausal ← nnoncausal + 1                                 22:               Q ← Q ∪ (Vc − V )
21:     for all c ∈ Cvariant (v) do                                         23:               V ← V ∪ {vα∗ }
22:         nnoncausal ← nnoncausal + 1
23:     for all c ∈ Cinvariant (v) do
24:         if |Ac | > 1 or (|Ac | = 1 and vAc (1) ∈/ V ) then
25:             for all (c0 , v 0 ) ∈ Ac do
26:                     0
                    if v ∈ V then                                           constraints is restricted to U and Θ variables being indepen-
27:                     Ac ← Ac − (c0 , v 0 )                               dent variables and Y variables being dependent variables.
28:                 if (Vc − {v}) ∩ Y 6= ∅ then                             We add also to Q any variables involved in constraints that
29:                     Ac ← Ac − (c0 , v 0 )
30:                 if nnoncausal = 1 and v 0 ∈
                                              / V and v 0 6= v then         have only one permissible causal assignment, because this
31:                     Ac ← Ac − (c0 , v 0 )                               will also restrict other causal assignments. The set of causal
32:             if |Ac | = 1 then
                                                                            assignments is maintained in A.
33:                 (c0 , v 0 ) ← Ac (1)
34:                 Q ← Q ∪ (Vc0 − V )                                         The algorithm goes through the queue, inspecting vari-
35:                 V ← V ∪ {v 0 }
                                                                            ables. For a given variable, we obtain all constraints it is
                                                                            involved in, and for each one that does not yet have a causal
                                                                            assignment (in A), we go through all permissible causal as-
reduced so that they can appear only as dependent variables.                signments, and determine if the causality is forced into one
   With AM defined, we can perform causality assignment                     particular causal assignment, α∗ . If so, we assign that causal-
for a given mode, m. Because AM was reduced as much as                      ity and propagate by adding the involved variables to the
possible, causality assignment (and, later, reassignment) will              queue. A causal assignment α = (c, v) is forced in one of
be more efficient than otherwise. Algorithm 2 describes the                 three cases: (i) v is in Y , (ii) all variables other than v of the
causality assignment process for a model given a mode. Here,                constraint are already in V , and (iii) v is not yet in V , and
the model is assumed to not have an initial causal assign-                  all but one of the constraints involving v have an assigned
ment. Causal assignment works by propagating causal re-                     causality, in which case no constraint is computing v and
strictions throughout the model. The process starts at inputs,              there is only one remaining constraint that must compute v.
which must always be independent variables in constraints;
outputs, which must always be the dependent variables in                    Example 7. Consider the mode m = [1 2]. Here, A[1 2]
constraints; and variables for involved in fixed causality con-             is given in column 4 of Table 1, denoted by the vcout in the
straints. From these variables, we should be able to propagate              causal assignment. In this mode, the first switch is off, so
throughout the model and compute a valid causal assignment                  i1 and i2 act as inputs. Given the integral causality assump-
for the model in the given mode. For the purposes of this                   tion, a unique causal assignment to the model exists and is
paper, we assume integral causality and that the model pos-                 specified in the column.
sesses no algebraic loops.2 In this case, there is only one                 Example 8. Consider the mode m = [2 1]. Here, A[2 1]
valid causal assignment (this is a familiar concept within                  is given in column 8 of Table 1. In this mode, the second
bond graphs) [13].                                                          switch is off, so i9 , i10 , and i11 act as inputs. Given the
   Specifically, the algorithm works as follows. Similar to                 integral causality assumption, a unique causal assignment to
Algorithm 1, we keep a queue of variables to propagate                      the model exists and is specified in the column. Note that
causality restrictions, Q, and a set of variables that are com-             some causal assignments are in the same as in m = [1 2],
puted in the current causality, V . Initially, V is set to U and            while others are different. In changing from one mode to
Θ, because these variables are not to be computed by any                    another, an efficient causality reassignment should be able
constraint. Q is set to U , Θ, and Y , since the causality of               to determine which constraints need to change causality, and
    2
                                                                            do the work for only that portion of the model.3 Causal
      If algebraic loops exist, the algorithm will terminate before all     assignments that do not change from mode to mode are in
constraints have been assigned a causality. Extending the algorithm         fixed causality and found by Algorithm 1.
to handle algebraic loops is similar to that for bond graphs; a con-
straint without a causality assignment is assigned one arbitrarily,
                                                                               3
and then effects of this assignment are propagated until nothing                 Note that this particular circuit was carefully chosen so that
more is forced. This process repeats until all constraints have been        causality does propagate across much of the circuit, in order to
assigned causality.                                                         demonstrate the causality reassignment algorithm.




                                                                      205
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


5 Structural Model Decomposition                                          Algorithm     3        Am
                                                                                                                         0
                                                                                                                                                  ←
For a given causal model in a given mode, we have the                     ReassignCausality(M, m, Am , A)
equivalent of a continuous systems model for the purpose of                        0
                                                                           1: Am ← ∅
structural model decomposition, and we can compute mini-                   2: for all (c, v) ∈ Am do
                                                                           3:     if c ∈ CMm then
mal submodels using the GenerateSubmodel algorithm                         4:
                                                                                           0
                                                                                      Am ← Am ∪ Am
                                                                                                       0

described in our previous work [12]. The algorithm finds                   5: V ← ∅
                                                                                                           c


a submodel, which computes a set of local outputs given a                  6: Q ← ∅
set of local inputs, by searching over the causal model. It                7: for all δ ∈ M where mδ 6= m0δ do
                                                                           8:     Q ← Q ∪ Vδ
starts at the local inputs, and propagates backwards through               9: while |Q| > 0 do
the causal constraints, finding which constraints and vari-               10:     v ← pop(Q)
ables must be included in the submodel. When possible,                    11:     for all c ∈ CMm (v) do
                                                                                                                0
causal constraints are inverted in order to take advantage of             12:              / {c : (c, v) ∈ Am } then
                                                                                      if c ∈
                                                                          13:             α∗ ← ∅
local inputs. Additional information and the pseudocode are               14:             for all α ∈ Ac do
provided in [12].                                                         15:                 if Vc − {vα∗ } ∪ V 6= ∅ then
                                                                          16:                     α∗ ← α
   In the context of residual generation, we set the local                17:                 else if αv ∈ Y then
output set to a single measured value, and the local inputs               18:                     α∗ ← α
to all other measured values and the (known) system inputs.               19:                 else if vα∗ = v and |CMm (v)|−|{c0 : (c0 , v 0 ) ∈ A∧v ∈
                                                                                             vc }| = 1 then
That is, we exploit the analytical redundancy provided by                 20:                    α∗ ← α
the sensors in order to find minimal submodels to compute                 21:            if α∗ 6= ∅ then
                                                                                                         0
estimated values of sensor outputs. In this framework, we                 22:                                       ∗
                                                                                             if ∃α ∈ Am where vα = vα then
                                                                                                    0        0
consider one submodel per sensor, each producing estimated                23:                  Am ← Am − {α∗ }
                                                                          24:                  Q ← Q ∪ (V{ α∗
                                                                                                            c} − V )
values for that sensor.                                                                         0        0
                                                                          25:                  Am ← Am ∪ {α∗ }
   Assuming that the set of sensors does not change from                  26:                  Q ← Q ∪ (Vc − V )
mode to mode, then for a hybrid system we have one sub-                   27:                  V ← V ∪ {vα∗ }
model for each sensor.4 However, since the set of con-                    28:          else if (Vc − V = v then
                                                                          29:              V ← V ∪ {v}
straints changes from mode to mode, the result of the                     30:              Q ← Q ∪ {v}
GenerateSubmodel algorithm will also change. When a
mode changes, we first reassign causality to the model for the
new mode. Then, we generate new updated submodels for                     a submodel that gets a state added in a new mode can initial-
that mode using GenerateSubmodel. In order to reduce                      ize using the estimated value from another submodel in the
the work performed by this algorithm when a mode changes,                 previous mode.
we use an efficient causality reassignment algorithm. That,
coupled with the reduced set AM , significantly reduces the
work of the algorithm compared to a naive approach, where                 6 Online Causality Reassignment
the submodels are completely regenerated for a new mode.                  As we mentioned before, from the initial mode in the system
Additionally, when the system transitions to a new mode,                  with a valid set of causal assignments, we compute minimal
the causal assignments for the previous mode can be stored,               submodels. However, when the system transitions to a differ-
so that when the system changes to a mode that has already                ent mode, any submodel containing constraints of a switch-
been visited, it just takes the causal assignments that were              ing component will no longer be consistent, and must be
stored previously. Similarly, submodels generated in previ-               recomputed. In order to do this, we need to know the causal
ously visited modes can be saved and reused when the mode                 assignments for the new mode. We can reassign causality in
reappears.                                                                an efficient incremental process to avoid having to reassign
Example 9. The causal assignments for the submodels in the                causality to the whole model, as causal changes typically
different modes are shown in Table 1. For example, consider               propagate only to a small local area in the model [11].
the submodel for i∗11 in m = [2 1]. Here, i11 is zero, since                 Algorithm 3 presents the causality reassignment procedure.
Sw2 is off, and therefore we have just two constraints needed             The main ideas are based on the hybrid sequential causality
to compute i∗11 . In mode m = [1 2], i∗3 can be computed                  assignment procedure (HSCAP) developed for hybrid bond
using 16 constraints, where v8∗ is used as a local input to the           graphs in [11]. In our more general modeling framework,
submodel.                                                                 we find that similar ideas apply. Essentially, we start with a
                                                                          causal model in a given mode. We then switch to a different
   Note that a submodel for an output may have different
                                                                          mode, so for the switching components we have a new set of
states in two different modes (e.g., in moving from m = [2 1]
                                                                          constraints in the model. We need to find causal assignments
to m = [1 2], the i∗3 submodel adds state v6 ). In order to
                                                                          for these constraints. It is likely that some of the necessary
continue tracking, new states must be initialized. For the pur-
                                                                          causal assignments will conflict with causal assignments
poses of this paper, we assume that in any one system mode,
                                                                          from the old mode, therefore, we have to resolve the conflict
all states are included in at least one submodel.5 Therefore,
                                                                          and propagate the change. The change will propagate only as
    4
      By assuming that the sensor set does not change, we mean            far as it needs to in order to obtain a valid causal assignment
only that sensors are not added/removed to/from the physical              for the model in the new mode. Here, propagation stops
system upon a mode change. They are still allowed to be con-              along a computational path when a new causal assignment
nected/disconnected, but still appear in the system model even            does not conflict with a previous assignment.
when disconnected. For example, if a disconnected sensor outputs
0, then that needs to still be in the model.                              mode. Estimation techniques to handle that situation are outside
    5
      If this is not the case, then a state is not observable in some     the scope of this paper.




                                                                    206
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


   The algorithm works similarly to Algorithm 2. It maintains                     10
                                                                                                                           Measured
a queue of variables to inspect and a set V of variables                           8                                       Estimated

that are known to be computed in the causality for the new                         6
mode (so includes only variables from new assignments




                                                                        i∗3 (A)
or confirmed assignments made in the new mode). In this                            4

case, we initialize the queue only to variables involved in the                    2

constraints of the switching components. If no components                          0
are switching, the queue will be empty and no work will be
done. The main idea is that the required causal changes from                           0   5   10         15     20   25               30
                                                                                                      Time (s)
the variables placed in the queue will, on average, be limited
to a very small area. The causal assignments for the new                                            (a) i∗3 .
mode are initialized to those for the previous mode, for any
constraints that still exist in the new mode. Some of these                        2                                       Measured

may conflict with the new mode and will be removed and
                                                                                                                           Estimated
                                                                                   0
replaced with different assignments.




                                                                        v8∗ (V)
   As in all the other causality algorithms, we go through                        -2

the queue and propagate the restrictions we find on causality.                    -4
We pop a variable off the queue, and look at all involved
                                                                                  -6
constraints. If the constraint is not causal, then we need
to assign causality. We do the same analysis as before to                         -8
                                                                                       0   5   10         15     20   25               30
find if a causality is forced, but checking things only with                                          Time (s)
respect to V that includes only variables with confirmed                                            (b) v8∗
causal assignments computing it in the new mode. If we find
a constraint that is forced into a particular causal assignment
for the new mode, we make the assignment. If it conflicts                         2                                        Measured
                                                                                                                           Estimated
with one already in the set of causal assignments (copied                         0
from the old mode), then we remove the old assignment and
                                                                       i∗11 (A)



add the new one, adding the involved variables to the queue                       -2

so that changes are propagated.                                                   -4


7 Demonstration of Approach                                                       -6


For the circuit example, we consider two modes: one where                              0   5   10        15
                                                                                                      Time (s)
                                                                                                                 20   25               30

Sw1 is on and Sw2 is off (i.e., m = [2 1]), and one where                                           (c) i∗11
Sw1 is off and Sw2 is on (i.e., m = [1 2]). We consider
a scenario in which to demonstrate the approach where the            Figure 2: Measured and estimated values with an increase in
system starts in m = [2 1], switches to m = [1 2] at t = 10 s,       R1 at t = 15 s.
and switches back to m = [2 1] at t = 20 s. Additionally, at
t = 15 s, a fault is injected, specifically, an increase in R1 .
    Fig. 2 shows the measured and submodel-estimated values          developed. In [14], parameterized ARRs are used. However,
for the sensors. Up through the first mode change, the outputs       the approach is not suitable for systems with high nonlineari-
are correctly tracked by the submodels. At the first mode            ties or a large set of modes. A different approach [15], but
change at 10 s, the submodels reconfigure and track correctly        uses purely discrete models.
up to 15 s, when the fault is injected, and a discrepancy               In the DX community, some approaches have used differ-
is observed in the i∗3 submodel. Specifically, the current           ent kind of automata to model the complete set of modes and
increases above what is expected. The other submodels in             transitions between them. In those cases, the main research
this mode are independent of the fault, and so continue to           topic has been hybrid system state estimation, which has has
track correctly. When the second mode change occurs, i∗11            been done using probabilistic (e.g., some kind of filter [16]
can still be tracked correctly, since its estimation remains         or hybrid automata [4]) or set-theoric approaches [5].
independent of the fault. However, we now see a discrepancy             Another solution has been to use an automaton to track the
in v8∗ , as the measurement increases above what is expected.        system mode, and then use a different technique to diagnose
This transient occurs because we switch from a mode in               the continuous behavior (for example, using a set of ARRs
which the submodel is independent of the fault to one where          for each mode [3], or parameterized ARRs for the complete
it is dependent on the fault. Fault isolation can be performed       set of modes [17]). Nevertheless, one of the main difficulties
by using the information that in m = [1 2], an increase in           regarding state estimation using these techniques is the need
R1 would produce an increase in the residual for i∗3 , and           to pre-enumerate the set of possible system-level modes and
in m = [2 1], it would produce an increase also in the v8∗           mode transitions, which is difficult for complex systems. We
residual.                                                            avoid this problem by using a compositional approach.
                                                                        Regarding hybrid systems modeling, there are several
8 Related Work                                                       proposals. For HBGs [8, 18], there are two main ap-
Modeling and diagnosis for hybrid systems have been an im-           proaches: those that use switching elements with fixed causal-
portant focus of study for researchers from both the FDI and         ity [18–20], and those who use ideal switching elements that
DX communities during the last 15 years. In the FDI commu-           change causality [8]. The advantages of the latter are that the
nity, several hybrid system diagnosis approaches have been           modeling of hybrid systems is done through a special kind of




                                                               207
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


hybrid component (which avoid the mode pre-enumeration in         [8] P.J. Mosterman and G. Biswas. A comprehensive
the system), and also changes are handled in a very efficient          methodology for building hybrid models of physical
way [11]. Finally, in [10] the HBGs are used to compute                systems. Artificial Intel., 121(1-2):171 – 209, 2000.
minimal submodels (Hybrid Possible Conflicts, HPCs) simi-         [9] S. Narasimhan and G. Biswas. Model-Based Diagnosis
lar to the minimal submodels presented in this paper. HPCs             of Hybrid Systems. IEEE Trans. Syst. Man. Cy. Part A,
can track hybrid systems behavior, efficiently changing on-            37(3):348–361, May 2007.
line for each mode the PC simulation model, by using block
                                                                  [10] A. Bregon, C. Alonso, G. Biswas, B. Pulido, and
diagrams as in [11], and performing diagnosis without pre-
enumerating the set of modes in the system. However, HPCs              N. Moya. Hybrid systems fault diagnosis with possi-
rely on HBG modeling and do not provide a generalized                  ble conflicts. In Proceedings of the 22nd International
framework for hybrid systems.                                          Workshop on Principles of Diagnosis, pages 195–202,
                                                                       Murnau, Germany, October 2011.
                                                                  [11] I. Roychoudhury, M. Daigle, G. Biswas, and X. Kout-
9 Conclusions                                                          soukos. Efficient simulation of hybrid systems: A
In this work, we have developed a compositional modeling               hybrid bond graph approach. SIMULATION: Trans-
framework for hybrid systems. Using computational causal-              actions of the Society for Modeling and Simulation
ity, we developed efficient causality assignment algorithms.           International, 87(6):467–498, June 2011.
Given this causal information, submodels computed using           [12] I. Roychoudhury, M. Daigle, A. Bregon, and B. Pulido.
structural model decomposition can be computed and recon-              A structural model decomposition framework for sys-
figured efficiently. The approach was demonstrated with a              tems health management. In Proceedings of the 2013
circuit system. In future work, we will further develop the            IEEE Aerospace Conference, March 2013.
hybrid systems diagnosis approach for the single and multi-       [13] D. C. Karnopp, D. L. Margolis, and R. C. Rosenberg.
ple fault cases, and we will approach the diagnosis task in            Systems Dynamics: Modeling and Simulation of Mecha-
a distributed manner. The assumption of one submodel per               tronic Systems. John Wiley & Sons, Inc., NY, 2000.
sensor can also be dropped, using the extended framework
developed in [21, 22].                                            [14] V. Cocquempot, T. El Mezyani, and M. Staroswiecki.
                                                                       Fault detection and isolation for hybrid systems using
                                                                       structured parity residuals. In 5th Asian Control Con-
Acknowledgments                                                        ference, volume 2, pages 1204–1212, July 2004.
This work has been funded by the Spanish MINECO                   [15] J. Lunze. Diagnosis of quantised systems by means of
DPI2013-45414-R grant and the NASA SMART-NAS                           timed discrete-event representations. In Proceedings of
project in the Airspace Operations and Safety Program of               the Third International Workshop on Hybrid Systems:
the Aeronautics Mission Directorate.                                   Computation and Control, HSCC ’00, pages 258–271,
                                                                       London, UK, 2000. Springer-Verlag.
References                                                        [16] X. Koutsoukos, J. Kurien, and F. Zhao. Estimation
                                                                       of distributed hybrid systems using particle filtering
[1] T. A. Henzinger.      The theory of hybrid automata.               methods. In In Hybrid Systems: Computation and
    Springer, 2000.                                                    Control (HSCC 2003). Springer Verlag Lecture Notes
[2] T. Rienmüller, M. Bayoudh, M.W. Hofbaur, and                       on Computer Science, pages 298–313. Springer, 2003.
    L. Travé-Massuyès. Hybrid Estimation through Syn-             [17] M. Bayoudh, L. Travé-Massuyès, and X. Olive. Di-
    ergic Mode-Set Focusing. In 7th IFAC Symposium on                  agnosis of a Class of Non Linear Hybrid Systems by
    Fault Detection, Supervision and Safety of Technical               On-line Instantiation of Parameterized Analytical Re-
    Processes, pages 1480–1485, Barcelona, Spain, 2009.                dundancy Relations. In 20th International Workshop
[3] M. Bayoudh, L. Travé-Massuyès, and X. Olive. Cou-                  on Principles of Diagnosis, pages 283–289, 2009.
    pling continuous and discrete event system techniques         [18] W. Borutzky. Representing discontinuities by means of
    for hybrid system diagnosability analysis. In 18th Eu-             sinks of fixed causality. In F.E. Cellier and J.J. Granda,
    ropean Conf. on Artificial Intel., pages 219–223, 2008.            editors, Proc. of the Int. Conf. on Bond Graph Modeling,
                                                                       pages 65–72, 1995.
[4] M.W. Hofbaur and B.C. Williams. Hybrid estimation
    of complex systems. IEEE Trans. on Sys., Man, and             [19] M. Delgado and H. Sira-Ramirez. Modeling and simu-
    Cyber, Part B: Cyber., 34(5):2178–2191, 2004.                      lation of switch regulated dc-to-dc power converters of
                                                                       the boost type. In IEEE Int. Conf. on Devices, Circuits
[5] E. Benazera and L. Travé-Massuyès. Set-theoretic es-               and Systems, pages 84–88, December 1995.
    timation of hybrid system configurations. Trans. Sys.         [20] P.J. Gawthrop. Hybrid Bond Graphs Using Switched I
    Man Cyber. Part B, 39:1277–1291, October 2009.                     and C Components. CSC report 97005, Centre for Sys.
[6] S. Narasimhan and L. Brownston. HyDE: A General                    and Control, Faculty of Eng., Glasgow, U.K., 1997.
    Framework for Stochastic and Hybrid Model-based               [21] A. Bregon, M. Daigle, and I. Roychoudhury. An inte-
    Diagnosis. In Proc. of the 18th Int. WS. on Principles             grated framework for distributed diagnosis of process
    of Diagnosis, pages 186–193, May 2007.                             and sensor faults. In 2015 IEEE Aerospace Conf., 2015.
[7] L. Trave-Massuyes and R. Pons. Causal ordering                [22] M. Daigle, I. Roychoudhury, and A. Bregon.
    for multiple mode systems. In Proceedings of the                   Diagnosability-based sensor placement through struc-
    Eleventh International Workshop on Qualitative Rea-                tural model decomposition. In Second Euro. Conf. of
    soning, pages 203–214, 1997.                                       the PHM Society 2014, pages 33–46, 2014.




                                                            208
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




                            Device Health Estimation by Combining
                        Contextual Control Information with Sensor Data
   Tomonori Honda and Linxia Liao and Hoda Eldardiry and Bhaskar Saha and Rui Abreu
                      Palo Alto Research Center, Palo Alto, California, USA
      e-mail: {tomo.honda, linxia.liao, hoda.eldardiry, bhaskar.saha, rui.maranhao}@parc.com

                                      Radu Pavel and Jonathan C. Iverson
                                      TechSolve, Inc., Cincinnati, Ohio, USA
                                      e-mail: {pavel, iverson}@TechSolve.org
                        Abstract                                  discusses various categories of anomaly detection technolo-
                                                                  gies and their assumptions as well as their computational
     The goal of this work is to bridge the gap between           complexity. Several approaches such as statistical meth-
     business decision making and real-time factory               ods [3], neural network methods [4] and reliability meth-
     data. Beyond real-time data collection, we aim to            ods [5], have been applied to detect anomalies for various
     provide analysis capability to obtain insights from          types of equipment. The philosophies and techniques of
     the data and converting the learnings into action-           monitoring and predicting machine health with the goal of
     able recommendations. We focus on analyzing                  improving reliability and reducing unscheduled downtime
     device health conditions and propose a data fusion           of rotary machines are presented by Lee et al. [6].
     method that combines sensor data with limited di-
     agnostic signals with the device’s operating con-                Many of these methods focus on analyzing, combining,
     text. We propose a segmentation algorithm that               and modeling sensor data (e.g. vibration, current, acous-
     provides a temporal representation of the device’s           tics signal) to detect machine faults. One issue that remains
     operation context, which is combined with sensor             mostly unaddressed in these methods is that they rarely con-
     data to facilitate device health estimation. Sensor          sider the varying operating context of the machine. In many
     data is decomposed into features by time-domain              cases, false alarms are generated due to a change in machine
     and frequency-domain analysis. Principal com-                operation (e.g. rotational speed) rather than a change in ma-
     ponent analysis (PCA) is used to project the high-           chine condition. A major challenge in addressing this issue
     dimensional feature space into a low-dimensional             is that most machine controllers are built with proprietary
     space followed by a linear discriminant analysis             communication protocols, which leads to a barrier in ob-
     (LDA) to search the optimal separation among                 taining control parameters to understand the context under
     different device health conditions. Our industrial           which the machine is operating. Recently, the MTConnect
     experimental results show that by combining de-              open protocol [7] was developed to connect various legacy
     vice operating context with sensor data, our pro-            machines independent of the controller providers. MTCon-
     posed segmentation and PCA-LDA approach can                  nect provides an unprecedented opportunity to monitor ma-
     accurately identify various device imbalance con-            chine operating context in real-time. In this paper, we lever-
     ditions even for limited sensor data which could             age MTConnect to diagnose machine health condition by
     not be used to diagnose imbalance on its own.                combining sensor data with operating context information.
                                                                  Additionally, we investigate whether it is possible to diag-
                                                                  nose machine health condition using less sensor data when
1 Introduction                                                    it is combined with context information.
The growing Internet of Things is predicted to connect 30             Prior work [8] has demonstrated that vibration data could
billion devices by 2020 [1]. This will bring in tremendous        be used for diagnosing machine imbalance fault conditions.
amounts of data and drive the innovations needed to realize       Our study focuses on extending prior work by exploring var-
the vision of Industry 4.0—cyber-physical systems moni-           ious types of sensor and control data for diagnosing the im-
toring physical processes, and communicating and cooper-          balance of the machine tools.
ating with each other and with humans in real time. One of
the key challenges to be addressed is how to analyze large          Our contribution includes the following extensions:
amounts of data to provide useful and actionable informa-
tion for businesses intelligence and decision making. In par-       • Combining control and sensor signals to improve ac-
ticular, to prevent unexpected downtime and its significant           curacy.
impact on overall equipment effectiveness (OEE) and total
cost of ownership (TCO) in many industries. Continuous              • Utilizing a different set of sensor data such as temper-
monitoring of equipment and early detection of incipient              ature, power, flow, and lubricant/coolant pH.
faults can support optimal maintenance strategies, prevent
downtime, increase productivity, and reduce costs.                  Our hypothesis is that these advancements to prior work
   A significant number of anomaly detection and diagno-          will aid in improving the diagnosis capability as well as re-
sis methods have been proposed for machine fault detection        ducing the cost of machine diagnostics by utilizing cheaper
and machine health condition estimation. Chandola et al. [2]      sensors.




                                                            209
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


2 Experimental Data                                                 sufficient degree. Additionally, if the associated sensors are
The data under study has been collected from experiments            too expensive to install, then data fusion may be applied.
utilizing a machine tool monitoring system implemented on              There are three data fusion approaches typically used in
a horizontal machining center manufactured by Milltronic            machinery diagnostics [10; 11]—data-level fusion, feature-
with Fanuc 0i-MC control. We have two main sources of               level fusion, and decision-level fusion. Data-level fusion
data: (i) data from additional sensors installed on the ma-         involves combining sensor data before feature extraction,
chine, and (ii) data from the machine tool controller. This         such that features contain information gathered from mul-
data has been collected using National Instrument equip-            tiple sensors. Feature-level fusion involves generating fea-
ment and software (LabVIEW).                                        tures from each sensor separately, then fusing this set of
   The external sensors used for data collection include:           features generated from all of the sensors coherently for di-
                                                                    agnostics. Finally, decision-level fusion creates diagnostics
  • power sensor that measures power using Hall effect,             from each sensor separately, then aggregates these diagnos-
  • accelerometers that capture machine tool motion in 6            tics into a single diagnostic output.
    degrees of freedom,                                                The choice of the three types of data fusion methods is
                                                                    often application specific. In our application, we found that
  • thermocouples that measure temperatures at 10 loca-
                                                                    temperature sensor data cannot resolve imbalance condi-
    tions on the machine tool,
                                                                    tions by itself and control signal data is too coarse-grained
  • pH sensor for detecting the pH level of the metalwork-          to aid in classifying imbalance conditions using the stan-
    ing fluid, and                                                  dard data-fusion techniques. Note that we did not focus on
  • flow rate sensor to measure metalworking fluid pump             spindle acceleration data, which could diagnose imbalance
    flow.                                                           on its own (see Subsection 4.1) since that would require
                                                                    retrofitting existing machine tools with new expensive sen-
   The second category consists of data collected from the          sors and data acquisition hardware. Ideally we would like
controller. This data includes drive loads, absolute and rel-       to use the readily accessible control signals and data from
ative positions, servo delays, and feed rate. The complete          inexpensive temperature sensors to diagnose imbalance. To
list of the components of the control data is listed in Pavel       achieve this goal, we proposed a different type of data fusion
et al.[8].                                                          approach. We used the control signal to provide the contex-
   Data has been collected in two sessions, one in 2009 and         tual information for temperature sensor data. The control
the other in 2010. Although the basic control signals are           signal is used for the segmentation of sensor data, but does
similar, they are offset by constant values (see Figure 1).         not directly map into feature vectors (see Subsection 4.2).
Since the positional offset could cause a difference in the
motion dynamics, we have treated them as separate data sets         3.2 Model Selection
for this study.
                                                                    Since the data sets are statistically small and dimensional-
                                                                    ity of the data is increased by feature synthesis, the models
3 Technical Approach                                                to be used for imbalance classification need to be carefully
                                                                    chosen to avoid over-fitting. The high-dimensional data
  For each extension to prior work listed in Section1, we           needs to be projected to a much smaller sub-space to prevent
have performed two main steps for creating appropriate di-          over-fitting1 To accomplish this, the main techniques used
agnostics:                                                          in this study are Principal Component Analysis (PCA) [12]
  • Feature Extraction & Synthesis                                  and Linear Discriminant Analysis (LDA) [13]. These tech-
                                                                    niques are based on linear coordinate transformation, which
  • Model Selection                                                 makes them more likely to under-fit and less likely to over-
                                                                    fit [14].
3.1 Feature Extraction & Synthesis
There are various approaches for condensing time series in-         4 Results
formation into data mining features. Prior work has utilized
transfer functions to map control signals to vibrational sen-       We have explored three types of imbalance diagnostics to
sor data [8]. The diagnosis step is then reduced to compar-         investigate the hypothesis posed in Section 1:
ing the features of transfer function-predicted vibration data        • Sensor based Diagnostics
and the sensor-derived vibration data. This approach makes
                                                                      • Control based Temporal Segmentation followed by
sense when the control signal directly impacts the output
                                                                        Sensor based Diagnostics
variables of the machine. For motion control of machine
tools, the estimated transfer function should be similar to         4.1 Sensor based Diagnostics
the transfer function of the implemented control (like PI or
PID). Typical vibration data features would include average,        In this case, each sensor signal was analyzed separately to
standard deviation, and maximum FFT values [9].                     determine if any of the sensor signals contains enough diag-
   However, we would like to diagnose the state of machine          nostic information to detect imbalance on its own. By plot-
using not only accelerometers, but also other sensors, such         ting the time series data we find that spindle acceleration
as temperature sensors. Since temperatures at various lo-           sensors (which captures vibration) show higher oscillation
cations are not part of active control loops, there may not             1
                                                                          Note that complexity of model is positively correlated with
exist well defined transfer functions that can map control          likelihood of over-fitting. Thus, creating a classifier that takes
signals to temperature sensor data very accurately. In such         high-dimensional input will have higher degree of fredoom (i.e.
cases where conventional features extracted from tempera-           higher complexity) compare to low-dimensional inputs, which re-
ture signals are not correlated with the fault (imbalance) to a     sults in higher likelihood of over-fitting.




                                                              210
      Proceedings of the 26th International Workshop on Principles of Diagnosis




(a) Absolute X position                                           (b) Absolute Y position




(c) Absolute Z position                                          (d) Spindle Motor Speed

                          Figure 1: Primary Control Signals




                                        211
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


amplitudes (see Figure 2) with increasing imbalance. Since           computed the standard deviation at the each time step and
imbalance actually impacts moment of inertia of the spindle,         identified the periods with standard deviation below a set
this change in acceleration is expected.                             threshold to find the consistent time intervals (shown as col-
   We also considered measuring imbalance through tem-               ored segments along the time axis in Figure 5 (b)). Then
perature. From the energy flow perspective, additional ac-           we find the intersection of the sets of consistent time inter-
celeration caused by imbalance should result in higher en-           vals over all the control signals to determine the aggregate
ergy consumption from the power source and higher energy             time intervals over which the control signals are statistically
dissipation to thermal inertias due to friction, which should        consistent (shown as black segments along the time axis in
result in temperature increase in parts of the machine tool.         Figure 5 (c)).
However, the time series data, from each of the tempera-                These temporal segments are then mapped to sensor data
ture sensors, did not show distinguishing features similar to        to facilitate diagnostics. For each of 16 temporal segments,
the acceleration sensors. An example of temperature sensor           we computed features including (i) average, (ii) standard de-
time series data is shown in Figure 3.                               viation, (iii) maximum FFT value, and (iv) FFT frequency
                                                                     at maximum amplitude. This step produces a 64 dimen-
                                                                     sional feature space to diagnose machine imbalance. As
                                                                     mentioned before, to avoid the overfitting we focus on linear
                                                                     transformation based approaches. We implemented Princi-
                                                                     pal Component Analysis (PCA) to reduce the dimensional-
                                                                     ity from 64 to 4 (postulating that there should be 4 unique
                                                                     dimensions given the 4 uncorrelated features that we have
                                                                     selected). The PCA step is followed by Linear Discrimi-
                                                                     nant Analysis to find the optimal coordinate transformation
                                                                     that provides maximum separation between classes. Result
                                                                     of this PCA-LDA analysis is shown in Figure 6 for Fluid
                                                                     Temperature sensor data. Another temperature sensor lo-
                                                                     cated at Spindle Motor also exhibits similar diagnostic ca-
                                                                     pability after application of control based temporal segmen-
                                                                     tation. This demonstrates that control data can be used to
                                                                     provide context to sensor data in a way that helps diagnose
                                                                     machine imbalance. Thus, temperature sensor which had
Figure 3: Sample Temperature Sensor Data (Fluid Temper-              inferior diagnostic performance without context data, could
ature): blue and red traces indicate nominal and faulty con-         classify imbalance perfectly when it is combined with addi-
ditions respectively                                                 tional context from control signal.

   For this sensor data analysis, the features extracted are (i)
                                                                     5 Conclusion and Discussion
average, (ii) standard deviation, (iii) maximum amplitude            This work explores various types of sensor and control data
of FFT, and (iv) frequency for maximum amplitude of FFT.             for diagnosing the imbalance of the machine tools. Our
These four features are inspected visually to determine if           proposed approaches utilize sensor data that has not been
imbalance could be classified by a simple linear classifier.         used before for this purpose. This includes temperature,
The spindle acceleration (X, Y, and Z) feature (maximum              power, flow, and lubricant/coolant pH. In addition, our pro-
amplitude of FFT) showed easily visible characteristics that         posed techniques combine control and sensor signals to im-
can distinguish between degrees of imbalance. See Figure 4           prove accuracy. Namely, by combining context information
for an example of visual classification based on X-axis ac-          gained from the control signal, temperature sensor was able
celeration data. Other sensor signals like power, pH, flow,          to classify machine imbalance conditions with much higher
and temperature did not exhibit such classification capabil-         accuracy than using itself alone.
ity.                                                                    For future work, we will explore diagnostics based on
                                                                     control signal alone. Given that relying on sensor data typ-
4.2 Control-based Segmentation followed by                           ically requires adding sensors to existing machine tools, it
    Sensor-based Diagnostics                                         would be ideal if we could diagnose imbalance of the ma-
                                                                     chine from control signals that are usually recorded (i.e. no
The second diagnostic approach that we explored combines             additional hardware required). The expectation is that if a
both sensor and control data in a coherent manner. The               machine tool uses feedback controls, then the control signal
first step in this approach is to utilize the control signal to      should be impacted by any change in the operational char-
provide temporal segmentation, i.e., assuming quasi-steady           acteristics (in this case the imbalance of the machine tools).
state, the goal is to find the time intervals in which the fol-
lowing conditions are satisfied: (i) all experiments display         References
same values for the primary control signal (actual spindle
speed) , and (ii) all the control signals are constant over          [1] Carrie MacGillivray, Vernon Turner, and Denise Lund.
the same period. Note that, to investigate the dynamic re-               Worldwide internet of things (iot) 2013–2020 fore-
sponse, rather than quasi steady state response, the control             cast: Billions of things, trillions of dollars. IDC. Doc,
signals should be consistent across the experiments so that              243661(3), 2013.
responses are compared under the same set of control in-             [2] Varun Chandola, Arindam Banerjee, and Vipin Kumar.
puts. Figure 5 (a) shows the result of this temporal seg-                Anomaly detection: A survey. ACM Computing Sur-
mentation scheme. For each of the control signals, we have               veys (CSUR), 41(3):15, 2009.




                                                               212
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




          (a) Spindle X Acceleration: 2009 Data                                (b) Spindle X Acceleration: 2010 Data




          (c) Spindle Z Acceleration: 2009 Data                                (d) Spindle Z Acceleration: 2010 Data

                              Figure 2: Spindle Acceleration Data for different imbalance level




         (a) Spindle X Acceleration for 2009 Data                            (b) Spindle X Acceleration for 2010 Data

                             Figure 4: Visual Classification using Spindle X Acceleration Sensor


[3] Markos Markou and Sameer Singh. Novelty detection:                incomplete failure data collected from after the date of
    a review-part 1: statistical approaches. Signal process-          initial installation. Reliability Engineering & System
    ing, 83(12):2481–2497, 2003.                                      Safety, 94(6):1057–1063, 2009.
[4] Markou Markos and Sameer Singh. Novelty detection:           [6] Jay Lee, Fangji Wu, Wenyu Zhao, Masoud Ghaffari,
    a review-part 2: neural network based approaches. Sig-           Linxia Liao, and David Siegel. Prognostics and health
    nal Processing, 83(12):2499–2521, 2003.                          management design for rotary machinery systems-
[5] Haitao Guo, Simon Watson, Peter Tavner, and Jiang-               reviews, methodology and applications. Mechanical
    ping Xiang. Reliability analysis for wind turbines with          Systems and Signal Processing, 42(1):314–334, 2014.




                                                           213
                     Proceedings of the 26th International Workshop on Principles of Diagnosis




            (a) Raw Spindle Speed Control                              (b) Spindle Speed Control with Consistent Time Segment




                                               (c) Aggregating Control Signals

                                            Figure 5: Time Series Segmentation




                     (a) Group 1                                                            (b) Group 2

                                   Figure 6: PCA-LDA Result using Fluid Temperature


[7] MTConnect Standard. Part 1-overview and protocol,            [8] Radu Pavel, John Snyder, Nick Frankle, Gary Key, and
    version 1.01. MTConnect Institute, 2009.                         Loran Miller. Machine tool health monitoring using




                                                           214
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


     prognostic health monitoring software. In MFPT 2010
     Conference, Huntsville, AL, April 2010.
[9] Houtao Deng, George Runger, Eugene Tuv, and
     Martyanov Vladimir. A time series forest for classi-
     fication and feature extraction. Information Sciences,
     239:142–153, 2013.
[10] Qing Charlie Liu and Hsu-Pin Ben Wang. A case study
     on multisensor data fusion for imbalance diagnosis of
     rotating machinery. AI EDAM, 15(03):203–210, 2001.
[11] Andrew KS Jardine, Daming Lin, and Dragan Banje-
     vic. A review on machinery diagnostics and prognos-
     tics implementing condition-based maintenance. Me-
     chanical systems and signal processing, 20(7):1483–
     1510, 2006.
[12] Svante Wold, Kim Esbensen, and Paul Geladi. Princi-
     pal component analysis. Chemometrics and intelligent
     laboratory systems, 2(1):37–52, 1987.
[13] Gary J Koehler and S Selcuk Erenguc. Minimizing
     misclassifications in linear discriminant analysis*. De-
     cision sciences, 21(1):63–85, 1990.
[14] Bo Yang, Songcan Chen, and Xindong Wu. A struc-
     turally motivated framework for discriminant analy-
     sis. Pattern Analysis and Applications, 14(4):349–367,
     2011.




                                                            215
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                  216
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




  On the Learning of Timing Behavior for Anomaly Detection in Cyber-Physical
                             Production Systems
                    Alexander Maier1 and Oliver Niggemann1,2 and Jens Eickmeyer1
                      1
                        Fraunhofer Application Center Industrial Automation IOSB-INA
                      e-mail: {alexander.maier, jens.eickmeyer}@iosb-ina.fraunhofer.de
                                        2
                                          inIT - Institute Industrial IT
                                    e-mail: oliver.niggemann@hs-owl.de

                         Abstract                                   specific features. The taxonomy is then used to evaluate
                                                                    whether the models can be identified automatically and used
     Model-based anomaly detection approaches by                    for anomaly detection.
     now have established themselves in the field of
     engineering sciences. Algorithms from the field                   Based on this evaluation, we present a timing learning
     of artificial intelligence and machine learning are            method, which is used to learn the timing behavior as timed
     used to identify a model automatically based on                automaton. In contrast to other approaches, we use the
     observations. Many algorithms have been devel-                 underlying timing distribution function to differentiate be-
     oped to manage different tasks such as monitoring              tween transitions with equal events which belong to differ-
     and diagnosis. However, the usage of the factor                ent processes.
     of time in modeling formalisms has not yet been                   By calculating the computation runtime, we prove that
     duly investigated, though many systems are de-                 our approach runs faster than other existing methods for
     pendent on time.                                               timed automaton learning.
     In this paper, we evaluate the requirements of the                The presented learning method is used in an exemplary
     factor of time on the modeling formalisms and the              plant setup to demonstrate the suitability for anomaly detec-
     suitability for automatic identification. Based on             tion in CPPS.
     these features, which classify the timing model-                  The paper is organized as follows: In Section 2 we eval-
     ing formalisms, we classify the formalisms con-                uate some timing learning features and give a taxonomy
     cerning their suitability for automatic identifica-            of how these features are met by three categories of tim-
     tion and the use of the identified models for the              ing modeling formalisms, namely (i) Dynamic system mod-
     diagnosis in Cyber-Physical Production Systems                 els, (ii) Operational formalisms and (iii) Descriptive for-
     (CPPS). We argue the reasons for choosing timed                malisms. In Section 3, we argue why we use timed automata
     automata for this task and propose a new timing                as formalism, point out some challenges in timed automaton
     learning method, which differs from existing ap-               learning and present our timing learning approach. Further,
     proaches and we proof the enhanced calculation                 we prove formally the enhancement of the calculation run-
     runtime. The presentation of a use case in a real              time of our approach. Section 4 completes the contribution
     plant set up completes this paper.                             with the presentation of a use case in a real plant. Finally
                                                                    in Section 5, we conclude this paper with a short discussion
                                                                    and give an outlook to future work.
1 Introduction
Many learning algorithms have been developed for the iden-
tification of behavior models of CPPS, e.g. [1], [2], [3].          2 Classification of Timing Learning Features
However, most of the learning algorithms do not include               and Algorithms
timing information, not least because the modeling for-
malisms do not consider timing information.                         The modeling of time for computation purpose is a widely
    Indeed, technical systems mostly depend on time, e.g. the       researched area (e.g. in [4], [5] and [6]). Many formalisms
filling of a bottle or the moving of a part on a conveyor belt.     have been created to model different aspects of timing be-
Therefore, many applications (such as the anomaly detec-            havior. In this paper, some aspects are analyzed which have
tion) require a model with timing information. Some faults          to be considered when choosing an appropriate timing mod-
only can be detected using timing information (especially           eling formalism. Based on this analysis, some modeling
degradation faults, e.g. a worn conveyor belt runs slower).         formalisms are evaluated according to their capabilities to
    In this paper, we use the term "Cyber-Physical Systems          model the timing behavior. One of those formalisms is cho-
(CPS)" for "systems that associate (real) objects and pro-          sen that is well suited for the anomaly detection in CPPS.
cesses with information processing (virtual) objects and pro-          To keep the application domain in mind, a special fo-
cesses through open, partly global, anytime interconnected          cus is on modeling and identification of the timing behavior
information networks". Further, a CPPS is a CPS in the con-         of CPPS. Additionally, the suitability of the modeling for-
text of an industrial production environment.                       malisms according to automatic learnability from observa-
    In this paper, we give a taxonomy of modeling for-              tions only and the suitability for anomaly detection is eval-
malisms. These formalisms are evaluated according to                uated.




                                                              217
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


2.1 Evaluation of Timing Modeling Features                         (a set of) linear behaviors, where the future behavior from a
Before choosing an appropriate timing modeling formalism           given state at a given time is always identical. Branching-
some key issues have to be considered, which are listed            time formalisms are interpreted over trees of states. That
below. Some of that and additional features are given in           means, in contrast to linear-time models, the future behavior
[5] where the authors provide a comprehensive analysis on          of a given state at a given time can follow different behavior
timing modeling features and corresponding modeling for-           according to the tree.
malisms is given.                                                     A linear behavior can be regarded as a special case of
                                                                   a tree. Conversely, a tree can be treated as a set of linear
Discrete or dense time domain                                      behaviors that share common prefixes (i.e., that are prefix-
   The separation of formalisms concerning the usage of dis-       closed); this notion is captured formally by the notion of
crete and dense time domains is a first natural categoriza-        fusion closure [8]. Thus, linear and branching models can
tion. Discrete time models comprise a set of isolated points,      be put on a common ground and compared.
whereas dense time means that in a dense set, ordered by
"<", for every 2 points t1 and t2 with t1 < t2 there is always     2.2 Taxonomy of Timing Modeling Formalisms
a third point t3 in between, such that t1 < t3 < t2 .
                                                                   Mainly, the timing modeling formalisms can be subdivided
Explicit or implicit modeling of time                              into three categories: (i) Dynamic system models, (ii) Oper-
   Another major distinctive feature is the possibility of im-     ational formalisms and (iii) Descriptive formalisms:
plicit and explicit modeling of time. Model formalisms with
explicit time allow the modeling of concrete time values for       Dynamic system models
some specific event, e.g. "if the sensor is activated, start           In various engineering disciplines (like mechanical or
the conveyor belt within two seconds". Implicit modeling           electrical) and especially in control engineering, the so-
of time only gives information about the time duration as a        called state-space representation is a common way to model
whole.                                                             the timing behavior of technical systems [9].
One clock or many clocks                                               Three key elements are essential for the state-based rep-
   Furthermore, time model formalisms can be differenti-           resentation: The vector x with the state variables, the vector
ated according to their number of used clocks. When deal-          u with the input variables and the vector y with the output
ing with independent modules within a system, the question         variables. All these values explicitly depend on the time at
arises whether to use one or many clocks. The usage of             which they are evaluated (usually represented as x(t), u(t),
many clocks leads to the need of clock synchronization in          and y(t)), however, the timing information is not explicitly
the simulation step, whereas the usage of one clock only re-       described in the form as "the filling of the bottle takes five
quires a transformation from an n-clock model to a 1-clock         second" i.e. it uses implicit timing.
model.                                                                 The main advantage of dynamical system models is that
                                                                   very detailed physical models can be created using estab-
Concurrency and composition                                        lished mathematical methods. But this also can turn into a
   Most real systems are too complex to model them in one          disadvantage. For many purposes, the models are too de-
overall model. The behavior has to be divided into several         tailed, i.e. they are unsuitable for high-level description,
subsystems, so that the overall model is a composition of its      since some expert knowledge is required to read and under-
sub-models. For finite state machines, the number of states        stand the models. As proposed in [10], dynamical systems
reduces enormously if the system is decomposed into sub-           can be used for the diagnosis of distributed systems.
systems. This is also referred to as modularization.                   Various methods exist to identify dynamic system mod-
   The decomposition is a less mature process. Difficulties        els. These methods are grouped under the term model iden-
can arise in the synchronization step. Mostly, the separated       tification (sometimes the term "system identification" is also
models of subsystems have equal or identical properties.           used), although, the model is not identified completely, but
Furthermore, the time bases can be different between the           a structure model is presumed and the identification meth-
modules, discrete or continuous, or the time base is implicit      ods only determine the parameters. So, still some expert
for one module and explicit for another.                           knowledge is necessary and manual work has to be done.
Single-mode and multiple-modes                                     In [6], Isermann describes some methods, e.g by means of
   The distinction between models, which can only cope             parameter estimation. The states itself are not identified.
with single-modes and models that additionally can deal                Dynamic system models also can be used for fault de-
with multiple-modes, goes a step deeper than concurrency           tection (e.g. [11]). The model-based fault detection uses
and decomposition. A system may, at some point in time,            the inputs u and the outputs y to generate residuals r, the
abruptly change its behavior. In technical systems, this hap-      parameter estimates Φ or state estimates x, that are called
pens for reasons such as shifting a gear or stopping a con-        features. A comparison of these features with the nominal
veyor belt. All state based models (e.g. statecharts, Petri        values (normal behavior) detects changes of features, which
nets or finite state machines) are able to describe multiple-      lead to analytical symptoms s. The symptoms are then used
mode systems, where equation based formalisms (e.g. ordi-          to determine the faults.
nary differential equation) can only describe the behavior of          Despite their suitability for the modeling of timing be-
single-mode systems.                                               havior, dynamic system models can hardly be learned auto-
                                                                   matically based on observations only, since the structure of
Linear- and branching-time models                                  the model has be given and mostly only the parameters are
  A difference can also be made between linear and branch-         identified.
ing time models [7]. Linear-time formalisms are interpreted
over linear sequences of states. Each description refers to        Operational Formalisms




                                                             218
                            Proceedings of the 26th International Workshop on Principles of Diagnosis


  Operational formalisms further can be subdivided into (i)               Petri nets are named according to Carl Adam Petri, who
synchronous state machines and (ii) asynchronous abstract              initially developed this modeling formalism [20]. A vari-
machines:                                                              ety of Petri nets exists [21]. The most common type is
                                                                       place/transition-nets. It basically consists of states and tran-
Synchronous state machines:                                            sitions. Places store tokens and hand them over to the tran-
   A large variety of synchronous state machines exists: fi-           sitions. If all incoming places hold at least one token, a
nite state machine, statecharts, timed automaton, hybrid au-           transition is enabled. An enabled transition will be fired.
tomaton, Büchi automaton, Muller automaton, and others                 After firing the transition, tokens from incoming transitions
(see [12]). Here, we confine our self to finite state machines         are moved to outgoing transitions.
and timed automata, the timing extension of finite state ma-              Petri nets also have been extended to handle timing infor-
chines.                                                                mation. Merlin and Farber proposed the first Timed Petri net
   The main strength and the reason for the wide usage of              in [22]. Each transition is extended with the minimum and
finite state machines is their accessibility for humans and            maximum firing time, where the minimum firing time can
their simplicity. Often, processes or timing behavior are              be 0 and the maximum can be ∞. A comprehensive sur-
described by a sequence of events. In fact, technical sys-             vey on several timed extensions to Petri nets can be found
tems are often programmed in state machines, e.g. using                in [23] and [24].
the standardized programming language from IEC 61131.                     Furthermore, several approaches exist to identify Petri
Therefore, modeling the timing behavior of such technical              nets from sampled data. However, some requirements are
systems, in the sense of finite state machines or timed au-            put on the language to be identified or some assumptions
tomata, is consequential.                                              are made, e.g. in [25], Petri nets are identified from knowl-
   Some algorithms already exist to identify timed automata            edge of their language, where it is assumed that the set of
from observations (e.g. in [13] , [14], [15], [16], [1]). Most         transitions and the number of places is known. Only the net
automata identification algorithms are based on the state              structure and the initial marking are identified.
merging method. The basic procedure is illustrated in Fig-                Petri nets in general are suited for fault detection
ure 1. It works as follows:                                            (e.g. in [26] or [2]). The different types of Petri nets
                                                                       (mainly condition/event-systems, place/transition-nets and
                               Data                                    high-level Petri nets) have different time and space com-
                                           Data
                            Acquisition
                                          Measure-                     plexity.
                                           ments
                                1
                                                                       Descriptive Formalisms
                                            2
                                                      Prefix              As the name suggests, descriptive formalisms describe
                                                     Detection
                                                                       the model using a natural language, mostly based on mathe-
                             State                                     matical logic [27]. Such formalisms are especially suited if
                            Merging                                    some conditions have to be described.
                               3                                       Example 1. If it is raining or if it was raining in the last
                                                                       two hours, then the street is wet.
                  Finite                   Prefix Tree
                Automaton                   Acceptor                      Similar rules can also be created for the prediction of
                                                                       output signals (actuators) based on the inputs (sensors) in
                                                                       a CPPS.
Figure 1: The principle of offline automaton learning algo-
                                                                          As already shown in Example 1, the conditions can also
rithms using the state merging approach.
                                                                       contain time information.
                                                                          There exist different types of descriptive formalisms, e.g.
   First, in step (1), the data is acquired from the system            first order logics, temporal logics, explicit-time logics or al-
and stored into a database. In step (2), the observations are          gebraic formalisms. Further details can be found in the lit-
used to create a prefix tree acceptor (PTA) in a dense form,           erature, e.g. [27].
whereas equal prefixes are stored only once. Then, in step                Some algorithms exist to identify descriptive models. For
(3), in an iterative manner all pairs of states are checked for        the prediction of the behavior of CPPS, a timed decision
compatibility. If a compatible pair of states is found, the            tree can be learned for instance. Examples for such learning
states are merged. In [13], additionally a transition split-           algorithms are ID3 [28], the C4.5 algorithm as extension of
ting operation is introduced, which is executed when the               the ID3 algorithm [29] or a generic algorithm for building a
resulting subtrees are different enough. The result is a fi-           decision tree by Console [3].
nite automaton the generalizes the observed behavior in an                Note that the rule can not always be interpreted back-
appropriate way.                                                       wards. Using Example 1, a reason for the wet street could
   Finite state machines can also be used for fault detec-             be that somebody has washed his car on the street. There-
tion and diagnosis (e.g. in [17], [18], [19]). Depending on            fore, descriptive formalisms have a limited suitability for
the used formalism, different errors can be detected: wrong            anomaly detection. The usage of descriptive formalisms for
event sequence, improper event, timing deviation and error             anomaly detection puts additional requirements on the rules,
in continuous signals.                                                 they have to be more concrete. Using the given example, it
                                                                       can be modified as follows:
Asynchronous abstract machines:
  Beside the finite state machines, which work syn-                    Example 2. The street is wet if and only if it is raining or it
chronously, there exist formalisms that work asyn-                     was raining in the last two hours.
chronously, called the asynchronous abstract machines. The                This rule allows a backward interpretation, if the reason
most popular formalism in this group is Petri nets.                    for the wet street is unknown. However, the meaning of the




                                                                 219
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


rule has now changed. Additionally, these kind of rules is             some state of the art timing representation methods and
hardly identifiable from observations only.                            propose our solution.
Comparison of Modeling Formalisms                                   • Relative or absolute time base: The time base is also
   Table 1 shows how the mentioned timing modeling fea-               a very important issue. The base can be either absolute
tures are met by the corresponding modeling formalisms.               e.g. referred to the beginning of a production cycle or
   It can be seen that operational and descriptive formalisms         relative to the last event.
allow a similar level of timing modeling, while dynamic             • Number of clocks: Technical systems may be pro-
system models differ in nearly all features. In contrast to           grammed using a certain number of clocks. These have
the other formalisms, dynamic system models use a dense               to be identified or the behavior has to be expressed us-
time domain, only allow implicit modeling of, time, use one           ing only one clock.
clock only and can model linear time models.
   Please note the different possibilities to handle concur-          Timed automata allow both, one and many clocks.
rent behavior. Petri nets are the first choice for this task.         However, in [13] Verwer showed that 1-clock timed
Using tokens, concurrent behavior can be modeled in one               automata and n-clock timed automata are language-
model. Timed automata and hidden Markov models (HMM)                  equivalent, but in contrast to n-clock timed automata,
are able to decompose the behavior in several subsystems.             1-clock timed automata can be identified efficiently.
                                                                    • Event splitting: When do events with different timing
3 Automaton Learning                                                  belong to the same event, or do they describe different
The decision of which formalism to use is based on several            events? As can be seen in Figure 2, the events can be
factors. These can differ based on the individual use case.           split based on the timing, which is based on the con-
Here, we consider the models to be used for learning and              tainer size: The robot needs more time to move the big
diagnosis of CPPS.                                                    container compared to the small one, this is captured
   Despite there exist several algorithms for the identifica-         in the given probability distribution function over time.
tion of timed behavior, it can be seen in Table 2 that the            More formally: The event’s timing distribution func-
usage of timed automata is a good choice.                             tion can comprise several modes that have to be identi-
                                                                      fied.
  • Understandability: In contrast to many other auto-
    matically identified models, the identified finite state               d(t)
                                                                                  probability              place at
    machines can be better understood by third persons.                                                    position A
                                                                                                                    Small
    They can be verified by experts.                                                              t
                                                                                               time
                                                                                      filling bottleevent a
                                                                                                                  Containers   A

  • Wide usage: Finite state machines are widely used,                  d(t)                               place at
                                                                                                           position A
    e.g. for modeling or programming.                                                   Robot
                                                                                      filling
                                                                                                t
                                                                                              bottle                           A
                                                                           d(t)    fillingStart
                                                                                            bottle
  • Learnability: Finite state machines are suitable for
                                                                                                                               B
    automatic learning. The goal is to use as few expert                          probability     t
                                                                                                          place
                                                                                                          at position B
    knowledge as possible.                                              d(t)
                                                                                   filling bottle event a
                                                                                                                    Large
  • Diagnosability: Finite state machines are suitable for                                    time
                                                                                                 t
                                                                                                         place Containers      B
                                                                                                         at position B
    fault detection. This applies for both, manually created
    and automatically identified finite state machines.
                                                                  Figure 2: The timing behavior changes based on the con-
  • Suitability for verification: The identified finite state     tainer size.
    machines can be used for automatic verification.
  • Modification: The identified finite state machine can
                                                                    • Event splitting or timing preprocessing: Continu-
    be manually modified and adapted after learning. This
                                                                      ing from the previous point, additionally the question
    can also be done automatically.
                                                                      arises that whether the modes are identified during the
3.1 Challenges in Automaton Learning                                  learning process itself or whether a preprocessing can
                                                                      be used to identify multiple modes and use this infor-
Some algorithms have already been introduced for the iden-            mation in the learning process, avoiding the additional
tification of timed automata, see Section 2.2. However, there         splitting operation.
are still some challenges in learning timed automata. This
applies in particular to the time factor.                         3.2 Timed Automaton Learning Algorithm
  • Identification of states and events: The timing behav-        Several algorithms have been introduced to learn an au-
    ior includes not only the time stamps for some obser-         tomaton based on observations of the normal behavior only.
    vations, but also some states and transitions with timed      While most automaton identification algorithms do not con-
    events in between. Many learning algorithms (espe-            sider time (e.g. MDI [30] and Alergia [31]), recently only
    cially for learning of Markov chains) assume the states       few algorithms have been introduced that identify a Timed
    and transitions as given and only learn the transition        Automaton. RTI+ [13] and BUTLA/HyBUTLA [16] learn
    probabilities. Here, the structure (states and events) is     in an offline manner, i.e. first the data is acquired and stored
    not given but has to be identified from observations.         and then the automaton is learned. However, for the case
  • Timing representation method: Additionally, an ap-            that observations cannot be stored, an online learning algo-
    propriate timing representation method has to be cho-         rithm is desirable, which includes each observed event on-
    sen, which is able to correctly describe the technical        line, without a preprocessing. OTALA [1] is an extension of
    processes. At the beginning of Section 3.2 we review          BUTLA and learns a timed automaton in an online manner.




                                                            220
                        Proceedings of the 26th International Workshop on Principles of Diagnosis



 Table 1: Taxonomy of the timing modeling features and how they are satisfied by the corresponding modeling formalisms.
                                                                                descriptive        Dynamic
                                           operational Formalisms
                                                                                Formalisms      system models
                                     Timed                                       e.g. Rule-     e.g. state space
                                                     HMM        Petri nets
                                   Automata                                    based system      representation
                Discrete or dense
                                    discrete        discrete     discrete         discrete           dense
                     time domain
              Explicit or implicit
                                    explicit        explicit     explicit         explicit          implicit
                modeling of time
                     One clock or
                                   one/many        one/many     one/many         one/many              one
                     many clocks
                     Concurrency
                                       ++              ++           +++               +                 +
                 and composition
                Single-mode and      single/        single/      single/
                                                                                   single            single
                  multiple-modes    multiple        multiple     multiple
           Linear- and branching-    linear/         linear/      linear/          linear/
                                                                                                     linear
                     time models   branching      branching     branching        branching


                Table 2: Satisfiability of the mentioned properties by different timing modeling formalisms.
                                               Timed                                Rule-based      State space
                                                           HMM       Petri nets
                                             Automata                                 system      representation
                     Understandability          +++          ++          ++            +++               +
                             Wide usage         +++         +++          ++             ++              ++
                            Learnability        +++          ++           +            +++               +
                         Diagnosability         +++          ++          ++             ++              ++
             Suitability for verification       +++         +++          ++             ++               +
                           Modification         +++          ++          ++            +++               +


   A crucial issue for the modeling formalism of timed sys-         terns as shown in Figure 2.
tems is the representation of the timing information. Usu-
ally, timed automata use a single clock only and therefore a        Timing preprocessing
relative time base is required, where a relative time stamp            The timing of events is analyzed in a preprocessing step.
represents the passed time from entering until leaving a            The relative time values of each event are collected in a his-
state. The timing information is annotated in the transition        togram. It is decided whether the timing behavior is subdi-
next to an event. The usual way is to use intervals record-         vided into multiple modes based on this histogram and the
ing the minimum and maximum observed time values for a              resulting probability density distribution over time. In case
specific event [13], [14], [15], [1].                               of multiple modes, an event is separated according to the
                                                                    number of modes in the PDF such that each event consists
   RTI+, the first algorithm for the identification of timed
                                                                    of only one mode. For instance, an event ei with 2 modes is
automata [13], included a transition splitting operation in
                                                                    separated into ei,1 , ei,2 , as can be seen in Figure 3.
addition to the merging operation. The timing in the transi-
tions is represented with histograms using bins and uniform                     Probability density
distribution [13]. During the state merging procedure, it is                    function over time
                                                                                                                 Separated events

also checked, whether a transition can be split. A transi-               p(t)                             p(t)
tion is split when the resulting subtrees are different enough.                         ei                          ei,1    ei,2
However, the splitting operation is associated with a high
calculation time, since depending on the bin size, all pos-                                           t                             t
sible splits have to be calculated. The disadvantage of this
approach is that the bin size has to be set manually by ex-         Figure 3: An event with a multi-mode timing behavior is
perts. Further, it does not take the underlying distribution        separated into its modes.
into account.
   In contrast to other existing algorithms for the identifi-         For the detection of multiple modes in events, three meth-
cation of timed automata, our proposed identification algo-         ods have been evaluated:
rithm BUTLA [16] uses probability density functions over
time (PDFs) to express the timing behavior. Unlike other              • Kernel density estimation: This version is straight for-
approaches, we base our decision on the timing information              ward by estimating the density of the distribution func-
itself, not on the subtree resemblance.                                 tion and subdividing at local minimums. It is optimized
   The identification algorithm BUTLA follows the method-               for efficient computation time. Nevertheless it delivers
ology from Figure 1. Additionally, instead of the splitting             useful results.
operation, a preprocessing step is introduced, which identi-          • ExpectationŰmaximization (EM) - algorithm: This
fies the timing behavior and captures different behavior pat-           method is well-known from the state of the art. It per-




                                                              221
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


    forms well, but the number of mixed distribution func-              procedure is not necessary, since the transitions are already
    tions has to be known or determined subsequently by                 split according to the identified timing modes.
    trying all values and take the best fitting.
  • Variational Bayesian inference: This version has the
                                                                        3.3 Analysis of the Timing Preprocessing
    weakest performance but delivers the best results. The              Figure 2 illustrates that a state can be a starting point for
    number of overlapping distribution function is calcu-               different processes: When the robot is started, it depends on
    lated in an iterative manner.                                       the size of the containers that which of the sub-trees is taken
                                                                        for the further process, based on the time that is needed to
   Due to the high computation effort of the EM-algorithm
                                                                        move the container. Different possibilities exist to identify
and Variational Bayesian inference, we chose to use the
                                                                        the different timing behavior of the sub-trees.
kernel density estimation for the timing preprocessing in
                                                                           The algorithm RTI+ uses a splitting operation, which cal-
BUTLA. The determination of the timing modes using the
                                                                        culates a p-value for all possible splits and its sub-trees. If
kernel density estimation works as follows:
                                                                        the lowest p-value of one split is less than 0.05, the transition
   First, for each event e, all timing values t1 , t2 , ..., tk are
                                                                        is split.
collected and stored in a list {e, {t1 , t2 , ..., tk }}, k ∈ N is
                                                                           Figure 4 illustrates the problem of the splitting operation.
the number of collected timing values for one event. Then,
                                                                        The main drawback of using the splitting operation is that
the PDFs are calculated using the kernel density estimation
                                                                        it requires additional computation time. First, all possible
method for each event. Density estimation methods use a
                                                                        splits have to be evaluated. Based on the number of ob-
set of observations to find the subjacent density function.
                                                                        servations, these can be a huge amount. And after finding
Given a vector t with the time values of the observations,
                                                                        the best splitting point based on the smallest p-value, the
the underlying density distribution for a time value t can be
                                                                        transition has to be split. Here, for all postfixes of the cor-
estimated as
                                                                        responding transition, it has to be decided that which path
                                  N                                     to follow. Since all these paths are mixed in the previous
                             1 X                                        states, the information that which path follows which states,
                     f (t) =       k(ti ; t)                   (1)
                             N i=1                                      based on the original data, has to be stored somehow. This
where N ∈ N is the number of time values in the vector of               leads to a huge memory consumption. To avoid this high
observations and k(ti ; t) is a non negative kernel function            memory consumption, RTI+ renews the prefix tree acceptor
                                                                        beginning with the corresponding state after each splitting
                   Z ∞
                                                                        operation. However, this is still time and space consuming.
                           k(t; t)dt = 1.                  (2)
                        −∞
   As underlying probability distribution, we use the Gaus-
sian distribution, which is defined as:                                                  a         Split         a       a

                                   1       (t−µ)2
                                          − 2σ2
               G(µ, σ 2 , t) = √     e                   (3)                                                         ?

                               2πσ 2                                                                         n               n'
                                                                                             m
  where σ 2 is the bandwidth (smoothing factor), µ the mean
value and t is the time value, for which the probability is
calculated.
                                                                              Figure 4: The problem of the splitting operation.
   The choice of the bandwidth is important for the correct-
ness of the results and it is the subject of research in dif-
ferent publications (e.g. [32]). In the case of identifying             Proposition 1. The time complexity of calculating and per-
the normal behavior of production plants, it is useful not to           forming a splitting operation is O(m2 · n2 ), where m is the
use a fixed value for smoothing factor but to keep it vari-             number of input samples and n is the number of states in the
able. Here, the variable smoothing factor is 5% of the cur-             PTA.
rent value. This results in the greater variance for greater
                                                                        Proof. For each transition (in worst case there are n − 1
time values and smaller variance for smaller time values.
                                                                        transitions in the PTA, if it is a linked list of states with only
Therefore, the density is estimated as:
                                                                        one input sample or all input samples follow the path), the
                         N                                              p-value has to be calculated (which has to be done for each
                     1 X       1        −
                                          (x−t)2
                                                                        input sample using the certain transition). Therefore, the
           f (t) =        √            e 2·0.05ti .            (4)
                     N i=1 2π · 0.05ti                                  complexity for calculating the p-values is O(m · n).
   In the next step the local minimums in the calculated PDF               One splitting operation itself also needs time in O(m · n)
are localized. One mode is assumed to be between the local              for the creation of the PTA with m input samples, where
minimums.                                                               each can have n states.
   Finally, referring to the original data (discrete time val-             In the worst case, if each transition has to be split, the
ues) and based on the assumption of normally distributed                complexity is in O(m2 · n2 ).
data, the needed statistic parameters (mean µ and standard
                                                                           BUTLA firstly uses a preprocessing of timing values to
deviation σ) are calculated. This is done for each mode:
                                                                        avoid this splitting operation. This version is based on the
between the minimum value, all local minimums and the
                                                                        assumption that events with the same changing signals but
maximum value.
                                                                        different timing behavior describe different behavior.
   Using this preprocessing of the timing information, the                 In the preprocessing step, events with multiple timing
time-consuming splitting operation during the state merging             modes are identified. These modes are used for the creation




                                                                  222
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


of the prefix tree. Events with the same symbol but arising        original observations’ timings and σ is the standard devia-
from different timing modes are handled as different events        tion.
and lead to different states in the prefix tree. In the iden-          In a first experiment, the Lemgo Model Factory (see Fig-
tification phase, these events are also handled as different.      ure 5) is used. A frequently occurring error for example is
Using this preprocessing step, the splitting process can be        the wear of a conveyor belt which leads to a decrease in the
omitted. This leads to a computation speed increase.               system’s throughput. 12 production cycles are used to iden-
Proposition 2. The time complexity of calculating the tim-         tify a normal behavior model. The PTA comprises 6221
ing modes in a preprocessing step is in O(n), where n is the       states. BUTLA reduces this to 13 states—this corresponds
number of observed events.                                         to a compression rate of 99.79%.
                                                                       To verify the model learning algorithm with a high
Proof. Since this is during the preprocessing step and the         amount of data, in a second experiment, data is generated ar-
PTA does not exist so far, the worst case is not dependent         tificially using the modified Reber grammar (extended with
on the PTA structure, but only on the number of incoming           timing information). 1000 samples are generated to learn
events and the number of symbols.                                  the model, then 2000 test samples are created where 1000
   First, the time stamps for each symbol in the alphabet          comprise timing errors. From the initial 5377 states in the
a ∈ Σ have to be collected. This takes time O(n).                  PTA, a model with 6 states is learned.
   Then for each a ∈ Σ, the probability density distribution           Table 3 shows the error rates for the anomaly detection
over time has to be calculated. For this, Equation 4 is com-       applied to both data sets using different factors k in the tim-
puted. Note that all events are not considered for a single        ing intervals.
symbol a ∈ Σ, but only those that belong to this symbol
a. All computations together need time O(n). Additionally
                                                                   Table 3: Experimental results using real and artificial data.
the local minimums have to identified, which is also done in
O(n).                                                                                                 k =1   k =2   k =3   k =4
   All these steps are performed subsequently and therefore         false negative rate (%) - LMF      2      5.3   12.8    30
the overall time complexity for the preprocessing step is           false positive rate (%) - LMF      12     4.2     2      0
O(n).                                                               false negative rate (%) - Reber    0      1.3   7.5     21
                                                                    false positive rate (%) - Reber     9     3.1    1.1     0

   Using the preprocessing step, the computation time can
be reduced compared to the splitting version. While the               The experimental results in Table 3 show that the false
splitting version runs in polynomial time, we could reduce         positive rate could be reduced by enlarging the time bounds.
this additional timing computation to linear time using the        But at the same time, the false negative rate rose. The ap-
preprocessing step.                                                plication of the enlargement of the time requires a trade off
                                                                   between false positive and false negative rate. This has to be
                                                                   done separately for each application.
4 Learning Automata Results
As mentioned before, the goal of the identified automata is        5 Conclusion
the usage for anomaly detection. An exemplary plant at the
                                                                   In this paper we analyzed the possibilities of learning the
institute has been used for experimental results. Figure 5
                                                                   timing behavior for anomaly detection in CPPS. First, we
shows a part of the Lemgo Model Factory and the identified
                                                                   gave a taxonomy of timing modeling formalisms. Based
models of two modules.
                                                                   on this taxonomy we analyzed whether the models can be
                                                                   identified automatically and whether they are suitable for
                                     Muscle on
                                      [8…34]                       anomaly detection.
                                 1                2
                                                                      Timed automata are often the first choice for the modeling
                                                                   of timed behavior of CPPS, especially for the modeling of
                                     Muscle off
                             event    [7…35]


                                     timing
                                                                   sequential timed behavior.
                                                                      Due to the intuitive interpretation, timed automata are
                                                                   well-suited to model the timing behavior. In our proposed
                                                                   learning method, we used probability density distribution
                                                                   functions over time for the timing representation. In a
                                                                   preprocessing step multiple modes in single transitions are
                                                                   identified, this enables the omission of the time consuming
                                                                   splitting operation.
                                                                      We proved the runtime enhancement formally and gave
Figure 5: Example plant with identified models for two             some experimental results which prove the practicability of
modules.                                                           timed automata for automatic identification and for anomaly
                                                                   detection.
   During the anomaly detection phase, the running plant’s
timing behavior is compared to the prognosis of the automa-        References
ton. A timing anomaly is signaled whenever a measured              [1] A. Maier. Online passive learning of timed automata
timing is outside the timing interval in the learned timed au-         for cyber-physical production systems. In The 12th
tomaton. Here, the interval is defined as [µ − k · σ, µ + k ·          IEEE International Conference on Industrial Infor-
σ], k ∈ R+ where µ is the mean value of the corresponding              matics (INDIN 2014). Porto Alegre, Brazil, Jul 2014.




                                                             223
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


[2] M.M. Mansour, M. Wahab, and W.M. Soliman. Petri               [17] S. Tripakis. Fault diagnosis for timed automata. In
     nets for fault diagnosis of large power generation sta-           Werner Damm and Ernst-Rüdiger Olderog, editors,
     tion. Ain Shams Engineering Journal, 4(4):831 – 842,              FTRTFT, volume 2469 of Lecture Notes in Computer
     2013.                                                             Science, pages 205–224. Springer, 2002.
[3] L. Console, C. Picardi, and D.T. Dupré. Temporal de-          [18] P. Supavatanakul, C. Falkenberg, and J. J. Lunze. Iden-
     cision trees: Model-based diagnosis of dynamic sys-               tification of timed discrete-event models for diagnosis,
     tems on-board. CoRR, abs/1106.5268, 2011.                         2003.
[4] A. Maier. Identification of Timed Behavior Models for         [19] Z. Simeu-Abazi, M. Di Mascolo, and M. Knotek.
     Diagnosis in Production Systems. PhD thesis, Univer-              Diagnosis of discrete event systems using timed au-
     sity of Paderborn, 2015.                                          tomata. In International Conference on cost effective
[5] C. A. Furia, D. Mandrioli, A. Morzenti, and M. Rossi.              automation in Networked Product Development and
                                                                       Manufacturing, Monterrey, Mexico, 2007.
     Modeling time in computing: A taxonomy and a com-
     parative survey. ACM Comput. Surv., 42(2):6:1–6:59,          [20] C. A. Petri. Fundamentals of a theory of asynchronous
     March 2010.                                                       information flow. In IFIP Congress, pages 386–390,
[6] R. Isermann and M. Münchhof. Identification of Dy-                 1962.
     namic Systems: An Introduction with Applications.            [21] T. Murata. Petri nets: Properties, analysis and applica-
     Advanced textbooks in control and signal processing.              tions. Proceedings of the IEEE, 77(4):541–580, April
     Springer, 2010.                                                   1989.
[7] M. Y. Vardi. Branching vs. linear time: Final show-           [22] P. M. Merlin and D. J. Farber. Recoverability of
     down. In Proceedings of the 7th International Con-                communication protocols–implications of a theoreti-
     ference on Tools and Algorithms for the Construction              cal study. Communications, IEEE Transactions on,
     and Analysis of Systems, TACAS 2001, pages 1–22,                  24(9):1036–1043, Sep 1976.
     London, UK, 2001. Springer-Verlag.                           [23] A. Cerone. A Net-based Approach for Specifying Real-
[8] R. Alur and T. A. Henzinger. Back to the future:                   time Systems. Serie TD. Ed. ETS, 1993.
     Towards a theory of timed regular languages. In In           [24] A. Cerone and A. Maggiolo-Schettini. Time-based
     Proceedings of the 33rd Annual Symposium on Foun-                 expressivity of time petri nets for system specifica-
     dations of Computer Science, pages 177–186. IEEE                  tion. Theoretical Computer Science, 216(1 - 2):1 –
     Computer Society Press, 1992.                                     53, 1999.
[9] H. Khalil. Nonlinear Systems. Prentice Hall, January          [25] M.P. Cabasino, A. Giua, and C. Seatzu. Identification
     2002.                                                             of Petri Nets from Knowledge of Their Language. Dis-
[10] S. Indra. Decentralized Diagnosis with Isolation on               crete Event Dynamic Systems, 17(4):447–474, 2007.
     Request for Spacecraft. In Astorga Zaragoza, editor,         [26] P. Nazemzadeh, A. Dideban, and M. Zareiee. Fault
     Proceedings of the 8th IFAC Symposium on Fault De-                modeling in discrete event systems using petri nets.
     tection, Supervision and Safety of Technical Processes,           ACM Trans. Embed. Comput. Syst., 12(1):12:1–12:19,
     pages 283–288, August 2012.                                       January 2013.
[11] R. Isermann. Model-based fault detection and diag-           [27] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi,
     nosis - status and applications. In 16th IFAC Sympo-              and P. F. Patel-Schneider, editors. The Description
     sium on Automatic Control in Aerospace, St. Peters-               Logic Handbook: Theory, Implementation, and Appli-
     bug, Russia, 2004.                                                cations. Cambridge University Press, New York, NY,
                                                                       USA, 2003.
[12] W. Thomas. Automata on infinite objects. In Jan van
     Leeuwen, editor, Handbook of Theoretical Computer            [28] J. R. Quinlan. Induction of decision trees. In Jude W.
     Science (Vol. B), pages 133–191. MIT Press, Cam-                  Shavlik and Thomas G. Dietterich, editors, Readings
     bridge, MA, USA, 1990.                                            in Machine Learning. Morgan Kaufmann, 1990. Orig-
                                                                       inally published in Machine Learning 1:81–106, 1986.
[13] S. Verwer. Efficient Identification of Timed Automata:
     Theory and Practice. PhD thesis, Delft University of         [29] J. R. Quinlan. C4.5: Programs for Machine Learn-
     Technology, 2010.                                                 ing. Morgan Kaufmann Publishers Inc., San Fran-
                                                                       cisco, CA, USA, 1993.
[14] M. Roth, J. Lesage, and L. Litz. Black-box identifica-
     tion of discrete event systems with optimal partitioning     [30] Franck Thollard, Pierre Dupont, and Colin de la
     of concurrent subsystems. In American Control Con-                Higuera. Probabilistic DFA inference using Kullback-
     ference (ACC), 2010, pages 2601–2606, June 2010.                  Leibler divergence and minimality. In Proc. of the 17th
                                                                       International Conf. on Machine Learning, pages 975–
[15] M. Roth, S. Schneider, J.-J. Lesage, and L. Litz. Fault           982. Morgan Kaufmann, 2000.
     detection and isolation in manufacturing systems with
                                                                  [31] Rafael C. Carrasco and Jose Oncina. Learning stochas-
     an identified discrete event model. Int. J. Systems Sci-
     ence, 43(10):1826–1841, 2012.                                     tic regular grammars by means of a state merging
                                                                       method. In GRAMMATICAL INFERENCE AND AP-
[16] O. Niggemann, B. Stein, A. Vodenčarević, A. Maier,              PLICATIONS, pages 139–152. Springer-Verlag, 1994.
     and H. Kleine Büning. Learning behavior models for
                                                                  [32] Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel
     hybrid timed systems. In Twenty-Sixth Conference on
                                                                       density estimation via diffusion. Annals of Statistics,
     Artificial Intelligence (AAAI-12), pages 1083–1090,
                                                                       38(5):2916–2957, 2010.
     Toronto, Ontario, Canada, 2012.




                                                            224
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




            The Case for a Hybrid Approach to Diagnosis: A Railway Switch

              Ion Matei and Anurag Ganguli and Tomonori Honda and Johan de Kleer
                         Palo Alto Research Center, Palo Alto, California, USA
                          e-mail: {imatei,aganguli,thonda,dekleer}@parc.com



                         Abstract                                  model ultimately has 56 continuous time state and more than
                                                                   2000 time-varying variables). We require this model to con-
     Behavioral models are at the core of Fault-                   tain the key mechanisms which comprise a switch mecha-
     Detection and Isolation (FDI) and Model-Based                 nism. Under the limiting conditions, building an accurate
     Diagnosis (MBD) methods. In some practical ap-                model of the system proved to be impractical and therefore
     plications, however, building and validating such             we used simplified models for the system’s components. For
     models may not always be possible, or only par-               example, we model the controller as a PID controller while
     tially validated models can be obtained. In this              the actually mechanism surely has a more complex one. The
     paper we present a diagnosis solution when only               Modelica model is fault augmented [Minhas et al., 2014]
     a partially validated model is available. The solu-           including parameters which represent the fault amounts for
     tion uses a fault-augmented physics-based model               wear, etc. Second, we develop ML classifiers to detect and
     to extract meaningful behavioral features corre-              diagnose faults by running the Modelica model repeatedly
     sponding to the normal and abnormal behavior.                 with various fault amounts. We mix noise in the simulation
     These features together with experimental train-              to avoid over-fitting. For the ML classifier to work requires
     ing data are used to build a data-driven statisti-            developing a set of features for the signal. Each time series
     cal model used for classifying the behavior of the            is segmented at defined conditions and a set of features is
     system based on observations. We apply this ap-               designed (e.g., mean in segment, max in segment). Mul-
     proach for a railway switch diagnosis problem.                tiple ML techniques can develop a classifier, the best we
                                                                   found are based on random-forest. Third, we throw away
1 Introduction                                                     the model — it was only important to develop the features
                                                                   and the classifier. We now use the classifiers developed for
Consider the case of developing diagnostic software for a
                                                                   the synthetic data on the real data. We were able to detect
complex system (for this paper our example is a railway
                                                                   faults with a high level of accuracy, but were only partially
switch). The task is to determine from operational data
                                                                   successful in identifying the correct fault mode (or nomi-
whether the switch is operating correctly or in one of a fixed
                                                                   nal) for the operating system. Independently, we showed
number of fault modes. We are given the following very
                                                                   that given enough data for the various fault modes, using
limiting (but all too common) conditions: (a) very limited
                                                                   the same set of features, a ML classifier can be designed that
resources to complete the project (a few man months); (b)
                                                                   also achieves a high diagnostic accuracy. The latter effort is
limited number of sensors; (c) unavailability of the model
                                                                   not the subject of the paper. Overall, the customer was very
of the system; (d) unavailability of the system itself (would
                                                                   satisfied with the results of the project. Throughout the rest
require an instrumented private rail system); (e) unavailabil-
                                                                   of the paper we describe in detail the procedure described
ity of the parameters of the system components; (f) lim-
                                                                   above.
ited nominal data; (g) extremely limited fault data (supplied
as time series); (h) highly non-linear multi-physics system
having multiple operating modes. Broadly speaking there
                                                                   1.1 FDI and MBD
are three approaches to this type of problem: Model-Based          In model-based approaches (FDI and MBD), the diagnosis
Diagnosis (MBD), Fault Detection and Isolation (FDI) and           engine is provided with a model of the system, values of the
Machine Learning (ML). None of these approaches is ade-            parameters of the model and values of some of its inputs
quate of this task. MBD and FDI require models and param-          and outputs. Its main goal is to determine from only this
eters which are unavailable. ML approaches will require a          information whether the system is malfunctioning, which
large amount of training data, and most approaches would           components might be faulty and what additional informa-
require extensive feature engineering. In this paper we will       tion need to be gathered (if any) to identify the faulty com-
demonstraint a hybrid approach to this task which was ulti-        ponents with relative certainty. The distinguishing features
mately fully satisfactory for the train company. Many real         of the MBD [de Kleer et al., 1992] approach are an empha-
world diagnostic tasks have similar limitations and we be-         sis on general diagnostic reasoning engines that perform a
lieve our approach is one that yields good diagnostic algo-        variety of diagnostic tasks via on-line reasoning, and infer-
rithms for many cases.                                             ence of a system’s global behavior from the automatic com-
   At a high level our approach is as follows. First we build      bination of physical components. Hence, MBD models are
by hand an approximate model in Modelica (our switch               compositional - the model of a combination of two systems




                                                             225
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


is directly constructed from the models of the constituent           2 we motivate and describe the railway switch diagnosis
systems. FDI methods can work with both physics-based                problem. Sections 3 and 4 present the physics-based model,
and empirical models. The physics-based models are usu-              its fault-augmented version and the partial validation of the
ally flattened, that is, the components and sub-components           system. Section 5 describes the diagnosis solution under a
structure is lost into an overall behavioral model. Often,           partially validated physics-based model while Section 6 puts
the faults are seen as separate inputs that need to be com-          our solution in the context of exiting work on railway switch
puted by the diagnosis engine. The disadvantage of this              diagnostics.
approach is that the physical semantics of the faults is ig-
nored. In addition, treating the faults as exogenous inputs          2 Problem Description
ignores the fact that the abnormal behavior may in fact
                                                                     Railway signaling equipment (including switches) generates
depend on the variables of the systems. However, many
                                                                     approximately 60% of the failure statistics related to traffic
FDI techniques were shown to be effective in diagnosing
                                                                     disruptions due to signalling problems. As a consequence
dynamical systems [Gertler, 1998; Isermann, 1997; 2005;
                                                                     more and more attention is paid to railway safety and op-
Patton et al., 2000].
                                                                     timal railway maintenance. As a result of the rapid tech-
   The above discussion emphasizes the need for a model              nological advances in microelectronics and communication
when using either an FDI or MBD approach. As we will see             technologies in the past decades, it has become possible
later in the paper, there are cases when such a model is very        to add sensing and communication capabilities to railway
difficult to obtain and (more importantly) validate, or only         equipment such as switches, to detect equipment failure and
a partial model is available. Naturally, both FDI and MBD            therefore to enhance the quality of the railway service. Al-
approaches would not fare well in such a scenario. When              though these sensing capabilities allow for easy detection of
no model is available, data-driven methods can be used to            faults in the electrical components of the equipment, a sig-
learn the behavior of the system and use this knowledge              nificant number of faults related to the mechanical compo-
to predict the system behavior. Such methods require ex-             nents affect parameters whose monitoring would be difficult
perimental data corresponding to the normal and abnormal             either due to cost or impracticality of sensor placement.
behavior for classification purposes; data that is used to ex-          The rail switch assembly considered in this paper is
tract features representative for the system’s behavior. The         shown Figure 2. The component responsible for moving the
set of features together with observations of the system (out-       switch blades is the point machine. The point machine has
put measurements) are used to learn a data-driven statistical        two sub-components: a servo-motor (generates rotational
model that is further used to classify the current observed          motion) and a gear-cam mechanism (amplifies the torque
behavior. Namely, when new data is available it is fed into          generated by the motor and transforms the rotational motion
the data-driven model, which in turn will provide a “best            into a translational motion).
guess” to which class of behavior (normal or abnormal) the              The adjuster transfers the motion from the point machine
data corresponds to. It is well recognized that in data-driven       to the load (switch blades) through a drive rod. In particular,
approaches, the effectiveness of the classification is highly        by adjusting two bolts, the adjuster controls the time when
dependent on the quality of the features used for learning.          the switch blades start moving having as reference the time
   In this paper, we begin to bridge the gap between pure            when the drive rod commence moving. The switch blades
model-based and data-driven methods with a more hybrid               are supported by a set of rolling bearings to minimize mo-
approach. We propose the use of a partially validated model          tion friction. The manufacturer of the point machine en-
to help us determine a set of features that are representa-          dowed the equipment with a series of sensors that can mea-
tive for the normal and abnormal behavior. In this approach          sure the motor’s angular velocity and torque, and the cam’s
we build a physics based model of the system, emphasiz-              angle and stroke (linear position). These sensors log data
ing its components and sub-components. Due to the lack               in real time which is ten sent to a central station for anal-
of sufficient technical specifications and measurement data,         ysis. These sensors were installed by design on the point
only partial validation is achieved. By this we mean that            machine to monitor its safety. Although the operator of the
only a sub-set of the variables of interest match their coun-        railway switch is also interested in the diagnosis of the point
terpart in the experimental data. The rest of the variables,         machine, other possible faults are of interest as well. The
although not completely matching the real data, they do ex-          faults considered in this paper are as follows: loose lock-pin
hibit similar characteristics compared to the real data, e.g.,       fault (at the connection between the drive rod and the point
same number of maxima, minima, or common regions of                  machine), adjuster bolts misalignment (the bolts move away
increasing/decreasing values, etc. In other words they are           from their nominal position), missing bearings and the pres-
qualitatively equivalent. The physics-based model is further         ence of an obstacle preventing the completion of the switch
extended to include behaviors under different fault operating        blades motion. Adding new sensors measuring forces ap-
modes. In particular, physics-based models for the faults            plied to the switch blades or the position of the switch blades
are included in the nominal model. The fault-augmented               may facilitate immediate detection of such faults. How-
model is then used to generate synthetic simulated normal            ever, due to the sheer number and possible configurations
and abnormal (including multiple faults) behavior and ex-            of switches in the railway transportation network, this is not
tract representative features that are used in a data-driven         a scalable solution. Therefore, the challenge is to diagnose
approach. Note that although ideally we would like to exe-           the aforementioned faults using only the available measure-
cute the feature extraction step automatically, in this paper it     ments.
is performed manually as the automatic feature extraction is
a challenging problem in its own. The diagnosis procedure            3 System Modeling
described above is pictorially presented in Figure 1.                This section presents the fault augmented physics-based
   The rest of the paper is organized as follows: in Section         model of railway switch assembly, together with some




                                                               226
                       Proceedings of the 26th International Workshop on Principles of Diagnosis




                               Figure 1: Diagnosis procedure with partially validated model


                                                                 ates a rotational motion. The gear-cam mechanism scales
                                                                 down the angular velocity of the motor and amplifies the
                                                                 torque generated by the motor. In addition, it transforms the
                                                                 rotational motion into a translational motion.
                                                                 Servomotor
                                                                 No technical details were provided on this component, such
                                                                 as type of motor or type of controller. Values for technical
                                                                 parameters (e.g., armature resistance, motor shaft inertia)
                                                                 were not available either. This information was not avail-
                                                                 able to the switch operator either. Therefore, as a result of
                                                                 a literature review on the type of motors used in railway
                                                                 switches, a DC-permanent motor was chosen to be the most
                                                                 likely candidate. The dynamical model for this component
                                                                 is given by
                                                                               di(t)
                                                                           La          =    −Ra i(t) − Ke ω(t) + v(t),
                                                                                dt
Figure 2: Diagnosis procedure with partially validated
                                                                              dω(t)
model                                                                       J          =    Kt i(t) − Bω(t) − τ (t),
                                                                                dt

model validation results. Such models provide deeper in-          where v(t) acts as input signal, ω(t) is the angular veloc-
sight on the behavior of the physical system. Simulated          ity at the motor flange that acts as output, τ (t) is the torque
behavior helps with learning of normal and abnormal be-          load of the motor and i(t) is the current through the arma-
havior patterns. The abnormal patterns are especially useful     ture. Generic motor parameters from the literature were also
when not enough experimental data describing the abnormal        chosen [Zattoni, 2006]. One question that may arise is if an
behavior is available. The modeling process consists of de-      empirical model can be estimated. Unfortunately since only
composing the system into its main components, build phys-       the output ω(t) is available, an empirical model based on
ical models and combining them into an overall model of          system identification cannot be estimated, since no voltage
the system. We used the Modelica language to construct the       measurements are available. No information on the type of
model, which is a non-proprietary, object-oriented, equation     controller was available to us either. As a consequence, we
based language to model complex physical systems [Tiller,        used a PID controller for the feedback loop. Based on the
2001]. Models for the three main components of the rail-         observed profile of the motor output we determined that the
way switch, the point machine, the adjuster and the switch       controlled variable is the angular velocity ω(t). Indeed, Fig-
blades, are presented in what follows.                           ure 3 shows the motor’s angular velocity1 that is maintained
                                                                 at a constant value by the controller. To compute the pa-
3.1 Point machine                                                rameters of the PID controller we estimated metrics corre-
                                                                 sponding to the transient component of the output (angular
The point machine is the component of the railway switch         velocity), such as rise time and overshoot; metrics that are
system that is responsible for moving the switch blades and      formulated in .
locking them in the final position until a new motion action
is initiated. It is composed of two sub-components: servo-           1
                                                                       The angular velocity profile shown in the graph is similar but
motor and gear-cam mechanism. The electrical motor trans-        not exactly the observed one, due to proprietary information re-
forms electrical energy into mechanical energy and gener-        strictions.




                                                           227
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




                                                                                  Figure 5: Adjuster diagram

                                                                  ing the adjuster was modeling the non-sticking contact be-
                                                                  tween the drive rod and the adjuster extremes. Stiff contact
             Figure 3: Motor angular velocity                     two bodies is usually modeled using a spring-damper com-
                                                                  ponent with very large values for the elasticity and damping
                                                                  constants. However, under this approach once contact takes
The Gear-Cam mechanism                                            place, it is permanent. To solve this challenge, we built a
As mentioned earlier, the gear-cam mechanism amplifies the        custom component that models the non-sticking contact.
torque generated by the motor and transforms the rotational
motion into a translational motion. The technical details         3.3 Switch blades
provided to us confirmed only the presence of the cam, but        The adjuster is connected to two switch blades that are
not of the gear. We inferred the presence of the latter, by       moved from left to right or right to left, depending on
comparing the angular velocity of the motor with the cam’s        the traffic needs. We look at a switch blade as a flexi-
angular velocity, estimated from the measured cam’s angle.        ble body and used an approximation method to modeling
This allowed us to estimate the ratio between the two veloci-     beams, namely the lumped parameter approximation. This
ties, and therefore estimate the gear ratio. The cam diagram      method assumes that beam deflection is small and in the lin-
is shown in Figure 4, where a wheel rotates as a result of        ear regime. The lumped parameter approach approximates
the torque transmitted through the gear and acts on a lever       a flexible body as a set of rigid bodies coupled with springs
that pushes the drive rod. Using the geometry of the cam,         and dampers. It can be implemented by a chain of alter-
                                                                  nating bodies and joints. The springs and dampers act on
                                                                  the bodies or the joints. The spring stiffness and damping
                                                                  coefficients are functions of the material properties and the
                                                                  geometry of the flexible elements. Parameters such a rail
                                                                  length, mass and mass moment of inertia were provided to
                                                                  us through technical documentation. To model the effect of
                                                                  the rail moving on rolling bearings, we included a friction
                                                                  component that accounts for energy loss due to friction. Al-
                                                                  though the component can model different friction models,
                                                                  the default models is Coulomb friction.

                                                                  3.4 Fault augmentation
                 Figure 4: Cam schematics                         In this section we describe the modeling artifacts that were
                                                                  used to include in the behavior of the system the four fault
the relation between the rotation motion and the linear mo-       operating modes: loose lock-pin, misaligned adjuster bolts,
tion (that is, the relation between the angle and the stroke)     obstacle and missing bearings.
is given by
                   stroke = R × sin(angle),                       Loose lock-pin
where R denotes the radius of the cam. In addition, the map       The lock-pin referred in this fault mode connects the point
between the applied torque and the generated force is             machine with the drive rod that transfers the motion to the
                                                                  switch blades. More precisely, it locks the drive rod to the
                        1                                         point machine. When this lock-pin becomes loose due to
              force =     × torque × cos(angle).
                        R                                         wear, it introduces a slackness in the way the motion is
As both the cam angle and the stroke were included in the         transferred to the switch blades. The lock-pin fault affects
available measurements, we used a least square method to          stability the connection point between the drive rod and
estimate the radius of the cam.                                   the point machine. In time, if not fixed, this can lead to a
                                                                  complete failure of the pin, and therefore the point-machine
3.2 Adjuster                                                      cannot longer act upon the blades. A custom-built compo-
The adjuster links the drive rod connected to the point ma-       nent whose main characteristic is that it implements a non-
chine to the switch blades, and hence it is responsible for       sticking pushing and pulling between two rods was built to
transferring the translational motion. There is a delay be-       model the effects of this fault. The impact between the two
tween the time instants the drive rod and the switch blades       rods is assumed to be elastic, that is, we use a spring-damper
start moving. This delay is controlled by setting the po-         assembly with large values for their parameters to model the
sitions of two bolts on the drive rod. Tighter bolt setting       contact. There are two types of contact: contact of the rods
means a smaller delay, while looser bolt setting produce a        with the boundaries of the locking mechanism and contact
larger delay. The high level diagram of the adjuster is de-       between the rods. Both these types of contact must exhibit
picted in Figure 5. The most challenging part in construct-       non-sticking pushing and pulling properties.




                                                            228
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


Misaligned adjuster bolts
In this fault mode the bolts of the adjuster deviate from their
nominal position. As a result, the instant at which the drive
rod meets the adjuster (and therefore the instant at which the
the switch rail starts moving) happens either earlier or later.
For example in a left-to-right motion, if the left bolt moves
to the right, the contact happens earlier. The reason is that
since the distance between the two bolts decreases, the left
bolt reaches the adjuster faster. As a result, when the drive
rod reaches its final position, there may be a gap between
the right switch blade and the right stock rail. In contrast, if
the left bolt moves to the left the contact happens later. The
model of the adjuster includes parameters that can set the
positions of the bolts, and therefore the effects of this fault
mode can be modeled without difficulty.
                                                                         Figure 6: Motor torque with its five operating zones
Obstacle
In this fault mode, an obstacle prevents the switch blades
reach their final nominal position, and therefore a gap be-          where the drive rod catches up again with switch blades an
tween the switch blades and the stock rail appears. The ef-          pushes them to their final position. Finally, in Zone 5 the
fect on the motor torque is a sudden increase in value, as the       switch blades are pushed against the stock rails for a short
motor tries to overcome the obstacle. To model this fault            period of time, hence the increase in torque. In support of
we included a component that implements a hard stop for              the validation of these five operating zone, a set of movies
the position of the switch blades. This component has two            depicting the motion of the switch blades were used. With
parameters for setting the left and right limits within motion       respect to the fault operating modes, we managed to gener-
of the switch blades is allowed. By changing the values of           ate similar effects in the simulated data, as the ones observed
these parameters, the presence of an obstacle can be simu-           in the measured data. Figure 7 shows the effect of the mis-
lated.                                                               aligned bolts fault, and in particular the case where the left
                                                                     bolt moves to the left. The effect is a delay applied on the
Missing bearings                                                     time instant the drive rod reaches the switch blades. In ad-
To minimize friction, the rails are supported by a set of            dition, Zone 5 is also affected since due to the decreased
rolling bearings. When they become stuck or lost, the en-            distance, the switch blades are no longer pushed against the
ergy losses due to friction increase. As mentioned in the            stock rails. In the case of an obstacle, the switch blades (and
section describing the switch blades modeling, a component
was included to account for friction. This component has a
parameter that sets the value for the friction coefficient. By
increasing the value of this parameter, the effect of the miss-
ing bearings fault can be simulated.

4 Model Validation
Motor angular velocity, cam angle and stroke, together with
the motor torque were used in the validation process. To
these measurements, we added the rail position that was
estimated from a set of movies depicting the rail motion,
to which image processing techniques were applied. We
achieved partial validation of the model. The simulated mo-
tor angular velocity, cam angle and stroke closely match
the measured data. The simulated motor torque however
matches in a qualitative sense its measured counterpart. The
main reason is the fact that we had to make assumptions on
the type controller motor and controller, without no way to          Figure 7: Motor torque in the normal and misaligned bolts
validate these assumptions. In addition, the available mea-          fault modes
surements did not allowe for the estimating the parameters
in the assumed models, as this problem is ill posed. Figure 6        hence the drive rod) push against an obstacle that does not
depicts the simulated torque, emphasizing the five operating         allow the completion of the motion. Therefore, the electric
zone. In Zone 1, the motor rotates the cam and the drive rod         motor develops the maximum allowable torque as seen in
moves freely. No contact with the switch blades takes place          Figure 8. In the case of the missing bearing fault mode, the
in this zone, and the (small) energy loss is due to friction in      motion friction of the switch blades increases, and hence
the mechanical components. Zone 2 corresponds to the case            the torque generated by the motor must accommodate this
where the drive rod pushes the two switch blades. The elas-          increase. We obtained this effect in simulation as shown in
ticity in the switch blades can be noticed in the toque profile      Figure 9. Finally, Figure 10 shows the effects of the lock-
in this zone. In Zone 3, the switch blades accelerate (as they       pin fault. The slackness introduced by the looseness of the
drop off the rolling bearings) and again the drive rod moves         pin induces a delay in the rail motion which also affects the
freely (note the drop in torque). Zone 4 depicts the case            behavior in Zone 5. In terms of the changes in the five op-




                                                               229
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                   effects in simulation. The choice of features described in the
                                                                   next section was supported by this understanding.

                                                                   5 Fault Detection and Diagnosis
                                                                   In the case of a railway switch, our measurements include
                                                                   the motor torque and motor angular velocity. As the switch
                                                                   moves from one extreme position to the other, these quan-
                                                                   tities are measured at a fixed sampling rate. Thus, we
                                                                   obtain a time series for each of the measurements. Let
                                                                   {τ (t1 ), . . . , τ (tN )} denote torque measured at time instants
                                                                   {t1 , . . . , tN }. Likewise, let {ω(t1 ), . . . , ω(tN )} denote the
                                                                   angular velocity. For simplicity’s sake, we denote the two
                                                                   time series of measurements by X. The diagnosis objective
                                                                   is to determine the underlying condition of the system from
                                                                   these time series. In other words, the objective is to deter-
                                                                   mine a classifier f : X → {N, F1 , F2 , F3 , F4 , F5 }, where
                                                                   N refers to the class label corresponding to the normal con-
Figure 8: Motor torque in the normal and obstacle fault            dition and F1 , F2 , F3 and F4 denote the class labels loose
modes                                                              bolt, tight bolt, loose lock-pin, missing bearings, and obsta-
                                                                   cle respectively.
                                                                       We adopt a machine learning approach to constructing the
                                                                   above mentioned classifier. The two main steps in building
                                                                   a machine learning classifier are feature selection and clas-
                                                                   sifier type selection. These two steps are discussed next.

                                                                   5.1 Feature selection
                                                                   As seen in Figure 6, the motor torque profile shows five dis-
                                                                   tinct operating zones. Moreover, we notice from Figures 7,
                                                                   8, 9 and 10 that a given fault’s impact on the torque pro-
                                                                   file seems limited to only some of the five zones. With this
                                                                   observation, our feature selection strategy is as follows.
                                                                     1. Identify the approximate time instants that define the
                                                                        boundaries of the five zones. For example, Zone 1 is
                                                                        defined to be between times 0.8 seconds and 2 seconds,
                                                                        zone 2 is defined to be between times 2 seconds and 4.1
                                                                        seconds, and so on.
Figure 9: Motor torque in the normal and missing bearings
                                                                     2. Within each zone, compute a set of measures. An ex-
fault modes
                                                                        ample of a measure is the total energy dissipated within
                                                                        the zone. This is computed as instantaneous power in-
                                                                        tegrated over the duration of the zone. The instanta-
                                                                        neous power is the product of instantaneous torque and
                                                                        angular velocity. Other examples of features include
                                                                        maximum and minimum torque values within the zone.
                                                                        The disclosure of the full set of measures used is not
                                                                        possible at this time for proprietary reasons. The fea-
                                                                        tures are normalized to have zero mean and unit stan-
                                                                        dard deviation.
                                                                   Note that it might be possible to combine one or more zones
                                                                   into one for feature selection.

                                                                   5.2 Classifier selection
                                                                   To map the features to the classes, {N, F1 , F2 , F3 , F4 , F5 },
                                                                   we use machine learning. Examples of types of classifiers
                                                                   commonly used include k− nearest neighbors, support vec-
Figure 10: Motor torque in the normal and lock-pin fault           tor machines, neural networks and decision trees. We chose
modes                                                              Random Forest, an ensemble classifier, because of its ro-
                                                                   bustness to overfitting. For a more detailed discussion on
                                                                   the advantages of Random Forest, we refer the reader to
erating zones, the simulated behavior showed similar char-         [Breiman, 2001]. In addition, we also developed a binary
acteristics as in the case of the real data. The understanding     classifier for fault detection based on Alternating Decision
of these behaviors come as a result of building the model,         Tree (AD Tree). The advantage of AD Tree is that the re-
augmenting the model with fault modes, and analyzing their         sults are human interpretable.




                                                             230
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


5.3 Results                                                          primarily due to confusion between missing bearings and
For each fault type, we introduce varying magnitudes of              normal. Figure 12 shows part of the fault detection AD
fault and simulate the switch model described earlier. The           Tree. A pink oval represents a feature node. Depending
fault magnitude is parameterized by a factor k which is var-         on the value of the feature, one of two branches is followed
ied over a pre specified range. A value of k equal to zero           until a leaf node is reached. Each edge that is traversed re-
corresponds to normal case. Higher values of k correspond            sults in a score shown within the blue rectangles. For every
to the faulty cases. In addition, we also add representative         root to leaf traversal, the total score is the sum of the scores
noise to the measurements. Figure 11 shows some example              accumulated on each edge. For a given data sample, mul-
torque profiles generated by the simulation.                         tiple root to leaf paths may be traversed. In that case, the
                                                                     final score is the sum of the scores accumulated over all the
                                                                     paths. If the final score is negative, the decision is normal;
                                                                     otherwise the decision is abnormal.

                                                                     Table 2: Fault detection confusion matrix on simulated data
                                                                                               Normal Abnormal
                                                                                   Normal        94.6        5.4
                                                                                 Abnormal        9.6        90.4


                                                                        Next, we test the classifiers on real data. A key prepro-
                                                                     cessing step is to compute a linear transformation that trans-
                                                                     forms the mean and standard deviation of the features of the
                                                                     nominal (normal) real data to make them equal to the mean
                                                                     and standard deviation of the features of the nominal simu-
                                                                     lated data. The same transformation is then applied on the
                                                                     real faulty data before testing with the ML classifier. We
                                                                     emphasize here that to compute the transformation we only
                                                                     require examples of real data showing normal behavior. We
Figure 11: Simulated torque measurements with added                  do not use any real fault data for training the ML classifier.
noise.                                                               Table 3 shows the fault detection results on real data. As
   The data generated is recorded and used to train and test         can be seen, we achieve a high accuracy of greater than 80
the machine learning classifier. We use leave-one-out cross-         percent. We also tested the multi-class random forest classi-
validation for training and testing the classifiers. In this ap-     fier to diagnose the various faults. We were able to diagnose
proach, one data sample is used for testing whereas all the          correctly all missing bearing faults but were unable to cor-
rest of the data is used for training. This is repeated un-          rectly diagnose the other faults.
til each data sample has been tested once. Table 1 shows
the confusion matrix for the simulated data described ear-              Table 3: Fault detection confusion matrix on real data
lier. The (i, j)th entry of the confusion matrix refers to the
percentage of cases where the true class was i but was clas-                                   Normal Abnormal
sified as j by the classifier. A matrix with 100 along all                         Normal        85.5       14.5
the diagonal entries would correspond to a perfect classifier.                    Abnormal        20         80
In the results shown in Table 1, we observe some misclas-
sification between classes N and F4 . Recall that N is the
normal class and F4 is the missing bearing class. On fur-
ther investigation, we determined that the misclassification         6 Related Work
occurs between the normal data and data corresponding to             A malfunctioning railway switch assembly can have a high
low magnitudes of the missing bearing fault.                         impact on the railway transportation safety, and therefore
                                                                     the problem of diagnosing such systems has been addressed
                                                                     in other works. [Zattoni, 2006] proposes a detection sys-
Table 1: Fault diagnosis confusion matrix on simulated data          tem based on off-line processing of the armature current
                 N     F1      F2    F3    F4     F5                 and voltage. The system implements an algorithm that real-
         N      97.2    0       0     2    0.8     0                 izes a finite impulse response system designed on the basis
         F1       0    100      0     0     0      0                 of an H2 -norm criterion, and allows for detection of incre-
         F2       0     0      99     1     0      0                 mental faults (e.g., loss of lubrication, increasing obstruc-
         F3       9     0       4    87     0      0                 tions, etc.). The approach hinges on the availability of a
         F4      11     0       0     0    89      0                 validated model of the point machine, which was not the
                                                                     case in our setup. [Zhou et al., 2001; 2002] propose a re-
         F5       0     0       0     0     0     100
                                                                     mote monitoring system for railway point machines. The
                                                                     system includes a variety of sensors for acquiring trackside
  The binary classification or fault detection result using          data related to parameters such as, distance, driving force,
AD Tree is shown in Table 2. As in the multi-class classifi-         voltage, electrical noise, or temperature. The monitoring
cation case, the false positives (normal classified as abnor-        system logs data for offline analysis that offers detailed in-
mal), and false negatives (abnormal classified as normal) are        formation on the condition of the system in the form of event




                                                               231
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                0&(=&3.718&


                                                                          0.135&          <1.645&




                                                                                              Max&torque&in&                Total&energy&
                   Feature&4&                               Feature&5&
                                                                                                zone&2&                      dissipated&




                                  Feature&5&                         Feature&6&




                                                 Figure 12: Part of the fault detection AD Tree


analysis and data trends. Hence unlike in our setup, the fo-                            normal and abnormal behavior. This approach relies on a set
cus is on detection rather than isolation. In addition, due                             of sensors measurements such as motors, voltage, current or
to scalability constraints, our solution is based on the em-                            switch blade positions, not all of them being available in our
bedded sensors, no other sensor being added. In [Asada                                  case. In addition, the computation of the net energy requires
et al., 2013] classification based fault detection and diag-                            parameters of the electrical motor (armature resistance and
nosis algorithm is developed using measurements such as                                 motor shaft inertia) that again are not available in our setup.
drive force, electrical current and voltage. In particular, a                           In addition, unlike our diagnosis objective, the focus in on
classifier based on support vector machines is used. Our                                detecting abnormalities within the point machine.
work also uses classification for diagnosis, but considers a
wider verity of classifiers such as Multiclass Random For-                              7 Conclusions
est or Logitboosted Random Forest that were proved to be
more robust [Opitz and Maclin, 1999]. The classification                                The three main general approaches to developing diagnostic
step in [Asada et al., 2013] depends on a set of features ex-                           software (FDI, MBR, and ML) all have severe limitations in
tracted by applying the discrete wavelet transform on the                               many real-world applications. We believe we will see many
active power. This step is oblivious on the operating modes                             more hybrid approaches to diagnosis that include the best of
of the point machine, which we showed to relevant in our                                these three approaches to build accurate diagnosers.The rail-
case. Hence, the diagnosis approach in [Asada et al., 2013]                             way switch is a critical and complex piece of equipment re-
is purely data driven. Since we had no access to current and                            quiring extremely high diagnostic accuracy (the main reason
voltage measurements this avenue for feature construction                               this project was initiated), and the approach outlined in this
was not available to us. Depending of the type of electri-                              paper was ultimately successful. Ultimately deployment of
cal motors, the current and the voltage could be computed                               this approach will depend on expanding the set of faults de-
from the angular velocity and torque, respectively. How-                                tecting and on installation of more sensor rich switches in
ever, knowledge of motor parameters is needed. [Asada                                   railroad infrastructures.
et al., 2013] consider two type of faults: underdriving and
overdriving of the drive rod. Overdriving refers to the case                            References
where the switch blades are pushed against the stock rails
due to misalignment, and a higher force then normal ap-                                 [Asada et al., 2013] T. Asada, C. Roberts, and T. Koseki.
pears between the stock rails and the switch blades. Over-                                An algorithm for improved performance of railway con-
driving map to misaligned bolts, missing bearings and ob-                                 dition monitoring equipment: Alternating-current point
stacles in our setup. All these fault modes exhibit higher                                machine case study. Transportation Research Part C:
forces than normal. Underdriving maps to a particular in-                                 Emerging Technologies, 30(0):81 – 92, 2013.
stance of the misaligned bolts fault (left bolt moves to the                            [Breiman, 2001] Leo Breiman. Random forests. Machine
left for example). Therefore, our solution differentiate be-                              learning, 45(1):5–32, 2001.
tween more possible causes of higher forces since we take
advantage of the particular signature these forces have in                              [de Kleer et al., 1992] J. de Kleer, A. Mackworth, and
each fault corresponding to overdriving. Another pure data-                                R. Reiter. Characterizing diagnoses and systems. 56(2-
driven approach for railway point machine monitoring was                                   3):197–222, 1992.
proposed in [Oyebande and Renfrew, 2002], where a net                                   [Gertler, 1998] J. Gertler. Fault-Detection and Diagnosis in
energy analysis technique was used to discriminate between                                Engineering Systems. New York: Marcel Dekker, 1998.




                                                                                  232
                       Proceedings of the 26th International Workshop on Principles of Diagnosis


[Isermann, 1997] R. Isermann. Supervision, fault-detection
   and fault-diagnosis methods - An introduction. Control
   Engineering Practice, 5(5):639 – 652, 1997.
[Isermann, 2005] Rolf Isermann.          Model-based fault-
   detection and diagnosis - status and applications. Annual
   Reviews in Control, 29(1):71 – 85, 2005.
[Minhas et al., 2014] R. Minhas, J. de Kleer, I. Matei,
   B. Saha, B. Janssen, D.G. Bobrow, and T Kortuglu. Us-
   ing fault augmented Modelica model for diagnostics. In
   Proceedings of the 10th International Modelica Confer-
   ence, Dec 2014.
[Opitz and Maclin, 1999] David Opitz and Richard Maclin.
   Popular ensemble methods: an empirical study. Journal
   of Artificial Intelligence Research, 11:169–198, 1999.
[Oyebande and Renfrew, 2002] B.O. Oyebande and A.C.
   Renfrew. Condition monitoring of railway electric point
   machines. Electric Power Applications, IEE Proceedings
   -, 149(6):465–473, Nov 2002.
[Patton et al., 2000] Ron J. Patton, Paul M. Frank, and
   Robert N. Clark. Issues of Fault Diagnosis for Dynamic
   Systems. Springer-Verlag London, 2000.
[Tiller, 2001] Michael Tiller. Introduction to Physical Mod-
   eling with Modelica. Kluwer Academic Publishers, Nor-
   well, MA, USA, 2001.
[Zattoni, 2006] Elena Zattoni. Detection of incipient fail-
   ures by using an -norm criterion: Application to rail-
   way switching points. Control Engineering Practice,
   14(8):885 – 895, 2006.
[Zhou et al., 2001] F. Zhou, M. Duta, M. Henry, S. Baker,
   and C. Burton. Condition monitoring and validation
   of railway point machines. In Intelligent and Self-
   Validating Instruments – Sensors and Actuators (Ref. No.
   2001/179), IEE Seminar on, pages 6/1–6/7, Dec 2001.
[Zhou et al., 2002] F.B. Zhou, M.D. Duta, M.P. Henry,
   S. Baker, and C. Burton. Remote condition monitoring
   for railway point machine. In Railroad Conference, 2002
   ASME/IEEE Joint, pages 103–108, April 2002.




                                                           233
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                  234
                         Proceedings of the 26th International Workshop on Principles of Diagnosis




    Design of PD Observer-Based Fault Estimator Using a Descriptor Approach

                  Dušan Krokavec, Anna Filasová, Pavol Liščinský, Vladimír Serbák
                          Department of Cybernetics and Artificial Intelligence,
             Technical University of Košice, Faculty of Electrical Engineering and Informatics,
                                             Košice, Slovakia
             e-mail: {dusan.krokavec, anna.filasova, pavol.liscinsky, vladimir.serbak}@tuke.sk


                          Abstract                                    fault estimation, a proportional multi-integral derivative es-
                                                                      timators are proposed in [7], [24].
     A generalized principle of PD faults observer de-                   Although the state observers for linear and nonlinear
     sign for continuous-time linear MIMO systems is                  systems received considerable attention, the descriptor de-
     presented in the paper. The problem addressed                    sign principles have not been studied extensively for non-
     is formulated as a descriptor system approach to                 singular systems. Modifying the descriptor observer design
     PD fault observers design, implying the asymp-                   principle [13], the first result giving sufficient design condi-
     totic convergence both the state observer error as               tions, but for linear time-delay systems, can be found in [5].
     fault estimate error. Presented in the sense of the              Reflecting the same problems concerning the observers for
     second Lyapunov method, an associated structure                  descriptor systems, linear matrix inequality (LMI) methods
     of linear matrix inequalities is outlined to possess             were presented e.g. in [9] but a hint of this method can be
     the observer asymptotic dynamic properties. The                  found in [23], [25]. The extension for a class of nonlinear
     proposed design conditions are verified by simu-                 systems which can be described by Takagi-Sugeno models
     lations in the numerical illustrative example.                   is presented in [12].
                                                                         Adapting the approach to the observer-based fault estima-
                                                                      tion for descriptor systems as well as its potential extension,
1   Introduction                                                      the main issue of this paper is to apply the descriptor prin-
As is well known, observer design is a hot research field ow-         ciple in PD fault observer design. Preferring LMI formula-
ing to its particular importance in observer-based control,           tion, the stability condition proofs use standard arguments
residual fault detection and fault estimation [1], where, es-         in the sense of Lyapunov principle for the design condi-
pecially from the stand point of the active fault tolerant con-       tions requiring to solve only LMIs without additional con-
trol (FTC) structures, the problem of simultaneous state and          straints. This presents a method designing the PD observa-
fault estimation is very eligible. In that sense various effec-       tion derivative and proportional gain matrices such that the
tive methods have been developed to take into account the             design is non-singular and ensures that the estimation error
faults effect on control structure reconfiguration and fault          dynamics has asymptotical convergence. From viewpoint
detection [16], [22]. The fault detection filters, usually re-        of application, although the descriptor principle is used, it
lying on the use of particular type of state observers, are           is not necessary to transform the system parameter into a
mostly used to produce fault residuals in FTC. Because it is          descriptor form or to use matrix inversions in design task
generally not possible in residuals to decouple totally fault         formulation. Despite a partly conservative form, the design
effects from the perturbation influence, different approaches         conditions can be transformed to LMIs with minimal num-
are used to tackle in part this conflict and to create residuals      ber of symmetric LMI variables.
that are as a rule zero in the fault free case, maximally sensi-         The paper is organized as follows. Placed after Introduc-
tive to faults, as well as robust to disturbances [2], [8]. Since     tion, Sec. 2 gives a basic description of the PD fault ob-
faults are detected usually by setting a threshold on the gen-        server and Sec. 3 presents design problem formulation in
erated residual signal, determination of an actual thresh-            the descriptor form for a standard Luenberger observer. A
old is often formulated in adaptive frames [3]. Generalized           new LMI structure, describing the PD fault observer design
method to solve the problem of actuator faults detection and          conditions, is theoretically explained in Sec 4. An example
isolation in over-actuated systems is given in [14], [15].            is provided to demonstrate the proposed approach in Sec. 5
   To estimate actuator faults for the linear time invariant          and Sec. 6 draws some conclusions.
systems without external disturbance the principles based                Used notations are conventional so that xT , X T de-
on adaptive observers are frequently used, which make es-             note transpose of the vector x and matrix X, respectively,
timation of actuator faults by integrating the system output          X = X T > 0 means that X is a symmetric positive defi-
errors [25]. In particular, proportional-derivative (PD) ob-          nite matrix, kXk∞ designs the H∞ norm of the matrix X,
servers introduce a design freedom giving an opportunity              the symbol In represents the n-th order unit matrix, ρ(X)
for generating state and fault estimates with good sensitivity        and rank(X) indicate the eigenvalue spectrum and rank of
properties and improving the observer design performance              a square matrix X, IR denotes the set of real numbers and
[6], [18], [19]. Since derivatives of the system outputs can          IRn , IRn×r refer to the set of all n-dimensional real vectors
be exploited in the fault estimator design to achieve faster          and n × r real matrices, respectively.




                                                                235
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


2   The Problem Statement                                           Lemma 2. The Luenberger observer (8), (9) is stable if for
The systems under consideration are linear continuous-time          given positive scalar δ ∈ IR there exist a symmetric positive
dynamic systems represented in state-space form as                  definite matrix P 1 ∈ IRn×n a regular matrix P 3 ∈ IRn×n
                                                                    and a matrix Y ∈ IRn×m such that
             q̇(t) = Aq(t) + Bu(t) + F f (t) ,              (1)
                                                                                            P 1 = P T1 > 0 ,                   (13)
                       y(t) = Cq(t) ,                    (2)        "                                                      #
where q(t) ∈ IRn , u(t) ∈ IRr , y(t) ∈ IRm are the vectors              ATP 3 + P T3 A − Y C − C T Y T   ∗
of the state, input and output variables, f (t) ∈ IRp is the                                                                   < 0.
                                                                         P 1 − P 3 + δP 3 A − δY C −δ(P 3 + P T3 )
                                                                                         T
fault vector, A ∈ IRn×n , B ∈ IRn×r , C ∈ IRm×n and
                                                                                                                       (14)
F ∈ IRn×p are real finite values matrices, m, r, p < n and
                                                                  When the above conditions hold, the observer gain matrix
                        A F                                         J is given as
                 rank             = n + p.               (3)
                        C 0                                                             J = (P T3 )−1 Y .              (15)
It is considered that the fault f (t) may occur at an uncertain       Hereafter, ∗ denotes the symmetric item in a symmetric
time, the size of the fault is unknown but bounded and that         matrix.
the pair (A, C) is observable.                                      Proof. Denoting the observer system matrix as
    Focusing on fault estimation task for slowly-varying
faults, the fault PD observer is considered in the following                               Ae = A − J C ,                      (16)
form [19]
                                                                    then with the equality
          q̇ e (t) = Aq e (t) + Bu(t) + F f e (t)+
                                                            (4)                                ė(t) = ė(t)                   (17)
          +J (y(t) − y e (t)) + L(ẏ(t) − ẏ e (t)) ,
                      y e (t) = Cq e (t) ,                  (5)     the equivalent form of (11) can be written
                                                                                                                  
    ḟ e (t) = M (y(t) − y e (t)) + N (ẏ(t) − ẏ e (t)) , (6)          In 0      ė(t)      ė(t)        0    In    e(t)
                                                                                         =          =                        ,
where q e (t) ∈ IRn , y e (t) ∈ IRm , f e (t) ∈ IRp are esti-           0 0       ë(t)        0         Ae −I n     ė(t)
mates of the system states vector, the output variables vec-                                                              (18)
tor and the fault vector, respectively, and J , L ∈ IRn×m ,         or, more generally,
M , N ∈ IRp×m is the set of observer gain matrices is to be
                                                                                         E ⋄ ė⋄ (t) = A⋄e e⋄ (t) ,            (19)
determined.
   To explain and concretize the obtained results, the fol-         where                                
lowing well known lemma of Schur complement property is                          e⋄T (t) = eT (t) ėT (t) ,         (20)
suitable.                                                                                                    
                                                                                     In 0               0   In
Lemma 1. [20] Considering the matrices Q = QT , R =                  E ⋄ = E ⋄T =             , A⋄e =             . (21)
                                                                                      0 0               Ae −I n
RT and S of appropriate dimensions, where detR 6= 0,
then the following statements are equivalent                        Defining the Lyapunov function of the form
                
      Q        S                                                                 v(e⋄ (t)) = e⋄T (t)E ⋄TP ⋄ e⋄ (t) > 0 ,       (22)
                   < 0 ⇔ Q + SR−1 S T < 0, R > 0 (7)
     S T −R
                                                                    where
  This shows that the above block matrix inequality has a                             E ⋄TP ⋄ = P ⋄TE ⋄ ≥ 0 ,                  (23)
solution if the implying set of inequalities has a solution.        then the derivative of (22) becomes
3   Descriptor Principle in Luenberger                                                v̇(e⋄ (t)) =
                                                                                                                               (24)
    Observer Design                                                     = ė (t)E P e (t) + e⋄T(t)P ⋄TE ⋄ ė⋄ (t) < 0
                                                                            ⋄T      ⋄T   ⋄ ⋄

To formulate the proposed PD observer design approach,              and, inserting (19) in (24), it yields
the descriptor principle in the observer stability analysis is
presented.                                                              v̇(e⋄ (t)) = e⋄T (t)(P ⋄TA⋄e + A⋄T ⋄ ⋄
                                                                                                        e P )e (t) < 0 ,       (25)
   If the fault-free system (1), (2) is considered, the Luen-
berger observer is given as                                                            P ⋄TA⋄e + A⋄T   ⋄
                                                                                                   e P < 0,                    (26)
      q̇ e (t) = Aq e (t) + Bu(t) + J (y(t) − y e (t)) ,    (8)     respectively. Introducing the matrix
                                                                                                        
                      y e (t) = Cq e (t) ,                  (9)                                 P1 P2
                                                                                        P⋄ =               ,                   (27)
and using (1), (2), (8), (9), it yields                                                         P3 P4
                   ė(t) = (A − J C)e(t) ,                 (10)     then, with respect to (23), it has to be
                                                                                           T                    
               (A − J C)e(t) − ė(t) = 0 ,                 (11)         In 0       P1 P2             P 1 P T3     In 0
                                                                                               =                         ≥ 0,
respectively, where                                                      0 0       P3 P4             P T2 P T4    0 0
                    eq (t) = q(t) − q e (t) .              (12)                                                            (28)
                                                                    which gives
  Using the descriptor principle, the following lemma pre-                                     T            
sents the Luenberger observer design conditions in terms of                        P1 P2              P1 0
                                                                                                 =               ≥ 0.      (29)
LMIs for the fault-free system (1), (2).                                           0      0           P T2 0




                                                              236
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


It is evident that (29) can be satisfied only if                         by denoting
                                                                                                              
             P 1 = P T1 > 0,         P 2 = P T2 = 0 .           (30)                  e◦T (t) = eTq (t) eTf (t) ,         (42)
After simple algebraic operations so (26) can be trans-                                                          
formed into the following form                                                   A F                 J              L
                                                                          A◦ =            , J◦ =          , L◦ =        , (43)
     "                                   #                                        0 0               M               N
         ATeP 3 + P T3 Ae        ∗                                                              
                                           <0      (31)                                  In 0
        P 1 − P 3 + P T4 Ae −P 4 − P T4                                          I◦ =              , C◦ = [ C 0 ] ,       (44)
                                                                                           0 Ip
and, owing to emerged products P T3 Ae , P T4 Ae in (31), the            where A◦ , I ◦ ∈ IR(n+p)×(n+p) , J ◦ , L◦ ∈ IR(n+p)×m ,
restriction on the structure of P 4 can be enunciated as                 C ◦ ∈ IRm×(n+p) , then the equation (41) can be written
                          P 4 = δP 3 ,                          (32)     as
where δ > 0, δ ∈ IR. Since now                                                  (I ◦ + L◦ C ◦ )ė◦ (t) = (A◦ − J ◦ C ◦ )e◦ (t) ,          (45)
                 P T4 Ae = δP T3 (A − J C) ,                    (33)                       A◦e e◦ (t) − D ◦e ė◦ (t) = 0 ,                (46)
then, with the notation                                                  respectively, where
                          Y = P T3 J ,                          (34)             A◦e = A◦ − J ◦ C ◦ ,     D ◦e = I ◦ + L◦ C ◦ .           (47)
(31) implies (14). This concludes the proof.
                                                                         Introducing the equality
Remark 1. It is naturally to point out that the symmetrical
form of Lemma 2, defined for P 1 = P , P 3 = P T3 =                                              ė◦ (t) = ė◦ (t) ,                      (48)
Q, is an equivalent inequality to the enhanced Lyapunov                  in analogy to the equation (18), then (48), (46) can be writ-
inequality for Luenberger observer design [11].                          ten as
   The above results can be generalized to formulate the de-               ◦  ◦   ◦                                  ◦ 
                                                                            I 0 ė (t)            ė (t)       0       I◦    e (t)
scriptor principle in fault PD observer design. The main rea-                                =           =                            .
son is to eliminate matrix inverse notations from the design                0 0 ë◦ (t)              0        A◦e −D ◦e      ė◦ (t)
conditions.                                                                                                                         (49)
                                                                         Thus, by denoting
4    PD Observer Design                                                          ◦                                       ◦ 
                                                                                  I 0                 0     I◦               e (t)
                                                                          E• =            , A•e =               , e •
                                                                                                                      (t) =           ,
If the observer errors between the system state vector and                         0 0               A◦e −D ◦e               ė◦ (t)
the observer state vector as well as between the fault vector                                                                       (50)
and the vector of its estimate are defined as follows                    the obtained descriptor form to PD fault observer is
                    eq (t) = q(t) − q e (t) ,                   (35)                          E • ė• (t) = A•e e• (t) ,                  (51)
                  ef (t) = f (t) − f e (t) ,             (36)
                                                                         where A•e , E • ∈ IR2(n+p)×2(n+p) .
then, for slowly-varying faults, it is reasonable to consider               The following solvability theorem is proposed to the de-
[12]                                                                     sign PD fault observer in the structure proposed in (4)-(6).
    ėf (t) = 0 − ḟ e (t) = −M Ceq (t) − N C ėq (t) .         (37)     Theorem 1. The PD fault observer (4)-(6) is stable if for
                                                                         given positive scalar δ ∈ IR there exist a symmetric po-
Note, since f e (t) can be obtained as integral of ḟ e (t), an
                                                                         sitive definite matrix P ◦1 ∈ IR(n+p)×(n+p) , a regular matriz
adapting parameter matrix G can be adjust interactively to
set the amplitude of f e (t), i.e., as results it is                     P ◦3 ∈ IR(n+p)×(n+p) and matrices Y ◦ ∈ IR(n+p)×m , Z ◦ ∈
                                                                         IR(n+p)×m such that
                                 Zt
                   f e (t) = G        ḟ e (τ )dτ .             (38)                            P ◦1 = P ◦T
                                                                                                         1 > 0,                           (52)
                                                                                                                                    
                                 0
                                                                              A◦TP ◦3 + P ◦T ◦      ◦ ◦
                                                                                          3 A −Y C −C
                                                                                                        ◦T
                                                                                                           Y ◦T               ∗
To express the time derivative of the system state error eq (t),                                 ◦                                       < 0,
                                                                                               V 21                          V ◦22
the equations (1), (4) together with (2), (5) can be integrated                                                                           (53)
as                                                                       where
         ėq (t) = Ae eq (t) + F ef (t) − LC ėq (t) ,     (39)
where Ae is given in (16) and the PD observer system ma-                  V ◦21 = P ◦1 − P ◦3 + δP ◦T ◦    ◦ ◦   ◦T ◦T
                                                                                                   3 A − δY C − C Z    , (54)
trix is
                                                                              V ◦22 = −δP ◦3 − δP ◦T    ◦ ◦    ◦T ◦T
                                                                                                  3 − δZ C − δC Z    .                    (55)
                       −1                       −1
 AP De = (I n+LC) Ae = (I n+LC)                       (A−J C) . (40)
                                                                         If the above conditions hold, the set of observer gain matri-
   Since (37), (39) can be rewritten in the following com-               ces is given by the equations
posed form
                                                                         J ◦ = (P ◦T
                                                                                            3 )
                                                                                                −1 ◦
                                                                                                  Y ,        L◦ = (P ◦T
                                                                                                                     3 )
                                                                                                                         −1 ◦
                                                                                                                           Z              (56)
  ėq (t)        Ae F       eq (t)       LC 0 ėq (t)
           =                        −                      ,             and the matrices J , L M , N can be separated with respect
  ėf (t)     −M C 0        ef (t)      N C 0 ėf (t)
                                                       (41)              to (43).




                                                                   237
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


Proof. Defining the Lyapunov function of the form                    Theorem 2. The PD observer (4)-(6) is stable if for given
                                                                     positive scalar δ ∈ IR there exist a symmetric positive
             v(e• (t)) = e•T (t)E •TP • e• (t) > 0 ,       (57)      definite matrix Q◦ ∈ IR(n+p)×(n+p) and matrices Y ◦ ∈
where                                                                IR(n+p)×m , Z ◦ ∈ IR(n+p)×m such that
                    E •TP • = P •TE • ≥ 0 ,                (58)                                Q◦ = Q◦T > 0 ,                       (71)
then, using the property (58), the time derivative of (57)                                                                    
along the trajectory of (51) becomes                                         A◦TQ◦ + Q◦A◦ − Y ◦ C ◦ − C ◦T Y ◦T         ∗
                                                                                                                                   < 0,
                                                                                          W ◦21                        W ◦22
                            •
                v̇(e (t)) =                                                                                                         (72)
                                                           (59)      where
  = ė (t)E P e (t) + e•T(t)P •TE • ė• (t) < 0 .
        •T     •T   • •


  Thus, substituting (51) into (59), it yields                                     W ◦21 = δQ◦A◦ − δY ◦ C ◦ − C ◦TZ ◦T ,            (73)

  v̇(e• (t)) = e•T (t)(P •TA•e + A•T • •                                          W ◦22 = −2δQ◦ − δZ ◦ C ◦ − δC ◦TZ ◦T .  (74)
                                  e P )e (t) < 0 , (60)
                                                                     If the above conditions are affirmative, the extended ob-
which implies
                                                                     server gain matrices are given by the equations
                    P •TA•e + A•T •
                               e P < 0.                    (61)                    J ◦ = (Q◦ )−1 Y ◦ ,    L◦ = (Q◦ )−1 Z ◦ .        (75)
Defining the Lyapunov matrix                                         Proof. Since there is no restriction on the structure of P 3 it
                        ◦                                          can be set
                          P1        P ◦2
                  P• =                         ,           (62)                      P ◦1 = P ◦3 = P ◦T
                                                                                                     3 =Q >0
                                                                                                             ◦
                                                                                                                                (76)
                          P ◦3      P ◦4                                                                 ◦                    ◦
                                                                     and the conditioned structure of P 4 , with respect to P 3 and
in analogy with (29) then (58) implies                               A◦e , can be taken into account as
             P ◦1 = P ◦T
                      1 > 0,      P ◦2 = P ◦T
                                           2 =0            (63)                              P ◦4 = δP ◦3 = δQ◦ ,                   (77)
and, using (50) and (62), (63) in (61), it yields                    where δ > 0, δ ∈ IR. If these conditions are incorporated
"           #             " ◦ ◦T #                               into (66)-(68), then
  0 A◦T  e      P ◦1 0         P1 P3           0   I◦
                           +                              < 0.               P T3 A◦e = Q◦ (A◦ − J ◦ C ◦ ) = Q◦ A◦ − Y ◦ C ◦ ,      (78)
  I ◦ −D ◦T
         e
                P ◦3 P ◦4       0 P ◦T4
                                              A◦e −D ◦e
                                                          (64)           P ◦T ◦ ◦     ◦T ◦ ◦    ◦ ◦ ◦    ◦ ◦
                                                                           4 L C = δP 3 L C = δQ L C = δZ C ,                       (79)
After some algebraic manipulations, (64) takes the follow-           where
ing form             •                                                           Y ◦ = Q◦ J ◦ , Z ◦ = Q◦ L◦ .            (80)
                      U 1 U •T2
                                     < 0,                 (65)       Thus, with these modifications, (65)-(68) imply (72)-(74).
                      U •2 U •3                                      This concludes the proof.
where, with the notation (47),
                                                                       Note, the design conditions formulated in Theorem 2 give
  U •1 = (A◦ − J ◦ C ◦ )T P ◦3 + P ◦T  ◦   ◦ ◦
                                   3 (A − J C ) ,          (66)      potentially more conservative solutions.
 U •2 = P ◦T  ◦   ◦ ◦      ◦     ◦    ◦T ◦T ◦
          4 (A − J C ) + P 1 − P 3 − C L P 3 , (67)                  5        Illustrative Example
   U •3 = −P ◦4 − P ◦T   ◦T ◦ ◦   ◦T ◦T ◦
                    4 − P4 L C − C L P4 .                  (68)      The considered system is represented by the model (1), (2)
By setting                                                           with the model parameters [10]
                                                                                                                       
   P ◦4 = δP ◦3 ,   Y ◦ = P ◦T ◦
                                       Z ◦ = P ◦T ◦
                                                           (69)                     1.380 −0.208       6.715 −5.676
                            3 J ,              3 L ,
                                                                                −0.581 −4.290         0.000     0.675 
                                                                          A=
where δ > 0, δ ∈ IR, then (65)-(68) imply (53)-(55).                                1.067    4.273 −6.654        5.893 
  Writing (68) as follows                                                           0.048    4.273     1.343 −2.104
                  U •3 =                                                                        
                                                                                0.000     0.000                          
                                                           (70)
 = −P 4 (I +L C ) − (I ◦ +L◦ C ◦ )TP ◦4 = −R•
      ◦T  ◦  ◦ ◦
                                                                        B=
                                                                               5.679     0.000 
                                                                                                   , C =
                                                                                                            4 0 1 0
                                                                                1.136 −3.146               0 0 0 1
and comparing (7) and (65), then, if the inequalities (52)-                     1.136      0.000
(53) are satisfied, the Schur complement property (7) ap-
                                                                     To consider single actuator faults it was set E = B, and
plied to (65) implies that R• is positive definite.
                                                                     the matrix variables Q◦ , Y ◦ , Z ◦ satisfying (71)-(74) for
   Since P ◦4 is regular, (I ◦ +L◦ C ◦ ) is also regular and so
                                                                     δ = 0.75 were as follows
AP De given by (40) exists. This concludes the proof.
                                                                                             Q◦ = [ Q◦1     Q◦2 ] ,
   Since there is no restriction on the structure of P 3 in The-
                                                                                                                    
orem 1, it follows that the problem of checking the existence                            0.1737       0.0012  0.1409
of a stable system matrix of PD adaptive fault observer in a                           0.0012        0.1615  0.0195 
given matrix space may also be formulated with symmet-                                                              
                                                                                       0.1409        0.0195  0.1794 
                                                                                Q◦1 =                                 ,
ric matrices P 3 and P 3 . This limit case of the LMI struc-                           −0.1316       0.0252 −0.1439 
                                                                                                                     
ture design condition, bound to a single symmetric matrix,                             −0.0118      −0.1975 −0.0464 
is given by the following theorem.                                                       0.1461      −0.0026  0.1557




                                                               238
                        Proceedings of the 26th International Workshop on Principles of Diagnosis

                                             
                 −0.1316 −0.0118       0.1461                                       2.5
               0.0252 −0.1975 −0.0026 
                                                                                   2
                                                                                                                                                 f
                                                                                                                                                 1
               −0.1439 −0.0464        0.1557 
       Q◦2 =                                 ,                                                                                                 f2
               0.2177 −0.1136 −0.1255                                             1.5
               −0.1136      1.4490 −0.1904                                                                                                     fe1




                                                                     f(t), fe(t)
                                                                                     1
                 −0.1255 −0.1904       1.3479                                                                                                    fe2
                                                                                  0.5
                       0.1162 −0.0220
                     −0.0094     0.1404                                            0
                                         
               ◦     0.0814      0.1439                                          −0.5
             Y =                           ,
                     −0.0719     0.1072                                                0   20   40   60   80          100   120   140   160         180
                     0.0060      0.0171                                                                         t[s]
                       0.0003     0.2159
                                        
                      −0.0164 −0.0445                                                 Figure 1: Adaptive fault estimator responses
                     0.0013 −0.0528 
                                        
               ◦     −0.0728     0.1181 
             Z =                           ,
                     0.0678      0.0229 
                                                                     Although many actuator faults can cause the gain to drift,
                     0.0015      0.1434                          in practice the faults lead to an abrupt change in gain [21].
                      −0.1062     0.1758                           To simulate this phenomena, it was assumed that the fault in
                                                                   actuators for (1) was given by
                           [  ]
where the SeDuMi package 17 was used to solve given set                            
of LMIs.                                                                                        0,          t ≤ tsa ,
                                                                                   
                                                                                   
   The PD observer extended matrix gains are then com-                             
                                                                                    tsb −t
                                                                                           fh
                                                                                                 (t − tsa ), t sa < tsb ,
puted using (56) as                                                        f (t) =
                                                                                              sa
                                                                                                fh ,         tsb ≤ tca ,
                                                                                 
                                                                                   
                       0.8777 −1.5720                                              
                                                                                      − fh (t − tcb ), tca < tcb ,
                                                                                    tcb −tca
                     −0.0801     0.5621                                                        0,          t ≥ tcb ,
                                        
               ◦     −0.0624     3.7385 
             J =                           ,
                     0.1229      2.2486 
                                                                  where, analyzing the single first actuator fault estimation, it
                     −0.0021     0.3934                          was set
                      −0.0767     0.1649
                                                                   fh = 2, tsa = 30s, tsb = 35s, tea = 65s, teb = 70s ,
                       0.7391 −2.0549
                     0.0663 −0.6605                              and for the single second actuator fault these parameters
                                                                 were
                     −0.7915     3.4010 
             L◦ =                          .
                     0.1994      1.3731 
                                         
                     −0.0000     0.2244                          fh = 2, tsa = 100s, tsb = 105s, tea = 135s, teb = 140s .
                      −0.0488     0.1187
                                                                   It is demonstrates that for equal fh in the first and the sec-
Verifying the PD observer system matrix eigenvalue spec-           ond actuator faults it is possible for given B to adjust the
trum, the results were                                             common adapting parameter matrix G in (38) as follows
  ρ(Ae ) = { −0.7731, −2.8914, −4.7816, −8.9188 } ,                                                        
                                                                                              40.0     5.9
                                                                                     G=                       .
 ρ(AP De ) = { −1.1194, −1.6912, −1.9969, −2.9765 } .                                          5.9 22.0
That means the PD observer is stable as well as its "P" part       The obtained results are illustrated in Fig. 1 where, just in
is stable, too. Moreover, also the descriptor form (45) of the     terms of rendering, all faults responses and their estimates
PD observer is stable, where                                       were combined into a single image, and so the demonstra-
                                               
            ρ (I ◦ + L◦ C ◦ )−1 (A◦ − J ◦ C ◦ ) =                  tion can not be seen as a progressive sequence of single
                                                                 faults in the actuators system. This figure presents the fault
                     −1.7763, −2.0966,                    .
     =                                                             signals, as well as their estimations, reflecting the single first
            −0.6629 ± 0.7872 i, −1.3632 ± 0.4931 i                 actuator fault starting at the time instant t = 30s and ap-
Comparing with a solution of (52)-(55) for the δ = 0.95, it        plied for 40s and then the fault of the second actuator is
is possible to verify that in this case                            demonstrated beginning in the time instant t = 100s and
                                                                   lasts for 40s. The presented simulation was carried out in
ρ(Ae ) = { −6.8230, −10.3876, −81.5789, −472.0230 } ,              the system autonomous mode, practically the same results
                                                                   were obtained for forced regime of the system.
 ρ(AP De ) = { −0.9562, −0.9774, −7.2561, −9.8300 } ,                 The adapting parameter G and the tuning parameter δ
                                             
          ρ (I ◦ + L◦ C ◦ )−1 (A◦ − J ◦ C ◦ ) =                    were set interactively considering the maximal value of fault
                                                                 signal amplitude fh and the fault observer dynamics. It can
          −1.0240, −1.0748, −6.4810, −9.1501      ,
    =                                                              be seen that the exists very small differences between the
                   −0.9650 ± 0.0068 i
                                                                   signals reflecting single actuator faults and the observer ap-
which implies in this case a faster dynamics of the descriptor     proximate ones for slowly warring piecewise constant actu-
form of the PD observer but a slower for the PD observer.          ator faults. The principle can be used directly in the control
Note, the exploitation of δ = 0.75 leads in this case to un-       structures with the fault compensation [4], but can not be
stable "P" part of the PD observer.                                directly used to localize actuator faults [14].




                                                             239
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


6     Concluding Remarks                                            [11] D. Krokavec and A. Filasová. On observer design me-
                                                                         thods for a class of Takagi-Sugeno fuzzy systems. Pro-
Based on the descriptor system approach a new PD fault                   ceedings of the 3rd International Conference on Con-
observer design method for continuous-time linear systems                trol, Modelling, Computing and Applications CMCA
and slowly-varying actuator faults is introduced in the paper.           2014, Dubai, UAE, 1-11, 2014.
Presented version is derived in terms of optimization over
LMI constraints using standard LMI numerical procedures             [12] D. Krokavec and A. Filasová. Design of PD observer-
to manipulate the fault observer stability and fault estima-             based fault estimator for a class of Takagi-Sugeno de-
tion dynamics. Presented in the sense of the second Lya-                 scriptor systems. The 1st IFAC Conference on Mo-
punov method expressed through LMI formulation, design                   delling, Identification and Control of Nonlinear Sys-
conditions guaranty the asymptotic convergence of the state              tems MICNOM-2015, Saint-Petersburg, Russia, 2015.
as well as fault estimation errors. The numerical simulation             (in press)
results show good estimation performances.                          [13] G. Lu, D.W.C. Ho, and Y. Zheng. Observers for a
                                                                         class of descriptor systems with Lipschitz constraint.
Acknowledgments                                                          Proceeding of the 2004 American Control Conference,
                                                                         Boston, MA, USA, 3474-3479, 2004.
The work presented in this paper was supported by VEGA,             [14] N. Meskin and K. Khorasani, Fault detection and iso-
the Grant Agency of the Ministry of Education and the                    lation of actuator faults in overactuated systems. Pro-
Academy of Science of Slovak Republic, under Grant No.                   ceedings of the 2007 American Control Conference,
1/0348/14. This support is very gratefully acknowledged.                 New York City, NY, USA, 2527-2532, 2007.
                                                                    [15] N. Meskin and K. Khorasani, Actuator fault detec-
References                                                               tion and isolation for a network of unmanned vehicles.
[1]   M. Blanke, M. Kinnaert, J. Lunze, and M. Staro-                    IEEE Transactions on Automatic Control, 54(4):835-
      swiecki. Diagnosis and Fault-Tolerant Control, Sprin-              840, 2009.
      ger-Verlag, Berlin, 2006.                                     [16] R.J. Patton and S. Klinkhieo. Actuator fault estima-
                                                                         tion and compensation based on an augmented state
[2]   W. Chen and M. Saif. Observer-based strategies for ac-             observer approach. Proceedings of the 48th IEEE
      tuator fault detection, isolation and estimation for cer-          Conference on Decision and Control, Shanghai, P.R.
      tain class of uncertain nonlinear systems. IET Control             China, 8482-8487, 2009.
      Theory & Applications, 1(6):1672-1680, 2007.
                                                                    [17] D. Peaucelle, D. Henrion, Y. Labit, and K. Taitz.
[3]   S. Ding. Model-Based Fault Diagnosis Techniques.                   User’s Guide for SeDuMi Interface, LAAS-CNRS,
      Design Schemes, Algorithms, and Tools, Springer-                   Toulouse, France, 2002.
      Verlag, Berlin, 2008.                                         [18] J. Ren and Q. Zhang. PD observer design for descrip-
[4]   A. Filasová and D. Krokavec. Design principles of ac-              tor systems. An LMI approach. International Journal
      tive robust fault tolerant control systems. Robust Con-            of Control, Automation, and Systems, 8(4):735-740,
      trol. Theory and Applications, A. Bartoszewicz (Ed.),              2010.
      InTech, Rijeca, 309-338, 2011.                                [19] F. Shi and R.J. Patton. Simultaneous state and fault
[5]   E. Fridman and U. Shaked. A descriptor system ap-                  estimation for descriptor systems using an augmented
      proach to H∞ control of linear time-delay systems.                 PD observer. Preprints of 19th IFAC World Congress,
      IEEE Transactions on Automatic Control, 47(2):253-                 Cape Town, South Africa, 8006-8011, 2014.
      270, 2002.                                                    [20] J.G. VanAntwerp and R.D. Braatz. A tutorial on linear
[6]   Z. Gao. PD observer parametrization design for de-                 and bilinear matrix inequalities. Journal of Process
      scriptor systems. Journal of the Franklin Institute,               Control, 10:363-385, 2000.
      342(5):551-564, 2005.                                         [21] H. Wang and S. Daley. Actuator fault diagnosis. An
                                                                         adaptive observer-based technique. IEEE Transactions
[7]   Z. Gao and S.X. Ding. Fault estimation and fault-                  on Automatic Control, 41(7):1073-1078, 1996.
      tolerant control for descriptor systems via propor-
                                                                    [22] F. Zhang, G. Liu, and L. Fang. Actuator fault estima-
      tional, multiple-integral and derivative observer de-
      sign. IET Control Theory & Applications, 1(5):1208-                tion based on adaptive H∞ observer technique. Pro-
      1218, 2007.                                                        ceedings of the 2009 IEEE International Conference
                                                                         on Mechatronics and Automation, Changchun, P.R.
[8]   J. Guo, X. Huang, and Y. Cui. Design and analysis                  China, 352-357, 2009.
      of robust fault detection filter using LMI tools. Com-        [23] K. Zhang, B. Jiang, and V. Cocquempot. Adaptive
      puters and Mathematics with Applications, 57(11-12):               observer-based fast fault estimation. International
      1743-1747, 2009.                                                   Journal of Control, Automation, and Systems 6(3):
[9]   K. Ilhem, J. Dalel, B.H.A. Saloua, and A.M. Naceur,                320-26, 2008.
      Observer design for Takagi-Sugeno descriptor system           [24] K. Zhang, B. Jiang, and P. Shi. Fast fault estimation
      with Lipschitz constraints. International Journal of In-           and accommodation for dynamical systems. IET Con-
      strumentation and Control Systems 2(2):13-25, 2012.                trol Theory & Applications, 3(2):189-199, 2009.
[10] J. Kautsky, N. K. Nichols, and P. Van Dooren. Robust           [25] K. Zhang, B. Jiang, and P. Shi. Observer-Based Fault
     pole assignment in linear state feedback. International             Estimation and Accomodation for Dynamic Systems,
     Journal of Control, 41(5):1129-1155, 1985.                          Springer-Verlag, Berlin, 2013.




                                                              240
                         Proceedings of the 26th International Workshop on Principles of Diagnosis




               Chronicle based alarm management in startup and shutdown stages
John W. Vasquez1,3,4 , Louise Travé-Massuyès1,2 ,Audine Subias1,3 Fernando Jimenez4 and Carlos Agudelo5
                    1
                      CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
                              2
                                Univ de Toulouse, LAAS, F-31400 Toulouse, France
                          3
                            Univ de Toulouse, INSA, LAAS, F-31400 Toulouse, France
                                      4
                                        Universidad de los Andes, Colombia.
                                          5
                                            ECOPETROL ICP, Colombia.
                            e-mail: jwvasque@laas.fr, louise@laas.fr, subias@laas.fr,
                         fjimenez@uniandes.edu.co, carlos.agudelo@ecopetrol.com.co
                        Abstract                            used for the chronicle design. The section 5 is devoted to
                                                                        the chronicle generation. Finally , an illustrative application
          The transitions between operational modes                     on real data from a petrochemical plant is given section 6.
          (startup/shutdown) in chemical processes gen-
          erate alarm floods and cause critical alarm
          saturation. We propose in this paper an approach
                                                                        2 State of the art: Alarm management
          of alarm management based on a diagnosis                      Alarm management has recently focused the attention of
          process. This diagnosis step relies on situation              many researchers in themes such as:
          recognition to provide to the operators relevant                 Alarm historian visualization and analysis: A combined
          information on the failures inducing the alarms               analysis of plant connectivity and alarm logs to reduce the
          flows. The situation recognition is based on                  number of alerts in an automation system was presented by
          chronicle recognition. We propose to use the                  [3]; the aim of the work presented is to reduce the num-
          information issued from the modeling of the                   ber of alerts presented to the operator. If alarms are re-
          system to generate temporal runs from which the               lated to one another, those alarms should be grouped and
          chronicles are extracted. An illustrative example             presented as one alarm problem. Graphical tools for rou-
          in the field of petrochemical plants ends the                 tine assessment of industrial alarm systems was proposed
          article.                                                      by [4], they presented two new alarm data visualization tools
                                                                        for the performance evaluation of the alarm systems, known
                                                                        as the high density alarm plot (HDAP) and the alarm sim-
     1 Introduction                                                     ilarity color map (ASCM). Event correlation analysis and
     The petrochemical industries losses have been estimated at         two-layer cause-effect model were used to reduce the num-
     20 billion dollars only in the U.S. each year, and the AEM         ber of alarms in [5]. A Bayesian method has been intro-
     (Abnormal Events Management) has been classified as a              duced for multimode process monitoring in [6]. This type
     problem that needs to be solved. Hence the alarm man-              of techniques helps us to recognize alarm chattering, group-
     agement is one of the aspects of great interest in the safety      ing many alarms or estimate the alarm limits in transition
     planning for the different plants. In the process state tran-      stages, but the time and the procedure actions are not in-
     sitions such as startup and shutdown stages the alarm flood        cluded.
     increases and it generates critical conditions in which the           Process data-based alarm system analysis and rational-
     operator does not respond efficiently, then a dynamic alarm        ization: The evaluation of plant alarm systems by behavior
     management is required [1]. Currently, many fault detec-           simulation using a virtual subject was proposed by [7]. [8]
     tion and diagnosis techniques for multimode processes have         introduced a technique for optimal design of alarm limits
     been proposed; however, these techniques cannot indicate           by analyzing the correlation between process variables and
     fundamental faults in the basic alarm system [2], in the other     alarm variables. In 2009 a framework based on the receiver
     hand the technical report ”Advance Alarm System Require-           operating characteristic (ROC) curve was proposed to op-
     ments” EPRI (The Electric Power Research Institute) sug-           timally design alarm limits, filters, dead bands, and delay
     gests a cause-consequence and event-based processing. In           timers; this work was presented in [9] and a dynamic risk
     this perspective, diagnosis approaches based on complex            analysis methodology that uses alarm databases to improve
     events processing or situation recognition are interesting is-     process safety and product quality was presented in [10]. In
     sues. Therefore, in this paper, a dynamic alarm management         [11] the Gaussian mixture model was employed to extract
     strategy is proposed in order to deal with alarm floods hap-       a series of operating modes from the historical process data
     pening during transitions of chemical processes. This ap-          and then the local statistic and its normalized contribution
     proach relies on situations recognition (i.e. chronicle recog-     chart were derived for detecting abnormalities early and for
     nition). As, the efficiency of alarm management approaches         isolating faulty variables. We can see that the use of virtual
     depends on the operator expertise and process knowledge,           subjects could be applied to probe the alarm system and us-
     our final objective is to develop a diagnosis approach as a        ing historical information about the alarm behavior for de-
     decision tool for operators. The paper is divided into 6 sec-      tecting abnormalities. The problem is presented when the
     tions. Section 2 gives an overview on the relevant literature.     simulation requires a lot time to probe the totally of scenar-
     The section 3 concerns the modeling of the system. The sec-        ios and when we have new plants that do not contain infor-
     tion 4 is about the chronicle principle and the temporal runs      mation about historical data.




                                                                  241
                    Proceedings of the 26th International Workshop on Principles of Diagnosis

                                                                                      S
   Plant connectivity and process variable causality analy-          • CSD ◆             i CSDi is the Causal System
sis (causal methods): In the literature, transition monitor-           Description or the causal model used to repre-
ing of chemical processes has been reported by many re-                sent the constraints underlying in the continuous
searchers. In [12] was presented a fault diagnosis strategy            dynamic of the hybrid system. Every CSDi asso-
for startup process based on standard operating procedures,            ciated to a mode qi , is given by a graph (Gc = #
this approach proposes behavior observer combined with                 [ K, I). I is the set of influences where there is
dynamic PCA (Principal Component Analysis) to estimate                 an edge e(vi , vj ) 2 I from vi 2 # to vj 2 # if the
process faults and operator errors at the same time, and in            variable vi influences variable vj . Then, the vertices
[13] was presented a framework for managing transitions                represent the variables and the edges represent the
in chemical plants where a trend analysis-based approach               influences between variables and for each edge exists
for locating and characterizing the modes and transitions in           an association with a component in the system. The
historical data is proposed. Finally, in [14] a hybrid model-          set of components is noted as COMP .
based framework was used for alarm anticipation where the            • Init is the initial condition of the hybrid system,
user can prepare for the possibility of a single alarm occur-
rence. For the transition monitoring, these types of tech-         3.2 Qualitative abstraction of continuous
niques are the most used in industrial processes and the hy-           behavior
brid model based framework could be a good representation
                                                                   In each mode of operation, variables evolve according to
of our system. We can observe that a causal model allows
                                                                   the corresponding dynamics. This evolution is represented
identify the root of the failures and check the correct evo-
                                                                   with qualitative values. The domain D(Vi ) of a qualitative
lution in a transitional stage. Our proposal is closer to the
                                                                   variable Vi 2 VQ is obtained through the function fqual :
third type of approach and seeks to exploit the causal rela-
                                                                   D(vi ) ! D(Vi ) that maps the continuous values of variable
tionships as presented in the next sections.
                                                                   vi to ranges defined by limit values (High Hi and Low Li ).
                                                                                   8
3 Representation of the system                                                     >
                                                                                   >
                                                                                   >
                                                                                     ViH if vi Hi ^ dv       dt > 0
                                                                                                                i

                                                                                   >
                                                                                   >   M                     dv
                                                                                            if vi < Hi ^ dti < 0
3.1 Hybrid Causal Model                                                            < Vi
The hybrid system is represented by an extended transition            f(vi )qual =                   _                     (2)
                                                                                   >
                                                                                   >                              dvi
system [15], whose discrete states represent the different                         >
                                                                                   >                 vi Li ^ dt > 0
                                                                                   >
                                                                                   : L
modes of operation for which the continuous dynamics are                             Vi if vi < Li ^ dv     dt < 0
                                                                                                              i

characterized by a qualitative domain. Formally, a hybrid          dvi
causal system is defined as a tuple:                                dt > 0 represents that the continuous variable vi is increas-
                                                                   ing and dv
                                                                           dt < 0 that it is decreasing. The behavior of these
                                                                              i
             = (#, D, Conf, T r, E, CSD, Init)            (1)      qualitative variables is represented in Figure 1. by the graph
Where                                                              GVi = (VQ , ⌃c , ) where VQ is the set of the possible qual-
                                                                   itative states (ViL : Low, ViM : Medium, ViH : High) of
 • # = {vi } is a set of continuous process variables
                                                                   the continuous variable vi , ⌃c is the finite set of the events
   which are function of time t.
                                                                   associate to the transitions and : VQ ⇥ ⌃c ! VQ is the
 • D is a set of discrete variables. D = Q [ K [ VQ . Q            transition function. The corresponding event generator is
   is a set of states qi of the transition system which repre-
   sent the system operation modes. The set of auxiliary
   discrete variables K = {Ki , i = 1, ...nc } represents
   the system configuration in each mode qi as defined
   below by Conf(qi ). VQ = {Vi } is a set of qualitative
   variables whose values are obtained from the behavior
   of each continuous variable vi .
 • Conf(qi ): Q ! ⌦i D(Ki ) where ⌦ is the Cartesian
   product and D(Ki ) is the domain of Ki 2 K that
   provides the configuration associated to the mode. i.e.
   the modes of the underlying multimode components
   (typically, a valve has two normal modes, opened and
   closed)
                                                                       Figure 1: Qualitative values of the process variables
 • E = ⌃[⌃c is a finite set of event types noted , where:
      – ⌃ is the set of event type associated to the proce-        defined by the abstraction function fVQ !
          dure actions in a startup or shutdown stages.
      – ⌃c is the set of event type associated to the behav-        fVQ ! : VQ ⇥ (VQ , ⌃c ) ! ⌃c
                                                                                              8+
          ior of the continuous process variables.                                            > l (vi ) if ViL ! ViM
                                                                                              >
                                                                                              <
 • T r : Q⇥ ⌃ ! Q is the transition function. The tran-                                         l (vi ) if ViM ! ViL
                                                                    8Vi 2 VQ , (Vin , Vim ) !
   sition from mode qi to mode qj with associated event                                       >
                                                                                              > h+ (vi ) if ViM ! ViH
                                                                                              :
      is noted (qi , , qj ) or qi ! qj . We assume that the                                     h (vi ) if ViH ! ViM
                                                                      n    m       L    M     H
   model is deterministic, without loss of generality i.e.          Vi , Vi 2 {Vi , Vi , Vi }
   whenever qi ! qj and qi ! qk then qj = qk for each                         S                                           (3)
   (qi , qj , qk ) 2 Q3 and each 2 ⌃.                                   ⌃c = vi 2# {l+ (vi ), l (vi ), h+ (vi ), h (vi )} (4)




                                                             242
                     Proceedings of the 26th International Workshop on Principles of Diagnosis


3.3 Automatic derivation of the causal model
To obtain the causal model of a system in a given operat-
ing mode implies to collect the equations that represent the
behavior of the system in this mode. The theory of causal
ordering issued from the Qualitative Reasoning community
can be well applied to obtain automatically the causal struc-
ture associated to a set of equations. Now, associating acti-
vation conditions to the equations extend the causal order-
ing to systems with several operating modes [16]. Then
these activation conditions can be related in the influences
of the resulting causal graph.The proposed algorithm, im-
plemented in the Causalito software makes use of condi-
tions that avoid recomputing a totally new perfect matching
for every operating mode, thus reducing the computational
cost. In this work, the Causal System Description is given                    Figure 2: Principle of chronicle generation
by CSD = (#, I), where each influence I is labeled with:
   • An activation condition indicating the modes in which
     it is active (or no label if it is active in all modes),        4.2 Temporal runs
   • The corresponding equation,
   • The component whose behavior is expressed by the                We denote a temporal run as h R, T i where R is a run and T
     equation.                                                       is the time graph of the run that includes the time constraints
In the follow section we expose the principle of the chroni-         CT between each pair of time points where must occurs the
cle generation where concepts such as event, chronicle and           events type. Figure 3 gives time graph examples and the
temporal run are described.                                          possible composition of time graphs. In our approach the

4 Chronicles
4.1 Events and chronicles
Let us consider time as a linearly ordered discrete set of in-
stants. The occurrence of different events in time represents
the system dynamics and a model can be determined to di-
agnose the correct evolution. An event is defined as a pair
( i , ti ), where i 2 E is an event type and ti is a variable of
integer type called the event date. We define E as the set of
all event types and a temporal sequence on E is an ordered
set of events denoted S = h( i , ti )ij with j 2 Nl where l
is the size of the temporal sequence S and Nl is a finite set
of linearly ordered time points of cardinal l. A chronicle is
a triplet C = (⇠, CT , G) such that ⇠ ✓ E, CT is the set of
temporal constraints. G = (N, It) is a directed graph where
N represent event types of E and the arcs It represent the                          Figure 3: Time graphs example
relationship between events 2 E, if the event 1 occurs t
time units after 2 , then it exists a directed link from 1 to        runs are issued from the system evolution from one oper-
  2 associated with a time constraint. Considering the two           ation mode to another. The interleaved sequence of event
events ( i , ti ) and ( j , tj ), we define the time interval as     types ↵1 , ↵2 , . . . ↵n represents the procedure actions and the
the pair ⌧ij = [t , t+ ], ⌧ij 2 CT corresponding to the lower        behavior evolution of the process variables. The time con-
and upper bounds on the temporal distance between the two            straints between each pair of event types are determined by
event dates ti and tj [17]. The idea of our proposal is to           simulation of the continuous behavior for each process vari-
design the chronicles from the hybrid causal model of the            able, responding to the procedure actions.
system. Indeed the evolution of the system can be captured
with temporal runs from which chronicles can be learn (See
Figure 2). More precisely, the system initiates in the state q0
and it evolves according to the transitions resulting from the       5 Generation of Chronicles
events defined by the procedure actions for specific scenar-
ios (startup/shutdown). For a given system modes qi 2 Q,             5.1 Chronicle database
the associated CSDi is used to generate the set of event
types corresponding to the evolution of the continuous pro-          An industrial or complex process P r is composed of differ-
cess variables. A run is defined by a sequence of event types        ent areas P r = {Ar1 , Ar2 , ...Arn } where each area Ark
↵1 , ↵2 , ....↵n where ↵i 2 E generated for each scenario us-        has different operational modes such as startup, shutdown,
ing the startup/shutdown procedures. These runs with time            slow march, fast march, etc. The set CArk of chronicles Cijk
constraints permit to construct the chronicle database of the        for each area Ark is presented in the matrix below, where
system. In this preliminary approach, time constraints are           the rows represent the operating modes (i.e. O1 : Startup,
obtained by simulation.                                              O2 : Shutdown, O3 : Startuptype , O4 : Startuptype , etc)




                                                               243
                             Proceedings of the 26th International Workshop on Principles of Diagnosis


and the columns the different faults.                                                       6 Case study
                        N f1                  f2 . . . . . . fn                             6.1 HTG (Hydrostatic Tank Gauging) system
                            2                                                   k3
              O1      C01k
                                 C11 k
                                             C21k
                                                        . . . . . . Ci1                     In the Cartagena Refinery currently are being implemented
              O2 6    C k C12        k
                                             C22k
                                                        . . . . . . Ci2         k7          news units and elements. In the startup stage they will need
                    6 02 k           k          k                               k7          a tool to help the operator to recognize dangerous condi-
              O3 6C03 C13 C23 . . . . . . Ci3                                      7
 CArk =             6 k              k          k                               k7          tions. We will analyze the startup and shutdown stages in the
              O4 6C04 C14                    C24        . . . . . . Ci4            7
              ... 6 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
                                                                                   7        unit of water injection. This process is a HTG (Hydrostatic
              ...   4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5        Tank Gauging) system composed by the following compo-
              Oj      C0jk
                                 C1j k
                                             C2jk
                                                        . . . . . . Cijk                    nents: one tank (T K), two normally closed valves (V 1 and
                                                                                    (5)     V 2), one pump (P u), a level sensor (LT ), a pressure sensor
The chronicle database used for diagnosis is composed by                                    (P T ), inflow sensor (F T1 ) and an outflow sensor (F T2 ), see
the entries of all the matrices {CArk }. This chronicle                                     Figure 5.
database is submitted to a chronicle recognition system that
identifies in an observable flow of events all the possible
matching with the set of chronicles from which the situation
(normal or faulty) can be assessed.

5.2 Chronicle learning

As explained previously when the system changes mode of
operation, a set of event types occurs forming a run R. As
this evolution is due to procedure actions. Not only a unique
temporal run can occur. Hence, we need to set up the maxi-
mal number of temporal runs that it could occur in each sce-
nario represented in the matrix (5). To obtain the chronicle
in each scenario is necessary to obtain the larger time graph
with as many event types and with the minimal values of the                                                  Figure 5: Process diagram
constraints. [18] proposes to determine the chronicles from
the temporal runs. They define a partial order relation be-                                    Assuming this system as a hybrid causal model, the un-
tween two temporal runs as hR, T i  h R0 , T 0 i when the set                              derlying discrete event system and the different process
of event types in R0 is a subset of event type in R and the                                 operation modes are described in Figure 6 where we can
time graphs T and T 0 are related by T T 0 determining the                                  see a possible correct evolution for the startup procedure.
result graph where exists a unique equivalent constraint that                               The events V 1c,o , V 2c,o represents that the valves V 1,V 2
is the minimal. The relation expresses that the set of con-                                 move from the state closed to the state opened, the events
straints in the time graph T 0 is a subset of constraints in T ,                            V 1o,c ,V 2o,c represents on the contrary the valves moving
CT (t, t0 ) ✓ CT 0 (t, t0 ). Therefore, we apply the composition                            from the state opened to the state closed. The event P uf n
(see Figure 3) between the time graphs in order to merge the                                indicates that the pump P u is turned on and the event
constraints obtaining the larger and constrained time graph                                 P un f indicates that the pump P u is turned off.
that represents the chronicle in that scenario. Figure 4 gives
an example of a chronicle generation from a maximal tem-                                    6.2 Identification of causal relationships
poral run. In the next section a case study is presented in
                                                                                            The level (L) in the tank is related to the weight (m) of
                                                                                            the liquid inside, its density (⇢) and the tank area (A). The
                                                                                            density (⇢) is the relationship of the pressures (Pmed ,Pinf )
                                                                                            in separated points (h). Based on the global material bal-
                                                                                            ance, we define that the input flow is equal to the outlet flow.
                                                                                            Then, the variation of the weight (dm(t)/dt) in the tank is
                                                                                            proportional to the difference between the inflow (QiT K )
                                                                                            and the outflow (QoV 2 ). The differential pressure in the
                                                                                            pump and in V 2 are specified as PP u and PV 2 . The
                                                                                            outlet pressure in the pump (P o) is related with the outlet
                                                                                            flow tank (QoT K ), the revolutions per minute in the pump
                                                                                            (RP MP u ), his capacity (C) and the radio of the outlet pipe
                                                                                            (r). The outflow (QoV 2 ) and inflow (QiT K ) control are re-
                                                                                            lated to the percentage aperture of the valves V 1 (LV 1) and
                                                                                            V 2 (LV 2) and differential pressures ( PV 1 , PV 2 ). In Fig-
                                                                                            ure 7 we can see the CSD of the system in the modes q1 ,
                      Figure 4: Chronicle example                                           q5 and q7 . For example, the mode q1 activates the influence
                                                                                            of QiT K to L. The mode q5 activates the influence of QiT K
                                                                                            to L and the influence of L to P o and finally the mode q7
which the chronicle generation from the temporal runs is il-                                activates the influence of QiT K to L, L to P o and P o to
lustrated.                                                                                  QoV 2.




                                                                                      244
            Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                    6.3 Event identification
                                                    One of the most important steps for fault diagnosis based
                                                    on chronicle recognition is to determine the set of events
                                                    that can carry the system to a failure. Each situation pat-
                                                    tern (normal or abnormal) is a set of events and temporal
                                                    constraints between them; then a situation model may also
                                                    specify events to be generated and actions to be triggered
                                                    as a result of the situation occurrence. For a startup proce-
                                                    dure in the example process, the set of event types ⌃ that
                                                    represent the procedure actions is:
                                                     ⌃ = {V 1c,o , V 2c,o , P uf n , V 1o,c , V 2o,c , P un f }    (6)
                                                    According to the causal graphs associated to the modes in-
                                                    volved in the sequence of procedure actions (i.e q1 , q5 and q7
                                                    indicated by red arrows on Figure 6), the event types of ⌃c
                                                    correspond to the behavior of the variables L,Po and QoV 2 .
                                                                   +
                                                                 {l(L)  , l(L) , h+(L) , h(L) ,
                                                             c   +                  +
                                                            ⌃ = l(P o) , l(P o) , h(P o) , h(P o) ,                (7)
                                                                l(QoV 2 ) , l(QoV 2 ) , h+
                                                                 +
                                                                                          (QoV 2 ) , h(QoV 2 ) }
                                                    From the startup/shutdown procedures the different tempo-
                                                    ral runs are determined and these temporal runs are related
                                                    to the normal and abnormal situations. The chronicle result-
                                                    ing from a normal startup procedure is presented in Figure
                                                    8. The model system was developed in Matlab including




Figure 6: Underlying DES of the HGT system




                                                       Figure 8: Chronicle C01 for normal behavior startup

                                                    the injection water process area. The continuous behavior
                                                    is related to the evolution of the level L, outlet pump pres-
                                                    sure P o and the outlet flow QoV 2 in the system. The dis-
                                                    crete evolution is related to the event evolution of the pro-
                                                    cedures in the startup and shutdown stages. From the dif-
                                                    ferent failure modes of the process, the dynamic behavior
                                                    of the variables is shown with a detection for the possible
                                                    process states, including the normal procedure without fail-
                                                    ure. The simulation includes 3 types of startup procedures
                                                    (OK, fail1 and fail2 ) with 4 types of fault modes (V1 , V2 ,
  Figure 7: CSD in the modes q1 , q5 and q7         P ump and Drainopen ) and 3 types of Shutdown proce-
                                                    dure (OK, N on actived and F ail). The evolution of the
                                                    continuous variables in the startup procedure without failure
                                                    is shown in Figure 9. The events are generated by the pro-
                                                    gram through the evolution of the differential equations, the




                                              245
                     Proceedings of the 26th International Workshop on Principles of Diagnosis


variable conditions and the procedural actions. Recognition         [6] Z. Ge and Z. Song. Multimode process monitoring
of the chronicles was done using the tool stateflow.                     based on bayesian method. Journal of Chemometrics,
                                                                         23, 636e650., 2009.
                                                                    [7] M. Noda X. Liu and H. Nishitani. Evaluation of plant
                                                                         alarm systems by behavior simulation using a virtual
                                                                         subject. Computers & Chemical Engineering, 34,
                                                                         374e386, 2010.
                                                                    [8] D. Xiao F. Yang, S. L. Shah and T. Chen. Im-
                                                                         proved correlation analysis and visualization of indus-
                                                                         trial alarm data. 18th IFAC World Congress Milano
                                                                         (Italy), 2011.
                                                                    [9] D.S. Shook S.R. Kondaveeti I. Izadi, S.L. Shah and
Figure 9: Normal behavior in startup procedure without fail-             T. Chen. A framework for optimal design of alarm
ure. Blue: Level, Green:Pressure, Red: ouletflow                         systems. 7th IFAC symposium on fault detection, su-
                                                                         pervision and safety of technical processes, Barcelona,
                                                                         Spain, 2009.
7 Conclusion                                                        [10] U.G. Oktem A. Pariyani, W.D. Seider and M. Soroush.
A preliminary method for alarm management based on au-                   Dynamic risk analysis using alarm databases to im-
tomatically learned chronicles has been proposed. The pro-               prove process safety and product quality: Part ii
posal is based on a hybrid causal model of the system and a              bayesian analysis. AIChE Journal, 58, 826e841.,
chronicle based approach for diagnosis. An illustrative ex-              2012.
ample of an hydrostatic tank gauging has been considered            [11] J. Liu and D. Chen. Non stationary fault detection and
to introduce the main concepts of the approach. In this pa-              diagnosis for multimode processes. AIChE Journal,
per the design of the temporal constraints of the chronicles             56, 207e219., 2010.
were performed from simulation results, but further research
                                                                    [12] L. Boang Z. Jing and Y. Hao. Fault diagnosis strategy
aim to generate the chronicles from the model of the system.
                                                                         for startup process based on standard operating proce-
Learning approaches are currently considered for acquiring
                                                                         dures. 25th Chinese Control and Decision Conference
the chronicle base directly from the sequences of events rep-
                                                                         (CCDC), 2013.
resenting the situations. For this propose the algorithm HC-
DAM (Heuristic Chronicle Discovery Algorithm Modified               [13] H. Vedam R. Srinivasan, P. Viswanathan and
[17]) may be used. The use of HIL (Hardware in the loop)                 A. Nochur. A framework for managing transitions in
to simulate and validate the proposal is also in our prospects.          chemical plants. Computers & Chemical Engineering,
                                                                         29, 305e322., 2005.
8 Acknowledge                                                       [14] A. Adhitya S. Xu and R. Srinivasan. Hybrid model-
                                                                         based framework for alarm anticipation. Industrial &
The ECOPETROL - ICP engineers Jorge Prada, Francisco
                                                                         Engineering Chemistry Research, 2014.
Cala and Gladys Valderrama help us to develop and validate
the simulations.                                                    [15] L. Travé-Massuyès R. Pons, A. Subias. Iterative hy-
                                                                         brid causal model based diagnosis: Application to au-
                                                                         tomotive embedded functions. Engineering Applica-
References                                                               tions of Artificial Intelligence, 2015.
[1] S. Ferrer D. Beebe and D. Logerot. The connection of            [16] L. Travé-Massuyès and R. Pons. Causal ordering for
    peak alarm rates to plant incidents and what you can                 multiple mode systems. in:. 11th International Work-
    do to minimize. Process Safety Progress, 2013.                       shop on Qualitative Reasoning, Cortona, Italy, pp.
[2] J. Zhao J. Zhu, Y. Shu and F. Yang. A dynamic alarm                  203– 214, 1997.
    management strategy for chemical process transitions.           [17] L. Travé-Massuyès A. Subias and E. Le Corronc.
    Journal of Loss Prevention in the Process Industries                 Learning chronicles signing multiple scenario in-
    30 207e218, 2014.                                                    stances. IFAC World Congress, Le Cap, South Africa,
[3] N. F. Thornhill M. Schleburg, L. Christiansen and                    26-29 August, 2014 ; also 25th International Work-
    A. Fay. A combined analysis of plant connectivity and                shop on Principles of Diagnosis (DX-2015), Graz
    alarm logs to reduce the number of alerts in an automa-              (Austria), 9-11 September., 2014.
    tion system. Journal of Process Control 23 839–851,             [18] Bruno Guerraz and Christophe Dousson. Chronicles
    2013.                                                                construction starting from the fault model of the sys-
[4] I. Izadi S. L. Shaha T. Black R. Sandeep, R. Kon-                    tem to diagnose. DX04 15th International Workshop
    daveeti and T. Chen. Graphical tools for routine as-                 on Principles of Diagnosis. Carcassonne (France).,
    sessment of industrial alarm systems. Computers and                  2004.
    Chemical Engineering 46 39–47, 2012.
[5] T. Takai M. Noda F. Higuchi, I. Yamamoto and
    H. Nishitani. Use of event correlation analysis to re-
    duce number of alarms. Computer Aided Chemical
    Engineering, 27, 1521e1526., 2009.




                                                              246
                         Proceedings of the 26th International Workshop on Principles of Diagnosis




                                    Data-Augmented Software Diagnosis

                           Amir Elmishali1 and Roni Stern1 and Meir Kalech1
                                   1
                                     Ben Gurion University of the Negev
                   e-mail: amir9979@gmail.com, roni.stern@gmail.com, kalech@bgu.ac.il



                          Abstract                                   Barinel is a combination of MBD and Spectrum Fault Lo-
                                                                     calization (SFL). SFL considers traces of executions, and
     The task of software diagnosis algorithms is to                 finds diagnoses by considering the correlation between exe-
     identify which software components are faulty,                  cution traces and which executions have failed. While very
     based on the observed behavior of the system.                   scalable, Barinel suffers from one key disadvantage: it can
     Software diagnosis algorithms have been studied                 return a very large set of possible diagnoses for the soft-
     in the Artificial Intelligence community, using a               ware developer to choose from. To handle this disadvantage,
     model-based and spectrum-based approaches. In                   Abreu et al. [5] proposed a Bayesian approach to compute
     this work we show how software fault predic-                    a likelihood score for each diagosis. Then, diagnoses are
     tion algorithms, which have been studied in the                 prioritize according to their likelihood scores.
     software engineering literature, can be used to                    Thanks to the open source movement and current soft-
     improve software diagnosis. Software fault pre-                 ware engineering tools such as version control and issue
     diction algorithms predict which software com-                  tracking systems, there is much more information about a
     ponents is likely to contain faults using ma-                   diagnosed system than revealed by the traces of performed
     chine learning techniques. The resulting data-                  tests. For example, version control systems store all revi-
     augmented diagnosis algorithm we propose is                     sions of every source files, and it is quite common that a
     able to overcome of key problems in software di-                bug occurs in a source file that was recently revised. Barinel
     agnosis algorithms: ranking diagnoses and distin-               is agnostic to this data. We propose a data-driven approach
     guishing between diagnoses with high probability                to better prioritize the set of diagnoses returned by Barinel.
     and low probability. This allows to significantly                  In particular, we use methods from the software engi-
     reduce the outputted list of diagnoses. We demon-               neering literature to learn from collected data how to pre-
     strate the efficiency of the proposed approach                  dict which software components are expected to be faulty.
     empirically on both synthetic bugs and bugs ex-                 These predictions are then integrated into Barinel to better
     tracted from the Eclipse open source project. Re-               prioritize the diagnoses it outputs and provide more accurate
     sults show that the accuracy of the found diag-                 estimates of each diagnosis likelihood.
     noses is substantially improved when using the                     The resulting data-augmented diagnosis algorithm is part
     proposed combination of software fault prediction               of a broader software troubleshooting paradigm that we call
     and software diagnosis algorithms.                              Learn, Diagnose, and Plan (LDP). In this paradigm, illus-
                                                                     trated in Figure 1(a), the troubleshooting algorithm learns
                                                                     which source files are likely to fail from past faults, previ-
1 Introduction                                                       ous source code revisions, and other sources. When a test
Software is prevalent in practically all fields of life, and its     fails, a data-augmented diagnosis algorithm considers the
complexity is growing. Unfortunately, software failures are          observed failed and passed tests to suggest likely diagnoses
common and their impact can be very costly. As a result,             leveraging the knowledge learned from past data. If further
there is a growing need for automated tools to identify soft-        tests are necessary to determine which software component
ware failures and isolate the faulty software components,            caused the failure, such test are planned automatically, tak-
such as classes and functions, that have caused the failure.         ing into consideration the diagnoses found. This process
We focus on the latter task, of isolating faults in software         continues until a sufficiently accurate diagnoses is found.
components, and refer to this task as software diagnosis.               In this work we implemented this paradigm and simulated
   Model-based diagnosis (MBD) is an approach to auto-               its execution on a popular open source software project – the
mated diagnosis that uses a model of the diagnosed system            Eclipse CDT. Information from the Git version control and
to infer possible diagnoses, i.e., possible explanations of the      the Bugzilla issue tracking systems was used, as illustrated
observed system failure. While MBD was successfully ap-              in Figure 1(b) and explained in the experimental results.
plied to a range of domains [1; 2; 3; 4], it has not been ap-           Results show a huge advantage of using our data-
plied successfully yet to software. The reason for this is that      augmented diagnoser over Barinel with uniform priors for
in software development, there is usually no formal model            both finding more accurate diagnoses and for better select-
of the developed software. To this end, a scalable software          ing tests for troubleshooting. Moreover, to demonstrate the
diagnosis algorithm called Barinel has been proposed [5].            potential benefit of our data-augmented approach we also




                                                               247
                            Proceedings of the 26th International Workshop on Principles of Diagnosis



                                              Server Logs
                                                                                                                               Source Code
                             Issue Tracking                   Source Code
                                 System
              Version
           Control System


                                       AI Engine                                                       AI Engine

              QA Tester                                     Developer             QA Tester                               Developer




                  (a) Learn, Diagnos, and Plan Paradigm                                       (b) Our current implementation

                             Figure 1: The learn, diagnose, and plan paradigm and our implementation.

experimented with a synthetic fault prediction model that is                      does not scale well and modeling the behavior of software
correctly identifies the faulty component. As expected, us-                       component is often infeasible.
ing the synthetic fault prediction model is better than using
the learned fault prediction model, thus suggesting room for                      2.1 SFL for Software Diagnosis
further improvements in future work. To our knowledge,                            An alternative approach to software diagnosis has been pro-
this is the first work to integrate successfully a data-driven                    posed by Abreu et al. (5; 7), based on spectrum-based fault
approach into software diagnosis.                                                 localization (SFL). In this SFL-based approach, there is no
                                                                                  need for a logical model of the correct functionality of every
2 Model-Based Diagnosis for Software                                              software component in the system. Instead, the traces of the
                                                                                  observed tests are considered.
The input to classical MBD algorithms is a tuple                                  Definition 2 (Trace). A trace of a test t, denoted by trace(t),
hSD, COM P S, OBSi, where SD is a formal description                              is the sequence of components involved in executing t.
of the diagnosed system’s behavior, COM P S is the set of
components in the system that may be faulty, and OBS                                 Traces of tests can be collected in practice with com-
is a set of observations. A diagnosis problem arises when                         mon software profilers (e.g., Java’s JVMTI). Recent work
SD and OBS are inconsistent with the assumption that all                          showed how test traces can be collected with low over-
the components in COM P S are healthy. The output of an                           head [8]. Also, many implemented applications maintain
MBD algorithm is a set of diagnoses.                                              a log with some form of this information.
                                                                                     In the SFL-based approach, SD is implicitly defined in
Definition 1 (Diagnosis). A set of components ∆ ⊆                                 SFL by the assumption that a test will pass if all the compo-
COM P S is a diagnosis if                                                         nents in its trace are not faulty. Let h(C) denote the health
        ^              ^                                                          predicate for a component C, i.e., h(C) is true if C is not
           (¬h(C)) ∧      (h(C 0 )) ∧ SD ∧ OBS                                    faulty. Then we can formally define SD in the SFL-based
       C∈∆                  C 0 ∈∆
                                /                                                 approach with the following set of Horn clauses:
                                                                                                        ^
is consistent, i.e., if assuming that the components in ∆ are                             ∀test (                h(C)) → passed(test)
faulty, then SD is consistent with OBS.                                                            C∈trace(test)
   The set of components (COM P S) in software diagnoses                          Thus, if a test failed then we can infer that at least one of the
can be, for example, the set of classes, or all functions, or                     components in its trace is faulty. In fact, a trace of a failed
even a component per line of code. Low level granularity of                       test is a conflict.
components, e.g., setting each line of code as a component,                       Definition 3 (Conflict).   A set of components Γ ⊆ COM P S
                                                                                                    V
will result in very focused diagnoses (e.g., pointing on the                      is a conflict if     h(C) ∧ SD ∧ OBS is inconsistent.
exact line of code that was faulty). Focusing the diagnoses                                      C∈Γ
in such way comes at a price of an increase in the computa-                          Many MBD algorithms use conflicts to direct the search
tional effort. Automatically choosing the most suitable level                     towards diagnoses, exploiting the fact that a diagnosis must
of granularity is a topic for future work.                                        be a hitting set of all the conflicts [9; 10; 11]. Intuitively,
   Observations (OBS) in software diagnosis are observed                          since at least one component in every conflict is faulty, only
executions of tests. Every observed test t is labeled as                          a hitting set of all conflicts can explain the unexpected ob-
“passed” or “failed”, denoted by passed(t) and f ailed(t),                        servation (failed test).
respectively. This labeling is done manually by the tester or                        Barinel is a recently proposed software MBD algo-
automatically in case of automated tests (e.g., failed asser-                     rithm [5] based on exactly this concept: considering traces
tions).                                                                           of tests with failed outcome as conflicts and returning their
   There are two main approaches for applying MBD to                              hitting sets as diagnoses. With a fast hitting set algorithm,
software diagnosis, each defining SD somewhat differently.                        such as the STACATTO hitting set algorithm proposed by
The first approach requires SD to be a logical model of the                       Abreu et al. [12], Barinel can scale well to large systems.
correct functionality of every software component [6]. This                       The main drawback of using Barinel is that it often outputs
approach allows using logical reasoning techniques to infer                       a large set of diagnoses, thus providing weaker guidance to
diagnoses. The main drawbacks of this approach is that it                         the programmer that is assigned to solve the observed bug.




                                                                            248
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


2.2 Prioritizing Diagnoses                                           preliminary set of experiments we found that the combina-
To address this problem, Barinel computes a score for every          tion of features that performed best is a combination of 68
diagnosis it returns, estimating the likelihood that it is true.     features from the features listed by Radjenovic et al. [13]
This serves as a way to prioritize the large set of diagnoses        worked best. This list of features included the McCabe [14]
returned by Barinel.                                                 and Halstead [15] complexity measures, several object ori-
                                                                     ented measures such as the number of methods overriding
   The exact details of how this score is compute is given
                                                                     a superclass, number of public methods, number of other
by Abreu et al. [5]. For the purpose of this paper, it is im-
                                                                     classes referenced, and is the class abstract, and several pro-
portant to note that the score computation used by Barinel
                                                                     cess features such as the age of the source file, the number
is Bayesian: it computes for a given diagnosis the posterior
                                                                     of revisions made to it in the last release, the number of de-
probability that it is correct given the observed passes and
                                                                     velopers contributed to its development, and the number of
failed tests. As a Bayesian approach, Barinel also requires
                                                                     lines changed since the latest version.
some assumption about the prior probability of each com-
                                                                        As shown in the experimental results section, the result-
ponent to be faulty. Prior works using Barinel has set these
                                                                     ing fault prediction model was accurate enough so that the
priors uniformly to all components. In this work, we pro-
                                                                     overall data-augmented software diagnoser be more effec-
pose a data-driven way to set these priors more intelligently
                                                                     tive than Barinel with uniform priors. However, we are not
and demonstrate experimentally that this has a huge impact
                                                                     sure that a better combination of features cannot be found,
of the overall performance of the resulting diagnoser.
                                                                     and this can be a topic for future work. The main novelty of
                                                                     our work is in integrating a software fault prediction model
3 Data-Augmented Software Diagnosis                                  with the Barinel.
The prior probabilities used by Barinel represent the a-priori
probability of a component to be faulty, without considering
                                                                     3.2 Integrating the Fault Prediction Model
any observed system behavior. Fortunately, there is a line of        The software fault prediction model generated as described
work on software fault prediction in the software engineer-          above is a classifier, accepting as input a software compo-
ing literature that deals exactly with this question: which          nent and outputting a binary prediction: is the component
software components is more likely to have a bug. We pro-            predicted to be faulty or not. Barinel, however, requires
pose to use these software fault predictions as priors to be         a real number that estimates the prior probability of each
used by Barinel. First, we provide some background on soft-          component to be faulty.
ware fault prediction.                                                  To obtain this estimated prior from the fault prediction
                                                                     model, we rely on the fact that most prediction models also
3.1 Software Fault Prediction                                        output a confidence score, indicating the model’s confidence
                                                                     about the classified class. Let conf (C) denote this con-
Fault prediction in software is a classification problem.
                                                                     fidence for component C. We use conf (C) for Barinel’s
Given a software component, the goal is to determine its
                                                                     prior if C is classified as faulty, and 1 − conf (C) otherwise.
class – healthy or faulty. Supervised machine learning algo-
rithms are commonly used these days to solve classification
problems. They work as follows. As input, they are given a           4 Experimental Results
set of instances, in our case these are software components,         To demonstrate the benefits of the proposed data-augmented
and their correct labeling, i.e., the correct class for each in-     approach, we implemented it and evaluated it as follows.
stance. They output a classification model, which maps an
instance to a class.                                                 4.1 Experimental Setup
   Learning algorithm extract features from a given instance,        As a benchmark, we used the source files, tests, and
and try to learn from the given labeled instances the relation       bugs reported for the Eclipse CDT open source software
between the features of an instance and its class. Key to            project (eclipse.org/cdt). Eclipse CDT is a popular
the success of machine learning algorithms is the choice of          open source Integrated Development Environment (IDE) for
features used. Many possible features were proposed in the           C/C++. The first release dates back to December 2003 and
literature for software fault prediction.                            the latest release we consider, labeled CDT 8.2.0, was re-
   Radjenovic et al. [13] surveyed the features used by ex-          leased in June 2013. It consists of 8,502 source code files
isting software prediction algorithms and categorizes them           and have had more than 10,129 bugs reported so far (for all
into three families. Traditional. These features are tradi-          releases). In addition, there are 3,493 automated tests coded
tional software complexity metrics, such as number of lines          using the JUnit unit testing framework.
of code, McCabe [14] and Halstead [15] complexity mea-
sures.                                                               Determining Faulty Files
Object Oriented. These features are software complex-                Eclipse CDT is developed using the Git version control sys-
ity metrics that are specifically designed for object oriented       tem and the Bugzilla issue tracking system. Git maintains
programs. This includes metrics like cohesion and coupling           all versions of each source file in a repository. This en-
levels and depth of inheritance.                                     ables computing process metrics for every version of every
Process. These features are computed from the software               source file. Similarly, Bugzilla is used to maintain all re-
change history. They try to capture the dynamics of the soft-        ported bugs. Some source file versions are marked in the
ware development process, considering metrics such as lines          Git repository as versions in which a specific bug was fixed.
added and deleted in the previous version and the age of the         The Git repository for Eclipse CDT contained matching ver-
software component.                                                  sions of source files for 6,730 out of 10,129 bugs reported as
   It is not clear from the literature which combination of          fixed in Bugzilla. We performed our experiments on these
features yields the most accurate fault predictions. In a            6,730 bugs.




                                                               249
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


   For both learning and testing a fault prediction model, we         is the area under the curve plotting the accuracy as a func-
require a mapping between reported bug and the source files           tion of the recall (every point is a different threshold value).
that were faulty and caused it. One possible assumption is               All metrics range between zero and one (where one is
that every source file revision that is marked as fixing bug          optimal) and are standard metrics in machine learning and
X is a faulty file that caused X. We call this the “All files”        information retrieval. The unfamiliar reader can find more
assumption. The “All files” assumption may overestimate               details in Machine Learning books, e.g. Mitchell’s classical
the number of faulty files as some of these files may have            book [17].
been modified due to other reasons, not related to the bug.              The results for both “All files” and “Most modified” as-
Even if all changes in a revision are related to fixing a bug,        sumptions show that the Random Forest classifier obtained
it still does not mean that all these files are faulty. For ex-       the overall best results. This corresponds to many recent
ample, properties files and XML configuration files. As a             works. Thus, in the results reported henceforth, we only
crude heuristic to overcome this, we also experiment with             used the model generated by the Random Forest classifier
an alternative assumption that we call the “Most modified”            in our data-augmented diagnoser. The precision and espe-
assumption. In the “Most modified” assumption, for a given            cially recall results are fairly low. This is understandable,
bug X we only consider a single source file as faulty from            as most files are healthy, and thus the training set is very
all the files associated with bug X, We chose from these              imbalanced. This is a known inhibitor to performance of
source file the one in which the revision made to that source         standard learning algorithms. We have experimented with
file was the most extensive. The extensiveness of the re-             several known methods to handle this imbalanced setting,
vision is measured by the number of lines added, updated,             such as SMOTE and random under sampling, but these did
and deleted to the source file in this revision. Below we             not produce substantially better results. However, as we
present experiments for both “All files” and “Most modi-              show below, even this imperfect prediction model is able
fied” assumptions. Śliwerski et al. [16] proposed a more             to improve the existing data-agnostic software diagnosis al-
elaborate method to heuristically identify the source files           gorithm. Note that we also experimented with other popular
that are caused the bug, when analyzing a similar data set.           learning algorithms such as Support Vector Machine (SVM)
Training and Testing Set                                              and Artificial Neural Network (ANN), but their results were
                                                                      worse than those shown in Table 1.
The sources files and reported bugs from 5 releases, 8.0.0–
8.1.1, were used to train the model of our data-augmented                Next, we evaluate the performance of our data-augmented
diagnoser, and the source files and reported bugs from re-            diagnoser in two diagnostic tasks: finding diagnoses and
lease 8.1.2 were used to evaluate it.                                 guiding test generation.

4.2 Comparing Fault Prediction Accuracy                               4.3 Diagnosis Task
As a preliminary, we evaluated the quality of the fault pre-          First, we compared the data-agnostic diagnoser with the
diction models used by our data-augmented diagnoser on                proposed data-augmented diagnoser in the task of finding
our Eclipse CDT benchmark.                                            accurate diagnoses. The input is a set of tests, with their
                                                                      traces and outcomes and the output is a set of diagnoses,
  All files         Precision    Recall   F-Measure     AUC           each diagnosis having a score that estimates its correctness.
  Random Forest     0.56         0.09     0.16          0.84          This score was computed by Barinel as desribed earlier in
  J48               0.44         0.17     0.25          0.61
                                                                      the paper, where the data-agnostic diagnoser uses uniform
  Naive Bayes       0.27         0.31     0.29          0.80
                                                                      priors and the proposed data-augmented diagnoser uses the
  Most modified     Precision    Recall   F-Measure     AUC           predicted fault probabilities from the learned model.
  Random Forest     0.44         0.04     0.08          0.76
  J48               0.15         0.03     0.05          0.55                                 Most modified            All files
  Naive Bayes       0.08         0.31     0.12          0.715           Diagnoser          Precision Recall      Precision Recall
                                                                        Data-agnostic      0.72      0.27        0.55         0.26
          Table 1: Faulty prediction performance.                       Data-augmented     0.90      0.32        0.73         0.35
   We used the Weka software package (www.cs.                           Syn. (0.6,0.01)    0.97      0.39        0.96         0.45
waikato.ac.nz/ml/weka) to experiment with several                       Syn. (0.6,0.1)     0.84      0.35        0.89         0.42
learning algorithms and compared the resulting fault predic-            Syn. (0.6,0.2)     0.77      0.34        0.83         0.39
                                                                        Syn. (0.6,0.3)     0.73      0.33        0.78         0.37
tion models. Specifically, we evaluated the following learn-
                                                                        Syn. (0.6,0.4)     0.69      0.32        0.74         0.36
ing algorithms: Random Forest, J48 (Weka’s implementa-
tion of a decision tree learning algorithm), and Naive Bayes.                 Table 2: Comparison of diagnosis accuracy.
Table 1 shows the precision, recall, F-measure, and AUC
of the fault prediction models generated by each of these                To compare the set of diagnoses returned by the differ-
learning algorithms. These are standard metrics for evaluat-          ent diagnosers, we computed the weighted average of their
ing classifiers. In brief, precision is the ratio of faulty files     precision and recall. This was computed as follows. First,
among all files identified by the evaluated model as faulty.          the precision and recall for every diagnoses was computed.
Recall is the number of faulty files identified as such by the        Then, we averaged these values, weighted by the score given
evaluated model divided by the total number of faulty files.          to the diagnoses by Barinel. This enables aggregating the
F-measure is a known combination of precision and recall.             precision and recall of a set of diagnoses while incorporat-
The AUC metric addresses the known tradeoff between re-               ing which diagnoses are regarded as more likely according
call and precision, where high recall often comes at the price        to Barinel’s. For brevity, we will refer to this weighted av-
of low precision. This tradeoff can be controlled by setting          erage precision and weighted average recall as simply pre-
different sensitivity thresholds to the evaluated model. AUC          cision and recall.




                                                                250
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


   Table 2 shows the precision and recall results of the data-     agnosis algorithm with different fault priors. Barinel uses
agnostic diagnoser and our data-augmented diagnoser, for           these priors only to prioritize diagnoses, but Barinel consid-
both “Most modified” and “All files” assumptions. Each             ers as diagnoses hitting sets of faulty traces. Thus, if two
result in the table is an average over the precision and re-       faulty components are used in the same trace, only one of
call obtained for 50 problem instances. A problem instance         them will be detected even if both have very high likelihood
consists of (1) a bug from one of the bugs reported for re-        of being faulty according to the fault prediction model.
lease 8.1.2. of Eclipse CDT, and (2) a set of 25 tests, chosen
randomly, while ensuring that at least one tests would pass        Considering More Tests
through the faulty files.                                          Next, we investigate the impact of adding more tests to the
   Both precision and recall of the data-augmented and data-       accuracy of the returned diagnoses.
agnostic diagnosers support the main hypothesis of this               Figure 2 shows the precision and recall results (Figures 2
work: a data-augmented diagnoser can yield substantially           (a) and (b), respectively), as a function of the number of
better diagnoses that a data-agnostic diagnoser. For exam-         observed tests. We compared the different diagnosers, given
ple, the precision of the data-augmented diagnoser under the       25, 40, 70, 100, and 130 observed tests.
“Most modified” assumption is 0.9 while that of the data-             The results show two interesting trends in both precision
agnostic diagnoser is only 0.72. The superior performance          and recall. First, as expected, the data-agnostic diagnoser
of the data-augmented diagnoser is shown for both “Most            performs worse than the data-augmented diagnoser, which
modified” and “All files” assumptions. Another observation         in terms performs worse than the diagnoser using a synthetic
that can be made from the results in Table 2 is that while the     fault prediction model, with Ph = 0.01. This supports our
precision of the data-augmented diagnoses is very high and         main hypothesis — that data-augmented diagnosers can be
is substantially better than that of the data-agnostic diag-       better than a data-agnostic diagnoser. Also, the better per-
noser, the improvement in recall is relatively more modest.        formance of Syn. (0.6, 0.01) demonstrates that future re-
This can be explained by the precision and recall results of       search on improving the fault prediction model will results
the learned model, shown in Table 1 and discussed earlier.         in a better diagnoser.
There too, the recall results was far worse than the preci-           The second trend is that adding more tests reduces the
sion results (recall that we are using the model learned by        precision and recall of the returned diagnoses. This, at
the Random Forest learning algorithm). It is possible that         first glance, seem counter-intuitive, as we would expect
learning a model with higher recall may result in higher re-       more tests to allow finding more accurate diagnoses and
call for the resulting diagnoses. We explore the impact of         thus higher recall and precision. This non-intuitive results
learning more accurate fault prediction model next.                can be explained by how tests were chosen. As explained
                                                                   above, the observed tests were chosen randomly, only veri-
Synthetic Priors                                                   fying that at least one test passes through each faulty source
The data-augmented diagnoser is based on the priors gen-           file. Adding randomly selected tests adds noise to the di-
erated by the learned fault prediction model. Building bet-        agnoser. By contrast, intelligent methods to choose which
ter fault prediction models is an active field of study [13]       tests to add can improve the accuracy of the diagnoses [18].
and thus future fault prediction models may be more accu-          This is explored in the next section. Another reason for the
rate than the ones used by our data-augmented diagnoser.           degraded performance when adding more tests is that more
To evaluate the benefit of a more accurate fault prediction        tests may pass through more fault source files, in addition
model on our data-augmented diagnoser, we created a syn-           to those from the specific reported bug used to generate the
thetic fault prediction model, in which faulty source files        problem instance in the first place. Thus, adding more tests
get Pf probability and healthy source files get Ph , where         increases the amount of faulty source files to detect.
Pf and Ph are parameters. Setting Ph = Pf would cause
the data-augmented diagnoser to behave in a uniform distri-        4.4 Troubleshooting Task
bution exactly like the data-agnostic diagnoser, setting the       Efficient diagnosers are key components of troubleshoot-
same prior probability for all source files to be faulty. By       ing algorithms. Troubleshooting algorithms choose which
contrast, setting Ph = 0 and Pf = 1 represent an optimal           tests to perform to find the most accurate diagnosis. Za-
fault prediction model, that exactly predicts which files are      mir et al. [18] proposed several troubleshootings algorithms
faulty and which are healthy.                                      specifically designed to work with Barinel for troubleshoot-
   The lines marked “Syn. (X,Y)” in Table 2 mark the               ing software bugs. In the below preliminary study, we eval-
performance of the data-augmented diagnoser when using             uated the impact of our data-augmented diagnoser on the
this synthetic fault prediction model, where X = Pf and            overall performance of troubleshooting algorithms. Specif-
Y = Ph . Note that we experimented with many values of             ically, we implemented the so-called highest probability
Pf and Ph , and presented above a representative subset of         (HP) troubleshooting algorithm, in which tests are chosen
these results.                                                     in the following manner. HP chooses a test that is expected
   As expected, setting lowering the value of Ph results in        to pass through the source file having the highest probability
more better diagnoses being found. Setting a very low Ph           of being faulty, given the diagnoses probabilities.
value improves the precision significantly up to almost per-          We run the HP troubleshooting algorithm with each of
fect precision (0.97 and 0.96 for the “Most modified” and          the diagnosers mentioned above (all rows in Table 2). We
“All files”, respectively). The recall results, while also im-     compared the HP troubleshooting algorithm using different
proving as we lower Ph , do not reach a very high value. For       diagnosers by counting the number of tests were required to
Ph = 0.01, the obtained recall is almost 0.39 and 0.45 for         reach a diagnoses of score higher than 0.7.
the “Most modified” and “All files”, respectively.                    Table 3 shows the average number of tests performed by
   A possible explanation for these low recall results lays in     the HP troubleshooting algorithm until it halts (with a suit-
the fact that all the evaluated diagnosers use the Barinel di-     able diagnosis). The results show the same over-arching




                                                             251
                                  Proceedings of the 26th International Workshop on Principles of Diagnosis



                        1                                                                    1           Syn. (0.6,0.01)       Syn. (0.6,0.2)
                                                                                                         Syn. (0.6,0.4)        Data-agnostic
                       0.8                                                                  0.8          Data-augmented

                                                                                            0.6
           Precision
                       0.6




                                                                                   Recall
                       0.4                                                                  0.4

                              Syn. (0.6,0.01)        Syn. (0.6,0.2)
                       0.2                                                                  0.2
                              Syn. (0.6,0.4)         Data-agnostic
                              Data-augmented
                        0                                                                    0
                             25       40          70       100        130                          25       40        70        100        130
                                                # Tests                                                             # Tests
                                   (a) Precision results                                                  (b) Recall results

                                  Figure 2: Diagnosis accuracy as a function of # tests given to the diagnoser.
          Algorithm                  Most modified        All files               [5]        Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund. Si-
          Data-agnostic              20.24                18.06                              multaneous debugging of software faults. Journal of Systems
          Data-augmented             10.80                15.45                              and Software, 84(4):573–586, 2011.
          Syn. (0.6,0.01)            3.94                 14.91
                                                                                  [6]        Franz Wotawa and Mihai Nica. Program debugging using
          Syn. (0.6,0.1)             15.44                17.83
          Syn. (0.6,0.2)             19.78                18.99                              constraints – is it feasible? Quality Software, International
          Syn. (0.6,0.3)             20.90                19.24                              Conference on, 0:236–243, 2011.
          Syn. (0.6,0.4)             20.74                19.18                   [7]        Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund.
                                                                                             Spectrum-based multiple fault localization. In Automated
      Table 3: Avg. additional tests for troubleshooting.                                    Software Engineering (ASE), pages 88–99. IEEE, 2009.
                                                                                  [8]        Alexandre Perez, Rui Abreu, and André Riboira. A dynamic
theme: the data-augmented diagnoser is much better than                                      code coverage approach to maximize fault localization effi-
the data-agnostic diagnoser for this troubleshooting task.                                   ciency. Journal of Systems and Software, 2014.
Also, using the synthetic fault prediction model can result                       [9]        Johan de Kleer and Brian C. Williams. Diagnosing multiple
in even further improvement, thus suggesting future work                                     faults. Artif. Intell., 32(1):97–130, 1987.
for improving the learned fault prediction model.
                                                                                  [10] Brian C. Williams and Robert J. Ragno. Conflict-directed
                                                                                       A* and its role in model-based embedded systems. Discrete
5 Conclusion, and Future Work                                                          Appl. Math., 155(12):1562–1595, 2007.
We presented a method for using information about the di-                         [11] Roni Stern, Meir Kalech, Alexander Feldman, and Gre-
agnosed system to improve Barinel, a scalable, effective,                              gory M. Provan. Exploring the duality in conflict-directed
software diagnosis algorithm [7]. In particular, we incor-                             model-based diagnosis. In AAAI, 2012.
porated a software fault prediction model into Barinel. The                       [12] Rui Abreu and Arjan JC van Gemund. A low-cost approx-
resulting data-augmented diagnoser is shown to outperform                              imate minimal hitting set algorithm and its application to
Barinel without such a fault prediction model. This was                                model-based diagnosis. In SARA, volume 9, pages 2–9, 2009.
verified experimentally using a real source code system                           [13] Danijel Radjenovic, Marjan Hericko, Richard Torkar, and
(Eclipse CDT), real reported bugs and information from                                 Ales Zivkovic. Software fault prediction metrics: A system-
the software’s source control repository. Results also sug-                            atic literature review. Information & Software Technology,
gests that future work on improving the learned fault pre-                             55(8):1397–1418, 2013.
diction model will result in an improved diagnosis accuracy.                      [14] Thomas J. McCabe. A complexity measure. IEEE Trans.
In addition, it is worthwhile to incorporate the proposed                              Software Eng., 2(4):308–320, 1976.
data-augmented diagnosis methods with other proposed im-                          [15] Maurice H. Halstead. Elements of Software Science (Operat-
provements of the based SFL-based software diagnosis, as                               ing and Programming Systems Series). Elsevier Science Inc.,
those proposed by Hofer et al. [19; 20].                                               New York, NY, USA, 1977.
                                                                                  [16] Jacek Śliwerski, Thomas Zimmermann, and Andreas Zeller.
References                                                                             When do changes induce fixes? ACM sigsoft software engi-
                                                                                       neering notes, 30(4):1–5, 2005.
[1]   Brian C. Williams and P. Pandurang Nayak. A model-based
      approach to reactive self-configuring systems. In Conference                [17] Tom Mitchell. Machine learning. McGraw Hill, 1997.
      on Artificial Intelligence (AAAI), pages 971–978, 1996.                     [18] Tom Zamir, Roni Stern, and Meir Kalech. Using model-
[2]   Alexander Feldman, Helena Vicente de Castro, Arjan van                           based diagnosis to improve software testing. In AAAI Con-
      Gemund, and Gregory Provan. Model-based diagnostic                               ference on Artificial Intelligence, 2014.
      decision-support system for satellites. In IEEE Aerospace                   [19] Birgit Hofer, Franz Wotawa, and Rui Abreu. Ai for the win:
      Conference, pages 1–14. IEEE, 2013.                                              Improving spectrum-based fault localization. ACM SIGSOFT
[3]   Peter Struss and Chris Price. Model-based systems in the                         Software Engineering Notes, 37:1–8, 2012.
      automotive industry. AI magazine, 24(4):17–34, 2003.                        [20] Birgit Hofer and Franz Wotawa. Spectrum enhanced dy-
[4]   Dietmar Jannach and Thomas Schmitz. Model-based diag-                            namic slicing for better fault localization. In ECAI, pages
      nosis of spreadsheet programs: a constraint-based debugging                      420–425, 2012.
      approach. Automated Software Engineering, 1:1–40, 2014.




                                                                            252
                        Proceedings of the 26th International Workshop on Principles of Diagnosis



                Faults isolation and identification of Heat-exchanger/ Reactor
                                 with parameter uncertainties

                 Mei ZHANG1,4,5 , Boutaïeb DAHHOU2,3 Michel CABASSUD 4,5 Ze-tao LI1
                                            1
                                              Guizhou University
                                              gzgylzt@163.com
                                      2
                                        CNRS LAAS, Toulouse, France
                                           boutaib.dahhou@laas.fr
                         3
                           Université de Toulouse, UPS, LAAS, Toulouse, France
                      4
                        Université de Toulouse, UPS, Laboratoire de Génie Chimique
                                         michel.cabassud@ensiacet.fr
                                  5
                                    CNRS, Laboratoire de Génie Chimique

                        Abstract                                   Supervision studies in chemical reactors have been reported
                                                                   in the literature concerning process monitoring, fouling de-
    This paper deals with sensor and process fault de-             tection, fault detection and isolation. Existing approaches
    tection, isolation (FDI) and identification of an in-
                                                                   can be roughly divided into data based method as in [3],
    tensified heat-exchanger/reactor. Extended high
                                                                   neural networks as in [4] and model based method as in
    gain observers are adopted for identifying sensor              [5,6,7,8,9]. Among the model based approach, observer
    faults and guaranteeing accurate dynamics since
                                                                   based methods are said to be the most capable
    they can simultaneously estimate both states and
                                                                   [10,11,12,13,14] if analytical models are available.
    uncertain parameters. Uncertain parameters in-                 Most of previous approaches focus on a particular class of
    volve overall heat transfer coefficient in this paper.
                                                                   failures. This paper deals with integrated fault diagnosis for
    Meanwhile, in the proposed algorithm, an ex-
                                                                   both sensor and process failures. Using temperature meas-
    tended high gain observer is fed by only one meas-             urements, together with state observers, an integrated diag-
    urement. In this way, observers are allowed to act
                                                                   nosis scheme is proposed to detect, isolate and identify
    as soft sensors to yield healthy virtual measures for
                                                                   faults. As for sensor faults, a FDI framework is proposed
    faulty physical sensors. Then, healthy measure-                based on the extended observer developed in [15]. Extended
    ments, together with a bank of parameter interval
                                                                   high gain observers are adopted in this paper due to its ca-
    filters are processed, aimed at isolating process
                                                                   pability of simultaneous estimation of both states and pa-
    faults and identifying faulty values. Effectiveness            rameters, resulting in more accurate system dynamics. The
    of the proposed approach is demonstrated on an in-
                                                                   estimates information provided by the observers and the
    tensified heat-exchanger/ reactor developed by the
                                                                   sensors measurements are processed so as to recognize the
    Laboratoire de Génie Chimique, Toulouse, France.               faulty physical sensors, thus achieving sensor FDI. Moreo-
                                                                   ver, the extended high gain observers will work as soft sen-
1   Introduction                                                   sors to output healthy virtual measurements once there are
Nowadays, safety is a priority in the design and develop-          sensor faults occurred. Then, the healthy measures are uti-
ment of chemical processes. Large research efforts contrib-        lized to feed a bank of parameter intervals filters developed
uted to the improvement of new safety tools and methodol-          in [11] to generate a bank of residuals. These residuals are
ogy. Process intensification can be considered as an inher-        processed for isolating and identifying process faults which
ently safer design such as intensified heat exchangers             involves jumps in overall heat transfer coefficient in this
(HEX) reactors in [1], the prospects are a drastic reduction       work.
of unit size and solvent consumption while safety is in-           It should be pointed out that the contribution of this work
creased due to their remarkable heat transfer capabilities.        does not lie with the soft sensor design or the parameter in-
However, risk assessment presented in [2] shows that po-           terval filter design as either part has individually already
tential risk of thermal runaway exists in such intensified         been addressed in the existing literature. However, the au-
process. Further, several kinds of failures may compromise         thors are not aware of any studies where both tasks are com-
safety and productivity: actuator failures (e.g., pump fail-       bined for integrated FDI, besides, there is no report whereby
ures, valves failures), process failures (e.g., abrupt varia-      parameter estimation capacity of the extended high gain ob-
tions of some process parameters) and sensor failures.             server is used to adapt the coefficient, rather than parameter
Therefore, supervision like FDI is required prior to the im-       FDI, thus together with sensor FDI framework forms the
plementation of an intensified process.                            contribution of this work.
For complex systems (e.g. heat-exchanger/reactors), fault
detection and isolation are more complicated for the reason        2   System modelling
that some sensors cannot be placed in a desirable place, and
for some variables (concentrations), no sensor exists. In ad-      2.1 Process description
dition, complete state and parameters measurements (i.e.           The key feature of the studied intensified continuous heat-
overall heat transfer coefficient) are usually not available.      exchanger/reactor is an integrated plate heat-exchanger




                                                             253
                                    Proceedings of the 26th International Workshop on Principles of Diagnosis


technology which allows for the thermal integration of sev-                              channel volume and cannot be a failure leads to fatal acci-
eral functions in a single device. Indeed, by combining a re-                            dent normally, but it may influence the dynamic of the pro-
actor and a heat exchanger in only one unit, the heat gener-                             cess and it is rather difficult to calculate the changes online.
ated (or absorbed) by the reaction is removed (or supplied)                              In this paper, we treat the parameter uncertainty as an un-
much more rapidly than in a classical batch reactor. As a                                measured state, and employ an observer as soft sensor to
consequence, heat exchanger/reactors may offer better                                    estimate it, unlike other literature, the estimation here is not
safety (by a better thermal control of the reaction), better                             for fouling detection but for more accurate model dynamics,
selectivity (by a more controlled operating temperature).                                and to ensure the value of the variable is within acceptable
                                                                                         parameter, (e.g., upper and lower bounds of the process var-
2.2 Dynamic model                                                                        iable value).
Supervision like FDI study can be much more efficient if a                               To rewrite the whole model in the form of state equations,
dynamic model of the system under consideration is availa-                               due to the assumption that every element behaves like a per-
ble to evaluate the consequences of variables deviations and                             fectly stirred tank, we suppose that one cell can keep the
the efficiency of the proposed FDI scheme.                                               main feature of the qualitative behavior of the reactor. For
Generally speaking, intensified continuous heat-exchanger/                               the sake of simplicity, only one cell has been considered.
reactor is treated as similar to a continuous reactor [16,17],                           Let us delete the subscript k for a given cell.
then flow modelling is therefore based on the same hypoth-                               Define the state vector as x1 T = [x11 , x12 ]T = [Tp , Tu ]T , un-
                                                                                                                                                               dhp
esis as the one used for the modelling of real continuous re-                            measured      state    x2 T = [x21 , x22 ]T = [hu , hp ]T ,                 =
                                                                                                                                                               dt
actors, represented by a series of N perfectly stirred tank re-                          dhu
actors (cells). According to [18] , the number of cells N                                    = ε(t) , ε(t) is an unknown but bounded function refers
                                                                                          dt
should be greater than the number of heat transfer units, and                            to variation of h, the control input u = Tui , the output vector
the heat transfer units is related with heat capacity flowrate.                                                                                   T
                                                                                         of measurable variables y T = [y1 , y2 ]T = [Tp , Tu ] , then
The modelling of a cell is based on the expression of bal-                               the equation (1) and (2) can be rewritten in the following
ances (mass and energy) which describes the evolution of                                 state-space form:
the characteristic values: temperature, mass, composition,
                                                                                                               ẋ 1 = F1 (x1 )x2 + g1 (x1 , u)
pressure, etc. Given the specific geometry of the heat-ex-
changer/reactor, two main parts are distinguished. The first                                                 { ẋ 2 = ε(t)                     (3)
part is associated with the reaction and the second part en-                                                   y = x1
compasses heat transfer aspect. Without reaction, the basic
                                                                                                                    A
mass balance expression for a cell is written as:                                                                         (Tp − Tu )              0
                                                                                                               ρp Cp Vp
{Rate of mass flow in – Rate of mass flow out = Rate of                                  Where, F1 (x1 ) = (        p
                                                                                                                                                               ),
                                                                                                                                          A
change of mass within system}                                                                                              0                      (Tu − Tp )
                                                                                                                                       ρu Cp Vu
The state and evolutions of the homogeneous medium cir-                                                                                    u
                                                                                                          (Tpi −Tp )Fp
culating inside cell 𝑘 are described by the following bal-
                                                                                                               Vp
ance:                                                                                    and g1 (x) = (                   ) , Tpi , Tui is the output of previ-
                                                                                                          (Tui −Tu )Fu
                                                                −1
2.2.1 Heat balance of the process fluid (J. s                        )                                         Vu
                   k                                                                     ous cell, for the first cell, it is the inlet temperature of pro-
               k dTp
 ρkp Vpk Cp             = hkp Ak (Tpk − Tuk ) + ρkp Fpk Cp k (Tpk−1 − Tpk ) (1)          cess fluid and utility fluid.
               p   dt                                       p
where ρkp is density of the process fluid in cell k (in                                  In this case, the full state of the studied system is given as:
kg. m−3 ), Vpk is volume of the process fluid in cell k (in m3 ),
Cp k specific heat of the process fluid in cell k (in                                                    ẋ = F(x1 )x + G(x1 , u) + ε̅(t)
    p                                                                                                  {                                               (4)
J. kg −1 . K −1 ) , hkp is the overall heat transfer coefficient (in                                    y = Cx
J. m−2 . K −1 . s −1 ).
                                                                                                       x1             0 F1 (x1 )
2.2.2 Heat balance of the utility fluid (J. s −1 )                                       Where x = [x ] , F(x1 ) = (              ) , G(x1 , u) =
                                                                                                         2            0     0
               dTku                                                                       g (x, u)                          0
ρku Vuk Cp k          = hku Ak (Tuk − Thk ) + ρku Fuk Cp k (Tuk−1 − Tuk )    (2)         ( 1       ) , C = (I 0), ε̅(t) = (     )
          u dt                                          u                                    0                             ε(t)
whereρku is density of the utility fluid in cell k (in kg. m−3 ),                        3 Fault detection and diagnose scheme
Vuk is volume of the utility fluid in cell k (in m3 ), Cp k specific
                                                                         u
heat of the utility fluid in cell k (in J. kg −1 . K −1 ) , hku is                       3.1 Observer design for sensor FDI
overall heat transfer coefficient (in J. m−2 . K −1 . s −1 ).                            The extended high gain observer proposed by [15] can be
The eq. (1) (2) represent the dynamic reactor comportment.                               used like an adaptive observer for estimation both states and
The two equations represent the evolution of two states (Tp :                            parameters simultaneously, in this paper, the latter capabil-
reactor temperature and Tu : utility fluid temperature).The                              ity is utilized to estimate incipient degradation of overall
heat transfer coefficient (h) is considered as a variable                                heat transfer coefficient (due to fouling), thus guaranteeing
which undergoes either an abrupt jumps (by an expected                                   a more accurate approximation of the temperature. It is quite
fault in the process) or a gradual variation (essentially due                            useful in chemical processes since parameters are usually
to degradation). The degradation can be attributed to foul-                              with uncertainties and unable to be measured.
ing. Fouling in intensified process is tiny due to the micro




                                                                                   254
                               Proceedings of the 26th International Workshop on Principles of Diagnosis


Consider a nonlinear system as the form:                                    With this formulation, the faulty model becomes:
               ẋ = F(x1 )x + G(x1 , u)                                              ẋ = F(x1 )x + G(x1 , u) + ε̅(t)
             {                           (5)                                        {                                                     (11)
              y = Cx                                                                  y = Cx + Fs fs
where x = (x1 , x2 )T ∈ ℛ 2n , x1 ∈ ℛ n is the state, x2 ∈ ℛ n              𝐹𝑠 is the fault distribution matrix and we consider that fault
is the unmeasured state, x2 = ϵ(t), u ∈ ℛ m , y ∈ ℛ p are in-               vector 𝑓𝑠 ∈ ℛ 𝑝 (𝑓𝑠𝑗 is the 𝑗𝑡ℎ element of the vector) is also a
put and output, ϵ(t) is an unknown bounded function                         bounded signal. Notice that, a faulty sensor may lead to in-
which may depend on u(t), y(t), noise, etc., and                            correct estimation of parameter. That is why we emphasized
                                                                            healthy measurement for parameter fault isolation as men-
          0      F1 (x1 )                 g (x, u)
F(x1 ) = (               ) , G(x1 , u) = ( 1      ), C(I       0),          tioned above.
          0         0                        0
 F1 (x1 ) is a nonlinear vector function, g1 (x, u) is a matrix             3.2.2 Fault detection and isolation scheme
function whose elements are nonlinear functions.                            The proposed sensor FDI framework is based on a bank of
Supposed that assumptions related boundedness of the                        observers, the number of observers is equal to the number
states, signals, functions etc. in [15] are satisfied, the ex-              of sensors. Each observer use only one sensor output to es-
tended high gain observer for the system can be given by:                   timate all the states and parameters. First, assumed the sen-
                                                                            sor used by ith observer is healthy, let yi denotes the ith
  x̂̇ = F(x̂1 )x + G(x̂1 , u) − Λ−1 (x̂1 )Sθ−1 C T (ŷ − y)                 system output used by the ith observer. Then we form the
{                                                               (6)
 ŷ = Cx̂                                                                   observer as:
                       𝐼          0                                                    x̂̇ i = F(x̂1i )x + G(x̂1i , u) + Hi (yi − ŷii )
Where:     Λ(𝑥̂1 ) = [                  ]                                   1 ≤ i ≤ p{ i
                      0        𝐹1 (𝑥̂1 )                                                                                                         (12)
                                                                                      ŷ = Cx̂ i
𝑆𝜃 is the unique symmetric positive definite matrix satisfy-
ing the following algebraic Lyapunov equation:                              Define eix = x̂ i − x, eiy = Ceix , eiyj = ŷji − yj , rji (t) = ‖ŷji −
                                                                            yj ‖, μi = ‖rji (t)‖ ≔ sup‖rji (t)‖, for t ≥ 0.
            θSθ + AT Sθ + Sθ A − C T C = 0                    (7)
                                                                            Where i denotes the ith observer, ŷii , ŷji denotes the ith, jth
             0 I
Where A = [        ] , θ > 0 is a parameter define by [15]                  estimated system output generated by the ith observer, Hi is
             0 0
and the solution of eq. (7) is:                                             the gain of ith observer determined by the following equa-
                           1            1
                                                                            tion :
                               I     − 2I                                                                                      2θi I
                Sθ = [     θ            θ
                                                ]             (8)                     Hi = Λ−1 (x̂1 )Sθ−1 C T = Λ(x̂1 ) [ 2 −1            ]
                         − 2I
                               1       2
                                            I
                                                                                                        i                  θi F1 (x̂1 )
                           θ          θ3                                    Then we get:
Then, the gain of estimator can be given by:                                Theorem 1:
                                                                            If the lth sensor is faulty, then for system of form (4), the
                                          2θI
     H = Λ−1 (x̂1 )Sθ−1 C T = Λ(x̂1 ) [ 2 −1       ]          (9)           observer (12) has the following properties:
                                       θ F1 (x̂1 )
                                                                            For i ≠ l , ŷ i = y asymptotically
Notice that larger 𝜃 ensures small estimation error. How-                   For i = l, ŷ i ≠ y
ever, very large values of 𝜃 are to be avoided in practice due              Proof: If the lth sensor is faulty, then:
to noise sensitiveness. Thus, the choice of 𝜃 is a compro-                  For i ≠ l, means that fsi = 0, yi = θsi , we have:
mise between fast convergence and sensitivity to noise.
                                                                                             lim eix = lim (x̂ i − x) = 0 (13)
                                                                                             t→∞         t→∞
3.2 Sensor fault detection and isolation scheme
                                                                            Then the vector of the estimated output ŷ i generated by ith
The above observer could guarantee the heat-exchanger/re-                   observer guarantee ŷ i = y after a finite time.
actor dynamics ideally. Then, a bank of the proposed ob-                    For i = l, means that θsl = ylf = yl + fsl , fsl ≠ 0 , the ob-
servers, together with sensor measurements, are used to                     server is designed on the assumption that there is no fault
generate robust residuals for recognizing faulty sensor.                    occurs, because there is fault fsl exit, so the estimation error
Thus, we propose a FDI scheme to detect, meanwhile, iso-
                                                                             elx = 0 asymptotically cannot be satisfied, then :
late and recovery the sensor fault.
                                                                                       lim (x̂ i − x) = lim (x̂ l − x) ≠ 0             (14)
                                                                                       t→∞                t→∞
3.2.1 Sensor faulty model                                                   we have:
                                                                                         ė lx = F(x̂1i , u) elx − Hi G(x̂1i , u, fsl ) elx      (15)
A sensor fault can be modeled as an unknown additive term
                                                                                                                                 i
in the output equation. Supposed θsj is the actual measured                 Then the vector of the estimated output ŷ generated by the
output from jth sensor, if jth sensor is healthy, θsj= yj , while           ith observer is different from y, that is ŷ i ≠ y.⊡
if jth sensor is faulty, θsj = yjf = yj + fsj , (𝑓𝑠𝑗 is the fault),         As mentioned above, the observers are deigned under the
for t ≥ t f and lim |yj − θsj | ≠ 0.That means yjf is the actual            assumption that no fault occurs, furthermore, each observer
                t→∞
output of the jth  sensor when it is faulty, while yj is the ex-            just subject to one sensor output. Residual rii is the differ-
pected output when it is healthy, that is:
                                                                            ence between the ith output estimation ŷii determined by
         yi ;              jth sensor when it is faulty                     the ith observer and the ith system output yi , then Theorem
  θsi = { f                                             (10)
          yi = yi + fsi ; jth sensor when it is faulty                      2 formulates the fault detection and isolation scheme.




                                                                      255
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


Theorem 2:                                                                   all the intervals whether or not one of them contains the
If the lth sensor is faulty, then:                                           faulty parameter value of the system, the faulty parameter
For i ≠ l, we have:                                                          value is found, the fault is therefore isolated and estimated.
                       fsi = 0, yi = θsi (16)                                The practical domain of each parameter is partitioned into a
thus ŷii converges to yi asymptotically, we get:                            certain number of intervals. For example, parameter hp is
                    rii = ‖ŷii − yi ‖ ≤ μi (17)                             partitioned into q intervals, their bounds are denoted
                                                                                   (0) (1)      (i)      (q)
For i = l, we have:                                                          by hp , hp , … , hp , … , hp . The bounds of ith interval are
 fsl ≠ 0, θsl = ylf = yl + fsl ≠ yl , then ŷll could not track yl               (i−1)        (i)
                                                                             hp and hp , are also noted as hbi            ai
                                                                                                                 p and hp , and the nomi-
correctly:                                                                   nal value for hp denotes by hp0 .
                  rll = ‖ŷll − yl ‖ ≥ μl (18)                               To verify if an interval contains the faulty parameter value
Therefore, in practice, we can check all the residuals rii , for             of the post-fault system, a parameter filter is built for this
1 ≤ i ≤ p, if rii ≥ μi denotes that ith sensor is faulty, then               interval. A parameter filter consists of two isolation observ-
the sensor fault detection and isolation is achieved.                        ers which correspond to two interval bounds, and each iso-
The residuals are designed to be sensitive to a fault that                   lation observer serves two neighboring intervals. An inter-
comes from a specific sensor and as insensitive as possible                  val which contains a parameter nominal value is unable to
to all the others sensor faults. This residual will permit us to             contain the faulty parameter value, so a parameter filter will
treat not only with single faults but also with multiple and                 not be built for it.
simultaneous faults.                                                         Define Eq. (3) into a simple form as:
Let rsi denotes the fault signature of the ith sensor, define:                ẋ = F1 (x1 )x2 + g1 (x1 , u)    ẋ = f(x1 , hp , u)
                                                                             { 1                            = { 1                  (24)
              1 if rii ≥ μi ; ith sensor is faulty                            y = x1                           y = x1
  rsi (t) = {                                          (19)
             0 if rii ≤ μi ; ith sensor is health                            The parameter filter for ith interval of hp is given below.
3.2.3 Fault identification and handling mechanism                            The isolation observers are:
1) Fault identification                                                                x̂̇ ai = f(x̂1 , hai              ̂ ai )
                                                                                                         p0 , u) + H(y − y

Supposed there are m healthy sensors and p − m faulty                                    {ŷ̇ ai = cx̂̇ ai                               (25)
ones, then to identify the faulty size of ith sensor, use m                                  ai
                                                                                           ε = y − cx̂       ̇ ai
estimated output ŷim generated by m observers which use
healthy measures, 1 ≤ m ≤ p − 1, m ≠ i , define f̂si as the                               x̂̇ bi = f(x̂1 , hbi              ̂ bi )
                                                                                                            p0 , u) + H(y − y
estimated faulty value of the ith sensor, then:                                          {ŷ̇ bi = cx̂̇ bi                        (26)
                                          ∆
              ̂fsi = 1 ∑m  |ŷ m − θsi | → fsi (20)
                     m i=1 i                                                               ε = y − hx̂̇
                                                                                              bi            bi

2) Fault recovery                                                            Where:
As mentioned above, the extended high gain observer is                                       hp0 , t < t f                       hp0 , t < t f
also worked as a software sensor to provide an adequate                      hai
                                                                              p0 (t) = {       (i)                  , hbi
                                                                                                                       p0 (t) = { (i−1)        ,(27)
                                                                                              hp , t ≥ t f                       hp , t ≥ t f
estimation of the process output, thus replacing the meas-
urement given by faulty physical sensor.                                     The isolation index of this parameter filter is calculated by:
θsi is the actual measured output from ith sensor:
                                 yi                                                  νi (t) = sgn(εai )sgn(εbi )        (28)
                    θsi = { f             (21)                               As soon as νi (t) = 1, the parameter filter sends the ’non-
                           yi = yi + fsi
Let m observers use healthy measurements as the soft sen-                    containing’ signal to indicate that this interval does not con-
sor for ith sensor, define:                                                  tain the faulty parameter value. And if the fault is in the ith
                               m                                             interval. Let:
                           1                                                                       1
                      y̅i = ∑ ŷim (22)                                                       ĥA = (hai A + hbi A)      (29)
                           m                                                                       2
                               i=1                                            to represent the faulty value, fault isolation and identifica-
If ith sensor is healthy, let the sensor actual output as θsi
                                                                             tion is then achieved.
its output, while if it is faulty, let y̅i to replace θsi , that is:
                  θ , if ith sensor healthy                                  4       Numerical simulation
            yi = { si                         (23)
                   y̅i , if ith sensor faulty                                A case study is developed to test the effectiveness of the
                                                                             proposed scheme. The real data is from a laboratory pilot of
3.3 process fault diagnose                                                   a continuous intensified heat-exchanger/reactor. The pilot is
In order to achieve process FDD, healthy measurements are                    made of three process plates sandwiched between five util-
fed to a bank of parameter intervals filters developed in [11]               ity plates, shown in Fig.1. More Relative information could
to generate a bank of residuals. These residuals are pro-                    be found in [2]. As previously said, the simulation model is
cessed for identifying parameter changes, which involves                     considered just for one cell which may lead to moderate in-
variation of overall heat transfer coefficient in this paper.                accuracy of the dynamic behavior of the realistic reactor.
The main idea of the method is as follows.                                   However, this point may not affect the application and
The practical domain of the value of each system parameter                   demonstration of the proposed FDD algorithm encouraging
is divided into a certain number of intervals. After verifying               results are got.




                                                                       256
                            Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                         4.3 Sensor FDI and recovery demonstration
                                                                         In order to show effectiveness of the proposed method on
                                                                         sensor FDI, multi faults and simultaneous faults in the tem-
Figure 1 (a) Reactive channel design; (b) utility channel de-            perature sensors are considered in case 1 and case 2 respec-
sign; (c) the heat exchanger/reactor after assembly.                     tively. Besides, the pilot is suffered to parameter uncertain-
The constants and physical data used in the pilot are given              ties caused by heat transfer coefficient decreases with ℎ =
in table1.                                                               (1 − 0.01𝑡)ℎ. Two extended high gain observers are de-
                                                                         signed to generate a set of residuals achieving fault detec-
           Table 1. Physical data used in the pilot                      tion and isolation in individual sensors. Observer 1 is fed by
    Constant          Value                          units               output of sensor 𝑇𝑝 to estimate the whole states and param-
                                                                         eter while observer 2 uses output of sensor 𝑇𝑢 . Advantages
      hA              214.8                         W. K −1
                                                                         of the proposed FDI methodology drop on that if one sensor
       A              4e−6                            m3                 is faulty, we can use the estimated value generated by the
      Vp            2.685e−5                          m3                 healthy one to replace the faulty physical value, thus provid-
      Vu            1.141e−4                          m3                 ing a healthy virtual measure.
     ρp , ρu          1000                          kg. m−3              Case 1: abrupt faults occur at output of sensor 𝑇𝑝 at t=80s,
    cp , cp           4180                       J. kg −1 . k −1         100s, with an amplitude of 0.3℃, 0.5℃ respectively.the re-
       p    u                                                            sults are reported in Fig.5-8.
4.1 operation conditions

The inlet fluid flow rate in utility fluid and process fluid are
𝐹𝑢 = 4.17𝑒 −6 𝑚3 , 𝐹𝑝 = 4.22𝑒 −5 𝑚3 𝑠 −1 .The inlet tempera-
ture in utility fluid is time-varying between 15.6℃ and
12.6℃, which is a classical disturbance in the studied sys-
tem, as shown in Fig.2. The inlet temperature in process
fluid is 76℃. Initial condition for all observers and models
are supposed to be T̂𝑝0 = T̂𝑢0 = 30℃, hA = 214.8 W. K −1 .

                                                                         Fig. 5 output temperature of both fluid in case 1 by observer
                                                                         1, red curve demonstrates the estimated value while black
                                                                         one is the measured value.

                                                                         It is obviously that since t=80s, 𝑇̂𝑢 (red curve) cannot track
                                                                         𝑇𝑢 (black curve) correctly, while it needs about 0.2s for 𝑇̂𝑝
           Fig.2 utility inlet temperature 𝑇𝑢𝑖                           to track 𝑇𝑝 at t=80s and t=100s. It suggests that faults occur,
                                                                         then the following task is to identify size and location of
4.2 High gain observer performance                                       faulty sensors. Fig.6 and Fig.7 achieves the goal. It takes
                                                                         0.1s and 0.3s for isolating the faults at 80s, 100s respec-
To prove the convergence of the observers and show their                 tively.
tracking capabilities, suppose the heat transfer coefficient
subjects to a decreasing of ℎ = (1 − 0.01𝑡)ℎ and followes
by a sudden jumps of 15 at 𝑡 = 100𝑠. These variations and
observer estimation results are reported in Fig.3.




Fig.3. simulation and estimation of heat transfer coefficient
variation.
                                                                                        Fig.6 isolation residual in case 1.
Black curve simulates the actual changes of the parameter
while the red one illustrates the estimation generated by the
proposed observer, it can be seen from Fig. 3 that the esti-
mation value tracks behavior of the real value with a good
accuracy, thus ensuring a good dynamics.




                                                                   257
                         Proceedings of the 26th International Workshop on Principles of Diagnosis




Fig.7b fault signature in case 1, obviously, faults only occur
at output of sensor 𝑇𝑝 .                                                            Fig. 9 isolation residual in case 2

For fault recovery, we can employ observer 2 as soft sensor
to generate a health value for faulty sensor 𝑇𝑝 . Observer 2
uses only measured 𝑇𝑢 to estimate all states and parameters.
Therefore, 𝑇̂𝑢 , 𝑇̂𝑝 generated by observer 2 are only decided
by 𝑇𝑢 . In case 1, faults occur only on sensor 𝑇𝑝 , sensor 𝑇𝑢 is
healthy, that is to say 𝑇̂𝑢 , 𝑇̂𝑝 generated by observer 2 will be
satisfied their expected values. As shown in Fig.8, we can
see that since 𝑇𝑢 is healthy, estimated value 𝑇̂𝑢 tracks meas-
ured 𝑇𝑢 perfectly, while estimated value 𝑇̂𝑝 (red curve) does
not track the faulty measured value 𝑇𝑝 (black curve), 𝑇̂𝑝 (red
curve) illustrates the expected value for sensor 𝑇𝑝 , we can
                                                                                Fig 10. Fault signature in case 2
use estimate 𝑇̂𝑝 (red curve) to replace measured faulty value
 𝑇𝑝 ( black curve) for fault recovery.                                4.4 Fast process fault isolation and identification

                                                                      Process fault is related to variation of overall heat transfer
                                                                      coefficient (h). The heat transfer coefficient is considered as
                                                                      variable which undergoes either an abrupt jumps (by an ex-
                                                                      pected fault in the flow rate) or a gradual variation (essen-
                                                                      tially due to fouling). For incipient variation, since fouling
                                                                      in intensified heat-exchanger/reactor is tiny and only influ-
                                                                      ence dynamics, we have employed extended high observers
                                                                      to ensure the dynamic influenced by this slowly variation.
                                                                      Therefore, the abrupt changes in heat transfer coefficient ℎ
                                                                      can only be because of sudden changes in mass flow rate. It
Fig.8 fault recovery in case 1, red curve demonstrates the            implies that the root cause of process fault is due to actuator
estimated value while black one is the measured value.                fault in this system.
If there are faults occurred only on output of sensor 𝑇𝑢 , the        Supposed an abrupt jumps in ℎ at t=40 from 214.8 to 167.
same results can be yield easily. For multi and simultaneous
faults on both sensors, we can still isolate the faults cor-
rectly. Case 2 will verify this point.

Case 2: simultaneous faults imposed to the outputs of sen-
sors 𝑇𝑝 as in case 1 and 𝑇𝑢 at t=80s with amplitude of 0.6℃.
Results are reported in Fig.9-10. Residuals are beyond their
threshold obviously at time 80s, 100s.
                                                                          Fig.11 detection residual in process faulty case
It can be seen from Fig.9, Fig .10 that the proposed FDI
scheme can isolate faults correctly, and it takes 0.25s, 0.4s         From Fig.11, at t=40s, unlike sensor fault cases, the residual
for isolating the faults in sensor 𝑇𝑝 at 80s, 100s and 0.2s for       leaves zero and never goes back, this indicates that process
isolating that in sensor 𝑇𝑢 at t=80s respectively. Compared           fault occurs. For fast fault isolation and identification, we
with Case 1, more times is needed in this Case 2.                     use the methodology of parameter interval filters developed
                                                                      in [11]. In [2], heat transfer coefficient ℎ changes between
                                                                      130.96 and 214.8, then ℎ is divided into 4 intervals as shown
                                                                      in table 2 and simulation results are shown in Fig.12. It can
                                                                      be seen at t=40s, only index for interval 150-170 goes to
                                                                      zero rapidly, then there is a fault in this interval. The faulty
                                                                                                        1                    1
                                                                      value is estimated by ℎ̂𝐴 = (ℎ𝑎 𝐴 + ℎ𝑏 𝐴) = (150 +
                                                                                                        2                    2
                                                                      170) = 160. We can see it is closely to actual faulty value
                                                                      167, and if more intervals are divided, the estimated value
                                                                      may be closer to the actual faulty value.




                                                                258
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


         Table 2 parameter filter intervals                       [5] O. a Z. Sotomayor and D. Odloak, “Observer-based
                                                                       fault diagnosis in chemical plants,” Chem. Eng. J., vol.
    Interval NO.    1         2         3        3                     112, pp. 93–108, 2005.
         ℎ𝑎 𝐴       130       150       170      190              [6] M. Du and P. Mhaskar, “Isolation and handling of
         ℎ𝑏 𝐴       150       170       190      214                   sensor faults in nonlinear systems,” Automatica, vol.
                                                                       50, no. 4, pp. 1066–1074, 2014.
                                                                  [7] F. Xu, Y. Wang, and X. Luo, “Soft sensor for inputs
                                                                       and parameters using nonlinear singular state observer
                                                                       in chemical processes,” Chinese J. Chem. Eng., vol. 21,
                                                                       no. 9, pp. 1038–1047, 2013.
                                                                  [8] F. Caccavale and F. Pierri, “An integrated approach to
                                                                       fault diagnosis for a class of chemical batch processes,”
                                                                       J. Process Control, vol. 19, no. 5, pp. 827–841, 2009.
                                                                  [9] M. Du, J. Scott, and P. Mhaskar, “Actuator and sensor
                                                                       fault isolation of nonlinear proces systems,” Chem.
                                                                       Eng. Sci., vol. 104, pp. 2940–303, 2013.
                                                                  [10] D. Fragkoulis, G. Roux, and B. Dahhou, “Detection,
                                                                       isolation and identification of multiple actuator and
                                                                       sensor faults in nonlinear dynamic systems:
Fig.12 “non_containing fault” index sent by parameter filter           Application to a waste water treatment process,” Appl.
                                                                       Math. Model., vol. 35, no. 1, pp. 522–543, 2011.
5 Conclusion                                                      [11] Z. Li and B. Dahhou, “A new fault isolation and
An integrated approach for fault diagnose in intensified               identification method for nonlinear dynamic systems:
heat-exchange/reactor has been developed in this paper. The            Application to a fermentation process,” Appl. Math.
approach is capable of detecting, isolating and identifying            Model., vol. 32, pp. 2806–2830, 2008.
failures due to both sensors and parameters. Robustness of        [12] X. Zhang, M. M. Polycarpou, and T. Parisini, “Fault
the proposed FDI for sensors is ensured by adopting a soft             diagnosis of a class of nonlinear uncertain systems with
sensor with respect to parameter uncertainties. Ideal isola-           Lipschitz nonlinearities using adaptive estimation,”
tion speed for process fault is guaranteed due to adoption of          Automatica, vol. 46, no. 2, pp. 290–299, 2010.
parameter interval filter. It should be notice that the pro-      [13] R. F. Escobar, C. M. Astorga-Zaragoza, J. a.
posed method is suitable for a large kind of nonlinear sys-            Hernández, D. Juárez-Romero, and C. D. García-
tems with dynamics models as the studied system. Appli-                Beltrán, “Sensor fault compensation via software
cation on the pilot heat-exchange/reactor confirms the ef-             sensors: Application in a heat pump’s helical
fectiveness and robustness of the proposed approach.                   evaporator,” Chem. Eng. Res. Des., pp. 2–11, 2014.
                                                                  [14] F. Bonne, M. Alamir, and P. Bonnay, “Nonlinear
References                                                             observer of the thermal loads applied to the helium bath
[1] F. Théron, Z. Anxionnaz-Minvielle, M. Cabassud, C.                 of a cryogenic Joule–Thompson cycle,” J. Process
    Gourdon, and P. Tochon, “Characterization of the                   Control, vol. 24, no. 3, pp. 73–80, 2014.
    performances         of     an     innovative      heat-      [15] M. Farza, K. Busawon, and H. Hammouri, “Simple
    exchanger/reactor,” Chem. Eng. Process. Process                    nonlinear observers for on-line estimation of kinetic
    Intensif., vol. 82, pp. 30–41, 2014.                               rates in bioreactors,” Automatica, vol. 34, no. 3, pp.
[2] N. Di Miceli Raimondi, N. Olivier-Maget, N. Gabas,                 301–318, 1998.
    M. Cabassud, and C. Gourdon, “Safety enhancement              [16] W. Benaïssa, N. Gabas, M. Cabassud, D. Carson, S.
    by transposition of the nitration of toluene from semi-            Elgue, and M. Demissy, “Evaluation of an intensified
    batch reactor to continuous intensified heat exchanger             continuous heat-exchanger reactor for inherently safer
    reactor,” Chem. Eng. Res. Des., vol. 94, pp. 182–193,              characteristics,” J. Loss Prev. Process Ind., vol. 21, pp.
    2015.                                                              528–536, 2008.
[3] P. Kesavan and J. H. Lee, “A set based approach to            [17] S. Li, S. Bahroun, C. Valentin, C. Jallut, and F. De
    detection and isolation of faults in multivariable                 Panthou, “Dynamic model based safety analysis of a
    systems,” Chem. Eng., vol. 25, pp. 925– 940, 2001.                 three-phase catalytic slurry intensified continuous
[4] D. Ruiz, J. M. Nougues, Z. Calderon, A. Espuna, L.                 reactor,” J. Loss Prev. Process Ind., vol. 23, no. 3, pp.
    Puigjaner, J. Maria, Z. Caldern, and A. Espufia,                   437–445, 2010.
    “Neural network based framework for fault diagnosis           [18] P. S. Varbanov, J. J. Klemeš, and F. Friedler, “Cell-
    in batch chemical plants,” Comput. Chem. Eng., vol.                based dynamic heat exchanger models-Direct
    24, pp. 777–784, 2000.                                             determination of the cell number and size,” Comput.
                                                                       Chem. Eng., vol. 35, pp. 943–948, 2011.




                                                            259
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                  260
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




  LPV subspace identification for robust fault detection using a set-membership
            approach: Application to the wind turbine benchmark
              Chouiref. H, Boussaid. B, Abdelkrim. M.N 1 , Puig. V2 and Aubrun.C3
      1
        Research Unit of Modeling, Analysis and Control of Systems (MACS), Gabès University
 e-mail: houda.chouiref@gmail.com, dr.boumedyen.boussaid@ieee.org,naceur.abdelkrim@enig.rnu.tn
             2
               Advanced Control Systems Group (SAC), Technical University of Catalonia
                                    e-mail: vicenc.puig@upc.edu
            3
              Centre de Recherche en Automatique de Nancy (CRAN), Lorraine University
                             e-mail: christophe.aubrun@univ-lorraine.fr
                    Abstract
     This paper focuses on robust fault detection for
     Linear Parameter Varying (LPV) systems using a
     set-membership approach. Since most of models
     which represent real systems are subject to mod-
     eling errors, standard fault detection (FD) LPV
     methods should be extended to be robust against
     model uncertainty. To solve this robust FD prob-
     lem, a set-membership approach based on an in-
     terval predictor is used considering a bounded de-
     scription of the modeling uncertainty. Satisfac-
     tory results of the proposed approach have been
     obtained using several fault scenarios in the pitch               Figure 1: Fault diagnosis with set estimator schema
     subsystem considered in the wind turbine bench-
     mark introduced in IFAC SAFEPROCESS 2009.

                                                                   must be robust. When modeling uncertainty in a determin-
1 Introduction                                                     istic way, there are two robust estimation methods: the first
The fault diagnosis of industrial processes has become an          method is the bounded error estimation that assumes the pa-
important topic because of its great influence on the opera-       rameters are considered time invariant and there is only an
tional control of processes. Reliable diagnosis and early de-      additive error [6]. On the other hand, the second approach
tection of incipient faults avoid harmful consequences. Typ-       is the interval predictor that takes into account the variation
ically, faults in sensors and actuators and the process itself     of parameters and which considers additive and multiplica-
are considered. In the case of the wind turbine benchmark,         tive errors [7], [8]. Here, the interval predictor is combined
a set of pre-defined faults with different locations and types     with existing nominal LPV identification presented by [9],
are proposed in [1] where the dynamic change in the pitch          allowing to include robustness and minimizing false alarms
system is treated. The procedure of fault detection is based       (see Fig. 1) [10]. Thus, this paper contributes with a new
either on the knowledge or on the model of the system [2].         set-membership estimator approach that combines the in-
Model-based fault detection is often necessary to obtain a         terval predictor scheme with the LPV identification through
good performance in the diagnosis of faults. The methods           subspace methods in one step. To illustrate the methodology
used in model-based diagnosis can be classified according          proposed in this work, the pitch subsystem of wind turbine
if they are using state observers, parity equations and pa-        system proposed as a benchmark in IFAC SAFEPROCESS
rameter estimation [3]. For linear time invariant systems          2009 will be used. First, this subsystem is modeled as an
(LTI), the FD task is largely solved by powerful tools. How-       LPV model using the hydraulic pressure as the scheduling
ever, physical systems generally present nonlinear behav-          variable. On the hypothesis that damping ratio and natural
iors. Using LTI models in many real applications is not            frequency have an affine variation with hydraulic pressure,
sufficient for high performance design. In order to achieve        this affine LPV model is estimated by means of the subspace
good performance while using linear like techniques, Lin-          LPV estimation algorithm. Second, the residue is synthe-
ear Parameter Varying systems are recently received con-           sized to take into account the robustness against the uncer-
siderable attention [4]. Recently, many model-based appli-         tainties in the parameters. This work is organized as fol-
cations using such systems and the subspace identification         lows: In Section 2, the LPV subspace estimation method is
method were published [5]. In model-based FD, a residual           recalled. In Section 3, the interval predictor approach com-
vector is used to describe the consistency check between           bined with the LPV subspace method is proposed as tool
the predicted and the real behavior of the monitored sys-          for robust fault detection. In Section 4, the modeling of the
tem. Ideally, the residuals should only be affected by the         pitch system as a LPV model is introduced. Section 5 deals
faults. However, the presence of disturbances, noises and          with simulation experiments that illustrate the implementa-
modeling errors yields the residual to become non zero. To         tion and performance of the proposed approach applied to
take into account these errors, the fault detection algorithm      the robust fault detection of wind turbine pitch system. Fi-




                                                             261
                             Proceedings of the 26th International Workshop on Principles of Diagnosis


nally, Section 6 gives some concluding remarks.
                                                                                              Y = [yp+1 , ..., yN ]                        (5)
2 LPV Subspace Identification method
In the literature, there are two methods for LPV identifica-                                                                  ]
tion: First, the ones based on global LPV estimation. Sec-                            Z = [N1p z̄1p , ..., NN
                                                                                                            p        p
                                                                                                                                           (6)
                                                                                                              −p+1 z̄N −p+1
ond, the ones based on the interpolation of local models
[11], However, those approaches could lead to unstable rep-               the controllability matrix can be expressed as:
resentations of the LPV structure while the original system
is stable [12]. That is why in this paper, we propose to                                           κp = [lp , ..., l1 ]
use a subspace identification algorithm proposed (see [9]                 with                     [                     ]
and [13]) to identify LPV systems which does not require
                                                                                               l1 = B̄ (1) , ..., B̄ (m)
interpolation or identification of local models and avoid in-
stability problems.                                                       and                [                            ]
                                                                                        lj = Ã(1) lj−1 , ..., Ã(m) lj−1
2.1 Problem formulation
                                                                                           [         ]
In the model used in identification in [9], the system ma-                  If the matrix Z T , U T has full row rank, the matrix
trices depend linearly on the time varying scheduling vector              Cκp and D can be estimated by solving the following linear
as follows:                                                               regression problem [14]:
             ∑m                                                                                                           2
     xk+1 =
                   (i)
                  µk (A(i) xk + B (i) uk + K (i) ek )    (1)                              min ∥Y − Cκp Z − DU ∥F                           (7)
                                                                                         Cκp ,D
                i=1
                                                                          where ∥∥F represents the Frobenius norm. This problem
                    yk = Cxk + Duk + ek                           (2)     can be solved by using traditional least square methods as
                                                                          in the case of LTI identification for time varying systems.
with xk ∈ R , uk ∈ R , yk ∈ R are the state, input and
                n             r          l
                                                                          Moreover, the observability matrix for the first model is cal-
output vectors and ek denotes the zero mean white innova-                 culated as follows:
tion process and m is the number of local model or schedul-                                                       
ing parameters:                                                                                           C
                        [                                                                              C Ã(1)    
                                                                                                                  
                            (2)
                 µk = 1, µk , ..., µm
                                          T                                                               .       
                                      k ]                                                   Γp =                  
                                                                                                          .       
                                                                                                                  
  Eqs.(1) and (2) can be written in the predictor form:                                                   .       
                                                                                                          (1) p−1
                m
                ∑                                                                                    C(Ã )
                       (i)
       xk+1 =         µk (Ã(i) xk + B̃ (i) uk + K (i) yk )       (3)
                                                                          with
                i=1
                                                                                                  ⌣                   ⌣           ⌣
with                                                                            κ̄kp = [φp−1,k+1 B k , ..., φ1,k+p−1 B k+p−2 , B k+p−1 ]
                       Ã(i) = A(i) − K (i) C
                                                                          and
                       B̃ (i) = B (i) − K (i) D                                                       ⌣
                                                                                                 B k = [B̃, Kk ]
2.2 Assumptions and notation                                              Then, Eq.(3) can be transformed into:
                  [         ]T
Defining zk = uTk , ykT         and using a data window of
length p to define the following vector:                                                       xk+p = φp,k xk + κ̄kp z̄kp
                                                                                           xk+p = φp,k xk + κp Nkp z̄kp
                                 zk
                             zk+1                                       where
                                                                                        φp,k = ÃK+p−1 ...Ãk+1 Ãk
                       p         .    
                     z̄k =            
                                 .                                        If the system (3) is uniformly exponentially stable the ap-
                                 .                                      proximation error can be made arbitrarily small then:
                               zk+p−1
                                                                                                  xk+p ≈ κp Nkp z̄kp
and introducing the matrix obtained using the Kronecker
product ⊗:                                                                To calculate the observability matrix Γp times the state X,
                                                                          we first calculate the matrix Γp κp :
             Pp/k = µk+p−1 ⊗ .... ⊗ µk ⊗ Ir+l                                                                                    
                                                                                         Clp      Clp−1       . .      Cl1
we can define                                                                          0       CA(1) lp−1 . .       C Ã(1) l1   
                                                                                                                               
                pp/k              .      . .         0
                                                                              p p
                                                                            Γ κ =        .                  .                   
                                                                                                                                  
               .            pp−1/k+1                                                 .                       .                 
                                                                                                                         p−1
        Nkp =  .                        .                                               0                            (1)
                                                                                                                  C(Ã ) l1
               .                             .               
                 0                                p1/k+p−1                Then, using the following Singular Value Decomposition
                                                                          (SVD):
Now, by defining the matrices U , Y and Z :                                                          [ ∑          ][    ]
                                                                                 \
                                                                                 p  p                      n  0
                                                                                                              ∑      V
                      U = [up+1 , ..., uN ]                       (4)           Γ κ Z = [ υ υσ⊥ ]
                                                                                                         0           V⊥




                                                                    262
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


the state is estimated by:                                        with
                         ⌢         ∑                                                      ∏
                                                                                         p−j
                         X=            V                           (Z i,j )T Z 1,j = (         µTÑ +v+j−i µÑ +v+j−1 )(zÑ
                                                                                                                         T
                                                                                                                                z
                                                                                                                            +j−i Ñ +j−1
                                                                                                                                         )
                                   n                                                  v=0
                                                                             p
                                                                             ∑
   Finally, C and D matrix are estimated using output equa-        ZT Z =         (Z 1,j )Z 1,j
tion (2) and A and B are estimated using the state equation                 j=1

(1). This algorithm can be summarized as follows [9]:                                                                      (10)
                                                                  Finally, the estimate sequence is obtained by solving the
  • Create the matrices U , Y and Z using (4),(5) and (6),        original SVD problem.
  • Solve the linear problems given in (7) ,                         The kernel method can be summarized as follows [9]:
  • Construct Γp times the state X,                                  • Create the matrices U T U using (4) and Z T Z and
  • Estimate the state sequence,                                       (Z i,j )T (Z i,j ) using (10),
  • With the estimated state, use the linear relations to ob-        • Solve the linear problem given in (8),
    tain the system matrices.                                        • Construct Γ times the state X using (9)and (10),
In the case of a very small p, we have in general a biased           • Estimate the state sequence,
estimate. However, when the bias is too large, it will be a          • With the estimated state, use the linear relation to ob-
problem. That is why a large p would be chosen. In the                 tain the system matrices.
case of a very large p, this method suffers from the curse
of dimensionality [13] and the number of rows of Z grows          3 Interval predictor approach
exponentially with the size of the past window. In fact, the
number of rows is given by:                                       To add robustness to the LPV subspace identification ap-
                                                                  proach presented in the previous section, it will be combined
                                ∑p                                with the interval predictor approach [16]. The interval pre-
                  ρZ = (r + ℓ)          mj
                                           j=1                    dictor approach is an extension of classical system identifi-
   To overcome this drawback, the kernel method will be           cation methods in order to provide the nominal model plus
introduced in the next subsection [15].                           the uncertainty bounds for parameters guaranteeing that all
                                                                  collected data from the system in non-faulty scenarios will
2.3 Kernel method                                                 be included in the model prediction interval. This approach
The
[ Tequation
          ] (7) has a unique solution if the matrix               considers separately the additive and multiplicative uncer-
  Z   U T has full row rank and is given by:                      tainties. Additive uncertainty is taken into account in the
[         ]                      [   ]                            additive error term e(k) and modeling uncertainty is con-
                [ T          ]     Z [ T         ]                sidered to be located in the parameters that are represented
  d    b
  Cκ D = Y Z
    p                   U T
                               (         Z   U T )−1
                                   U                              by a nominal value plus some uncertainty set around. In the
When this is not the case, that will occurs when p is large,      literature, there are many approximation of the set uncertain
the solution is computed by using the SVD of the matrix:          parameter Θ. In our case, this set is described by a zonotope
        [     ]              [ ∑          ][ T ]                  [10] :
           Z                       m   0     V                            Θ = θ0 ⊕ HB n = {θ0 + Hz : z ∈ B n }                    (11)
                 = [ υ υ⊥ ]
           U                     0     0     V⊥T
                                                                  where: θ is the nominal model (here obtained with the
                                                                            0
Then, the solution of the minimum norm is given by:               identification approach, H is matrix uncertainty shape, B n
              [            ]      ∑−1                             is a unitary box composed of n unitary (B = [−1, 1]) inter-
                 dp D
                 Cκ      b =YV          υT
                                                 m                val vectors and ⊕ denotes the Minkowski sum. A particu-
To avoid computations in a large dimensional space, the           lar case of the parameter set is used that corresponds to the
minimum norm results in:                                          case where the parameter set Θ is bounded by an interval
                                   2
                                                                  box [17]:
                       min ∥α∥F                          (8)
                        α                                                 Θ = [θ1 , θ1 ] × ...[θi , θi ] × ...[θnθ , θnθ ]        (12)
with                    [               ]
                 Y − α ZT Z + U T U = 0                           where θi = θi0 − λi and θi = θi0 + λi with λi ≥ 0 and
                                              [         ]         i = 1, ..., nθ . In particular, the interval box can be viewed as
where α are the Lagrange multipliers and Z T Z + U T U            a zonotope with center θ0 and H equal to an nθ ×nθ diagonal
is referred as the kernel matrix.                                 matrix:
The matrix Γ times the state X can be constructed as fol-
lows:                                                                               θ1 + θ1 θ2 + θ2        θn + θn
                                                                         θ0 = (          ,        , ...,         )              (13)
                           ∑
                           p                                                           2       2              2
                                  1,j T 1,j
                      α j=1 (Z ) Z                                              H = diag(λ1 , λ2 , ..., λn )                    (14)
                                            
                      ∑   p
                                      T      
                      α          2,j
                              (Z ) Z     1,j                     For every output, a model can be extracted in the following
                                            
                      j=2                                       regressor form:
           Γκp Z =              .          
                                                     (9)
                                 .                                               y(k) = φ(k)θ(k) + e(k)                         (15)
                                            
                                 .          
                                            
                      ∑   p
                                  p,j T 1,j                      where
                        α     (Z ) Z
                             j=p




                                                            263
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


  • φ(k) is the regressor vector of dimension 1× nθ which           with
    can contain any function of inputs u(k) and outputs               x1 = β, x2 = β̇, u = βr
    y(k).                                                             which can be discretised using an Euler approximation.
                                                                    Then, the following system is obtained:
  • θ(k) ∈ Θ is the parameter vector of dimension nθ ×1.                       {
                                                                                  x(k + 1) = Ax(k) + Bu(k)
                                                                                                                        (22)
                                                                                        y(k) = Cx(k)
  • Θ is the set that bounds parameter values.
                                                                    with [                            ]
                                                                            1         Te
   • e(k) is the additive error bounded by a constant where         A=
                                                                         −Te wn2 −2Te ξwn + 1
     |e(k)| ≤ σ.                                                       [        ]
                                                                           0
In the interval predictor approach, the set of uncertain pa-        B=
                                                                         Te wn2
rameters Θ should be obtained such that all measured data
in fault-free scenario will be covered by the interval pre-         C=[ 1 0 ]
dicted output.
                                                                    4.2 LPV Pitch system model
              y(k) ∈ [ŷ(k) − σ, ŷ(k) + σ]               (16)      The pitch parameters wn and ξ are variable with hydraulic
                                                                    pressure P [1] [22]. Then, the pitch model can be written
where                                                               as the following LPV model according to [23] using P as
              ŷ(k) = yˆ0 (k) − ∥φ(k)H∥1                  (17)      the scheduling variable ϑ :
                                                                            {
                                                                               x(k + 1) = A(ϑ)x(k) + B(ϑ)u(k)
                                                                                                                         (23)
              ŷ(k) = yˆ0 (k) + ∥φ(k)H∥1                  (18)                           y(k) = Cx(k)

and yˆ0 (k) is the model output prediction with nominal pa-         with   [                                           ]
                                                                                   1                    Te
rameters with θ0 =[θ1 , θ2 , ..., θnθ ]T obtained using the LPV     A(ϑ) =
identification algorithm:                                                    −Te wn2 (P )      −2Te ξ(P )wn (P ) + 1
                                                                        [             ]
                                                                              0
                  yˆ0 (k) = φ(k)θ0 (k)                    (19)      B=
                                                                          Te wn2 (P )
   Then, fault detection will be based on checking if (16)          y(k) = x1 (k) = β(k)
is satisfied. In case that, it is not satisfied a fault can be
indicated. Otherwise, nothing can be said.                          4.3 Regressor form pitch system model
                                                                    The pitch model can be transformed to the following regres-
4 Case study: wind turbine benchmark                                sion form [24]:
  system                                                                               y(k) = φ(k)θ(k)                      (24)
In this work, a specific variable speed turbine is considered.
                                                                    where, φ(k) is the regressor vector which can contain any
It is a three blade horizontal axis turbine with a full con-
                                                                    function of inputs u(k) and outputs y(k). θ(k) ∈ Θ is the
verter. The energy conversion from wind energy to mechan-
                                                                    parameter vector. Θ is the set that bounds parameter values.
ical energy can be controlled by changing the aerodynamics
of the turbine by pitching the blades or by controlling the
                                                                    In particular
rotational speed of the turbine relative to the wind speed.
                                                                    φ(k) = [ y(k − 2) y(k − 1) u(k − 2)]
The mechanical energy is converted to electrical energy by
a generator fully coupled to a converter. Between the ro-                                  T
                                                                    θ = [ θ1   θ2   θ3 ]
tor and the generator, a drive train is used to increase the
rotational speed from the rotor to the generator [18]. This          θ1 = (−Te2 wn2 + (2wn ξTe − 1))
model can be decomposed into submodels: Aerodynamic,
Pitch, Drive train and Generator [19] [20]. In this paper,           θ2 = −2wn ξTe + 2
we focus on faults in the pitch subsystem as explained in the
following subsection.                                                θ3 = Te2 wn2
4.1 Pitch system model
                                                                    5 Results
In the wind turbine benchmark model, the hydraulic pitch
is a piston servo mechanism which can be modeled by a               The pitch systems, which in this case are hydraulic, could
second order transfer function [21] [1]:                            be affected by faults in any of the three blades. The con-
                                                                    sidered faults in the hydraulic system can result in changed
              β(s)          ωn2                                     dynamics due to a drop in the main line pressure. This dy-
                     = 2                                  (20)
              βr (s)  s + 2ζωn s + ωn2                              namic change induces a change in the system parameters:
                                                                    the damping ratio between 0.6 rad/s and 0.9 rad/s and the
  Notice that βr refers to reference values of pitch angles.        frequency between 3.42 rad/s and 11.11 rad/s according
The pitch model can be written in the following state space:        to [23]. In this work, a fault detection subspace estimator
        {                                                           is designed to determine the presence of a fault. To distin-
                       ẋ1 = x2                                     guish between fault and modeling errors, an interval predic-
                                                          (21)      tor approach is applied and a residual generation is used for
            ẋ2 = −2ξwn x2 − wn 2 x1 + wn 2 u




                                                              264
                                           Proceedings of the 26th International Workshop on Principles of Diagnosis



                   16                                                      Upper
                                                                                                                                  0.635
                                                                           Lower
                   14

                   12                                                                                                                               0.63


                   10
                                                                                                                                  0.625
    minmaxoutput




                                                                                                damping ratio in case1
                    8
                                                                                                                                                    0.62
                    6

                    4                                                                                                             0.615

                    2
                                                                                                                                                    0.61
                    0

                   −2                                                                                                             0.605

                    1.58         1.6        1.62             1.64   1.66       1.68
                                                   Time(s)                   x 10
                                                                                   4
                                                                                                                                                          0.6
                                                                                                                                                                0              0.5           1              1.5        2               2.5
                                                                                                                                                                                                  Time(s)                           x 10
                                                                                                                                                                                                                                           4


  Figure 2: Upper (red line) and lower (blue line) bounds
                                                                                                                                                                    Figure 3: Damping ratio in non-faulty case

deciding if there is a fault. To illustrate the performance of
this robust fault detection approach:: ξ ∈[ 0.6 0.63 ] and
wn ∈[ 10.34 11.11 ] are considered. Then, a parameter
set Θ is bounded by an interval box:                                                                                                                11.3


                                                                                       (25)
                                                                                                                                                    11.2
                           Θ = [θ1 , θ1 ] × [θ2 , θ2 ] × [θ3 , θ3 ]
                                                                                                                                                    11.1

                                                                                                                                                          11
and for i = 1, · · · , 3
                                                                                                             frequency in case1




                                                                                                                                                    10.9

                                                                                                                                                    10.8
                                              θi − θi
                                       λi = (         )                                (26)                                                         10.7
                                                 2
                                                                                                                                                    10.6

                                                 θi + θi                                                                                            10.5
                                       θi0 = (           )                             (27)
                                                    2                                                                                               10.4

using equations (17) and (18), the output bounds are calcu-                                                                                         10.3
                                                                                                                                                                0              0.5           1              1.5        2               2.5
lated to be used in fault detection test which are given in                                                                                                                                       Time(s)                           x 10
                                                                                                                                                                                                                                           4



Fig. 2.yˆ0 (k) is obtained by the use of the identification ap-
proach described in Section 2. To validate this algorithm                                                                                                            Figure 4: Frequency in non-faulty case
two cases are used:
- Case 1: In this case, the pressure varies after time 10000s
while parameters vary in the interval of parametric uncer-
tainties, that is, damping ratio varies between 0.6 rad/s
and 0.63 rad/s and the frequency between 10.34 rad/s
and 11.11 rad/s. These parameters are presented respec-                                                                                                   2.1

tively in Figures. 3 and 4. The pitch angle in this case is                                                                                                2
given in Fig. 5 altogether with the prediction intervals.
                                                                                                                         pitch angle in non faulty case




                                                                                                                                                          1.9
For fault detection, the residual signal, based on the com-
parison between the measured pitch angle and the estimated                                                                                                1.8

one at each sampling instance, is calculated and it is shown                                                                                              1.7
in Fig. 6. For fault decision, a fault indicator signal is used                                                                                           1.6
and the decision is taken in function of this indicator. If
the actual angle is not within the predicted interval given in                                                                                            1.5

Eq.(16), the fault indicator is equal to 1 and the system is                                                                                              1.4
faulty. Otherwise, it is equal to 0 and the system is fault-                                                                                              1.3
free. The fault indicator signal given in Fig. 7 shows that
there is no fault despite the pressure variation. The parame-                                                                                             1.2
                                                                                                                                                                      1.6608    1.6608   1.6608   1.6608 1.6608   1.6608   1.6608
ters variation is considered as a modeling error.                                                                                                                                                 Time(s)                              4
                                                                                                                                                                                                                                    x 10
- Case 2: In this case, the pressure P varies between time
t = 10000s and t = 17000s outside its nominal value. In                                                                                                              Figure 5: Pitch angle in non-faulty case
this time interval, the damping ratio varies between 0.63
rad/s and 0.72 rad/s and the frequency varies between




                                                                                          265
                                                               Proceedings of the 26th International Workshop on Principles of Diagnosis



                                                                                                                                      0.72



                                                                                                                                       0.7




                                                                                                             damping ratio in case2
                                                                                                                                      0.68


                                0.025                                                                                                 0.66

                                     0.02

                                0.015                                                                                                 0.64

                                     0.01
                                                                                                                                      0.62
                                0.005
residue




                                       0                                                                                               0.6
                                                                                                                                             0        0.5       1             1.5   2        2.5
                 −0.005                                                                                                                                             Time(s)                  4
                                                                                                                                                                                          x 10

                               −0.01

                 −0.015                                                                                                                          Figure 8: Damping ratio in faulty case
                               −0.02

                 −0.025
                                            0          0.5        1             1.5   2        2.5                                    11.5
                                                                      Time(s)              x 10
                                                                                               4


                                                                                                                                       11

                                                  Figure 6: Residual in non-faulty case
                                                                                                             frequency in case2       10.5


                                                                                                                                       10


                                                                                                                                       9.5


                                                                                                                                        9


                                                                                                                                       8.5


                                                                                                                                        8
                                                                                                                                             0        0.5       1             1.5   2        2.5
                                                                                                                                                                    Time(s)                  4
                                                                                                                                                                                          x 10



                                                                                                                                                  Figure 9: Frequency in faulty case

                                       1

                                      0.8
                                                                                                           8.03 rad/s and 10.34 rad/s. On the other hand, the damp-
                                                                                                           ing ratio varies between 0.6 rad/s and 0.63 rad/s and the
                                      0.6
                                                                                                           natural frequency varies between 10.34 rad/s and 11.11
                                      0.4                                                                  rad/s outside as shown in Figures 8 and 9. In this case,
          Fault indicator in case1




                                      0.2                                                                  the pitch angle is given in Fig. 10, while the residual and
                                                                                                           fault indicator signals are presented in Fig. 11 and Fig. 12,
                                       0
                                                                                                           respectively.
                                     −0.2                                                                     Fig. 12 shows that the fault indicator signal changes its
                                     −0.4                                                                  signature between time 10000s and 17000s which induce
                                     −0.6
                                                                                                           that the parameters vary larger than the modeling range due
                                                                                                           to actuator fault in wind turbine benchmark system between
                                     −0.8
                                                                                                           instants t = 10000s and 17000s.
                                      −1
                                            0          0.5        1             1.5   2        2.5
                                                                      Time(s)              x 10
                                                                                               4
                                                                                                           6 Conclusions
                                                                                                           The proposed approach is based on an LPV estimation ap-
                                                Figure 7: Fault indicator in non-faulty case               proach to generate a residual as the difference between the
                                                                                                           real and the nominal behavior of the monitored system.
                                                                                                           When a fault occurs, this residual goes out of the inter-
                                                                                                           val which represents the uncertainty bounds in non faulty
                                                                                                           case. These bounds are generated by means of an inter-
                                                                                                           val predictor approach that adds robustness to this fault de-
                                                                                                           tection method, by means of propagating the parameter un-
                                                                                                           certainty to the residual or predicted output. The proposed




                                                                                                     266
                                                                      Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                                                                                   approach is illustrated by implementing a robust fault de-
                                                                                                                                   tection scheme for a pitch subsystem of the wind turbine
                                                                                                            Mesaured               benchmark. Simulations show satisfactory fault detection
                                         2.2                                                                Max
                                                                                                            Min
                                                                                                                                   performance despite model uncertainties.
                                         2.1

                                          2
                                                                                                                                   References
            pitch angle in faulty case




                                         1.9

                                         1.8
                                                                                                                                   [1] P. Odgaard, J. Stoustrup, and M. Kinnaert. Fault toler-
                                                                                                                                       ant control of wind turbines-a benchmark model. In
                                         1.7
                                                                                                                                       7th IFAC symposium on fault detection, supervision
                                         1.6
                                                                                                                                       and safety of technical processes, Barcelona, spain,
                                         1.5                                                                                           2009.
                                         1.4
                                                                                                                                   [2] R. Isermann. Fault diagnosis systems: an introduction
                                         1.3
                                                                                                                                       from fault detection to fault tolerance. 2006.
                                         1.2
                                                                                                                                   [3] M. Blanke, M. Kinnaert, J. Lunze, M. Staroswiecki,
                                                   1.6608    1.6608   1.6608      1.6608       1.6608   1.6608   1.6608
                                                                               Time(s)                           x 10
                                                                                                                       4               and J Schröder. Diagnosis and fault-tolerant control.
                                                                                                                                       2006.
                                                    Figure 10: Pitch angle in faulty case                                          [4] G. Mercere, M. Lovera, and E Laroche. Identifi-
                                                                                                                                       cation of a flexible robot manipulator using a linear
                                                                                                                                       parameter-varying descriptor state-space structure. In
                                                                                                                                       Proc. of the IEEE conference on decision and control,
                                                                                                                                       Orlando, Florida,USA, 2011.
                                   0.15
                                                                                                                                   [5] J. Dong, B. Kulcsár, and M Verhaegen. Fault detection
                                                                                                                                       and estimation based on closed-loop subspace identi-
                                         0.1                                                                                           fication for linear parameter varying systems. In DX,
                                                                                                                                       Stockholm, 2009.
                                   0.05                                                                                            [6] J. Bravo, T. Alamo, and E.F. Camacho. Bounded
                                                                                                                                       error identification of systems with time-varying pa-
residue




                                                                                                                                       rameters. IEEE Transactions on Automatic Control,
                                          0                                                                                            51:1144 Ű 1150., 2006.
                                                                                                                                   [7] D. Efimov, L. Fridman, T. Raissi, A. Zolghadri, and
                   −0.05                                                                                                               R. Seydou. Interval estimation for lpv systems apply-
                                                                                                                                       ing high order sliding mode techniques. Automatica,
                                                                                                                                       48:2365–2371, 2012.
                                  −0.1
                                               0            0.5         1
                                                                               Time(s)
                                                                                         1.5            2              2.5
                                                                                                                       4
                                                                                                                                   [8] D. Efimov, T. Raissi, and A. Zolghadri. Control of
                                                                                                                 x 10
                                                                                                                                       nonlinear and lpv systems: interval observer-based
                                                                                                                                       framework. IEEE Transactions on Automatic Control.,
                                                            Figure 11: Residual signal                                                 2013.
                                                                                                                                   [9] J. Van Willem and M Verhagen. Subspace identifica-
                                                                                                                                       tion of bilinear and lpv systems for open-and closed-
                                                                                                                                       loop data. Automatica, 45:371–381, 2009.
                                          1
                                                                                                                                   [10] J. Blesa, V. Puig, J Romera, and J Saludes. Fault di-
                                                                                                                                        agnosis of wind turbines using a set-membership ap-
                                  0.95                                                                                                  proach. In the 18th IFAC world congress, Milano,
                                                                                                                                        Italy, 2011.
                                         0.9                                                                                       [11] H. Tanaka, Y Ohta, and Y Okimura. A local approach
   fault indicator




                                                                                                                                        to lpv-identification of a twin rotor mimo system. In in
                                  0.85                                                                                                  proceedings of the 47th IEEE Conference on Decision
                                                                                                                                        and Control Cancun, Mexico, 2008.
                                         0.8
                                                                                                                                   [12] R. Toth, F. Felici, P. Heuberger, and P Van den Hof.
                                                                                                                                        Discrete time lpv i/o and state-space representations,
                                                                                                                                        differences of behavior and pitfalls of interpolation.
                                  0.75
                                                                                                                                        In in proceedings of the European Control Conference
                                               0            0.5         1                1.5            2              2.5              (ECC), Kos, Greece, 2007.
                                                                               Time(s)                           x 10
                                                                                                                       4

                                                                                                                                   [13] J. Van Willem and M Verhagen. Subspace identifica-
                                                                                                                                        tion of mimo lpv systems: the pbsid approach. In in
                                                             Figure 12: Fault indicator
                                                                                                                                        Proceedings of the 47th IEEE Conference on Decision
                                                                                                                                        and Control Cancun, Mexico, 2008.




                                                                                                                             267
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


[14] P. Gebraad, J. Van Wingerden, G. Van der Veen, and
     M Verhaegen. Lpv subspace identification using a
     novel nuclear norm regularization method. In Ameri-
     can Control Conference on O’Farrell Street, San Fran-
     cisco, CA, USA, 2011.
[15] V. Verdult and M Verhaegen. Kernel methods for sub-
     space identification of multivariable lpv and bilinear
     systems. Automatica, 41:1557–1565, 2005.
[16] J. Blesa, V. Puig, and J Saludes. Identification for pas-
     sive robust fault detection using zonotope based set
     membership appraches. International journal of adap-
     tive control and signal processing, 25:788–812, 2011.
[17] P. Puig, V. Quevedo, T. Escobet, F. Nejjari, and
     S De las Heras. Passive robust fault detection of dy-
     namic processes using interval models. IEEE Transac-
     tions on Control Systems Technology, 16:1083 –1089,
     2008.
[18] B. Boussaid, C. Aubrun, and M.N Abdelkrim. Set-
     point reconfiguration approach for the ftc of wind tur-
     bines. In the 18th World Congress of the International
     Federation of Automatic Control (IFAC), Milano, Italy,
     2011.
[19] B. Boussaid, C. Aubrun, and M.N Abdelkrim. Two-
     level active fault tolerant control approach. In The
     Eighth International Multi-Conference on Systems,
     Signals Devices (SSD’11),Sousse, Tunisia, 2011.
[20] B. Boussaid, C. Aubrun, and M.N Abdelkrim. Active
     fault tolerant approach for wind turbines. In The In-
     ternational Conference on Communications, Comput-
     ing and Control Applications (CCCA’11), Hammamet,
     Tunisia, 2011.
[21] P. Odgaard, J. Stoustrup, and M Kinnaert. Fault toler-
     ant control of wind turbines-a benchmark model. IEEE
     Transactions on control systems Technology, 21:1168–
     1182, 2013.
[22] P. Odgaard and J Stoustrup. Results of a wind tur-
     bine fdi competition. In 8th IFAC symposium on
     fault detection ,supervision and safety of technical pro-
     cesses,Mexico, 2012.
[23] C. Sloth, T. Esbensen, and J Stoustrup. Robust and
     fault tolerant linear parameter varying control of wind
     turbines. Mechatronics, 21:645–659, 2011.
[24] H. Chouiref, B. Boussaid, M.N Abdelkrim, V. Puig,
     and C Aubrun. Lpv model-based fault detection:
     Application to wind turbine benchmark. In Interna-
     tional conference on electrical sciences and technolo-
     gies (cistem’14), Tunis, 2014.




                                                             268
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




                       Processing measure uncertainty into fuzzy classifier

         Thomas Monrousseau1 , Louise Travé-Massuyès1 and Marie-Véronique Le Lann1,2
               1
                 CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
                   2
                     Univ de Toulouse, INSA, LAAS, F-31400 Toulouse, France
              e-mails: thomas.monrousseau@laas.fr, louise@laas.fr ,mvlelann@laas.fr



                         Abstract                                   data [5] [6] [7], the use of fuzzy logic type-1 or type-2 [3]
                                                                    or statistical models.
     Machine learning such as data based classification                Fuzzy logic is a multi-valued logic framework intro-
     is a diagnosis solution useful to monitor complex              duced by Zadeh [8] that is known to be more efficient for
     systems when designing a model is a long and ex-               representating uncertainty and impreciseness than binary
     pensive process. When used for process monitor-                logic. In previous work, a fuzzy classifier named Learning
     ing the processed data are available thanks to sen-            Algorithm for Multivariate Data Analysis (LAMDA)
     sors. But in many situations it is hard to get an ex-          has been proposed by Aguilar [9]. This classifier can
     act measure from these sensors. Indeed measure                 originally process simultaneously two different types
     is done with a lot of noise that can be caused by              of data: quantitative data and qualitative data. A real
     the environment, a bad use of the sensor or even               number contains an infinite amount of precision whereas
     the conversion from analogic to numerical mea-                 human knowledge is finite and discrete, thus LAMDA is
     sure. In this paper we propose a framework based               interesting because there is no solution proposed in the
     on a fuzzy logic classifier to model the uncertainty           literature to process in a uniform way heterogeneous data
     on the data by the use of crisp (non fuzzy) or fuzzy           and to handle in a same problem quantitative data and
     intervals. Our objective is to increase the num-               qualitative data is often a complex subject. A new type
     ber of good classification results in the presence             of data, the interval, has been introduced by Hedjazi [10]
     of noisy data. The classifier is named LAMDA                   to model uncertainties by means of crisp intervals. In this
     (Learning Algorithm for Multivariate Data Anal-                paper we propose an extention to fuzzy intervals in order to
     ysis) and can perform machine learning and clus-               improve its application to process noisy data measurements
     tering on different kind of data like numerical val-           but with the capacity to handle others features types like
     ues, symbols or interval values.                               “clean” data or qualitative features. Moreover the algorithm
                                                                    should stay low cost in term of memory and computation
1 Introduction                                                      time to enable the method to be embedded on small systems.
Data classification is the process of dividing pattern space
using hard, fuzzy or probabilistic partitions into a number            In the first part of the paper the LAMDA algorithm is
of regions [1]. Classification algorithms are more and more         shortly presented then in a second time a method to use the
used nowadays in a world where it is not always simple to           algorithm to classify noisy data is introduced. This method
get a model of complex process. On the opposite it is easier        is in two parts: the first presents a general solution to model
to get data on systems by monitoring and store it. Differ-          uncertainty on data with crisp intervals based on confidence
ent types of classifiers can be used depending on the sit-          intervals and the second shows an improvement to model
uation. The principal ones described in the literature are          Gaussian noise with fuzzy intervals. In both cases examples
artificial neural networks, k-nearest neighbors, support vec-       of application are introduced to show the improvement of
tor machine, decision trees, fuzzy classifiers and statistical      the method compared to the use of the data without trans-
methods.                                                            formation.
   Most of the time, data are issued from sensor measure-
ments and are corrupted by noise. This noise can have dif-          2 LAMDA algorithm (Learning Algorithm
ferent origins, for example environment disturbances, bad             for Multivariate Data Analysis)
use of the sensor, hysteresis effect or numerical conversion
and representation of the data. Many domains of applica-            This section presents the principle of the LAMDA algo-
tion have to deal with noise problems like medical diagno-          rithm.
sis [2], biologic identifications [3] or image recognition [4].
Uncertainty can be understood in two ways: the first is the         2.1 General principle
uncertainty directly present in the data like noise and the         LAMDA is a classification algorithm based on fuzzy logic
second can be assimilated as the reliability of a feature in-       created on an original idea of Aguilar [9] and can achieve
side a class. In this paper we consider only the first case. To     machine learning and clustering on large data sets.
avoid noise problems in classification some solutions have             The algorithm takes as input a sample x made up of N
been provided previously, for example the transformation of         features. The first step is to compute for each feature of x, an




                                                              269
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                    Where xn is the n-th feature of the sample x, ρj,n is
                                                                    the mean of the n-th feature for the class j and σj,n is
                                                                    the standard deviation of the n-th feature for the class j.

                                                                  • Qualitative data:

                                                                    Qualitative can take values in a set of modalities. The
                                                                    membership function of qualitative data returns the fre-
                                                                    quency of modality taken by the feature into the class
                                                                    during the learning phase. We introduce a qualitative
                                                                    variable with K modality {Q1 , ..., QK } and the fre-
                                                                    quency Φkj of the modality Qk for the class j. The
                                                                    membership is described by:
                                                                              f (xn ) = (Φ1j,n )q1 ∗ ... ∗ (ΦK
                                                                                                             j,n )
                                                                                                                  qK
                                                                                                                         (4)
 Figure 1: Summarized scheme of the LAMDA algorithm                                   k
                                                                                       q = 0 if xn 6= Qk
                                                                               with
adequacy degree to each class Cj , j = 1..J where J is the                             q k = 1 if xn = Qk
total number of class. This is obtained by the use of a fuzzy
adequacy function. So J vectors of N adequacy degrees             • Intervals:
are computed, these vectors are called Marginal Adequacy
Degree vectors (MAD). At this point, all the features are in        The membership function for interval data is a function
a common space. Then the second step is to take all the             which tests the similarity between two fuzzy intervals.
MADs and aggregate them into one global adequacy degree             In this case similarity is defined by two components:
(GAD) by means of a fuzzy aggregation function. Thus the            the distance between the intervals and the surface that
J MAD vectors (composed of N MADs) become J scalar                  these intervals have in common. Indeed the class pro-
GADs, the higher the GAD, the better the adequacy to the            totype for crisp interval data is a mean interval. The
class. The simplest way to assign the sample x to a class is        similarity function is:
to keep as result the class with the biggest GAD.
   All the process is summarized in Fig. 1.                                       R
                                                                               1 V µA∩B (ξ)dξ     ∂[A, B]
                                                                      S(A, B) = ( R           +1−         ) (5)
2.2 Fuzzy membership computation                                               2 V µA∪B (ξ)dξ      $[V ]
During the learning step, the algorithm creates prototype
data for each class and for each feature. These data are
called classe descriptors or prototypes; they can be for ex-        where µX (x) is the value of x in the fuzzy set X,
ample means or variances. We define as Cj,n the class pro-          ∂[A, B] is the distance between intervals A = [a− , a+ ]
totype of the n-th feature for the class j.                         and B = [b− , b+ ]and $[X] is the size of a fuzzy set
   As previously mentioned the first step of the algorithm          into a V universe. This is described by:
is a comparison between the sample vector x and all the                                        Z
Cj,n . This operation is performed with membership func-                             $[X] =       µX (ξ)dξ               (6)
tions and gives as result a membership adequacy degree.                                          V
Thus M ADj,n is the MAD for the j-th class and the n-               In the case of crisp intervals and in a universe between
th feature. As the framework is based on fuzzy logic, all           0 and 1:
memberships are numbers in the [0,1] interval. The general
membership function is:                                                                 1 $[A ∩ B]
                                                                           S(A, B) =     (         + 1 − ∂[A, B])          (7)
                                                                                        2 $[A ∪ B]
                  M ADj,n = f (Cj,n , xn )                (1)
   The class prototype Cj,n depends on two things: the type         where $[X] in this case can be replaced by the length
of data and the function used. Some functions may require           of the interval:
only one data into Cj,n whereas others need a list of param-
eters.                                                                      $[X] = upperbound(X)-lowerbound(X)        (8)
   In the following section, some examples of membership
functions are presented.                                            and distance ∂[A, B] is defined as:
   • Quantitative data:                                              ∂[A, B] = max[0, max(a− , b− )−min(a+ , b+ )] (9)

     Many functions are available for this kind of data. For        In the case where an interval feature is used the pro-
     example the Gaussian:                                          totype for a class j is given by [ρn−    n+         n−
                                                                                                        j , ρj ] where ρj ,
                                                                                   n+
                                     (xn − ρj,n )2                  respectively ρj represents the mean value of lower
                                 −        2
                                        2σj,n                       bounds (respectively upper bounds) of all the elements
                   f (xn ) = e                            (2)       belonging to class j for this feature.
                                                                    Once the MAD are computed whatever the feature
     or the binomial function:                                      type, it is possible to perform any type of processing
                 f (xn ) = ρxj,n
                               n
                                 .(1 − ρj,n )1−xn         (3)       as described on Fig. 2




                                                            270
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


                                                                  of confidence (for example a confidence interval of 95% is
                                                                  an interval in which the exact value of the measure can be
                                                                  found with a probability of 95%). Introducing x̂ the mea-
                                                                  sured value and l the length of a centered on zero confi-
                                                                  dence interval based on the measurement error, the interval
                                                                  used by the algorithm is calculated: X = [x̂ − 2l ; x̂ + 2l ].
                                                                     The main aim of the transformation is to improve the clas-
                                                                  sification on the transition zones where data is really sensi-
                                                                  tive to noise and a small change can modify the output of the
                                                                  classifier. The use of intervals to model uncertainty is effec-
                                                                  tive only if the “clean” data is relevant for the classification
                                                                  problem. If it is not the case a better solution is to remove
                                                                  the irrelevant feature. It will in most cases provide better
                                                                  output results. This expresses the fact that if the “clean”
                                                                  data is difficult to classify it is not improved by using confi-
                                                                  dence intervals.
                                                                  3.2 Experiments
                                                                  A set of data has been created for an application test which
Figure 2: Projection principle for heterogeneous feature          can be interpreted as sensors time evolution of a continuous
types                                                             process. This set of data is composed by three quantitative
                                                                  (numerical) features of 101 samples that are shown on the
                                                                  Fig. 3. Three classes are specified and used as targets for
2.3 Marginal adequacy degree merging                              the classifier. These classes are chosen arbitrarily to repre-
Once all the features are grouped into the membership space       sent different behaviors of a system that could be healthy
the next step of the algorithm is to transform the MAD vec-       or failure modes. Nevertheless the classes are built to make
tors into a set of single value which depicts the global mem-     all the data relevant for the system monitoring which means
bership of the sample to a class. These values were intro-        the three features do not have a global negative impact on
duced in section 2.1 and are called GAD. To perform this          the classification results.
transformation a fuzzy aggregation function Ψ is used.               The three features x, y and z are defined by the following
   The aggregation function is the following:                     time functions:
   Ψ(M AD) = α.γ(M AD) + (1 − α).β(M AD)                (10)                  −t
                                                                    • x=e 2
   where γ is a fuzzy T-norm and β is a fuzzy T-conorm.                          t
α parameter is called exigency indicator. It enables to give        • y = 12 · e 4 − 1
more or less significance to the union operation and the in-        • z = tanh(t − 5)
tersection operation. Two fuzzy T-norm and T-conorm are
currently implemented in the algorithm, the min-max and
the probabilistic. For example if min-max is used, (10) be-
comes:
 Ψ(M AD) = α.min(M AD)+(1−α).max(M AD) (11)
   When all GAD are computed they give the membership
of the data x to each class. The final result depends on the
application but the simplest way to give a result is to class
the sample in the class which has the highest GAD. A limit
membership can also be fixed: if no GAD is higher than the
limit, the sample is defined as unclassifiable.

3 Uncertainty modeled with crisp intervals
3.1 Method presentation
Every data measurement is performed with noise. In some
cases noise has enough bad effect to increase the error of
classification. Thus the point is to model the imprecision of
the data to decrease the number of bad classifications.
   A technique used in several fields of application is the
use of intervals to symbolize data uncertainty [11] [12]. So
we are suggesting a framework where numerical data are                  Figure 3: Data used to test the intervals method
transformed into intervals to model imprecision.
   In a situation where the probability law followed by the          This example is used to measure the improvement in the
noise on a variable is unknown, it may be possible to ob-         classification results in the case of all data are noisy. Artifi-
tain a confidence interval. It is an interval in which the        cial noise is added by the following: x is the ideal variable
real value of the measure is present with a certain amount        without noise and x̂ the noisy variable, x̂ = x + Y with




                                                            271
                         Proceedings of the 26th International Workshop on Principles of Diagnosis




                                                                       Figure 5: Example of approximation of a Gaussian fuzzy
                                                                       interval by a triangular fuzzy interval
Figure 4: An example of data corrupted with a noise in the
interval [-0.5 ; 0.5]
                                                                       4 Modeling Gaussian noise with fuzzy
                                                                         intervals
Y a random variable following a uniform distribution on an             4.1 Fuzzy interval method presentation
interval I.                                                            Most of the time, noise on physical measure follows a Gaus-
                                                                       sian distribution centered on the real value. Thus it is inter-
   The experiment has been performed with these condi-                 esting to model this specific kind of uncertainty. Neverthe-
tions: α parameter of (10) is set at 0.8 with the [min,max]            less, it is difficult to handle fuzzy intervals with an exact
functions to compute the fuzzy aggregation and the mem-                Gaussian shape. That is why we suggest approximating the
bership function used for quantitative data is the bino-               Gaussian with a triangular fuzzy interval. This interval is
mial.[min, max] aggregation is chosen because experiments              described with a lower boundary x− and an upper boundary
on the algorithm showed that this kind of aggregation pro-             x+ : X = [x− ; x+ ] which leads to a similar description as
vides better results on noisy data that the probabilistic one.         crisp intervals. So:
A first classification without any noise gives a result of 91%            µX (x− ) = 0 and µX (x+ ) = 0 and µX ( x +x
                                                                                                                     +   −
                                                                                                                           )=1
of good classification. Then the experiment is repeated a                                                              2

great many times to avoid statistical mistakes. In this case,             with µX (x) the fuzzy value of x into the fuzzy set X. As
the experiment has been run fifty thousand times, x̂ is re-            a Gaussian of ρ mean is centered on the true measure value
computed at each new run. Results are given on table 1.                                                            +   −
                                                                       the maximum fuzzy value of the triangle x +x  2   is equal to
                                                                       ρ. To compute x− and x+ we propose to use the full width
                                                                       at half maximum (FWHM) that can be calculated this way:
 Interval for ran-      [-0.3 ; 0.3]   [-0.5 ; 0.5]   [-2 ; 2]
                                                                                                      p
 dom data                                                                               F W HM = 2 2ln(2) · σ                  (12)
 Mean       success
                                                                          with σ that is the standard deviation of the measure.
 percentage               89.9%          84.7%        79.6%
                                                                       Thus for a Gaussian function that has a mean value ρ and a
 with      binomial
                                                                       standard deviation
                                                                                      p σ the approximated
                                                                                                         p interval X is defined
 function
 Mean       success                                                    by X = [ρ − 2 2ln(2) · σ; ρ + 2 2ln(2) · σ]. An example
 percentage with          91.9%          89.8%        70.3%            of this approximation is given on Fig. 5.
 interval function
                                                                          Until now all the implementations of the LAMDA algo-
  Table 1: Table of results for the crisp intervals method             rithm were using only crisp intervals despite the fact that
                                                                       the general method was introduced. The class prototype is
                                                                       now a triangle interval computed with the means of upper
                                                                       and lower boundaries of the data used to train the algorithm.
                                                                       Thus the membership function is still a similarity measure
  As it can be seen, this method provides an improvement
                                                                       between two fuzzy intervals like in (5) but it is necessary to
on the results in the two first cases where noise deteriorates
                                                                       redefine the distance function between the intervals. A solu-
the classification with the quantitative method but when the
                                                                       tion has been proposed to measure a distance with the center
data is still globally consistent. In these cases, the intervals
                                                                       of gravity of triangular fuzzy intervals [13]. In the present
method gives better results than binomial method 82% of
                                                                       situation:
the time. But when noise amplitude is much higher than the
data like in the [−2; +2] error interval, the interval method                                     a+ + a−   b+ + b−
does worse in general than the binomial function.                                   ∂[A, B] = |           −         |            (13)
                                                                                                     2         2



                                                                 272
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


   with A = [a− ; a+ ] and B = [b− ; b+ ], A and B being
triangular fuzzy intervals like described in this section.
   The intersection A ∩ B needed in (5) is calculated with an
analytical solution based on geometry and trigonometry. It
avoids numerical integration that could be less precise and
longer to compute.

4.2 Experiments
As we did previously with the crisp method, a test is per-
formed with a Gaussian noise on the same data set (Fig. 3).
The test is done in the same conditions as in the previous
section. The difference is on the construction of the noisy
data x̂ = x + Y . Y is now a random variable that follows a
normal distribution of standard deviation σ and centered on
0. Results of the simulation are given on the table 2.
                                                                            Figure 6: Representation of iris data by class
 σ                      0.2       0.5        0.7        1
 Mean     success
 percentage            83.2%     79.8%     79.8%     79.6%             The classifications are performed using the cross-
 with binomial                                                      validation method. The percentages of well classified data
 function                                                           for the two methods are:
 Mean     success
 percentage with       86.8%     82.5%     77.2%     71.3%            • using binomial function (scalar): 81.3%
 crisp    interval                                                    • using fuzzy triangular intervals: 94.0%
 function
 Mean     success                                                      Once again the classification rate is increased by the use
 percentage with       93.1%     84.5%     79.3%     74.8%          of the fuzzy interval method instead of the binomial one.
 fuzzy    interval
 function                                                           5 Conclusion
                                                                    We presented in this article two methods to model uncer-
  Table 2: Table of results for the fuzzy intervals method
                                                                    tainty for classification applications. An example showed
                                                                    that these methods can improve classification results even
   Similarly to the previous test, the interval method in-          when the signal to noise ratio is high. The second method
creases the rate of good classifications until the standard de-     based on fuzzy intervals demonstrated that try to model
viation σ becomes too high and the binomial function pro-           more precisely the probability law of the noise can pro-
vides better results. This point is reached here for σ = 0.7        vide better results than use confidence intervals modelled
which corresponds to a signal to noise ratio (SNR) of 6 dB          by crisp intervals. However this process to model uncer-
for the signal with the smallest amplitude. Also it is im-          tainty reveals limits when the SNR reaches a low level. A
portant to notify that in all cases the fuzzy interval provides     future important work is to limit the classification error of
better results than the crisp interval method.                      the interval method at the level of the numerical method.
                                                                       These methods will now be tested on data out coming
4.3 Experiments on iris dataset                                     from a real industrial process.
As a second example we use the classical iris dataset[14].             Another way to manage uncertainty on classifiers like
This dataset contains four features: sepal length in cm,            LAMDA could be to use type-2 fuzzy functions [15]. This
sepal width in cm, petals length in cm and petal width              is an expansion of classical fuzzy logic where the member-
in cm. All these features are measured for three types of           ship functions give in output a fuzzy interval which can be
flower: iris Setosa, iris Versicolour and iris Virginica which      used to model variance of the data.
constitute three classes. It is easy to classify without any           To provide a better solution to manage uncertainty in the
error the iris dataset by using only the petals information         LAMDA classifier it can be useful to extend the problem to
that are in general most relevant that the sepals ones. Thus        the qualitative features. It is often difficult to determine if a
only the sepal sizes are kept in this test to simulate the          qualitative element is close to another, for example the color
noise. The figure 6 shows the repartition of the data in the        "orange" is closer to "red" than "blue". But on small training
2D space of the sepal features.                                     dataset consider this kind of information can improve final
                                                                    classification results. This could be done by using similarity
   We assume that the data follow a normal distribution             matrix which are already used in some artificial intelligence
centered on a mean µj,n and with a standard-deviation σj,n .        problems.
This hypothesis can be verified by using a statistical test.           LAMDA algorithm can work with a feature selection al-
The Kolmogorov-Smirnov test has been used for each class            gorithm named MEMBAS (Membership Margin Based Fea-
with a 5% significance level, it shows that the hypothesis is       ture Selection) [16]. This algorithm uses LAMDA classes
true for the iris Setosa and the iris Versicolour but not for       definitions and its membership functions to provide an ana-
the iris Virginica. Nevertheless all the data are processed as      lytical solution for the feature selection. A future work will
if they follow a normal distribution.                               be to measure the impact of the interval use on MEMBAS
                                                                    algorithm to perform selection on noisy data.




                                                              273
                        Proceedings of the 26th International Workshop on Principles of Diagnosis


References                                                        [14] Fisher R.A. {UCI} machine learning repository, 1936.
[1] J. C. Bezdek. A review of probabilistic, fuzzy, and                http://archive.ics.uci.edu/ml.
     neural models for pattern recognition. Journal of Intel-     [15] J.M. Mendel, R.I. John, and F. Liu. Interval type-
     ligent and Fuzzy Systems, Vol. 1, No. 1:pp 1–25, 1993.            2 fuzzy logic systems made simple. Fuzzy Systems,
[2] E. Alba, J. Garcia-Nieto, L. Jourdan, and E. Talbi.                IEEE Trans. on, Vol. 14, No. 6:pp 808–821, Dec. 2006.
     Gene selection in cancer classification using pso/svm        [16] L.Hedjazi, J.Aguilar-Martin, and M.V. Le Lann.
     and ga/svm hybrid algorithms. In Evolutionary Com-                Similarity-margin based feature selection for symbolic
     putation, CEC 2007. IEEE Congress on, pages 284–                  interval data. Pattern Recognition Letters, Vol.32,
     290, Sept. 2007.                                                  No4:pp. 578–585, March 2012.
[3] Scott Ferson, H. Resit Akqakaya, and Amy Dunham.
     Using fuzzy intervals to represent measurement error
     and scientific uncertainty in endangered species clas-
     sification. In Fuzzy Information Processing Society,
     1999. NAFIPS. 18th International Conference of the
     North American on, pages pp 690–694, Jul 1999.
[4] Zhang Weiyu, S.X. Yu, and Shang-Hua Teng. Power
     svm: Generalization with exemplar classification un-
     certainty. In Computer Vision and Pattern Recognition
     (CVPR), 2012 IEEE Conference on, pages pp 2144–
     2151, June 2012.
[5] Arafat Samer, Dohrmann Mary, and Skubic Mar-
     jorie. Classification of coronary artery disease stress
     ecgs using uncertainty modeling. In Computational
     Intelligence Methods and Applications, 2005 ICSC
     Congress, 2005.
[6] Kynan E. Graves and Romesh Nagarajah. Uncertainty
     estimation using fuzzy measures for multiclass classi-
     fication. Neural Networks, IEEE Transactions on, Vol.
     18:pp. 128–140, 2007.
[7] Prabha Verma and R.D.S. Yadava. Fuzzy c-means
     clustering based uncertainty measure for sample
     weighting boosts pattern classification efficiency. In
     Computational Intelligence and Signal Processing
     (CISP), 2012 2nd National Conference on, pages 31–
     35, 2012.
[8] L.A. Zadeh. Fuzzy sets. Information and Control, vol.
     8:pp. 338–353, June 1965.
[9] Carrete N.P. and Aguilar-Martin J. Controlling selec-
     tivity in nonstandard pattern recognition algorithms. In
     IEEE Transactions on Systems, Man and Cybernetics,
     volume 21, pages 71–82. IEEE, Jan/Feb 1991.
[10] Hedjazi L., Aguilar-Martin J., Le Lann M.V., and
     Kempowsky T. Towards a unfined principle for rea-
     soning about heterogeneous data: a fuzzy logic frame-
     work. International Journal of Uncertainty, Fuzzy-
     ness and Knowledge-Based Systems, Vol. 20, No. 2:pp.
     281–302, 2012.
[11] B. Kuipers.       Qualitative Reasoning: Modeling
     and Simulation with Incomplete Knowledge. The
     MIT Press,Cambridge, Massachusetts, london edition,
     1994.
[12] Lynne Billard. Some analyses of interval data. Journal
     of Computing and Information Technology, CIT 16:pp
     225–233, 2008.
[13] Hsieh C. H. and Chen S. H. Similarity of general-
     ized fuzzy numbers with graded mean integration rep-
     resentation. In Proceedings of the Eighth International
     Fuzzy Systems Association World Congress, volume
     vol. 2, pages pp. 551–555, Taipei, Taiwan, Republic
     of China, 1999.




                                                            274
Proceedings of the 26th International Workshop on Principles of Diagnosis




        Tools/Benchmarks




                                  275
Proceedings of the 26th International Workshop on Principles of Diagnosis




                                  276
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




                  Random generator of k-diagnosable discrete event systems

                                             Yannick Pencolé1 2
                    1
                      CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
                            2
                              Univ de Toulouse, LAAS, F-31400 Toulouse, France
                                       e-mail: yannick.pencole@laas.fr



                         Abstract                                    2. diagnosability algorithms: the fact that a fault f is k-
                                                                        diagnosable is usually the worst case for this type of
     This paper presents a random generator of dis-
                                                                        algorithms (as they all look for the existence of an am-
     crete event systems that are by construction k-
                                                                        biguous scenario to conclude the system is not diag-
     diagnosable. The aim of this generator is to pro-
                                                                        nosable).
     vide an almost infinite set of diagnosable systems
     for creating benchmarks. The goal of such bench-                  The paper is organised as follows. After formally re-
     marks is to provide a solid set of examples to test            calling the problem that motivates the generation of bench-
     and compare algorithms that solve many prob-                   marks, we describe the fundamental property which is being
     lems around diagnosable discrete event systems.                used for the effective generation of systems where a given
                                                                    fault f is k-diagnosable. Then the description of the algo-
1 Introduction                                                      rithm of the generator is provided as well as some details
                                                                    about its effective implementation.
For many years, the problem of fault diagnosis in discrete
event systems has been actively addressed by different sci-
entific communities such as DX (AI-based diagnosis) [1;
                                                                    2 Background
2], FDI (Fault Detection and Isolation), DES (Discrete              This paper addresses the random generation of benchmarks
Event Systems) [3]. Depending on the community, many                for the problem of the fault diagnosis of discrete event sys-
differents aspects of the same problem have been addressed          tem. This problem is briefly recalled in this section. We
such as the design of efficient diagnosers, the checking of di-     assume that the reader is familiar with the notations of the
agnosability properties, the effective modelling of real sys-       language theory (notion of Kleene closure, prefixes,...).
tems. When dealing with performance, most of the contri-
butions present experimental results on specific examples of        2.1 Modelling
their own usually inspired or based on real world systems.          We suppose that the system under monitoring behaves as an
The main problem about these contributions is that they are         event generator that can be modelled as an automaton.
not really comparable as they are not applied on the same           Definition 1 (System description). The model (system de-
benchmarks. Moreover, used benchmarks may not be al-                scription) SD of a discrete event system S is a finite state
ways completly defined in a paper due, most of the time, to         automaton SD = (Q, Σ, T, q0 ) where:
confidential data that cannot be published so other academic
contributors cannot use them for comparison purposes. In              • Q is a finite set of states;
order to analyse and boost the effective performance of algo-         • Σ is a finite set of events;
rithms addressing the fault diagnosis problem in DES, com-            • T ⊆ Q × T × Q is a finite set of transitions;
mon and fully available benchmarks become a necessity.
   This paper addresses the random generation of k-                   • q0 is the initial state of the system.
diagnosable systems. We here propose the possibility to                Σ is the set of events that the system can produce. Among
generate (and store on a web page) k-diagnosable systems            Σ we distinguish events that are observable Σo ⊆ Σ and
that have been generated without any kind of bias that would        events that are not observable. When the system operates,
come from a specific diagnosis/diagnosability method. By            its effective behaviour is represented by a trace of the au-
doing so, we propose to design a random category for                tomaton (also called a run).
benchmarks as the SAT community proposed for SAT prob-
                                                                    Definition 2 (Trace). A trace τ ∈ Σ∗ of the system is a finite
lems and to get the same advantages by comparing different
                                                                    sequence of events associated with a transition path from the
diagnosis/diagnosability approaches on the same but ran-
                                                                    initial state q0 to a state q in the model of the system.
dom systems. The choice of generating k-diagnosable sys-
tems is motivated by the fact that they can be used as exam-           The set of traces of the system is the language generated
ples for:                                                           by its model and is denoted L(S) (so the automaton SD
                                                                    generates the language L(S)). Let PΣ0 (τ ) be the classical
   1. diagnosis algorithms: given a fault f , we know by con-
                                                                    projection of a sequence τ of Σ∗ on the alphabet Σ0 recur-
      struction that the most precise algorithm will determine
                                                                    sively defined as follows:
      its occurrence with certainty within the next k observa-
      tions after the occurrence of f ;                              1. PΣ0 (ε) = ε;




                                                              277
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


  2. PΣ0 (τ.e) = PΣ0 (τ ) if e 6∈ Σ0 ;                                Definition 8 (Diagnosability). The fault f is diagnosable in
                                                                      a system S if:
  3. PΣ0 (τ.e) = PΣ0 (τ ).e if e ∈ Σ0 .
                                                                                        ∃n ∈ N+ , Diagnosable(n)
  Based on this notion of projection, we can associate with
any trace of the system its observable part.                          where Diagnosable(n) stands for:
Definition 3 (Observable trace). Let τ be a trace of the sys-                   ∀τ1 .f ∈ L(S), ∀τ2 : τ1 .f.τ2 ∈ L(S)
tem, the observable trace στ is the projection of τ over the                             |PΣo (τ2 )| ≥ n ⇒
set of observable events Σo :
                                                                          (∀τ ∈ L(S), (PΣo (τ ) = PΣo (τ1 .f.τ2 ) ⇒ f ∈ τ )).
                         στ = PΣo (τ ).                               Definition 9 (k-Diagnosability). The fault f is k-
                                                                      diagnosable, k ∈ N+ , in a system S if:
2.2 Diagnosis problem and solution
                                                                               Diagnosable(k) ∧ ¬Diagnosable(k − 1).
Now we are ready to define the classical Fault diagnosis
problem on DES.                                                          Diagnosability is a property that relies on the liveness
                                                                      of the observability of the system which means that, to be
Definition 4 (Fault). A fault is a non-observable event f ∈           (k)-diagnosable, a system must not generate unbounded se-
Σ.                                                                    quences of unobservable events (no cycle of unobservable
   A fault is represented as a special type of non-observable         events in SD). Throughout this paper, we consider that the
event that can occur on the underlying system. Once the               observability of the system is live.
event has occurred, we say that the fault is active in the sys-
tem, otherwise it is inactive. We consider here the problem           3 Random Generator
of permanent faults as initially introduced in [4].                   The aim of this section is to present the algorithm that is
Definition 5 (Diagnosis problem). A diagnosis problem is              being used to randomly generate a discrete event systems
a triple (SD, OBS, FAULTS) where SD is the model of                   where a fault f is k-dignosable and that has been imple-
a system, OBS is the sequence of observations of Σ?o and              mented inside the Diades software. We focus on the gener-
FAULTS is the set of fault events defined over SD.                    ation of a system with one fault only. (see th conclusion for
                                                                      the generation for n, n > 1 faults).
  Informally speaking, (SD, OBS, FAULTS) represents
the problem of finding the set of active faults from FAULTS           3.1 Signatures and fault ambiguity
that have occurred relying on the model SD and the se-                The algorithm that generates a k-diagnosable system relies
quence of observations OBS.                                           on the notion of signatures. Let f be a faulty event, the sig-
Definition 6 (Diagnosis Candidate). A diagnosis candidate             nature of f is the set of observable traces resulting from the
is a couple (q, F ) where q is a state of SD (q ∈ Q) and F is         projection of system traces that contain at least one occur-
a set of faults.                                                      rence of an event f before the last observation of the trace.
   A diagnosis candidate represents the fact that the under-          Definition 10 (Signature). The signature of an event f into
lying system is in state q and the set F of faults has occurred       a system S is the language Sig(f ) ⊆ Σ?o such that
before reaching state q.                                                  Sig(f ) ={στ |τ = τ1 .o.τ2 ∈ L(S),
Definition 7 (Solution Diagnosis). The solution ∆ of the                           f ∈ τ1 , o ∈ Σo , τ2 ∈ Σ? , στ = PΣo (τ )}.
problem (SD, OBS, FAULTS) is the set of diagnosis can-
                                                                          In the following, we will also denote by Sig(¬f ) the set
didates (q, F ) such that there exists for each of them at least
                                                                      of observable traces associated with the traces of the sys-
one trace τ of SD such that:
                                                                      tem that do not contain any fault f before the last obser-
  1. the observable trace of τ is exactly the sequence                vation. Intuitively speaking, as long as the current observ-
     OBS = o1 . . . om and the last event of τ is om ;                able trace is in Sig(¬f ) ∩ Sig(f ), we know that the system
                                                                      may have produced a faulty trace or a non-faulty trace be-
  2. the set of fault events that has occurred in τ is exactly
                                                                      fore the last observation. k-diagnosability ensures that the
     F;
                                                                      ambiguity can last at most for k observations. The principle
  3. the final state of τ is q.                                       of the generator relies on the following result that formal-
                                                                      izes this intuition. Let L ARGEST P REFIXES(τ, n) = {τi0 :
   Informally, candidate (q, F ) is part of the solution if it is
                                                                      τ = τi0 τi , |τi | = i, i ∈ {0, . . . , n − 1}} be the set of the n
possible to find out in SD a behaviour of the system satisfy-
                                                                      largest prefixes of τ (τ being a prefix of itself).
ing OBS which leads to the state q after the last observation
of OBS and in which the faults F have occurred.                       Theorem 1. In the system S, the event f is k-diagnosable
                                                                      if and only if:
2.3 Diagnosability                                                       1. For any observable trace σ in Sig(¬f ) ∩ Sig(f ), there
Diagnosability is a property of the system that asserts                      exists n < k such that L ARGEST P REFIXES(σ, n) ⊆
whether a fault f of a system S can be always diagnosed                      Sig(¬f ) ∩ Sig(f ) and L ARGEST P REFIXES(σ, n +
with certainty after the observation of a finite set of obser-               1) 6⊆ Sig(¬f ) ∩ Sig(f ).
vations [4]. In other words, once the fault f has occurred in            2. There        exists    at        least   one      observable
S, it is sufficient to wait a certain amount of observations to              trace σ in Sig(¬f ) ∩ Sig(f ) such that
ensure that any candidate (q, F ) of the solution contains f                 L ARGEST P REFIXES(σ, k − 1) ⊆ Sig(¬f ) ∩ Sig(f )
(f ∈ F ).                                                                    and an observable o such that σo ∈ Sig(f ).




                                                                278
                          Proceedings of the 26th International Workshop on Principles of Diagnosis


    Proof: (⇒)Let τ1 .f be a trace of the system S. As S is              Algorithm 1 General algorithm for the random generation
k-diagnosable, there exists m ≤ k such that ∀τ2 : τ1 .f.τ2 ∈             of k-diagnosable systems.
L(S), |PΣo (τ2 )| ≥ m ⇒ (∀τ ∈ L(S), (PΣo (τ ) =                             Input: k ∈ N, k ≥ 1
PΣo (τ1 .f.τ2 ) ⇒ f ∈ τ )). Consider one of these                           Input: f an event
trace τ1 .f.τ2 such that τ2 contains exactly m observa-                     Input: deg maximal output degree
tions (PΣo (τ2 ) = o1 . . . om ). k-diagnosability implies                  (Σo , Σ) ← G ENERATE E VENTS()
that there exists a minimal integer n ∈ {1, . . . , m − 1}                  S←∅
such that PΣo (τ1 .f ).o1 . . . on+1 ⇒ f ∈ τ as soon as                     AmbSig(f ) ← G ENERATE A MB S IGNATURE(k,Σo , deg)
τ ∈ L(S) and PΣo (τ ) = PΣo (τ1 .f ).o1 . . . on+1 , there-
fore PΣo (τ1 .f ).o1 . . . on+1 ∈ Sig(f ) \ Sig(¬f ) and ∀i ∈              /* AmbSig(f ) = (Q, Σo , T, q0 , A) a deterministic au-
{1, . . . , n}, PΣo (τ1 .f ).o1 . . . oi ∈ Sig(f ) ∩ Sig(¬f ). So          tomaton */
L ARGEST P REFIXES(PΣo (τ1 .f ).o1 . . . on , n) ⊆ Sig(¬f ) ∩              MF[q0 ] ← G ENERATE S TATES()
Sig(f ).        So for any τ1 .f there exists n < k                        MNF[q0 ] ← G ENERATE S TATES()
such that L ARGEST P REFIXES(PΣo (τ1 .f ).o1 . . . on , n) ⊆               for all q ∈ Q in Breadth-First Order from q0 do
Sig(¬f ) ∩ Sig(f ). Now, remark that for any observ-                          (Σfo , Σo¬f ) ← R ANDOM S PLIT(Σo , q)
able sequence σ that belongs to Sig(¬f ) ∩ Sig(f ), there                     S ← S∪ G EN FAULT E XTS∗ (MF[q],Σfo ,deg)
must exist a trace τ1 .f.τ2 of the system, with τ2 con-                       S ← S∪ G EN N OM E XTS∗ (MNF[q],Σ¬f    o ,deg)
taining at least one observable event, such that σ =                                       o
                                                                              for all q −→ q 0 ∈ T do
PΣo (τ1 .f.τ2 ), so there must exist n < k such that σ ∈                        if MF[q 0 ] = ∅ then
L ARGEST P REFIXES(PΣo (τ1 .f ).o1 . . . on , n) ⊆ Sig(¬f ) ∩                        MF[q 0 ] ← G ENERATE S TATES()
Sig(f ) so, for any σ that belongs to Sig(¬f ) ∩ Sig(f ),                            MNF[q 0 ] ← G ENERATE S TATES()
there is no set L ARGEST P REFIXES(σ, n + 1) that only con-                     end if
tains ambiguous signatures.                                                     S ← S∪
    Finally, as S is k-diagnosable, we know that there exists                   G EN N OM E XTS(MNF[q],MNF[q 0 ],o,deg)
at least one trace τ.f.τ1 .o1 , such that τ is a trace of the sys-              if q 0 6∈ A then
tem that does not contain f , τ1 is a finite continuation of τ.f                     S ← S∪ G EN N OM E XTS(MF[q],MF[q 0 ],o,deg)
that is unobservable and o1 is observable and there is a fi-                    else
nite continuation τ2 o2 τ3 o3 . . . τk ok with PΣo (τi ) = ε such                    if q ∈ A then
that for any i ∈ {1, . . . , k − 1}, PΣo (τ.f.τ1 .o1 . . . τi oi ) ∈                    S ← S∪ G EN E XTS(MF[q],MF[q 0 ],o,deg)
Sig(¬f ) ∩ Sig(f ) PΣo (τ.f.τ1 .o1 . . . τk ok ) ∈ Sig(f ) \                         else
Sig(¬f ) which implies the condition 2 with σ =                                         S ← S∪
PΣo (τ.f.τ1 .o1 . . . τk−1 ok−1 ).                                                      G EN FAULT E XTS(MF[q],MF[q 0 ],o,deg)
(⇐) Suppose now that conditions 1 and 2 hold. Consider                               end if
an observable trace σ that is ambiguous (σ ∈ Sig(f ) ∩                          end if
Sig(¬f )). Condition 1 states that there exists n < k such                    end for
that L ARGEST P REFIXES(σ, n) ⊆ Sig(¬f ) ∩ Sig(f ) and                     end for
L ARGEST P REFIXES(σ, n + 1) 6⊆ Sig(¬f ) ∩ Sig(f ). Con-                   Output: S where f is k-diagnosable.
sider now any largest observable trace σ 0 such that |σ 0 | −
|σ| = m and σ ∈ L ARGEST P REFIXES(σ 0 , m) ⊆ Sig(¬f )∩
                                                                            The generation is composed of two steps. The first one
Sig(f ), it follows that L ARGEST P REFIXES(σ 0 , m + n) ⊆
                                                                         is the generation of the ambiguous signature with G ENER -
Sig(¬f ) ∩ Sig(f ) and L ARGEST P REFIXES(σ 0 , m + n +
                                                                         ATE A MB S IGNATURE . The result of this function is a de-
1) 6⊆ Sig(¬f ) ∩ Sig(f ). As σ 0 is one of the largest ob-
                                                                         terministic automaton AmbSig(f ) = (Q, Σo , T, q0 , A) that
servable trace holding this condition, any observable trace
                                                                         actually generates the language Sig(f )∩Sig(¬f ) (any tran-
σ 0 o, o ∈ Σo , is either in Sig(f ) or in Sig(¬f ) but not in
                                                                         sition path from state q0 to an accepting state of A rep-
both of them. Condition 1 states that m + n < k, so k ob-
                                                                         resents a sequence of Sig(f ) ∩ Sig(¬f )). The automa-
servations at least are required to solve the ambiguity. Con-
                                                                         ton AmbSig(f ) is generated with respect to the conditions
dition 2 states that there exists at least such an observable
                                                                         1 and 2 that are defined in Theorem 1 to ensure the k-
trace σ 0 with m + n = k − 1 and an observation o so that
                                                                         diagnosability of the resulting system S. The second step of
σ 0 o is definitively in Sig(f ) so f can be diagnosed with
                                                                         the generation is the effective generation of S based on the
certainty in this case with exactly k observations. Hence the
                                                                         ambiguous signature AmbSig(f ). The idea is to map every
result.
                                                                         state q of AmbSig(f ) with two sets of states in S denoted
                                                                         M F [q] and M N F [q]. Given any path σ of AmbSig(f ) that
3.2 Algorithm                                                            leads to state q with, as a last observation, the event o, any
                                                                         state of M F [q] (resp. M N F [q]) will be reached by at least
The principle of the random generator is depicted in Al-                 one transition path τ of S starting from a state of M F [q0 ]
gorithm 1. Given a parameter k and a fault event f , the                 (resp. M N F [q0 ]) that ends with a transition labelled with o
algorithm randomly generates a system S where the event                  and the observable projection of τ is exactly σ. The differ-
f is k-diagnosable by construction. We also provide an-                  ence between M F and M N F is that any underlying path
other parameter deg which is the maximal number of output                of S leading to a state of M F [q] (resp. M N F [q]) has an
transitions that is allowed per state during the generation of           observable projection which is a prefix of Sig(f ) (resp. a
the system. Parameter deg is important for the creation of               prefix of Sig(¬f )). To generate S we explore AmbSig(f )
benchmarks as the output degree has a strong influence on                from its initial state in a breadth-first search manner. For a
the diagnosis/diagnosability computations.                               given state q, we have to consider three types of transition




                                                                   279
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


generations going out of any state of M F [q], M N F [q]. The         rameters like the number of (un)observable events the output
first ones are the transition paths that will lead to an observa-     degree of transitions, the parameter k, the minimal number
tion o that belongs to AmbSig(f ), the second one is the set          of observable events involved in the ambiguous signature,
of transition paths that do not lead to an observation o that         the number of states (still experimental). One particular pa-
belongs to AmbSig(f ) but lead to an observation o0 that              rameter is the seed parameter that allows the generation of
belongs to Sig(f ) only and the third one is the set of transi-       the same system (the seed ensures the same generation of
tion paths that do not lead to an observation o that belongs          random numbers). By construction, the algorithm is linear
to AmbSig(f ) but lead to an observation o00 that belongs to          in the number of states. A set of pre-computed benchmarks
Sig(¬f ) only.                                                        as well as the implemented generator are available at the
   The second and third cases are handled by randomly split-          following url:
ting Σo into two subsets (Σfo , Σ¬fo ) each of them only con-         http://homepages.laas.fr/ypencole/benchmarks
taining observable events that are not output event of q in
AmbSig(f ) (R ANDOM S PLIT(Σo , q)). Then given Σfo , we              4 Conclusions
randomly generate faulty extensions for a subset of Σfo (the          To test and compare diagnosis and/or diagnosability algo-
selection of the subset is also random and might even be              rithms, fully detailed and available benchmarks are a ne-
empty if q has no output events in AmbSig(f ), indeed if q            cessity. In order to test how generic is an algorithm, we
has no output events, it must be extended to ensure that the          propose here an algorithm that randomly generates systems
observability of the system is live). An extension is a set of        where a fault f is k-diagnosable. We also propose an im-
acyclic and unobservable transition paths that lead to a tran-        plementation within the D IA D ES framework. Extension to
sition labeled with an observable event from Σfo . A faulty           generate systems with n k-diagnosable faults is easy, it re-
extension ensures that an event f has at least occurred on            quires to repeat the generation of the ambiguous signatures
any generated transition path before the observable transi-           for the n faults and explore them in parallel to generate the
tion (G EN FAULT E XTS). Given Σ¬f    o , we proceed the same         k-diagnosable system. Our short-term perspective is to im-
way to generate non-faulty extensions (G EN N OM E XTS).              prove the generator to allow a better control of the number
As, in these two cases, the traces generated by these ex-             of generated states. A fixed number of generated states re-
tensions are not associated with observable traces involved           quires to add new constraints in the generation that prop-
in AmbSig(f ) any more, it is sufficient to generate further          agate during the generation process. Without any control
extensions on these traces and guarantee that the observ-             about this propagation, the generation may just fail as it
able language associated with these further extensions is live        could become an over-constrained problem. Our perspec-
(this procedure is denoted by the ∗ in G EN N OM E XTS∗ and           tive is to also go one step further by generating diagnosable
G EN FAULT E XTS∗ ).                                                  systems that are component-based in order to scale up the
   The last case to handle now is the case where the observ-          size of the generated system. The D IA D ES framework al-
able event o is an output event of q in AmbSig(f ), which             ready has a tool to generate component-based systems [5]
                                                            o
means that there exists one and only one transition q −→ q 0          which ensures that any component is globally consistent, but
in AmbSig(f ). If q 0 has never been visited, the set of states       adding the constraint of diagnosability makes the generation
M F [q 0 ] and M N F [q 0 ] are generated first. A nominal ex-        far more complex to implement.
tension is generated from M N F [q] to M N F [q 0 ]. Depend-
ing on the status of q 0 , the extension between M F [q] and          References
M F [q 0 ] is different. If q 0 6∈ A, it means that any prefix        [1] Gianfranco Lamperti and Marina Zanella. Diagnosis of
generated by AmbSig(f ) with paths from q0 to q 0 are pre-                discrete-event systems from uncertain temporal obser-
fixes of sequences in Sig(f ) ∩ Sig(¬f ) but they are not in              vations. Artificial Intelligence, 137:91–163, 2002.
Sig(f ) ∩ Sig(¬f ), they can therefore be only in Sig(¬f ):           [2] Yannick Pencolé and Marie-Odile Cordier. A formal
extensions between M F [q] and M F [q 0 ] are then nominal
                                                                          framework for the decentralised diagnosis of large scale
extensions. Now, if q 0 ∈ A, there are two cases. If q 6∈ A,
                                                                          discrete event systems and its application to telecommu-
it means the system must become faulty between the states
                                                                          nication networks. Artificial Intelligence, 164(2):121–
of M F [q] and M F [q 0 ] so that paths of the system that reach
                                                                          170, 2005.
any state of M F [q 0 ] is associated with an observable trace
that belongs to Sig(f ) (G EN FAULT E XTS). If q ∈ A, any             [3] Janan Zaytoon and Stéphane Lafortune. Overview of
path that reaches a state of M F [q] is already faulty (its ob-           fault diagnosis methods for discrete event systems. An-
servable trace is already in Sig(f )), any type of extension              nual Reviews in Control, 37:308–320, 2013.
from M F [q] to M F [q 0 ] is therefore possible (faulty or not),     [4] Meera Sampath, Raja Sengupta, Stéphane Lafortune,
hence the use of G EN E XTS.                                              Kasim Sinnamohideen, and Demosthenis Teneketzis.
                                                                          Diagnosability of discrete-event systems. Transactions
3.3 Implementation                                                        on Automatic Control, 40(9):1555–1575, 9 1995.
Algorithm 1 is implemented with the help of the Diades                [5] Yannick Pencolé. Fault diagnosis in discrete-event sys-
library package [5]. Diades is a set of C++ libraries                     tems: How to analyse algorithm performance? In Di-
that implement discrete event systems in a component-                     agnostic reasoning: Model Analysis and Performance,
based way, different diagnosis algorithms as defined in                   pages 19–25, Montpellier, France, 2012.
the spectrum of [6] (from component-based algorithms                  [6] Anika Schumann, Yannick Pencolé, and Sylvie
to diagnoser-based algorithms). D IA D ES also imple-                     Thiébaux. A spectrum of symbolic on-line diagnosis
ments a diagnosability checker as well as an accuracy                     approaches. In 17th International Workshop on Princi-
checker. The generator results in a Linux terminal command                ples of Diagnosis, pages 194–201, Nashville, TN USA,
dd-diagnosable-des-generate with a set of pa-                             2007.




                                                                280
                        Proceedings of the 26th International Workshop on Principles of Diagnosis




              H Y D IAG: extended diagnosis and prognosis for hybrid systems

          Elodie Chanthery1,2 , Yannick Pencolé1 , Pauline Ribot1,3 , Louise Travé-Massuyès1
                 1
                   CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
                                   e-mail: [firstname.name]@laas.fr
                     2
                       Univ de Toulouse, INSA, LAAS, F-31400 Toulouse, France
                      3
                        Univ de Toulouse, UPS, LAAS, F-31400 Toulouse, France


                         Abstract                                    • (ζ0 , q0 ) ∈ ζ × Q, is the initial condition.
     H Y D IAG is a software developed in Matlab by                   Each state q ∈ Q represents a behavioural mode that is
     the DISCO team at LAAS-CNRS. It is currently                  characterized by a set of constraints Cq that model the lin-
     a software designed to simulate, diagnose and                 ear continuous dynamics (defined by their representations
     prognose hybrid systems using model-based tech-               in the state space as a set of differential and algebraic equa-
     niques. An extension to active diagnosis is also              tions). A behavioural mode can be nominal or faulty (antic-
     provided. This paper aims at presenting the na-               ipated faults). The unknown mode can be added to model
     tive H Y D IAG tool, and its different extensions to          all the non anticipated faulty situations. The discrete part of
     prognosis and active diagnosis. Some results on               the hybrid automaton is given by M = (Q, Σ, T, q0 ), which
     an academic example are given.                                is called the underlying discrete event system (DES). Σ is
                                                                   the set of events that correspond to discrete control inputs,
                                                                   autonomous mode changes and fault occurrences. The oc-
1 Introduction                                                     currence of an anticipated fault is modelled by a discrete
H Y D IAG is a software developed in Matlab, with Simulink.        event fi ∈ Σf ⊆ Σuo , where Σuo ⊆ Σ is the set of unob-
The development of this software was initiated in the              servable events. Σo ⊆ Σ is the set of observable events.
DISCO team with contributions about diagnosis on hybrid            Transitions of T model the instantaneous changes of be-
systems [1]. It has undergone many changes and is cur-             havioural modes. The continuous behaviour of the hybrid
rently a software designed to simulate, diagnose and prog-         system is modelled by the so called underlying multimode
nose hybrid systems using model-based techniques [2; 3; 4].        system Ξ = (ζ, Q, C, ζ0 ). The set of directly measured vari-
An extension to active diagnosis has been also realized [5;        ables is denoted by ζOBS ⊆ ζ.
6]. This article aims at presenting the native HyDiag tool            An example of a hybrid system modeled by a hybrid au-
and its different extensions to prognosis and active diagno-       tomaton is shown in Figure 1. Each mode qi is characterized
sis.                                                               by state matrices Ai , Bi , Ci and Di .
   Section 2 recalls the hybrid formalism used by H Y D IAG.
Section 3 presents the native H Y D IAG tool that simulates                                              Hybrid system
and diagnoses hybrid systems. Section 4 explains how H Y-                                  q1                σ12                q2
D IAG has been extended in H Y D IAG P RO to prognose and                u    C1
                                                                                   x1(n+1)=A1x1(n)+B1u(n)
                                                                                                                          x2(n+1)=A2x2(n)+B2u(n)
                                                                                   Y1(n)=C1x1(n)+D1u(n)              C2
diagnose hybrid systems. Section 5 presents the extension                                                    σ21
                                                                                                                          Y2(n)=C2x2(n)+D2u(n)     y
to active diagnosis. Experimental results of H Y D IAG and its
                                                                                       σ1i
extension H Y D IAG P RO are finally presented in Section 6.                                                               σ
                                                                                                    qi
                                                                                             xi(n+1)=Aixi(n)+Bu(n)
2 Hybrid Model for Diagnosis                                                          Ci     Yi(n)=Cixi(n)+Diu(n)

                                                                                                                                     …
H Y D IAG deals with hybrid systems defined in a monolithic
way. Such a system must be modeled by a hybrid automaton
[7]. Formally, a hybrid automaton is defined as a tuple S =
(ζ, Q, Σ, T, C, (q0 , ζ0 )) where:
                                                                             Figure 1: Example of an hybrid system
  • ζ is a finite set of continuous variables that comprises
    input variables u(t) ∈ Rnu , state variables x(t) ∈
    Rnx , and output variables y(t) ∈ Rny .
  • Q is a finite set of discrete system states.                   3 Overview of the native H Y D IAG diagnoser
  • Σ is a finite set of events.
                                                                   The method developed in [1] for diagnosing faults on-line
  • T ⊆ Q × Σ → Q is the partial transition function
                                                                   in hybrid systems can be seen as interlinking a standard di-
    between states.
          S                                                        agnosis method for continuous systems, namely the parity
  • C = q∈Q Cq is the set of system constraints linking            space method, and a standard diagnosis method for DES,
    continuous variables.                                          namely the diagnoser method [8].




                                                             281
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


3.1 How to use H Y D IAG ?                                          of the system by triggering the current transition of the hy-
Step 1: hybrid model edition                                        brid diagnoser that matches the current observation. It is
H Y D IAG allows the user to edit the modes of a hybrid au-         possible to define in H Y D IAG a simulation scenario for the
tomaton S as illustrated in Figure 1. To model the system,          modeled system with a duration and a time sample defined
the user must first provide in the Graphical User Interface of      by the user.
the H Y D IAG software the following information: the num-
ber of modes, the number of discrete events that can be ob-         3.2 Software architecture with extensions
servable or unobservable, and the sampling period used for          The general architecture of H Y D IAG and its two extensions
the underlying multimode system (defined by the set of state        (see the next sections for their description) is presented on
matrices of the state space representation of each mode).           Figure 3. Ellipses represent the objects handled by the soft-
   There are optional parameters that are helpful to initialize     ware, rectangles with rounded edges depict H Y D IAG func-
the mode matrices automatically before editing them: the            tions and rectangles with straight edges correspond to exter-
number of entries for the continuous dynamics, the number           nal D IA D ES packages. The behaviour automaton is at the
of outputs for continuous dynamics, the dimensions of each          heart of the architecture as H Y D IAG and both its extensions
matrix A. The number of entries (resp. outputs) must be the         rely on it to perform diagnosis, active diagnosis and prog-
same for all the modes.                                             nosis.
   The simulator of the edited model has no restrictions on
the number of modes or the order of the continuous dynam-                                                                                                      ActHyDiag

ics, it is generically designed. Online computations are per-                            Specialized
                                                                                                             AND/OR                                     Conditional
                                                                        ActDiades          Active
formed using Matlab / Simulink. Results provided by Mat-
                                                                                                                              AO* Algorithm
                                                                                                              Graph                                        plan
                                                                                         diagnosers

lab can be reused if a special need arises. Figure 2 shows an                                                                                        Conditional plan
                                                                                                                                                         display
overview of the software interface.
                                                                                                                                                           HyDiag
                                                                         Model display         Additional                 Behaviour
                                                                                               Signature              Automaton display            Diagnoser display
                                                                                                 event
                                                                           Enriched
                                                                                                             Behaviour                                Diagnoser
                                                                            hybrid        ARRs computation                         Diades
                                                                                                             Automaton                                 diagnosis
                                                                            model

                                                                                                                                                     Diagnosis display

                                                                                                                                                   diagnosis

                                                                                                                      Prognoser                     Prognosis display
                                                                                                                       prognosis
                                                                                                                                       prognosis
                                                                                                                                                          HyDiagPro




                                                                    Figure 3: H Y D IAG architecture with its extensions H Y D I -
                                                                    AG P RO and ACT H Y D IAG .



                                                                    4       H Y D IAG P RO : an extension for Prognosis
                                                                    H Y D IAG has been extended in order to provide a progno-
                                                                    sis functionality to the software [4]. The prognosis function
                                                                    computes (1) the fault probability of the system in each be-
          Figure 2: H Y D IAG Graphical User Interface              havioural mode, (2) the future fault sequence that will lead
                                                                    to the system failure, (3) the Remaining Useful Life (RUL)
                                                                    of the system.
Step 2: building the diagnoser                                         In H Y D IAG P RO, the initial hybrid model is enriched
H Y D IAG automatically computes the analytical redundancy          by adding for each behavioural mode a set of aging laws:
relations (ARRs) by using the parity space approach [9].            S + = (ζ, Q, Σ, T, C, F, (q0 , ζ0 )) where F = {F q , q ∈ Q}
Details of this computation can be found in [10].                   and F q is a set of aging laws one for each anticipated fault
   The idea of H Y D IAG is to capture both the continuous          f ∈ Σf in mode q. The aging modeling framework that
dynamics and the discrete dynamics within the same math-            is adopted in H Y D IAG P RO is based on the Weibull proba-
ematical object. To do so, the discrete part of the hybrid          bilistic model [11] (see more details in [4]). The Weibull
system M = (Q, Σ, T, q0 ) is enriched with specific observ-         fault probability density function W (t, βjq , ηjq , γjq ) gives at
able events that are generated from continuous information.         any time the probability that the fault fj occurs in the sys-
The resulting automaton is called the Behaviour Automaton           tem mode q. Weibull parameters βjq and ηjq are fixed by the
(BA) of the hybrid system. H Y D IAG then builds the diag-          system mode q and characterise the degradation in mode q
noser of the Behaviour Automaton (see [8]) by using the             that leads to the fault fj . Parameter γjq is set at runtime to
D IA D ES1 software also developed within the DISCO team            memorize the overall degradation evolution of the system
at LAAS-CNRS (see an example of diagnoser in Figure 7).             accumulated in the past modes [11].
Step 3: system simulation and diagnosis                                The prognoser uses the aging laws in S + to predict fault
Given the built hybrid diagnoser, H Y D IAG then loads a set        occurrences (see Figure 3). The prognoser uses the cur-
of timed observations produced by the system and it pro-            rent diagnosis result to update on-line these aging laws (the
vides at each observation time an update of the diagnosis           parameters γjq ) according to the operation time in each be-
                                                                    havioural mode. For each new result of diagnosis, the prog-
   1
       http://homepages.laas.fr/ypencole/DiaDes/                    nosis function computes the most likely sequence of dated




                                                              282
                         Proceedings of the 26th International Workshop on Principles of Diagnosis


faults that leads to the system failure. From this sequence is                                           pump
                                                                                                                Pump1      Pump 2
estimated the system RUL [4].                                                                          mode

                                                                                                         1        ON           ON

5   ACT H Y D IAG: Active Diagnosis                                                                      2        ON           OFF

The second extension of H Y D IAG provides an active diag-                                               3        OFF          OFF
nosis functionality to the software (see Figure 3). The inputs
are the same as for H Y D IAG but an additional file indicates                                           4        F il
                                                                                                                  Fail         ON

the events of S that are actions, as well as their respective                                            5        ON           Fail
cost. Based on the behaviour automaton, we compute a set
of specialised active diagnosers (one per fault): such a diag-                                           6        Fail         OFF

noser is able to predict, based on the behaviour automaton,                                              7        OFF          Fail
whether a fault can be diagnosed with certainty by applying
an action plan from a given ambiguous situation [6]. From                                                8        Fail         Fail

these diagnosers, we also extract a planning domain as a
AND/OR graph.
   At runtime, when H Y D IAG is diagnosing, the diagno-                            Figure 5: Water tank DES model
sis might be ambiguous. An active diagnosis session can
be launched as soon as a specialised active diagnoser can
analyse that the current faulty situation is discriminable by                Table 1: Weibull parameters of aging models
applying some actions. If the active diagnosis session is             Aging laws       β      η     Aging laws            β            η
launched, an AO∗ algorithm starts and computes a condi-               F q1   f1q1     1.5    3000   F q2   f1q2           2           3000
tional plan from the AND-OR graph that optimises an ac-                      f2q1     1.5    4000          f2q2           1           7000
tion cost criterion. It is important to note that in the case         F q3
                                                                             f1q3      1     8000   F q4
                                                                                                           f1q4          NaN          NaN
of a system with continuous dynamics, only discrete actions                  f2q3      1     7000          f2q4           2           4000
are contained in the active diagnosis plan issued by ACT H Y-         F q5
                                                                             f1q5      2     3000   F q6
                                                                                                           f1q6          NaN          NaN
D IAG. In particular, it is assumed that if it is necessary to               f2q5     NaN    NaN           f2q6           1           7000
guide the system towards a value on continuous variables,             F q7   f1q7      1     8000   F q8   f1q8          NaN          NaN
                                                                             f2q7     NaN    NaN           f2q8          NaN          NaN
the synthesis of control laws must be performed elsewhere.

6 HyDiag/HyDiagPro Demonstration                                    space:
                                                                               
Water tank system model                                                             X(k + 1) =      AX(k) + BU (k)
                                                                                                                                             (1)
                                                                                     Y (k)   =      CX(k) + DU (k)
                   Pump P1           Pump P2
                                                                    where the state variable X is the water level in the tank,
                                                                    continuous inputs U are the flows delivered by the pumps
                                                                    P1 , P2 and the flow going through the valve, A = (1), B =
                                      hmax                                   !
                                                                      eT e/S
                                      h2
                                                                      eT e/S with T e the sample time, S the tank base area
                                                                      eT e/S
                                                                    and ei = 1 (resp. 0) if the pump is turned on (resp. turned
                                             h1                                                 !
                   h
                                                                                              0
                                                                    off), C = (1) and D = 0 .
                                                                                              0
                Figure 4: Water tank system                         H Y D IAG results
                                                                    Figure 6 presents the set of results obtained by H Y D IAG and
   H Y D IAG P RO has been tested on a water tank system            H Y D IAG P RO on the folllowing scenario. The time hori-
(Figure 4) composed of one tank with two hydraulic pumps            zon is fixed at Tsim = 4000h, the sampling period is
(P1 , P2 ). Water flows through a valve at the bottom of the        Ts = 36s and the filter sensitivity for the diagnosis is set
tank depending on the system control. Three sensors (h1 ,           as Tf ilter = 3min. The residual threshold is 10−12 . The
h2 , hmax ) detect the water level and allow to set the control     scenario involves a variant use of water (max flow rate =
of the pumps (on/off). It is assumed that the pumps may             1200L/h) depending on user needs during 4000h. Pumps are
fail only if they are on. The discrete model of water tank          automatically controlled to satisfy the specifications indi-
and the controls of pumps are given in Figure 5. Discrete           cated above. Flow rate of P1 and P2 are respectively 750L/h
events in Σ = {h1 , h2s , h2i , hmax , f1 , f2 } allow the sys-     and 500L/h.
tem to switch into different modes. Observable events are              The diagnoser computed by H Y D IAG is given in Figure 7.
Σo = {h1 , h2s , h2i , hmax }. Two faults that correspond to        Each state of the diagnoser indicates the belief state in the
the pump failures are anticipated Σf = {f1 , f2 } and are not       model enriched by the abstraction of the continuous part of
observable.The Weibull parameter values of aging models             the system, labelled with faults that have occurred on the
F = {F qi } are reported in Table 1.                                system. This label is empty in case of nominal mode. In the
   The underlying continuous behaviour of every discrete            scenario, fault f1 was injected after 3500h and fault f2 was
mode qi for i ∈ {1..8} is represented by the same state             not injected.




                                                              283
                           Proceedings of the 26th International Workshop on Principles of Diagnosis


                  f1
 q_32,{}
 q_75,{f2}
 q_64,{f1}
 q3,{}




                                              Predicted dates of fault occurrence (h)
 q7,{f2}
 q6,{f1}




                                                                                                                            Remaining Useful Life (h)
 q_23,{}                                                                                df2
 q_21,{}
 q_57,{f2}
 q8,{f1,f2}
 q_46,{f1}
                                                                                                                                                                   f1
 q2,{}                                                                                  df1
 q5,{f2}
 q4,{f1}
 q12,{}                                                                                                         f1
 q1,{}



                       Time (h)                                                                 Time (h)                                                Time (h)



Figure 6: Scenario: Diagnoser belief state (left), Prognosis results of degradations df1 and df2 (middle), System RUL (right).


                                                                                                    results on an academic example are exposed in the paper.
                                                                                                    An extension to active diagnosis is also presented. The ac-
                                                                                                    tive diagnosis algorithm is currently tested on a concrete in-
                                                                                                    dustrial case. H Y D IAG and its user manual will be soon
                                                                                                    available on the LAAS website.

                                                                                                    References
                                                                                                    [1]    M. Bayoudh, L. Travé-Massuyès, and X. Olive. Hybrid sys-
                                                                                                           tems diagnosis by coupling continuous and discrete event
                                                                                                           techniques. In IFAC World Congress, 2008.
                                                                                                    [2]    P. Ribot, Y. Pencolé, and M. Combacau. Diagnosis and prog-
                                                                                                           nosis for the maintenance of complex systems. In IEEE In-
                                                                                                           ternational Conf. on Systems, Man, and Cybernetics, 2009.
                                                                                                    [3]    Elodie Chanthery and Pauline Ribot. An integrated frame-
                                                                                                           work for diagnosis and prognosis of hybrid systems. In the
                                                                                                           3rd Workshop on Hybrid Autonomous System (HAS), 2013.
                                                                                                    [4]    S. Zabi, P. Ribot, and E. Chanthery. Health Monitoring and
                                                                                                           Prognosis of Hybrid Systems. In Annual Conference of the
              Figure 7: Diagnoser state tracker                                                            Prognostics and Health Management Society , 2013.
                                                                                                    [5]    M. Bayoudh, L. Travé-Massuyès, and Xavier Olive. Active
                                                                                                           diagnosis of hybrid systems guided by diagnosability proper-
   Left hand side of Figure 6 shows the diagnoser belief state                                             ties. In the 7th IFAC Symposium on Fault Detection, Super-
just before and after the fault f1 occurrence. Results are                                                 vision and Safety of Technical Processes, 2009.
consistent with the scenario: before 3500h, the belief states                                       [6]    E. Chanthery, Y. Pencolé, and N. Bussac. An AO*-like al-
of the diagnoser are always tagged with a nominal diagnosis.                                               gorithm implementation for active diagnosis. In 10th Inter-
After 3500h, all the states are tagged with f1 .                                                           national Symposium on Artificial Intelligence, Robotics and
   Middle of Figure 6 illustrates the predicted date of fault                                              Automation in Space, i-SAIRAS,, 2010.
occurrence (df1 and df2 ). At the beginning of the process,                                         [7]    T. Henzinger. The theory of hybrid automata. In Proceedings
the prognosis result is: Π0 = ({f1 , 4120}, {f2 , 5105}). It                                               of the 11th Annual IEEE Symposium on Logic in Computer
can be noted that the predicted dates df1 and df2 of f1 and                                                Science, pages 278–292, 1996.
f2 globally increase. Indeed, the system oscillates between                                         [8]    M. Sampath, R. Sengputa, S. Lafortune, K. Sinnamohideen,
stressful modes and less stressful modes. To make it simple,                                               and D. Teneketsis. Diagnosability of discrete-event systems.
we can consider that in some modes, the system does not                                                    IEEE Trans. on Automatic Control, 40:1555–1575, 1995.
degrade, so the predicted dates of f1 and f2 are postponed.
                                                                                                    [9]    M Staroswiecki and G Comtet-Varga. Analytical redundancy
Before 3500h, the predicted date of f1 is lower than the one
                                                                                                           relations for fault detection and isolation in algebraic dy-
of f2 . After 3500h, the predicted date of f2 is updated,                                                  namic systems. Automatica, 37(5):687–699, 2001.
knowing that the system is in a degraded mode. Finally, the
                                                                                                    [10] M. Maiga, E. Chanthery, and L. Travé Massuyès. Hybrid sys-
prognosis result is Π3501 = ({f2 , 5541}). Figure 6 shows
                                                                                                         tem diagnosis : Test of the diagnoser hydiag on a benchmark
the evolution of the RUL of the system. At t = 3501, as the
                                                                                                         of the international diagnostic competition DXC 2011. In 8th
fault f2 is estimated to occur at t = 5541, the system RUL                                               IFAC Symposium on Fault Detection, Supervision and Safety
at t = 3501 is 5541 − 3501 = 2040h.                                                                      of Technical Processes, 2012.
                                                                                                    [11] P. Ribot and E. Bensana. A generic adaptative prognostic
7 Conclusion                                                                                             function for heterogeneous multi-component systems: appli-
H Y D IAG is a software developed in Matlab, with Simulink,                                              cation to helicopters. In European Safety & Reliability Con-
by the DISCO team, at LAAS-CNRS. This tool has been                                                      ference, Troyes, France, September 18-22 2011.
extended into H Y D IAG P RO to simulate, diagnose and prog-
nose hybrid systems using model-based techniques. Some




                                                                                              284
Author index



                                         D
A
                                         Dague Philippe                     51
Abdelkrim Mohamed Naceur          261    Dahhou Boutaı̈eb                  253
Abreu Rui                    193, 209    Daigle Matthew                    201
Agudelo Carlos                    241    De Kleer Johan               193, 225
Alonso-Gonzalez Carlos             59    Duvivier David                    159
Archer Dave                       193
Aubrun Christophe                 261
                                         E
B                                        Eickmeyer Jens                43, 217
                                         El Fallah Seghrouchni Amal         35
Berdjag Denis                      159   Eldardiry Hoda               193, 209
Biswas Gautam              27, 75, 185
Blesa Joaquim                       67
Bobrow Daniel
Boussaid Boumedyen
                                   193
                                   261
                                         F
Bregon Anibal                  59, 201   Feldman Alexander            127, 193
Bunte Andreas                      185   Feng Wenquan                       27
Burke David                        193   Filasova Anna                     235
                                         Fourlas George                    177

C
Cabassud Michel                   253
                                         G
Carrera Rolando                   145    Ganguli Anurag                    225
Cayetano Raúl                    145    Gaurel Christian                  159
Chanthery Elodie                  281    Givehchi Omid                      43
Chouiref Houda                    261    Gomez Pablo                        11
Christopher Cody                  119    Grastien Alban               105, 119
Corruble Vincent                   35    Grigoleit Florian                  91
Cossé Ronan                      159    Grill Tanja                        11
                                         Guan Xiumei                        27
H                                          M
Hanley John                          193   Maier Alexander                  217
He Zhangming                         137   Matei Ion                        225
Herpson Cédric                       35   Mishali Amir                     247
Honda Tomonori             193, 209, 225   Monrousseau Thomas               269
                                           Mühlbacher Clemens              153


I
                                           N
Ibrahim Hassan                      51
Iverson Jonathan                   209     Niggemann Oliver         43, 185, 217



J                                          O
Jannach Dietmar                      3
Jauberthie Carine               67, 83     Ocampo-Martinez Carlos            99
Jimenez Fernando                  241
Jung Daniel                         75

                                           P
K                                          Pavel Radu                        209
                                           Pencolé Yannick         83, 277, 281
Kalech Meir                   113, 247
                                           Perez Alexandre                   193
Karras George                      177
                                           Pethig Florian                     43
Khorasgani Hamed               75, 185
                                           Piechowiak Sylvain                159
Kinnebrew John                     185
                                           Pons Renaud                        83
Koitz Roxane                       167
                                           Provan Gregory                    127
Krokavec Dusan                     235
                                           Puig Vicenç                  99, 261
Kyriakopoulos Kostas               177
                                           Pulido Belarmino                   59


L                                          R
Le Gall Françoise                  67
                                           Ribot Pauline                83, 281
Le Lann Marie-Véronique       19, 269
                                           Roux Elisa                        19
Li Peng                             43
                                           Roychoudhury Indranil            201
Li Shuxing                         137
Li Ze-Tao                          253
Liao Linxia                        209
Liscinsky Pavol                    235
S                                          V
Saha Bhaskar                       209     Vasquez John William   241
Santos Simón Jorge                153     Verde Cristina         145
Schmitz Thomas                       3     Volgmann Sören        185
Serbak Vladimir                    235
Shchekotykhin Kostyantyn             3
Shinitzky Hilla
Simon Laurent
                                   113
                                    51
                                           W
Steinbauer Gerald                  153     Wang Jiongqi           137
Stern Roni                    113, 247     Wotawa Franz           167
Struss Peter                        91
Subias Audine                      241

                                           Z
T                                          Zhang Mei              253
                                           Zhao Hongbo             27
Travé-Massuyès Louise      19, 67, 83,   Zhou Gan                27
                           241, 269, 281   Zhou Haiyin            137
Traxler Patrick                       11