=Paper= {{Paper |id=Vol-1507/DX_2015_proceedings |storemode=property |title=None |pdfUrl=https://ceur-ws.org/Vol-1507/dx15_proceedings.pdf |volume=Vol-1507 }} ==None== https://ceur-ws.org/Vol-1507/dx15_proceedings.pdf

26th International Workshop
on Principles of Diagnosis

2015
Paris, France
August 31 - September 3, 2015

Editors: Yannick Pencolé, Louise Travé-Massuyès, Philippe Dague

LABORATOIRE DE RECHERCHE EN
I N F O R M AT I Q U E
Proceedings of the 26th International Workshop
on Principles of Diagnosis (DX-15)

August 31-September 3, 2015

Paris, France

Yannick Pencolé, Louise Travé-Massuyès, Philippe Dague, Editors
Foreword
The International Workshop on Principles of Diagnosis is an annual event that started in 1989,
originating in the Artificial Intelligence community. Its focus is on theories, principles and compu-
tational techniques for diagnosis, monitoring, testing, reconfiguration and repair of complex systems
and applications of these techniques to real world problems.

This year, DX-15 received 41 submissions (39 full papers and 2 tool papers) from 15 countries,
from 5 continents. Each paper was thoroughly peer reviewed by three reviewers. We accepted 17
regular papers (selection rate 43.6%), 18 posters and 2 benchmark/tool papers. We wish to thank all
the authors of submitted papers, the program committee members for the time and effort spent, the
invited speakers for their participation.

As the DX-15 workshop is co-located with the IFAC International Symposium SAFEPROCESS
2015, its organization would not have been possible without the full support of the SAFEPROCESS
organization team and especially Vincent Cocquempot who did a tremendous coordination job be-
tween the two events. Also special thanks to our local contact Nazih Mechbal at École Nationale
Supérieure d’Arts et Métiers (ENSAM), ParisTech, where DX-15 and SAFEPROCESS take place.
Thanks also to the local organization team at LAAS-CNRS and at the CNRS administrative depart-
ment of Toulouse (DR14) for their full technical and administrative support.

We also wish to thank our sponsors: Centre National de la Recherche Scientifique (CNRS), Uni-
versité de Toulouse), Laboratoire de Recherche en Informatique, Université Paris-Sud, École Nationale
Supérieure d’Arts et Métiers (ENSAM), Institut National des Sciences Appliquées de Toulouse (INSA-
Toulouse), Université Pierre et Marie Curie (UMPC), and ACTIA.

Yannick Pencolé, Louise Travé-Massuyès, Philippe Dague.
August 2015

Word cloud generated from the titles of the DX-15 papers by http://www.wordle.net.

i
ii
Workshop Organization

Program Co-Chairs
Yannick Pencolé LAAS-CNRS, Univ. Fédérale Toulouse, France
Louise Travé-Massuyès LAAS-CNRS, Univ. Fédérale Toulouse, France
Philippe Dague LRI, Université Paris-Sud, France

International Program Committee
Rui Abreu PARC, USA
Jose Aguilar Universidad de los Andes, Venezuela
Carlos Alonso Universidad de Valladolid, Spain
Gautam Biswas Vanderbilt University, USA
Anibal Bregon Universidad de Valladolid, Spain
Luca Console Università di Torino, Italy
Matthew Daigle NASA Ames Research Center, USA
Johan de Kleer PARC, USA
Michael Hofbaur Joanneum Research, Austria
Alexander Feldman PARC, USA
Gerhard Friedrich Klagenfurt University, Austria
Alban Grastien NICTA, Australia
Claudia Isaza University of Antioquia, Medellı́n, Colombia
Meir Kalech Ben-Gurion University of the Negev, Israel
Mattias Krysander Linköping University, Sweden
Anastassia
Bonn-Rhein-Sieg University of Applied Science, Germany
Küstenmacher
Ingo Pill TU Graz, Austria
Gregory Provan University College Cork, Ireland
Xavier Pucel ONERA CERT, France
Martin Sachenbacher LION Smart GmbH, Germany
Ramon Sarrate Universitat Politècnica de Catalunya, Spain
Neal Snooke Aberystwyth University, UK
Gerald Steinbauer TU Graz, Austria
Peter Struss TU München, Germany
Anna Sztyber The Institute of Automatic and Robotics, Warsow University of Technology
Gianluca Torta Università di Torino, Italy
Franz Wotawa TU Graz, Austria
Marina Zanella Università degli Studi di Brescia, Italy

iii
Additional Reviewers
Moussa Maiga LAAS-CNRS, Univ. Fédérale Toulouse, France
Nathalie Barbosa Roa LAAS-CNRS, Univ. Fédérale Toulouse, France
Euriell Le Corronc LAAS-CNRS, Université Paul Sabatier, Univ. Fédérale Toulouse, France
Élodie Chanthery LAAS-CNRS, INSA Toulouse, Univ. Fédérale Toulouse, France
Indranil Roychoudhury SGT Inc, NASA Ames Research Center, USA

Workshop Organizing Committee
Yannick Pencolé LAAS-CNRS, Univ. Fédérale Toulouse, France
Louise Travé-Massuyès LAAS-CNRS, Univ. Fédérale Toulouse, France
Vincent Cocquempot CRISTAL, Université Lille 1, France
Audine Subias LAAS-CNRS, INSA Toulouse, Univ. Fédérale Toulouse, France
Nazih Mechbal ENSAM Paris, France

Technical and Administrative Support
Christèle Mouclier LAAS-CNRS, Toulouse, France
Régine Duran LAAS-CNRS, Toulouse, France
Dominique Daurat LAAS-CNRS, Toulouse, France
Fabienne Baduel LAAS-CNRS, Toulouse, France
Bruno Birac LAAS-CNRS, Toulouse, France
Stéphanie Saluden Délégation Régionale CNRS, Toulouse, France
Régine Barthes Délégation Régionale CNRS, Toulouse, France

iv
Table of contents

Regular papers

A Divide-And-Conquer-Method for Computing Multiple Conflicts for Diagnosis
by Shchekotykhin Kostyantyn, Jannach Dietmar, Schmitz Thomas 3

A Robust Alternative to Correlation Networks for Identifying Faulty Systems
by Traxler Patrick, Grill Tanja, Gomez Pablo 11

Applied multi-layer clustering to the diagnosis of complex agro-systems
by Roux Elisa, Travé-Massuyès Louise, Le Lann Marie-Véronique 19

A Bayesian Framework for Fault diagnosis of Hybrid Linear Systems
by Zhou Gan, Biswas Gautam, Feng Wenquan, Zhao Hongbo, Guan Xiumei 27

ADS2 : Anytime Distributed Supervision of Distributed Systems that Face Unreliable or Costly Com-
munication
by Herpson Cédric, El Fallah Seghrouchni Amal, Corruble Vincent 35

Data Driven Modeling for System-Level Condition Monitoring on Wind Power Plants
by Eickmeyer Jens, Li Peng, Givehchi Omid, Pethig Florian, Niggemann Oliver 43

Using Incremental SAT for Testing Diagnosability of Distributed DES
by Ibrahim Hassan, Dague Philippe, Simon Laurent 51

Improving Fault Isolation and Identification for Hybrid Systems with Hybrid Possible Conflicts
by Bregon Anibal, Alonso-Gonzalez Carlos, Pulido Belarmino 59

State estimation and fault detection using box particle filtering with stochastic measurements
by Blesa Joaquim, Le Gall Françoise, Jauberthie Carine, Travé-Massuyès Louise 67

Minimal Structurally Overdetermined Sets Selection for Distributed Fault Detection
by Khorasgani Hamed, Biswas Gautam, Jung Daniel 75

Condition-based Monitoring and Prognosis in an Error-Bounded Framework
by Travé-Massuyès Louise, Pons Renaud, Ribot Pauline, Pencolé Yannick, Jauberthie Carine 83

Configuration as Diagnosis: Generating Configurations with Conflict-Directed A* - An Application to
Training Plan Generation -
by Grigoleit Florian, Struss Peter 91

Decentralised fault diagnosis of large-scale systems: Application to water transport networks
by Puig Vicenç, Ocampo-Martinez Carlos 99

v
Self-Healing as a Combination of Consistency Checks and Conformant Planning Problems
by Grastien Alban 105

Implementing Troubleshooting with Batch Repair
by Stern Roni, Kalech Meir, Shinitzky Hilla 113

Formulating Event-Based Critical Observations in Diagnostic Problems
by Christopher Cody, Grastien Alban 119

A Framework For Assessing Diagnostics Model Fidelity
by Provan Gregory, Feldman Alexander 127

Posters

A General Process Model: Application to Unanticipated Fault Diagnosis
by Wang Jiongqi, He Zhangming, Zhou Haiyin, Li Shuxing 137

A SCADA Expansion for Leak Detection in a Pipeline
by Carrera Rolando, Verde Cristina, Cayetano Raúl 145

Automatic Model Generation to Diagnose Autonomous Systems
by Santos Simón Jorge, Mühlbacher Clemens, Steinbauer Gerald 153

Methodology and Application of Meta-Diagnosis on Avionics Test Benches
by Cossé Ronan, Berdjag Denis, Piechowiak Sylvain, Duvivier David, Gaurel Christian 159

SAT-Based Abductive Diagnosis
by Koitz Roxane, Wotawa Franz 167

Fault Tolerant Control for a 4-Wheel Skid Steering Mobile Robot
by Fourlas George, Karras George, Kyriakopoulos Kostas 177

Data-Driven Monitoring of Cyber-Physical Systems Leveraging on Big Data and the Internet-of-Things
for Diagnosis and Control
by Niggemann Oliver, Biswas Gautam, Kinnebrew John, Khorasgani Hamed, Volgmann Sören, Bunte 185
Andreas

Diagnosing Advanced Persistent Threats: A Position Paper
by Abreu Rui, Bobrow Daniel, Eldardiry Hoda, Feldman Alexander, Hanley John, Honda Tomonori, 193
De Kleer Johan, Perez Alexandre, Archer Dave, Burke David

A Structural Model Decomposition Framework for Hybrid Systems Diagnosis
by Daigle Matthew, Bregon Anibal, Roychoudhury Indranil 201

Device Health Estimation by Combining Contextual Control Information with Sensor Data
by Honda Tomonori, Liao Linxia, Eldardiry Hoda, Saha Bhaskar, Abreu Rui, Pavel Radu, Iverson 209
Jonathan

vi
On the Learning of Timing Behavior for Anomaly Detection in Cyber-Physical Production Systems
by Maier Alexander, Niggemann Oliver, Eickmeyer Jens 217

The Case for a Hybrid Approach to Diagnosis: A Railway Switch
by Matei Ion, Ganguli Anurag, Honda Tomonori, De Kleer Johan 225

Design of PD observer-based fault estimator using a descriptor approach
by Krokavec Dusan, Filasova Anna, Liscinsky Pavol, Serbak Vladimir 235

Chronicle based alarm management in startup and shutdown stages
by Vasquez John William, Travé-Massuyès Louise, Subias Audine, Jimenez Fernando, Agudelo Carlos 241

Data-Augmented Software Diagnosis
by Mishali Amir, Stern Roni, Kalech Meir 247

Faults isolation and identification of Heat-exchanger/ Reactor with parameter uncertainties
by Zhang Mei, Dahhou Boutaı̈eb, Cabassud Michel, Li Ze-Tao 253

LPV subspace identification for robust fault detection using a set-membership approach: Application
to the wind turbine benchmark
by Chouiref Houda, Boussaid Boumedyen, Abdelkrim Mohamed Naceur, Puig Vicenç, Aubrun 261
Christophe

Processing measure uncertainty into fuzzy classifier
by Monrousseau Thomas, Travé-Massuyès Louise, Le Lann Marie-Véronique 269

Tools/Benchmarks

Random generator of k-diagnosable discrete event systems
by Pencolé Yannick 277

HyDiag: extended diagnosis and prognosis for hybrid systems
by Chanthery Elodie, Pencolé Yannick, Ribot Pauline, Travé-Massuyès Louise 281

vii
viii
Proceedings of the 26th International Workshop on Principles of Diagnosis

Regular papers

1
Proceedings of the 26th International Workshop on Principles of Diagnosis

2
Proceedings of the 26th International Workshop on Principles of Diagnosis

A Divide-And-Conquer-Method for Computing Multiple Conflicts for Diagnosis

Kostyantyn Shchekotykhin1 and Dietmar Jannach2 and Thomas Schmitz2
1
Alpen-Adria University Klagenfurt, Austria
e-mail: kostyantyn.shchekotykhin@aau.at
2
TU Dortmund, Germany
e-mail: {firstname.lastname}@tu-dortmund.de

Abstract approaches, since they require only a very limited reasoning
functionality like consistency or entailment checking with-
In classical hitting set algorithms for Model- out knowing the internals of the reasoning algorithm. Such
Based Diagnosis (MBD) that use on-demand con- methods can benefit from the newest improvements in rea-
flict generation, a single conflict is computed soning algorithms, such as incremental solving, heuristics,
whenever needed during tree construction. Since learning strategies, etc., without any modifications.
such a strategy leads to a full “restart” of the A non-intrusive conflict detection algorithm which has
conflict-generation algorithm on each call, we shown to be very efficient in different application scenar-
propose a divide-and-conquer algorithm called ios is Junker’s Q UICK X PLAIN [10] (QXP for short) which
M ERGE X PLAIN which efficiently searches for was designed to find a single minimal conflict based on a
multiple conflicts during a single call. divide-and-conquer strategy. The algorithm was originally
The design of the algorithm aims at scenarios in developed in the context of constraint problems, but since
which the goal is to find a few leading diagnoses its method is independent of the underlying reasoner, it was
and the algorithm can – due to its non-intrusive used in several of the hardware and software diagnosis ap-
design – be used in combination with various un- proaches mentioned above.
derlying reasoners (theorem provers). An em- In many classical hitting set based approaches, conflicts
pirical evaluation on different sets of benchmark are computed individually with QXP during HS-tree con-
problems shows that our proposed algorithm can struction when they are required, as in many domains not
lead to significant reductions of the required diag- all conflicts are known in advance [11]. This, however, has
nosis times when compared to a “one-conflict-at- the effect that QXP has to be “restarted” with a slightly dif-
a-time” strategy. ferent configuration whenever a new conflict is needed.
In this paper, we propose M ERGE X PLAIN (MXP for
short), a divide-and-conquer algorithm which searches for
1 Introduction multiple conflicts during a single decomposition run. Our
In Model-Based Diagnosis (MBD), the concept of conflicts method is built upon QXP and is therefore also non-
describes parts of a system which – given a set of observa- intrusive. The basic idea behind MXP is that (a) the early
tions – cannot all work correctly. Besides MBD, the calcu- identification of multiple conflicts can speed up the overall
lation of minimal conflicts is a central task in a number of diagnosis process, e.g., due to better conflict “reuses” [2],
other AI approaches [1]. Reiter [2] showed that the minimal and that (b) we can identify additional conflicts faster when
hitting sets of conflicts correspond to diagnoses, where a di- we decompose the original components into smaller subsets
agnosis is a possible explanation why a system’s observed with the divide-and-conquer strategy of MXP.
behavior differs from its expected behavior. He used this The paper is organized as follows. After a problem char-
property for the computation of diagnoses in the breadth- acterization in Section 2, we present the details of MXP in
first hitting set tree (HS-tree) diagnosis algorithm. Section 3 and discuss the properties of the algorithm. Sec-
Over time, the principle of this MBD approach was used tion 4 presents the results of an extensive empirical evalua-
for a number of different diagnosis problems such as elec- tion using various diagnosis benchmark problems. Previous
tronic circuits, hardware descriptions in VHDL, program work is finally discussed in Section 5.
specifications, ontologies, and knowledge-based systems [3;
4; 5; 6; 7]. A reason for the broad utilization of hitting set 2 Preliminaries
approaches is that its principle does not depend on the un-
derlying knowledge representation and reasoning technique, 2.1 The Diagnosis Problem
because only a general Theorem Prover (TP) – a component
We use the definitions of [2] to characterize a system, diag-
that returns conflicts – is needed.
noses, and conflicts.
The implementation of a TP can be done in different
ways. First, the conflict detection can be implemented as Definition 1 (System). A system is a pair (SD, C OMPS )
a reasoning task, e.g., by modifying a consistency check- where SD is a system description (a set of logical sentences)
ing algorithm [8; 9]. Second, “non-intrusive” conflict de- and C OMPS represents the system’s components (a finite set
tection techniques can be used with a variety of reasoning of constants).

3
Proceedings of the 26th International Workshop on Principles of Diagnosis

A diagnosis problem arises when a set of logical sen- Algorithm 1: Q UICK X PLAIN(B, C)
tences O BS, called observations, is inconsistent with the
normal behavior of the system (SD, C OMPS ). The correct Input: B: background theory, C: the set of possibly
behavior is represented in SD with an “abnormal” predicate faulty constraints
AB /1. That is, for any component ci ∈ C OMPS the literal
Output: A minimal conflict CS ⊆ C
1 if isConsistent(B ∪ C) then return ‘no conflict’;
¬AB(ci ) represents the assumption that the component ci
2 else if C = ∅ then return ∅;
behaves correctly.
3 return GET C ONFLICT(B, B, C)
Definition 2 (Diagnosis). Given a diagnosis problem (SD,
C OMPS , O BS ), a diagnosis is a minimal set ∆ ⊆ C OMPS such function GET C ONFLICT (B, D, C)
that SD ∪ O BS ∪ {AB(c)|c ∈ ∆} ∪ {¬AB(c)|c ∈ C OMPS\∆} 4 if D 6= ∅ ∧ ¬ isConsistent(B) then return ∅;
is consistent. 5 if |C| = 1 then return C;
6 Split C into disjoint, non-empty sets C1 and C2
A diagnosis therefore corresponds to a minimal subset of
7 D2 ← GET C ONFLICT (B ∪ C1 , C1 , C2 )
the system components which, if assumed to be faulty (and
8 D1 ← GET C ONFLICT (B ∪ D2 , D2 , C1 )
thus behave abnormally) explain the system’s behavior, i.e.,
9 return D1 ∪ D2
are consistent with the observations.
Two general classes of MBD algorithms exist. One relies
on direct problem encodings and the aim is often to find one
diagnosis quickly, see [12; 13; 14]. The other class relies on Theorem 1 ([10]). Let B be a background theory, i.e., a
the computation of conflicts and their hitting sets (see next set of constraints considered as correct, and C be a set of
section). Such diagnosis algorithms are often used when the possibly faulty constraints. Then, Q UICK X PLAIN always
goal is to find multiple or all minimal diagnoses. In the con- terminates. If B ∪ C is consistent it returns ‘no conflict’.
text of our work, techniques of the second class can imme- Otherwise, it returns a minimal conflict CS .
diately profit when the conflict generation process is done
more efficiently. 2.4 Using QXP During HS-Tree Construction
2.2 Diagnoses as Hitting Sets Assume that MBD is applied to find an error in the defini-
Finding all minimal diagnoses corresponds to finding all tion of a CSP. The CSP comprises the set of possibly faulty
minimal hitting sets (HS) of all existing conflicts [2]. constraints C. These are the elements of C OMPS. The sys-
Definition 3 (Conflict). A conflict CS for (SD, C OMPS , tem description SD corresponds to the semantics of the con-
O BS ) is a set {c1 , . . . , ck } ⊆ C OMPS such that SD ∪ O BS straints in C. Finally, the observations O BS are encoded as
∪{¬AB(ci ) | ci ∈ CS } is inconsistent. unary constraints and are added to the background theory
B. During the HS-tree construction, QXP is called when-
Assuming that all components of a conflict work correctly ever a new node is created and no conflict reuse is possi-
therefore contradicts the observations. A conflict CS is min- ble. As a result, QXP can either return one minimal conflict,
imal, if no proper subset of CS is also a conflict. which can be used to label the new node, or return ’no con-
To find the set of all minimal diagnoses for a given prob- flict’, which would mean that a diagnosis is found at the tree
lem, [2] proposed a breadth-first HS-tree algorithm with tree node. Note that QXP can be used with other algorithms,
pruning and conflict reuse. A correction to this algorithm e.g., preference-based search [19] or boolean search [20], in
was proposed by Greiner et al. which uses a directed acyclic the same way as with the HS-tree algorithm.
graph (DAG) instead of the tree to correctly deal with non-
minimal conflicts [15]. Our work, however, does not de-
pend on this correction as QXP as well as our proposed 3 M ERGE X PLAIN (MXP): Algorithm
MXP method always return minimal conflicts. Apart from Details
this, a number of algorithmic variations were suggested
3.1 General Considerations
in the literature which, for example, use problem-specific
heuristics [16], a greedy search algorithm, or apply paral- The pseudo-code of MXP, which unlike QXP can return
lelization techniques [17], see also [18] for an overview. multiple conflicts at a time, is given in Algorithm 2. MXP,
like QXP, is generally applicable to a variety of problem do-
2.3 Q UICK X PLAIN (QXP) mains. The mapping to the terminology used in MBD (SD,
QXP was developed in the context of inconsistent constraint C OMPS, O BS) is straightforward as discussed in the previous
satisfaction problems (CSPs) and the computation of expla- section. In the following, we will use the notation and sym-
nations. E.g., in case of an overconstrained CSP, the prob- bols from [10], e.g., C or B, and constraints as a knowledge
lem consists in determining a minimal set of constraints representation formalism.
which causes the CSP to become unsolvable for the given Note that there are applications of MBD in which the
inputs. A simplified version of QXP [10] is shown in Al- function isConsistent has to be “overwritten” to take the
gorithm 1. The rough idea of QXP is to apply a recursive specifics of the underlying knowledge representation and
procedure which relaxes the input set of faulty constraints reasoning system into account. The ontology debugging
C by partitioning it into two sets C1 and C2 (line 6). If C1 approach presented in [7] for example extends isConsis-
is a conflict the algorithm continues partitioning C1 in the tent with the verification of entailments of a logical theory.
next recursive call. Otherwise, i.e., if the last partitioning MXP can be used in such scenarios after the corresponding
has split all conflicts in C, the algorithm extracts a conflict adaptation of the implementation of isConsistent.
from the sets C1 and C2 . This way, QXP finally identifies Furthermore, MXP can be easily extended for cases in
single constraints which are inconsistent with the remaining which the MBD approach has to support the specification
consistent set of constraints and the background theory. of (multiple) test cases, i.e., sets of formulas that must be

4
Proceedings of the 26th International Workshop on Principles of Diagnosis

consistent or inconsistent with the system description, e.g., Algorithm 2: M ERGE X PLAIN(B, C)
[21; 22].
Input: B: background theory, C: the set of possibly
3.2 Algorithm Rationale faulty constraints
Output: Γ, a set of minimal conflicts
MXP (Algorithm 2) accepts two sets of constraints as in-
1 if ¬isConsistent(B) then return ‘no solution’;
puts, B as the assumed-to-be-correct set of background con-
2 if isConsistent(B ∪ C) then return ∅;
straints and C, the possibly faulty components/constraints.
3 h_, Γi ← FIND C ONFLICTS(B, C)
In case C∪B is inconsistent, MXP returns a set of minimal
4 return Γ;
conflicts Γ by calling the recursive function FIND C ONFLICTS
in line 3. This function again accepts B and C as an input and function FIND C ONFLICTS (B, C) returns tuple hC 0 , Γi
returns a tuple hC 0 , Γi, where Γ is a set of minimal conflicts 5 if isConsistent(B ∪ C) then return hC, ∅i;
and C 0 ⊂ C is a set of constraints that does not contain any 6 if |C| = 1 then return h∅, {C}i;
conflicts, i.e., B ∪ C 0 is consistent. 7 Split C into disjoint, non-empty sets C1 and C2
The logic of FIND C ONFLICTS is similar to QXP in that we 8 hC10 , Γ1 i ← FIND C ONFLICTS(B, C1 )
decompose the problem into two parts in each recursive call 9 hC20 , Γ2 i ← FIND C ONFLICTS(B, C2 )
(lines 7–9). Differently from QXP, however, we look for 10 Γ ← Γ1 ∪ Γ2 ;
conflicts in both splits C1 and C2 independently and then 11 while ¬isConsistent(C10 ∪ C20 ∪ B) do
combine the conflicts that are eventually found in the two 12 X ← GET C ONFLICT(B ∪ C20 , C20 , C10 )
halves (line 10)1 . If there is, e.g., a conflict in the first part 13 CS ← X ∪ GET C ONFLICT(B ∪ X, X, C20 )
and one in the second, FIND C ONFLICTS will find them inde- 14 C10 ← C10 \ {α} where α ∈ X
pendently from each other. Of course, there might also be 15 Γ ← Γ ∪ {CS }
conflicts in C whose elements are spread across both C1 and 16 return hC10 ∪ C20 , Γi
C2 , that is, the set C10 ∪ C20 ∪ B is inconsistent. This situation
is addressed in lines 11–15. The computation of a minimal
conflict is done by two calls to GET C ONFLICT (Algorithm 1).
In the first call this function returns a minimal set X ⊆ C10 C1 = {c1 ,c2 ,c3 } and C2 = {c4 ,c5 } and provides them as in-
such that X ∪C20 ∪B is a conflict (line 12). In line 13, we then put to the recursive calls (lines 8 and 9). In the next level
look for a subset of C20 , say Y , such that Y ∪ X corresponds of the recursion – marked with 2 in Figure 1 – the input is
to a minimal conflict CS . The latter is added to Γ (line 15). found to be inconsistent (line 5) and again partitioned into
In order to restore the consistency of C10 ∪ C20 ∪ B we have to two sets (line 7). In the subsequent calls, 3 and 4 , the two
remove at least one element α ∈ CS from either C10 or C20 . input sets are found to be consistent (line 5) and, therefore,
Therefore, in line 14 the algorithm removes α ∈ X ⊆ CS the set {c1 , c2 , c3 } has to be analyzed using GET C ONFLICT
from C10 . (lines 12 and 13) defined in Algorithm 1. GET C ONFLICT
Note that MXP allows us to use different split functions returns the conflict {c1 ,c3 }, which is added to Γ. Finally,
in line 7. In our default implementation we use a function FIND C ONFLICTS removes c1 from the set C1 and returns the
0
that splits the set of constraints C into two equal parts, i.e., tuple h{c2 ,c3 }, {{c1 ,c3 }}i to 1 .
split(n) = n/2, where |C| = n. In the worst case this split Next, the “right-hand” part of the initial input, the set
function results in a perfect binary tree with n leaves. Con- C2 = {c4 ,c5 }, is provided as input to FIND C ONFLICTS 5 .
sequently, the total number of nodes is 2n − 1, which cor- Since C2 is inconsistent, it is partitioned into two sets
respond to 2(2n − 1) consistency checks (lines 5 and 11). C1 = {c4 } and C2 = {c5 }. The first recursive call 6 re-
Other split functions might result in a similar number of turns h{c4 }, ∅i since the input is consistent. The second
consistency checks in the worst case as well, since in any
call 7 , in contrast, finds that the input comprises only
case MXP has to traverse a binary tree with n leaves. For
one constraint that is inconsistent with the background the-
instance, the function split(n) = n − 1 results in a tree with
ory B. Therefore, it returns h∅,{{c5 }}i in line 6. Since
one branch of the depth n − 1 and n leaves, that is, 2n − 1
nodes to traverse. However, while the number of nodes to C10 ∪ C20 = {c4 } ∪ ∅ is consistent with B, FIND C ONFLICTS 5
explore might be comparable, the important point is that the returns h{c4 }, {{c5 }}i to 1 .
computational costs for the individual consistency checks Finally, in 1 the set of constraints C10 ∪ C20 = {c2 ,c3 } ∪
can be different depending on the splitting strategy. Un- {c4 } is found to be inconsistent with B (line 11) and GET-
der the reasonable assumption that consistency checking of C ONFLICT is called. The method returns the conflict {c2 ,c4 }
smaller sets of constraints requires less time, the function and c2 is removed from C10 . The resulting set {c3 ,c4 } is con-
split(n) = n/2 allows MXP to split the set of constraints sistent and MXP returns Γ = {{c1 ,c3 }, {c5 }, {c2 , c4 }}.
faster, thus, improving the overall runtime.
3.4 Properties of M ERGE X PLAIN
3.3 Example Theorem 2. Given a background theory B and a set of con-
Consider a CSP consisting of six constraints {c0 , ..., c5 }. straints C, Algorithm 2 always terminates and returns
The constraint c0 is considered correct, i.e., B = {c0 }. Let • ‘no solution’, if B is inconsistent,
{{c0 , c1 , c3 }, {c0 , c5 }, {c2 , c4 }} be the set of minimal con-
flicts. Algorithm 2 proceeds as follows (Figure 1). • ∅, if B ∪ C is consistent, and
Since the input CSP (B ∪ C) is not consistent, the al- • a set of minimal conflicts Γ, otherwise.
gorithm enters the recursion. In the first step, FIND C ON -
FLICTS partitions the input set (line 7) into the two subsets Proof. In the first case, given an inconsistent background
theory B, the algorithm terminates in line 1 and returns ‘no
1
The calls in line 8 and 9 can in fact be executed in parallel. solution’. In the second case, if the set B ∪ C is consistent,

5
Proceedings of the 26th International Workshop on Principles of Diagnosis

C1 = {c1 , c2 , c3 } C2 = {c4 , c5 }
h{c2 , c3 } , {{c1 , c3 }}i
1 : h{c4 } , {{c5 }}i
Γ = {{c1 , c3 } , {c5 }} ∪ {{c2 , c4 }}
C = {c3 , c4 }
y
%
C1 = {c1 , c2 } C2 = {c3 }
C1 = {c4 } C2 = {c5 }
h{c1 , c2 } , ∅i
h{c4 } , ∅i
2 : h{c3 } , ∅i 5 :
h∅, {{c5 }}i
Γ = ∅ ∪ {{c1 , c3 }}
isConsistent X
C = {c2 , c3 }

B ∪ C = {c0 , c5 }
B ∪ C = {c0 , c1 , c2 } B ∪ C = {c0 , c3 } B ∪ C = {c0 , c4 }
3 : 4 : 6 : 7 : isConsistent
isConsistent X isConsistent X isConsistent X
|C| = 1

Figure 1: M ERGE X PLAIN recursion tree. Each node shows values of selected variables in the FIND C ONFLICTS function.

then no subset of C is a conflict. MXP terminates and re- strategies MXP might not return enough minimal conflicts
turns ∅. for the HS-tree algorithm to compute at least one diagnosis.
Finally, if the set B ∪ C is inconsistent, the algorithm en- For instance, let {{c1 , c2 } , {c1 , c3 } , {c2 , c4 }} be the set of
ters the recursion in line 3. The function FIND C ONFLICTS all minimal conflicts. If MXP returns Γ = {{c1 , c2 }}, which
in each call partitions the input set C into two sets C1 and is one of the possible valid outputs, then the HS-tree algo-
C2 . The partitioning continues until either the found set rithm fails to find a diagnosis as {c1 , c2 } must be hit twice.
of constraints C is consistent or a singleton conflict is de- In this case, the HS-tree algorithm must call MXP multiple
tected. Therefore, every recursion branch ends after at most times or another algorithm for diagnosis computation must
log |C|−1 calls. Consequently, FIND C ONFLICTS terminates if be used, e.g., [23].
the conflict detection loop in lines 11–15 always terminates. Corollary 2. Algorithm 2 is sound, i.e., every set CS ∈ Γ
We consider two situations. If the set C10 ∪ C20 is consistent is a minimal conflict, and complete, i.e., given a diagnosis
with B, the loop terminates. Otherwise, in each iteration at problem for which at least one minimal conflict exists, Algo-
least one conflict in the set C10 ∪ C20 is resolved. This fact rithm 2 returns Γ 6= ∅.
follows from Theorem 1 according to which the function
GET C ONFLICT in Algorithm 1 always returns a minimal con-
The soundness of the algorithm follows from Theorem 1,
flict if the input parameter C is inconsistent with B. Since since the conflict computation of MXP uses the GET C ON -
FLICT function of QXP. The completeness is shown as fol-
the number of conflicts is finite and in each iteration one of
the conflicts in C10 ∪ C20 is resolved in line 14, the loop will lows: Let B be a background theory and C a set of faulty
terminate after a finite number of iterations. Consequently, constraints, i.e., B ∪ C is inconsistent. Assume MXP returns
Algorithm 2 terminates and returns a set of minimal con- Γ = ∅, i.e., no minimal conflicts are found. However, this is
flicts Γ. impossible, since the loop in line 11 will never end. Con-
sequently, Algorithm 2 will not terminate which contradicts
Corollary 1. Given a consistent background theory B and a our assumption. Hence, it holds that MXP is complete.
set of inconsistent constraints C, Algorithm 2 always returns
a set ofSminimal conflicts Γ such that there exists a diagnosis 4 Evaluation
∆i ⊆ CS i ∈Γ CS i . We have evaluated the efficiency of computing multiple con-
The proof follows from the fact that – similar to the HS- flicts at once with MXP using a number of different diagno-
tree algorithm – a conflict is resolved by removing one of its sis benchmark problems. As a baseline for the comparison,
elements from the set of constraints C1 in line 14. The loop we use QXP as a Theorem Prover, which returns exactly
in line 11 guarantees that every conflict CS i ∈ C10 ∪ C20 is one minimal conflict at a time. Furthermore, we made mea-
hit. Consequently, FIND C ONFLICTS hits every conflict in the surements with a variant of MXP called PMXP in which
input set C and the set of constraints {α1 , . . . , αn } removed the lines 8 and 9 are executed in parallel in two threads on a
in every call of line 14 is a superset or equal to a diagnosis of multi-core computer.
the problem. The construction of at least one diagnosis from
the found conflicts Γ can be done by the HS-tree algorithm. 4.1 Benchmark Problems
MXP can in principle use several strategies for the res- We made experiments with different benchmark problems.
olution of conflicts in line 14. The strategy used in MXP First, we used the five first systems of the DX Competition
by default is conservative and allows us to find several con- (DXC) 2011 Synthetic Track. For each system, 20 scenarios
flicts at once. Two additional elimination strategies can be are specified in which artificial faults were injected. In addi-
used in line 14: (1) C10 ← C10 \ X or (2) C10 ← C10 \ CS and tion, we made experiments with a number of CSP problems
C20 ← C20 \ CS . These more aggressive strategies result in from the CSP solver competition 2008 and several CSP en-
a smaller number of conflicts returned by MXP in each call codings of real-world spreadsheets. The injection of faults
but each call returns the results faster. However, for the latter was done in the same way as in [17].

6
Proceedings of the 26th International Workshop on Principles of Diagnosis

In addition to these benchmark problems, we developed System #C #V #F #D #D |D| #Cf |Cf|
a diagnosis problem generator, which can be configured 74182 21 28 4 - 5 30 - 300 139 4.66 4.9 3.3
to generate (randomized) diagnosis problems with varying 74L85 35 44 1 - 3 1 - 215 66.4 3.13 5.9 8.3
characteristics, e.g., with respect to the number of conflicts, 74283 38 45 2 - 4 180 - 4,991 1,232.7 4.42 78.8 16.1
their size, or their position in the system description SD. 74181 67 79 3 - 6 10 - 3,828 877.8 4.53 7.8 10.6
c432 162 196 2 - 5 1 - 6,944 1,069.3 3.38 15.0 19.8
4.2 Measurement Method
We implemented all algorithms in a Java-based MBD Table 1: Characteristics of selected DXC benchmarks. #C:
framework, which uses Choco as an underlying constraint number of constraints, #V: number of variables, #F: num-
solver, see [17]. The experiments were conducted on a lap- ber of injected faults, #D: range of the number of diagnoses,
top computer (Intel i7, 8GB RAM). As a performance indi- #D: average number of the diagnoses, |D|: average diag-
cator we use the time needed (“wall clock”) for computing nosis size, #Cf: average number of conflicts, |Cf|: average
one or more diagnoses. The reported running time num- conflict size.
bers are averages of 100 runs of each problem setting that
were done to avoid random effects. We furthermore ran- System QXP-5 MXP-5 QXP-1 MXP-1
domly shuffled the ordering of the constraints in each run to [ms] Improv. [ms] Improv.
avoid effects that might be caused by a certain positioning 74182 17.0 19% 17.0 19%
of the conflicts in SD. For the evaluation of MXP we used 74L85 20.9 15% 16.1 19%
the most aggressive elimination strategy (2) as described in 74283 61.2 29% 53.8 32%
Section 3.4. 74181 691.8 45% 637.0 47%
Since MXP can return more than one conflict at a time, it c432 707.5 25% 503.9 37%
is expected to be particularly useful when the problem is to
find a set of n first (leading) diagnoses, e.g., in the context of Table 2: Performance gains for DXC benchmarks when
applying MBD to software debugging [5; 7]. We therefore searching for the first n diagnoses of minimal cardinality.
report the results for the tasks “find-one-diagnosis” (as an
extreme case) and “find-n-diagnoses”.
Constraint Problems / Spreadsheets The characteristics
The task of finding a single diagnosis is comparably
for the next set of benchmark problems (six CSP compe-
simple and “direct encodings” or algorithms like I NVERSE -
tition instances, five CSP-encoded real-world spreadsheets
Q UICK X PLAIN [23] are typically more efficient for this
with injected faults [17]) are shown in Table 3.
task than the HS-tree algorithm. For instance, I NVERSE -
Q UICK X PLAIN requires only O(|∆| log(|C|/|∆|)) calls to TP.
If TP can check the consistency in polynomial time, then Scenario #C #V #F #D |D| #Cf |Cf|
one diagnosis can also be computed efficiently. The prob- c8 523 239 8 4 6.25 7 1.6
lem of finding more than one diagnosis is very different and costasArray-13 87 88 2 >5 3.6 >565 45.6
computationally challenging, because deciding whether an domino-100-100 100 100 3 81 2 2 15
additional diagnosis exists is NP-complete [24]. In such set- graceful–K3-P2 60 15 4 >117 2.94 >12 29.2
tings the application of methods that are highly efficient for mknap-1-5 7 39 1 2 1 1 2
finding one diagnosis is not always advantageous. For in- queens-8 28 8 15 9 10.9 15 2.8
stance, the evaluation presented in [14] demonstrates this hospital payment 38 75 4 40 4 4 3
fact for direct encodings. Therefore a comparison of our al- profit calculation 28 140 5 42 4.25 11 9
gorithm with approaches for the “find-one-diagnosis” prob- course planning 457 583 2 3024 2 2 55.5
lem is beyond the scope of our work, as we are interested preservation model 701 803 1 22 1 1 22
in problem settings in which the HS-tree algorithm is fa- revenue calculation 93 154 4 1452 3 3 15.7
vorable and no assumptions about the underlying reasoner
should be made. When the task is to find all diagnoses, the Table 3: Characteristics of selected CSP settings.
performance of MXP is similar to that of QXP as all exist-
ing conflicts have to be determined. The results for determining the five first minimal diag-
noses are shown in Table 42 . Again, performance improve-
4.3 Results ments of up to 54% can be observed. The obtained im-
DXC Benchmark Problems Table 1 shows the charac- provements vary quite strongly across the different problem
teristics of the analyzed and CSP-encoded DXC benchmark instances: the higher the complexity of the underlying prob-
problems. Since we consider multiple scenarios per system, lem, the stronger are the improvements achieved with our
the number of faults and the corresponding diagnoses can new method. Only in the two cases in which only one single
vary strongly across the experiment runs. conflict exists (see Table 3), the performance can slightly de-
Table 2 shows the observed performance gains when us- grade as MXP performs an additional check if further con-
ing MXP instead of QXP in terms of absolute numbers (ms) flicts among the remaining constraints exist.
and the relative improvement. For the problem of finding the
first 5 diagnoses (QXP-5/MXP-5), the observed improve- Systematically Generated MBD Problems To be able to
ments range from 15% up to 45%. For the extreme case of systematically analyze which factors potentially influence
finding one single diagnosis, even slightly stronger improve- the obtained performance improvements, we developed an
ments can be observed. The improvements when searching MBD problem generator in which we could vary (i) the
for, e.g., the first 10 diagnoses are similar for cases in which
2
significantly more than 10 diagnoses actually exist. The results for finding one diagnosis follow the same trend.

7
Proceedings of the 26th International Workshop on Principles of Diagnosis

Scenario QXP MXP #Cp #Cf |Cf| Cf Pos. QXP MXP PMXP
[ms] [ms] Impr. [ms] Impr. Impr.
c8 615 376 39% 50 5 2 Random 351 27% 30%
costasArray-13 1,379,842 629,366 54% 50 5 2 Left 161 6% 10%
domino-100-100 417 389 7% 50 5 2 Right 481 69% 70%
graceful–K3-P2 1611 1123 30% 50 5 2 LaR 293 51% 57%
mknap-1-5 32 36 -11% 50 5 2 Neighb. 261 54% 58%
queens-8 281 245 13% 100 5 2 Random 417 33% 35%
hospital payment 1,717 1,360 21% 100 5 2 Left 181 14% 17%
profit calculation 86 76 12% 100 5 2 Right 622 75% 76%
course planning 2,045 1,544 25% 100 5 2 LaR 351 58% 63%
preservation model 371 391 -5% 100 5 2 Neighb. 314 62% 65%
revenue calculation 109 87 21% 50 15 4 Random 2,300 22% 20%
50 15 4 Left 452 -8% -4%
Table 4: Results for CSP benchmarks and spreadsheets
50 15 4 Right 1,850 72% 73%
when searching for 5 diagnoses.
50 15 4 LaR 3,596 22% 18%
50 15 4 Neighb. 166,335 43% 43%
overall number of C OMPS, (ii) the number of conflicts and
their average size (and as a consequence the number of diag- Table 5: Results when varying the problem characteristics.
noses), and (iii) the position of the conflicts in the database.
We considered the last aspect because the performance of
in the left part of SD, some improvements or light deterio-
QXP and MXP can largely depend on this aspect3 . If,
rations can be observed for MXP. The latter two situations
e.g., there is only one conflict and the conflict is represented
(all conflicts are clustered in one half) are actually quite im-
by the two “left-most” elements in SD, QXP’s divide-and-
probable but help us better understand which factors influ-
conquer strategy will be able to rule out most other elements
ence the performance.
very fast.
We evaluated the following configurations regarding the (2) When comparing the results of the first two blocks
position of the conflicts (see Table 5): (a) Random: The in the table, it can be seen that the improvements achieved
elements of each conflict are randomly distributed across with MXP are stronger when there are more components in
SD and more time is needed for performing the individual
SD; (b) Left/Right: All elements of the conflict appear in
exactly one half of SD; (c) LaR (Left and Right): Conflicts consistency checks. This is in line with the results of the
are both in the left and right half, but not spanning both other experiments.
halves; (d) Neighb.: Conflicts appear randomly across SD, (3) Parallelization can help to obtain modest additional
but only involve “neighboring” elements. improvements. The strongest improvements are observed
One specific rationale of evaluating these constellations for the LaR configuration, which is intuitive as PMXP by
individually is that conflicts in some application domains design explores the left and right halves independently in
(e.g., when debugging knowledge bases) might represent parallel. Note that in the experiments with the DXC and the
“local” inconsistencies in SD. CSP benchmark problems, in most cases we could not ob-
Since the conflicts are known in advance in this exper- serve runtime improvements through parallelization. This is
iment, no CSP solver is needed to determine the consis- caused by two facts. First, the consistency checking times
tency of a given set of constraints. Because zero compu- are often on average below 1 ms, which means that the rel-
tation times are unrealistic, we added simulated consistency ative overhead of starting a new thread can be comparably
checking times in each call to the TP. The value of the sim- high. Second, the used CSP solver causes some additional
ulated time quadratically increases with the number of con- overheads and thread synchronization when used in multiple
straints to be checked and is capped in the experiments at 10 threads in parallel.
milliseconds. We made additional tests with different con-
sistency checking times to evaluate to which extent the im- 5 Related Work
provements obtained with MXP depend on the complexity
In [10], Junker informally sketches a possible extension of
of an individual consistency check for the underlying prob-
QXP to be able to compute multiple “preferred explana-
lem. However, these tests did not lead to any significant
tions” in the context of Preference-Based Search (PBS). The
differences.
general goal of Junker’s approach is partially similar to our
Table 5 shows some of the results of this simulation. In
work and the proposed extended version of QXP could in
this evaluation, we also include the results of the parallelized
theory be used during the HS-tree construction as well.
PMXP variant. The following observations can be made.
Technically, Junker proposes to set a choice point when-
(1) The performance of QXP strongly depends on the po-
ever a constraint ci is found to be consistent with a partial re-
sition of the conflicts. In the probably most realistic Random
laxation during search and thereby look for (a) branches that
case, MXP helps to reduce the computation times around
lead to conflicts not containing ci and (b) branches leading
20-30%. In the constellations that are “unfortunate” for
to conflicts in which the removal of ci leads to a solution.
QXP, the speedups achieved with MXP can be as high as
75%. When QXP is “lucky” and all conflicts are clustered Unfortunately, it is not fully clear from the informal
sketch in [10] where the mentioned choice point should
3
We assume a splitting strategy in which the elements are sim- be set. If applied in line 5 of Algorithm 1, conflicts are
ply split in half in the middle with no particular ordering of the only found in the left-most inconsistent partition. The
elements. method would then return only a small subset of all conflicts

8
Proceedings of the 26th International Workshop on Principles of Diagnosis

M ERGE X PLAIN would return. If the split is done for every techniques such as M ARCO [29] aim at the enumeration of
ci consistent with a partial relaxation during PBS, the result- conflicts. In general, many of these algorithms use a similar
ing diagnosis algorithm corresponds to the binary HS-tree divide-and-conquer principle as we do with MXP. How-
method [25], which according to the experiments in [11] is ever, such algorithms – including the ones listed above –
not generally favorable over HS-Tree algorithms, in partic- often modify the underlying knowledge base by adding re-
ular when we are searching for a limited set of diagnoses. laxation variables to clauses of a given unsatisfiable formula
From the algorithm design, note that QXP applies a con- and then use a SAT solver to find the relaxations. This strat-
structive conflict computation procedure prior to partition- egy roughly corresponds to the direct diagnoses approaches
ing, whereas MXP does the partitioning first – thereby re- discussed above. MXP, in contrast, acts completely inde-
moving multiple constraints at a time – and then uses a pendently of the underlying knowledge representation lan-
divide-and-conquer conflict detection approach. Finally, our guage. Moreover, the problem-independent decomposition
method can, depending on the configuration, make a guaran- approach used by MXP is a novel feature which – to the
tee about the existence of a diagnosis given the returned con- best of our knowledge – is not present in the existing con-
flicts without the need of computing all existing conflicts. flict detection techniques from the MaxSAT field. Specifi-
cally, it allows our algorithm to find multiple conflicts more
In general, our work is related to a variety of (com-
efficiently because it searches for them within independent
plete) approaches from the MBD literature which aim to
small subsets of the original knowledge base. In addition,
find diagnoses more efficiently than with Reiter’s original
MXP can find conflicts in knowledge bases formulated in
method. Existing works for example try to speed up the
very expressive knowledge representation languages, such
process by exploiting existing hierarchical, tree-like or dis-
as description logics, which cannot be efficiently translated
tributed structural properties of the underlying problem [16;
to SAT, see also [23].
26], through parallelization [17], or by solving the dual
problem [27; 28; 29]. A main difference to these works
is that we make no assumption about the underlying prob- 6 Conclusions
lem structure and leave the general HS-tree procedure un- We have proposed and evaluated a novel, general-pur-
changed. Instead, our aim is to avoid a full restart of the pose and non-intrusive conflict detection strategy called
conflict search process when constructing a new node by M ERGE X PLAIN, which is capable of detecting multiple
looking for potentially existing additional conflicts in each conflicts in a single call. An evaluation on various bench-
call, and to thereby speedup the overall process. mark problems revealed that M ERGE X PLAIN can help to
Beside complete methods, a number of approximate di- significantly reduce the required computation times when
agnosis approaches have been proposed in the last years, applied in a Model-Based Diagnosis setting in which the
which for example use stochastic and heuristic search [30; goal is to find a defined number of diagnoses and in
31]. The relation of our work to these approaches is limited which no assumption about the underlying reasoning engine
as we are focusing on application scenarios where the goal should be made.
is to find a few first diagnoses more quickly but at the same One additional property of M ERGE X PLAIN is that the
time maintain the completeness property. Finally, for some union of the elements of the returned conflict sets is guaran-
domains, “direct” and SAT-based, e.g., [32], or CSP-based, teed to be a superset of one diagnosis of the original prob-
e.g., [33], encodings, have shown to be very efficient to find lem. Recent methods like the one proposed in [23] can
one or a few diagnoses in recent years. For instance, [33] therefore be applied to find one minimal diagnosis quickly.
suggests an encoding scheme that first translates a given di-
agnosis problem (SD, C OMPS , O BS ) into a CSP. Then a spe-
cific diagnosis algorithm is applied that searches for conflict
Acknowledgements
sets with increasing cardinality, i.e., 1, 2, . . . , |C OMPS |. The This work was supported by the Carinthian Science Fund
same method is then used to search for diagnoses in the set (KWF) contract KWF-3520/26767/38701, the Austrian Sci-
of all found conflict sets. In order to speed up the compu- ence Fund (FWF), and the German Research Foundation
tations the author suggests a kind of hierarchical approach (DFG) under contract numbers I 2144 N-15 and JA 2095/4-
that helps the user spot the relevant components. Generally, 1 (Project “Debugging of Spreadsheet Programs”).
most of the “direct” methods require the use of additional
techniques like hierarchical diagnosis or iterative deepening References
that constrain the cardinality of computed diagnoses while
computing minimal diagnoses. [1] Ulrich Junker. QUICKXPLAIN: Conflict Detection
The concept of conflicts plays a central role in different for Arbitrary Constraint Propagation Algorithms. In
other reasoning contexts than Model-Based Diagnosis, e.g., IJCAI ’01 Workshop on Modelling and Solving prob-
explanations or dynamic backtracking. Specifically, in re- lems with constraints (CONS-1), 2001.
cent years a number of approaches were proposed in the [2] Raymond Reiter. A Theory of Diagnosis from First
context of the maximum satisfiability problem (MaxSAT), Principles. Artificial Intelligence, 32(1):57–95, 1987.
see [34] for a recent survey. In these domains the con-
[3] Gerhard Friedrich, Markus Stumptner, and Franz
flicts are referred to as unsatisfiable cores or Minimally Un-
satisfiable Subsets (MUSes); Minimal Correction Subsets Wotawa. Model-Based Diagnosis of Hardware De-
(MSCes) on the other hand correspond to the concept of signs. Artificial Intelligence, 111(1-2):3–39, 1999.
diagnoses in this paper. In [35] or [36], for example, dif- [4] Cristinel Mateis, Markus Stumptner, Dominik
ferent algorithms were recently proposed to find one so- Wieland, and Franz Wotawa. Model-Based Debug-
lution to the MaxSAT problem, which corresponds to the ging of Java Programs. In Proceedings AADEBUG
problem of finding one minimal/preferred diagnosis. Other ’00 Workshop, 2000.

9
Proceedings of the 26th International Workshop on Principles of Diagnosis

[5] Dietmar Jannach and Thomas Schmitz. Model-based [21] Alexander Felfernig, Gerhard Friedrich, Dietmar Jan-
diagnosis of spreadsheet programs: a constraint-based nach, and Markus Stumptner. Consistency-based di-
debugging approach. Automated Software Engineer- agnosis of configuration knowledge bases. Artificial
ing, 2014. Intelligence, 152(2):213–234, 2004.
[6] Jules White, David Benavides, Douglas C. Schmidt, [22] Gerhard Friedrich and Kostyantyn Shchekotykhin. A
Pablo Trinidad, Brian Dougherty, and Antonio Ruiz General Diagnosis Method for Ontologies. In Pro-
Cortés. Automated Diagnosis of Feature Model ceedings ISWC ’05, pages 232–246, 2005.
Configurations. Journal of Systems and Software, [23] Kostyantyn Shchekotykhin, Gerhard Friedrich, Patrick
83(7):1094–1107, 2010. Rodler, and Philipp Fleiss. Sequential diagnosis of
[7] Kostyantyn Shchekotykhin, Gerhard Friedrich, high cardinality faults in knowledge-bases by direct
Philipp Fleiss, and Patrick Rodler. Interactive diagnosis generation. In Proceedings ECAI ’14, pages
ontology debugging: Two query strategies for effi- 813–818, 2014.
cient fault localization. Journal of Web Semantics, [24] Thomas Eiter and Georg Gottlob. The Complexity of
12-13:88–103, 2012.
Logic-Based Abduction. Journal of the ACM (JACM),
[8] Franz Baader and Rafael Penaloza. Axiom Pinpointing 42(1):1–49, 1995.
in General Tableaux. Journal of Logic and Computa-
[25] Li Lin and Yunfei Jiang. The computation of hitting
tion, 20(1):5–34, 2008.
sets: Review and new algorithms. Information Pro-
[9] Johan de Kleer. A Comparison of ATMS and CSP cessing Letters, 86(4):177–184, May 2003.
Techniques. In Proceedings IJCAI ’89, pages 290–
296, 1989. [26] F Wotawa and I Pill. On classification and modeling
issues in distributed model-based diagnosis. AI Com-
[10] Ulrich Junker. QUICKXPLAIN: Preferred Explana-
munications, 26(1):133–143, 2013.
tions and Relaxations for Over-Constrained Problems.
In Proceedings AAAI ’04, pages 167–172, 2004. [27] Ken Satoh and Takeaki Uno. Enumerating Minimally
Revised Specifications Using Dualization. In JSAI ’05
[11] Ingo Pill, Thomas Quaritsch, and Franz Wotawa. From
Workshop, pages 182–189, 2005.
Conflicts to Diagnoses: An Empirical Evaluation of
Minimal Hitting Set Algorithms. In Proceedings DX [28] Roni Stern, Meir Kalech, Alexander Feldman, and
’11 Workshop, pages 203–211, 2011. Gregory Provan. Exploring the Duality in Conflict-
[12] Alexander Feldman, Gregory Provan, Johan de Kleer, Directed Model-Based Diagnosis. In Proceedings
AAAI ’12, pages 828–834, 2012.
Stephan Robert, and Arjan van Gemund. Solv-
ing Model-Based Diagnosis Problems with Max-SAT [29] Mark H. Liffiton, Alessandro Previti, Ammar Malik,
Solvers and Vice Versa. In Proceedings DX ’10 Work- and Joao Marques-Silva. Fast, Flexible MUS Enumer-
shop, pages 185–192, 2010. ation. Constraints, pages 1–28, 2015.
[13] Amit Metodi, Roni Stern, Meir Kalech, and Michael [30] Lin Li and Jiang Yunfei. Computing Minimal Hitting
Codish. A Novel SAT-Based Approach to Model Sets with Genetic Algorithm. In Proceedings DX ’02
Based Diagnosis. Journal of Artificial Intelligence Re- Workshop, pages 1–4, 2002.
search, 51:377–411, 2014. [31] A Feldman, G Provan, and A van Gemund. Approx-
[14] Iulia Nica, Ingo Pill, Thomas Quaritsch, and Franz imate Model-Based Diagnosis Using Greedy Stochas-
Wotawa. The Route to Success – A Performance Com- tic Search. Journal of Artifcial Intelligence Research,
parison of Diagnosis Algorithms. In Proceedings IJ- 38:371–413, 2010.
CAI ’13, pages 1039–1045, 2013. [32] Amit Metodi, Roni Stern, Meir Kalech, and Michael
[15] R Greiner, B A Smith, and R W Wilkerson. A Correc- Codish. Compiling Model-Based Diagnosis to
tion to the Algorithm in Reiter’s Theory of Diagnosis. Boolean Satisfaction. In Proceedings AAAI ’12, pages
Artificial Intelligence, 41(1):79–88, 1989. 793–799, 2012.
[16] Markus Stumptner and Franz Wotawa. Diagnos- [33] Yannick Pencolé. DITO: a CSP-based diagnostic en-
ing tree-structured systems. Artificial Intelligence, gine. In Proceedings ECAI ’14, pages 699–704, 2014.
127(1):1–29, 2001.
[34] Antonio Morgado, Federico Heras, Mark Liffiton,
[17] Dietmar Jannach, Thomas Schmitz, and Kostyantyn Jordi Planes, and Joao Marques-Silva. Iterative and
Shchekotykhin. Parallelized Hitting Set Computation core-guided MaxSAT solving: A survey and assess-
for Model-Based Diagnosis. In Proceedings AAAI ’15, ment. Constraints, 18(4):478–534, 2013.
pages 1503–1510, 2015.
[35] Jessica Davies and Fahiem Bacchus. Postponing opti-
[18] Johan de Kleer. Hitting set algorithms for model-based mization to speed up MAXSAT solving. In Proceed-
diagnosis. In Proceedings DX ’11 Workshop, pages ings CP ’13, pages 247–262, 2013.
100–105, 2011.
[36] Alexey Ignatiev, Antonio Morgado, Vasco Man-
[19] Ulrich Junker. Preference-Based Search and Multi-
quinho, Ines Lynce, and Joao Marques-Silva. Progres-
Criteria Optimization. Annals of Operations Research, sion in Maximum Satisfiability. In Proceedings ECAI
130:75–115, 2004. ’14, pages 453–458, 2014.
[20] Ingo Pill and Thomas Quaritsch. Optimizations for
the Boolean Approach to Computing Minimal Hitting
Sets. In Proceedings ECAI ’12, pages 648–653, 2012.

10
Proceedings of the 26th International Workshop on Principles of Diagnosis

A Robust Alternative to Correlation Networks for Identifying Faulty Systems
Patrick Traxler 1 and Pablo Gómez2 and Tanja Grill1
1
Software Competence Center Hagenberg, Austria
e-mail:patrick.traxler@scch.at
tanja.grill@scch.at
2
Institute of Applied Knowledge Processing, Johannes Kepler University, Linz, Austria
e-mail: pablo.gomez@faw.jku.at

Abstract 1
2
We study the situation in which many systems 3
1 2 5
relate to each other. We show how to robustly 4
learn relations between systems to conduct fault 5
detection and identification (FDI), i.e. the goal is 6
to identify the faulty systems. Towards this, we 1 2 3 4 5 6 3 4 6
present a robust alternative to the sample correla- (a) Fitness matrix (b) Digraph
tion matrix and show how to randomly search in
it for a structure appropriate for FDI. Our method Figure 1: Learning relations between 6 systems. We draw
applies to situations in which many systems can an edge between two systems if there is a strong linear re-
be faulty simultaneously and thus our method re- lation between them. First, we compute the fitness matrix,
quires an appropriate degree of redundancy. We 1(a), our robust alternative to the sample correlation matrix.
present experimental results with data arising in Darker colors mean a stronger linear relation. Going from
photovoltaics and supporting theoretical results. Fig. 1(a) to 1(b) is a discretization step via thresholding. The
digraph is the input for conducting FDI.
1 Introduction
The increasing number of technical systems connected to
ilar concepts. The concept that fits our needs are correlation
the Internet raises new challenges and possibilities in di-
networks. A correlation network is some structure in the
agnosis. Large amount of data needs to be processed and
correlation matrix, e.g. a minimum spanning tree or a clus-
analyzed. Faults need to be detected and identified. Sys-
tering. In our application we have n variables which rep-
tems exist in different configurations, e.g. two systems of
resent the produced energy per photovoltaic system. Given
the same type that have different sets of sensors. Knowl-
that a single system correlates strongly with enough other
edge about the system design is often incomplete. Data is
systems, we use this information for FDI via applying a me-
often unavailable due to unreliable data connections. Be-
dian.
sides these and other difficulties, the large amount of data
also opens new possibilities for diagnosis based on machine We can also think of correlation networks as a method
learning. for knowledge discovery. It has been applied in areas such
The idea of our approach is to conduct fault detection and as biology [18; 10] and finance [12] to analyze gene co-
identification (FDI) by comparing data of similar systems. expression and financial markets. In our situation, the first
We assume to have data of machines, devices, systems of a step is to learn linear relations between systems. For learn-
similar type and want to know if some system is faulty and if ing we need historical data. A sample result of this step is
so, to identify the faulty systems. This situation may deviate depicted in Fig. 1. In Fig. 1(a) the fitness matrix, our robust
from classic diagnosis problems in that we just have limited alternative to the correlation matrix, is shown. It represents
information (e.g. sensor or control information) of system the degree of linearity between any two systems. For FDI,
internals. Moreover, we may have incomplete knowledge the second step of our method, we work with the result as
about the system design. This makes manual system mod- depicted in 1(b) and current data. In the example, we derive
eling hard or even impossible. The problem is then to com- for every of the six systems an estimation m̂i of its current
pare the limited information of the working systems (per- value yi from its neighbors current values, e.g. for system
haps only input-output information) to identify faulty sys- 1 we get an estimate from the current values of the systems
tems. 2, 3, 4 and for system 5 from system 6. Finally, we test for a
In this work we tackle one concrete problem of this kind. fault by checking if |m̂i − yi | is large.
It is motivated by photovoltaics. We describe it in more de- The major difficulty we try to tackle with this approach
tail below. The problem that arises in our and other appli- is the presence of many faults. Faults influence both the
cations is that not every two systems can be compared. We learning problem and the FDI problem. Robustness is an
thus need to learn relations between systems. essential property of our algorithms. Our result can be seen
There are different approaches to learn structure, e.g. as a robust structure learning algorithm for the purpose of
learning Bayesian networks, Markov random fields, or sim- FDI. Robustness is a preferable property of many learning

11
Proceedings of the 26th International Workshop on Principles of Diagnosis

and estimation algorithms. However, the underlying opti- 1.2 Related Work
mization problems unlike their non-robust variants are often Correlation networks have applications in biology and fi-
NP-hard. This is for example the case for computing robust nance. See e.g. [12; 18; 10] and the references therein. In
and non-robust estimators for linear regression, e.g. Least biology [18; 10], they are applied to study gene interactions.
Median of Squares versus Ordinary Least Squares [16]. We The correlation matrix is the basis for clustering genes and
avoid NP-hardness by a careful modeling of our problem. the identification of biologically significant clusters. In [18;
In particular, our algorithms are computationally efficient. 10], a scale-free network is derived via the concept of topo-
Under some conditions, FDI can be done in (almost) linear logical overlap. Scale-free networks tend to have few nodes
time in the number of systems n. (genes) with many neighbors, so called hubs.
To summarize our contributions, we introduce a novel al- Correlation networks are primarily used for knowledge
ternative to the sample correlation matrix and present a first discovery. In particular, concepts such as clusters, hubs, and
use of it to discover structure appropriate for general FDI spanning trees are interpreted in the context of biology and
and in particular for identifying faulty photovoltaic systems finance. In our work, we introduce a robust alternative to
(PV). Our method works in the presence of many faults. Our correlation networks.
algorithms are computationally efficient. Our method incor- Other structural approaches, i.e. approaches based on
porates a couple of techniques from machine learning and graphical models, are based on Bayesian networks, Markov
statistics: (Repeated) Theil-Sen estimation for robust sim- random fields and similar concepts. Gaussian Markov ran-
ple linear regression. Trimming to obtain a robust fitness dom fields are loosely related to correlation networks. Their
measure. Randomized subset selection for improved run- structure is described by the precision matrix, the inverse
ning time. And a median mechanism to conduct FDI. covariance matrix (ch. 17.3, [9].)
In Sec. 2 we discuss our method. In Sec. 3 we present Another structural approach is FDI in sensor networks [7;
experimental and theoretical results. 4; 19; 20]. The current approach [7; 4; 19] mainly deals with
wireless sensor networks. The algorithms usually use the
1.1 Motivating Application: Identifying Faulty median for FDI such as we do. The difference is that FDI
Photovoltaic Systems in wireless sensor networks uses a geometric model similar
Faults influence the performance of photovoltaic systems to interpolation methods. It requires the geographic location
(PV). PV systems produce less energy than possible if faults of the sensors. It is assumed that two sensors close to each
occur. We can distinguish between two kinds of faults. other have a similar value. This cannot be assumed in gen-
Faults caused by an exogenous event such as shading, (melt- eral. To overcome these problems of manual modeling, we
ing) snow, and tree leafs covering solar modules. And faults apply machine learning techniques.
caused by endogenous events such as module defects and Models for PV systems are compared in [14]. All these
degradation, defects at the power inverter, and string dis- models require the plane-of-array irradiance. Fault de-
connections. tection of PV systems is the topic of e.g. [3; 8; 5; 2;
We are going to detect faults by estimating the drop in 17]. Firth et al. [8] consider faults if the PV system gen-
produced energy. Most of the common faults result in such erates no energy. Another type of fault occurs if the pan-
a drop. The particular problem is given by the sensor setup. els are covered by snow, tree leaves, or something else.
We just assume to know the produced energy and possible In this case, we can observe a drop in energy. It is con-
but not necessarily the area (e.g. the zip code) where the PV sidered e.g. in [5]. The fraction of panel area covered
system is located. is a crucial parameter. All these approaches [3; 8; 5; 2;
We apply our method to PV system data. Difficulties in 17] require at least the knowledge of the plane-of-array irra-
the application are different system types and deployments diance, i.e. it requires an irradiance sensor installed. We do
of systems. For example, different number of strings and make this assumption.
modules per string and differing orientation (north, west, The median is common in fault detection and identifica-
south, east) of the modules; see Fig. 2. Moreover, the lack of tion. One reason for this circumstance is its optimal break-
information due to the lack of sensors and incomplete data down point [16]. We also make use of (repeated) Theil-
due to unreliable data connections. Faults occur frequently, Sen algorithms [6; 15] for learning. An ingredient of our
in particular exogenous faults during winter. fault identification algorithm is the algorithm for median
The novelty of our work in the context of photovoltaics selection [1] and an algorithm for generating uniform sub-
is that it works in an extremely restrictive (only power mea- sets efficiently (see e.g. the Fisher-Yates or Knuth shuffle
surement) sensor setting. To the best of our knowledge, we in [13].) In our algorithm analysis we derive bounds for a
are the first to consider this restrictive sensor setting. We partial Cauchy-Vandermonde identity (pg. 20 in [11]).
only need to know the produced energy of a PV system.
There is also the implicit assumption, which is tested by 2 Method
the learning algorithm, that the systems are not too far from
each other so that we can observe them in similar working 2.1 Data Model for Incomplete Data
(environmental) conditions. Distances of a couple of kilo- We have data from n systems and one data stream per sys-
meters are possible. Systems which are very close to each tem. A data stream for system i ∈ {1, . . . , n} is given by a
other and have the same orientation such as systems in a set Ni ⊆ {1, . . . , N } of available data and values xi,t ∈ R
solar power plant yield the best results. Other approaches with t ∈ Ni . We can think of the parameter t as discrete
assume the presence of a plane-of-array-irradiance sensor time. With Ni , we explicitly model data availability. Incom-
which are mostly deployed for solar power plants. Irradi- plete data is a common problem in our situation. Causes in
ance estimations via satellite imaging are usually not accu- practice are unreliable data connections or unreliable sen-
rate enough. sors. We call D := {(xi,t )t∈Ni : i ∈ {1, . . . , n}} a data

12
Proceedings of the 26th International Workshop on Principles of Diagnosis

set. Sets of historical and current data are the inputs to our wide availability of efficient implementations of near-linear
algorithms. time algorithms. There is also a variant of T TS , called the
repeated Theil-Sen estimator, which has a breakdown point
2.2 Fitness Matrix – Definition and Robustness of 0.5, but less efficient implementations. The concrete def-
The fitness matrix is intended as a robust replacement for inition of T TS can be found e.g. in [6]. It is however not
the sample correlation matrix. The sample correlation co- important for our application, only its robustness property
efficient such as the sample covariance is well known to be and the existence of efficient implementations are.
sensitive to faults (outliers) [16]. As an example, we gener- To define the breakdown point of a fitness matrix, let f
ated the data for Fig. 1 with faults. The non-robust sample be a real-valued function defined on any finite data set. We
correlation matrix would have yield a digraph without edges define the fitness matrix as
instead of the digraph in Fig. 1(b).
A fault can be an arbitrary corruption of a single data item Fji := f (Zi,j )
xi,t . That is, xi,t = x̃i,t + ∆, ∆ 6= 0, where ∆ is the fault. and its breakdown point as
We think of x̃i,t as the actual or true but unobserved value.
We do not make any assumptions on faults themselves ε∗ (F ) := min ε∗ (f, Zi,j ).
i,j
but only on their number. This is at core of the definition of
the breakdown point. This statistical concept is defined for Next, we provide the fitness matrix we are going to use. It
a particular estimation or learning problem. In our case for has the property that Fji is close to zero if xi and xj are
simple linear regression. strongly linearly related and it has a high breakdown point.
Linear regression is closely related to the correlation co- Let yt := xi,t and ŷt := xj,t · θ̂2 + θ̂1 , t ∈ Ni ∩ Nj , for
efficient. For simple linear regression – a regression model
the Theil-Sen estimate θ̂ of Zij . Let rt := ŷt − yt be the
with one independent and one dependent variable – the
residuals. And let i1 , . . . , ik ∈ Ni ∩ Nj with k := |Ni ∩ Nj |
correlation coefficient can be seen as a fitness measure of
be such that |ri1 | ≤ · · · ≤ |rik |. We define
the line which fits the data best w.r.t. vertical squared dis-
tances. See e.g. [16]. However, the corresponding estimator, √
bk/ 2c
1 X
namely `2 -regression a.k.a. ordinary least squares, is known f TS
(Zij ) := P √ · |rit |. (1)
to be sensitive to outliers [16]. On the other hand, there are bk/ 2c
|yit | t=1
t=1
estimators for simple linear regression which are robust to
a large number of faults, i.e. they have a large breakdown We define F TS w.r.t. f TS , i.e.
point. TS
The idea underlying the fitness matrix is thus to replace Fji := f TS (Zi,j ).
the correlation coefficient (and `2 -regression) by a robust √
notion of fitness based on robust linear regression. In the re- Note that the sum goes from 1 up to bk/ 2c. This trim-
mainder of this section we recall the definition of the break- ming together with the high breakdown point of Theil-Sen
down point following [16], pg. 9, and we are going to for- directly implies the following result.
malize the notion of fitness matrices. Theorem 1. It holds that ε∗ (F TS ) ≥ 1 − √12 .
We define the breakdown only for simple linear regres-
Finally, we compare the sample correlation matrix and the
sion. We fix two systems i, j ∈ {1, . . . , n} and define
fitness matrix. Let C denote the sample correlation matrix
Z := Zi,j := {(xi,t , xj,t ) : t ∈ Ni ∩ Nj }. Let T be a
and define Cji0
= 1 − |Cji |. Both matrices have the property
regression estimator, i.e. T (Zi,j ) = θ̂ ∈ R2 is the intercept that if some entry is close to 0 then xi and xj have a strong
and slope for the data set Zi,j . For Z, we define Z 0 as Z linear relation. It is guaranteed that Cji
0
is at most 1. A value
with m data points arbitrarily corrupted. Define
close to 1 means a weak linear relation. For Fji TS
, it is not
bias(m; T, Z) := sup kT (Z) − T (Z 0 )k.
Z0 guaranteed that Fji TS
≤ 1, but experimental results suggest
If bias(m; T, Z) is infinite, then m faults (outliers) have an that it is usually the case. We also note that both matrices
arbitrarily large effect on the estimate T (Z 0 ). The (finite obey a weak form of the triangle inequality since if xi and
sample) breakdown point of T and Z is defined as xj correlate strongly and xj and xk correlate strongly, then
also xi and xk correlate.
m
∗
ε (T, Z) := min : bias(m; T, Z) = ∞ . There are two important benefits of fitness matrices over
|Z| correlation matrices. They are robust and are also defined
To explain this notion, we consider four typical examples. for incomplete data. On the negative side, the fitness matrix
The breakdown point ε∗ (T`2 , Z) is 1/n for `2 -regression. is not positive semi-definite, in particular not symmetric.
This holds for any Z. The situation is different for `1 -
regression in that ε∗ (T`1 , Z) = 1/n for some Z. 2.3 Structure in Fitness Matrices – Algorithm
In this work we are going to use the Theil-Sen estimator1 LEARN and IDENTIFY
T a.k.a. median slope selection. The reason is its break-
TS
We want to identify faulty systems. In a first step, we learn
down point of at least 1 − √12 ≥ 0.292 (see e.g. [6]) and the a structure appropriate for FDI; see algorithm LEARN. We
1 obtain it via thresholding the fitness matrix. Most of the
There is a subtle issue here we have to deal with. Regres- correlation networks, i.e. structures arising from the sample
sion problems are optimization problems. The solution to the con-
crete optimization problem does not need to be unique. In our
correlation matrix, are obtained in this way [12; 18; 10].
situation, intercept and slope are unique for `2 -regression but not We denote the threshold by θ ≥ 0 and the threshold fitness
for `1 -regression. The estimator T`1 is however unique for a (de- matrix as
terministic) algorithm solving the optimization problem. We thus Fji if Fji ≤ θ
Fji;θ := .
think of T`1 as the output of a particular (deterministic) algorithm. 0 if Fji > θ

13
Proceedings of the 26th International Workshop on Principles of Diagnosis

Algorithm 1 Algorithm LEARN with input D (data set) and particular, we have the following result.
parameter θ (fitness threshold). Output is a digraph G with Theorem 2. Let D be a data set with n systems and let
edge labels (intercept, slope) representing the threshold fit- m := maxi |Ni |. The running time of LEARN is O(n2 · m ·
ness matrix. log(m)). The running time of IDENTIFY is O(k · n).
Let G = (V, E) be a digraph with V = {1, . . . , n} and
E = {}. Proof. LEARN. There are O(n2 ) pairs of systems. The
for all i ∈ V and j ∈ V \ {i} do Theil-Sen estimator can be computed in time O(m log(m))
Learn (Theil-Sen) the intercept aj,i and slope bj,i be- [6]. The computation of f TS , Eq. 1, is done via sorting and
tween xi (dependent variable) and xj (independent vari- thus takes time O(m log(m)).
able). IDENTIFY. Assume |Ni− | ≥ k. We uniformly at ran-
end for dom choose a k-element subset out of Ni− and compute
for all i ∈ V and j ∈ V \ {i} do the median. For random selection we can use for example
Compute the trimmed fitness f = f TS (Eq. 1) of Zi,j . the Fisher-Yates (or Knuth) shuffle [13] which runs in time
end for O(k) and for median selection the algorithm in [1] which
if f ≤ θ then also runs in time O(k). The second case, 1 ≤ |Ni− | ≤ k−1,
Add to G the directed edge from j to i with edge la- is analogous. This shows that the overall running time of
bels (aj,i , bj,i ). IDENTIFY is O(kn).
end if
In Sec. 3.2, we provide some sufficient conditions that
IDENTIFY works correctly even if k = O(log(n)). This
The input to algorithm LEARN is a data set D as described is a strong running time improvement from O(n2 ) to O(n ·
in Sec. 2.1. It outputs a digraph G = (V, E), i.e. the (pos- log(n)).
sible sparse) threshold fitness matrix FθTS . Additionally, in-
tercept and slope of the simple linear regressions are added 3 Results
as edge labels.
3.1 Experimental
In this section we are going to discuss how to apply our
Algorithm 2 Algorithm IDENTIFY with input G (digraph
method, Sec. 2, to photovoltaic data. In particular, it re-
with edge labels), current data yi for the i-th system, and
mains to discuss how the use-case fits to the model. More
parameters k and s (deviation). It outputs the set of all faulty
precisely, why there is strong correlation between PV sys-
systems H.
tems. Finally, we present experimental results to verify the
Set H = {}. estimation and fault identification quality of our algorithms.
for all i ∈ V = {1, . . . , n} do
Use-Case Photovoltaics
Let Ni− := {j ∈ V : (j, i) ∈ E}.
A simple system model for PV systems is as follows:
if |Ni− | = 0 then
Continue with the next (system) i. Pi = ci · Ii .
end if
if |Ni− | ≥ k then Here, Pi is the power, Ii the plane-of-array (POA) irradiance
Select uniformly at random a k-element of the i-th system, and ci a constant of the system which can
subset S from Ni− . be interpreted as the efficiency of converting solar energy
else into electrical energy. More complex physical models in-
Set S := Ni− . clude system variables such as the module temperature [14;
end if 17]. Our considerations translate to the more complex mod-
Compute Mi := {ŷj = bj,i · yj + aj,i : j ∈ S}. els as long as they are time-independent. We also note that
Compute the median m̂i of Mi . these models are more accurate, but only slightly, since the
Add i to H if |m̂i − yi | > s POA-irradiance has the most critical influence on the pro-
end for duced energy.
Output H. We get from the above considerations that Pi = c0ij · Pj
given that Ii = Ij . In our situation we cannot test the condi-
tion Ii = Ij since we do not know the POA-irradiance, but
In the second step, we identify the faulty systems; see Ii ≈ Ij holds if the system operate under similar weather
algorithm IDENTIFY. Its input is the result of algorithm conditions and have a similar orientation. The former holds
LEARN. Algorithm IDENTIFY constructs a random di- if the systems are close to each other. To reduce the effect of
graph of in-degree at most k for FDI. It works as follows. In- different orientations, see Fig. 2, we consider the following
dependently for every system, we choose uniformly at ran- model: Pi∆ = uij · Pj∆ + vij . The variable Pi∆ is the power
dom at most k of its neighbors in the digraph G and compute within a time interval ∆, usually one hour. The variables
the median m̂i of estimated values derived from the selected uij and vij are the unknowns.
neighbors values. We compare the median m̂i to the current In more general words, let Yi be the output of the i-th
system value and decide whether it has a fault or not via the system and let Xi describe the system input and system in-
deviation parameter s. ternals. Our model assumption is that for a reasonable num-
We discuss the threshold parameter θ and the deviation ber of system pairs (i, j), the system outputs Yi are Yj are
parameter s in Sec. 3.1. They essentially depend on the vari- linearly related given that Xi ≈ Xj . By the above consid-
ance in the data set D. Parameter k in algorithm IDENTIFY erations, it is plausible that PV systems fulfill these require-
has the purpose of improving running time efficiency. In ments.

14
Proceedings of the 26th International Workshop on Principles of Diagnosis

4000

1000
POA−irradiance
3000

800
Estimate
Power [Watt]

Power [Watt]

600
2000

400
1000

200
Real

12:00 12:00
13:00 13:00
0

0
Time Time

Figure 2: Four power curves of a sunny day in August, data Figure 3: A faulty system. The real power curve of observed
set DK . Two PV systems have their maximum power peak values shows a fault from roughly 11:00 to 14:30. The es-
before and the other two after 13:00. They have different timated values are considerable higher during this period.
orientations, i.e. they produce more energy in the morning The PV system has a plane-of-array irradiance sensor in-
or evening. stalled. A cross check with its power curve reveals that the
fault was detected correctly.
We next describe our experimental setup to verify it by
real data. The entry x is an average over all systems and 7 days. The
first day is noted in column Start.
Experimental Setup Algorithm LEARN is executed once for every week and
To demonstrate our method, we use two data sets DA and with θ = 0.8 and roughly three months of historical data,
DK . DA arises from 13 systems from a solar park located in e.g. for the months January, February, and March to get es-
Arizona2 . The PV systems there are geographically close. timates for the days April, 1. to April, 7. Algorithm IDEN-
We use data for one year. DK arises from 40 systems spread TIFY is executed with s = 0.25 · |m̂i | and k = 11 for both
across a typical municipality located in Austria, i.e. the sys- data sets. The choice of parameters θ and s depend on the
tems can be up to some kilometers apart. Their orientation variance of the input data and were chosen manually, so to
can differ significantly. Some systems may be orientated to get a reasonable number of good estimates. Similar for k.
the west, others to the east. We have data for almost a year. The difficulty in choosing the parameters is that increasing θ
A system is faulty if it produces considerable less energy will usually reduce the number of neighbors. For a reason-
than estimated; see Fig. 3. This definition is motivated by able number of good estimates we need both: A strong lin-
the fact that most faults imply a drop in energy. The dif- ear relation of a system to its neighbors and enough neigh-
ficulty in setting up an experiment is that we do not know bors. The parameters were chosen accordingly. For param-
if a PV system is faulty in advance, i.e. we do not have la- eter k, we derive a theoretical result in Sec. 3.2 which says
beled data. We thus design our experiment as follows: We that k = O(log(n)) is a good choice for n the number of
verify the accuracy of the energy estimation, namely the rel- systems.
ative deviation |m̂i − yi |/|m̂i | for every system i and over
the period of a week, m̂i and yi as in algorithm IDENTIFY. Experimental Results
This relative deviation is noted in column Hour of Table The false positive rate (FPR), the false negative rate (FNR),
1 for the time period 12:00 to 13:00. In column Day of and the estimation accuracy are the most interesting num-
Table 1 we note the same but for a whole day, i.e. m̂i is the bers for us. As remarked above, we do not have labeled
estimated energy (power) for the whole day calculated from data. The faults as recorded in Table 1 are faults as detected
the hourly estimates and yi the actual energy for the whole by our algorithm.
day. For the whole day we consider the time period from We make a worst case assumption, namely that all de-
9:00 to 16:00. tected faults are false positives. This yields a FPR of at most
The number |m̂i − yi |/|m̂i | can be read as some relative 0% to 5% per 7 day period (rows in the table.) To get an un-
deviation, i.e. the estimation is 100 · x% away from the truth derstanding of FNR, we simulated faults by subtracting 33%
value where x is some entry in the column Hour and Day. percent of energy. The FNR in this case is at most 10% per
7 day period. In the rows Sum and Sum−33% in Table 1 we
2
http://uapv.physics.arizona.edu/ summed up the faults to get the FPR and FNR for the whole

15
Proceedings of the 26th International Workshop on Principles of Diagnosis

(a) Results for DA . 3.2 Theoretical
Start Hour Faults Day Faults
April, 1. 0.058 2/77 0.037 1/77 We argued in Sec. 3.1 that algorithm LEARN yields good
May, 1. 0.040 2/77 0.014 0/77 estimates for the systems current value. For an estimate to
be good, the neighboring system j in G of system i needs
June, 1. 0.019 0/91 0.021 0/91
to work correctly. Moreover, the regression estimates, the
July, 1. 0.068 7/88 0.071 7/88
intercept and slope, need to be accurate enough. In this sec-
Aug., 1. 0.362 12/91 0.250 7/91 tion, we provide a supporting theoretical result which says
Sept., 1. 0.096 4/65 0.034 1/78 that, if enough estimates are good, algorithm IDENTIFY
Oct., 1. 0.019 0/84 0.016 0/84 correctly identifies all faulty systems.
Nov., 1. 0.039 0/72 0.025 0/72 The input to IDENTIFY is a digraph G = (V, E) with
Dec., 1. 0.135 10/84 0.130 7/84 edge labels. Let yi be the current value of system i. Let
Sum 37/729 23/742 yi = ỹi + ∆i . We think of ỹi as the true value. We say that
Sum−33% 673/729 682/742 system i is correct if ∆i = 0 and faulty otherwise.
(b) Results for DK . The input to IDENTIFY has to satisfy two conditions, Eq.
2 and 3, to work correctly. These conditions state that there
Start Hour Faults Day Faults
are more good than bad estimates. We formulate them be-
June, 1. 0.056 7/269 0.068 11/273
low.
June, 15. 0.055 7/238 0.097 17/258
July, 1. 0.077 7/267 0.068 9/280 Theorem 3. Let 0 < p < 1 and s > 0. Let H := {i ∈
July, 15. 0.025 0/267 0.044 6/280 {1, . . . , n} : |∆i | > 2s}. Assume that the input digraph G
Aug., 1. 0.037 2/279 0.030 3/280 satisfies Eq. 2 and 3. Then, algorithm IDENTIFY outputs
Aug., 15. 0.031 1/280 0.032 0/280 H with probability at least 1 − p.
Sept., 1. 0.040 0/280 0.033 0/280 Let ŷj be the estimates as computed in IDENTIFY. Fix
Sept., 15. 0.092 20/280 0.056 0/280 a system i and let j ∈ Ni− . We say that ŷj is s-good for
Sum 42/2160 46/2211 system i if |ỹi − ŷj | ≤ s. Let Ai := {j ∈ Ni− : |ỹi − ŷj | ≤
Sum−33% 1960/2154 2033/2207 s} be the s-good estimates for system i. Condition 2 is as
follows: For every system i with 1 ≤ |Ni− | ≤ k − 1 it holds
Table 1: The values in column Hour and Day contain the that
relative deviation |m̂i − yi |/|m̂i |, m̂i and yi as in algorithm |N − |
IDENTIFY. They are averages over all systems and the pe- |Ai | > i , (2)
2
riod of a week. Column Start contains the start date of the
7 day period. The two columns labeled Faults contain the i.e. there are more good than bad estimates. For the case that
number of (possible false detected) faults relative to the to- |Ni− | ≥ k we assume
tal number of analyzed hours and days, respectively. The
rows Sum contain the summed up number of faults, once for 1
|Ai | > 1 − · |Ni− |, (3)
the actual data sets and then with a simulated fault of −33% cn,p,k
less energy.
with cn,p,k := ( np · 18k )2/(k−1) . Setting k = Ω(log( np ))
makes cn,p,k larger than some constant independent of n
data sets. and p. This is the most reasonable setting as it implies that
a constant fraction of estimates can be bad and IDENTIFY
The interpretation of these results is as follows. Setting
still identifies the faulty systems correctly. We remark that
the parameter s to 0.25 · |m̂i | means that we define a fault
the asymptotic analysis which yields cn,k,p is not optimal.
as a 25% relative deviation of the observed produced energy
from its true value. Setting s to this value, yields the above In particular, it seems that the factor 18k is not optimal and
mentioned FPR. Simulating a 33% drop in energy, which may be improved to a factor as small as 2k/2 . For practical
corresponds naturally to the faults we want to detect, yields applications, the following heuristic seems reasonable: For
the above FNR. n systems and a failure probability p of IDENTIFY, set k to
10 · log( np ).
For the data set DA we have knowledge about the POA-
irradiance. We can thus cross-check with the irradiance to
check if faulty systems were identified correctly; see Fig. 3.3 Proof of Theorem 3
3. This manual inspection suggests that the FPR is much We apply the following lemma with A = Gi and M = Ni− .
smaller than 5%, close to 1%. Furthermore, increasing the It directly gives us the probability that IDENTIFY correctly
drop implies a decreasing FNR, i.e. stronger energy drops identifies the faulty systems since the median works cor-
are easier to identify. rectly if |S ∩Ai | > |S ∩(Ni− \Ai )|, where S is the (random)
Depending on the application, these rates may be consid- set chosen in IDENTIFY.
ered appropriate or not. In some applications, we may want
Lemma 1. Let M be a finite set and A ⊆ M . Let k ≥ 2
to detect faults which yield a drop in energy of less −25%.
be an integer. Let S ⊆ M be a k-element subset selected
This worsens the FPR and FNR. On the other side, if we
uniformly at random. Then
want to improve the FPR and FNR, we may have to specify
a fault as a drop in energy of −50%. In other words, our bk/2c
|M \ A|
parameter setting is one out of many reasonable parameter Pr(|S ∩ A| > |S ∩ (M \ A)|) ≥ 1 − 18 k
.
settings. S |M |

16
Proceedings of the 26th International Workshop on Principles of Diagnosis

Proof. Let M := {1, . . . , m}, F := M \ A, and r := |F |. Proof of Theorem 3. We show that the success probability
First, we are going to bound the number of k-element sub- of IDENTIFY is at least 1 − p. Let p0 := np . We show that
sets S ⊆ M for which |S ∩ G| ≤ k 0 with k 0 = bk/2c. The for every i ∈ V , G = (V, E), the success probability of a
exact number of these sets is single iteration in the loop of IDENTIFY is at least 1 − p0 .
k0 This implies the above claim since (1 − p0 )n ≥ 1 − p0 n by
X m−r r
(4) e.g. the Binomial Theorem.
i=0
i k−i Fix some i ∈ V , i.e. we consider one iteration in the loop
of IDENTIFY. We apply Lemma 1. Let us assume that |S ∩
since there are |A| i ways to choose an i-element subset Ai | > |S ∩ (Ni− \ Ai )|, S the random k-element subset as

|F |
from A and k−i ways to choose from F for the remain- in IDENTIFY and Ai the good estimates as defined above.
ing k − i elements. Since |S∩Ai | > |S∩(Ni− \Ai )|, it follows that |m̂i −ỹi | ≤ s
Note that |S ∩ A| > |S ∩ F | iff |S ∩ A| > b|S|/2c = for the median m̂i as computed in IDENTIFY and yi =
k 0 . Moreover, we can assume that r = |F | ≥ 1 since the ỹi + ∆i .
claim holds for r = 0. To provide a lower bound for the Assume ∆i = 0, i.e. system i works correctly. Then,
probability of this event we show an upper bound on the |m̂i − yi | = |m̂i − ỹi | ≤ s. Thus, i is not output.
complementary event, i.e. |S ∩ A| ≤ k 0 . First, we derive an Assume ∆i 6= 0, i.e. system i is faulty. Here, |m̂i − yi | =
upper bound for Eq. 4 using |m̂i − ỹi − ∆i |. It follows from |m̂i − ỹi | ≤ s and |∆i | > 2s
k k that |m̂i − yi | > s. Thus, i is output.
m m me Finally, we want that the probability of failure for a
≤ ≤ (5)
k k k single step is at most np . By Lemma 1, 18k αbk/2c ≤
|Ni− \Ai |
for e = 2.714 . . . and 1 ≤ k ≤ m. (See e.g. pg. 12 in [11].) 18k α(k−1)/2 ≤ np with α := |Ni− |
. With c = cn,p,k :=
Since this inequality holds just for k ≥ 1 we rewrite Eq. 4 − −
n k 2/(k−1)
( p · 18 ) , c · |Ni \ Ai | ≤ |Ni | and thus (1 − 1/c) ·
as
X k0 −
|Ni | ≤ |Ai |.
r m−r r
+ . (6)
k i k−i

i=1 4 Conclusions and Open Problems
It holds that kr ≤ ( re
k ) and for the second term in Eq. 6
k
We presented a method for learning structure to identify
k0 k0 i k−i faulty systems. The basic method of correlation networks
X m−r r X (m − r)e re
≤ has found many applications in biology and finance. In our
i=1
i k−i i=1
i k−i application, the presence of many faults required the design
and analysis of robust algorithms. We provided an experi-
Xk
0 i i k mental analysis of our algorithms to verify their estimation
m−r k−i 1
= (re)k and fault identification quality. We also provided a support-
r i k−i
i=1 ing theoretical result which allowed us to considerable im-
Next, we prove the upper bound on the probability p that prove the running time of algorithm IDENTIFY.
|S ∩ A| ≤ k 0 . We select uniformly at random a k-element Improving the running time of LEARN remains as an
−1
subset of M . Its probability is m . We multiply Eq. 6 open problem. It is not directly clear that it is necessary
k

m −1 to compare every two systems. The reason is that if systems
with k and get two parts p1 + p2 ≥ p. For the first part (i, j) and (j, k) correlate strongly, then also (i, k) correlate,
−1
p1 ≤ ( m ) since m
re k
k ≤ (k/m)k . For the second part p2 , but not necessarily strongly. Thus, it may not be necessary
we use r ≤ r , ((k−i)/i)i ≤ 2k , and (k/(k−i))k ≤ 2k .
m−r m to solve a simple linear regression problem for every system
The latter since i ≤ k 0 . We get an upper for the second part: pair.
k X In other applications it may be useful to solve a general
k0 i i k
re m−r k−i k linear regression problem instead of a simple linear regres-
p2 ≤ ≤ sion, e.g. if our model depends on more than one variable
m r i k−i
i=1 per system. The corresponding correlation networks are
k X
k0 i based on the partial correlation coefficient [12]. Since ro-
12r m
. bust estimators for general linear regression are based on re-
m i=1
r gression problems which are NP-hard, it remains as an open
0 problem to find a robust alternative to partial correlation net-
An upper bound for the geometric sum is k 0 (m/r)k . In works that can be computed efficiently.
total Finally, to put our method and results into a broader con-
k k k
re 12r m text, we approached the problem of FDI via learning graph-
p ≤ p1 + p2 ≤ + k0 . ical models. It seems to be a challenge to learn classical
m m r
component-models of technical systems to conduct diagno-
Substituting k−1
2 for k and further simplification yields
0
sis. In this work we were able to close the gap between
(k−1)/2 (k−1)/2 (structure) learning on the one side and FDI on the other
(k + 2)(12)k r r side for a concrete problem setting.
p≤ ≤ 18k .
2 m m
The latter since ((k + 2)/2)1/k ≤ 1.5 for k ≥ 3. We have References
thus a lower bound for the probability 1 − p and the claim [1] Manuel Blum, Robert W. Floyd, Vaughan Pratt,
follows. Ronald L. Rivest, and Robert E. Tarjan. Linear time

17
Proceedings of the 26th International Workshop on Principles of Diagnosis

bounds for median computations. In Proc. of the 4th [17] P. Traxler. Fault detection of large amounts of photo-
Annual ACM Symposium on Theory of Computing, voltaic systems. In Online Proc. of the ECML PKDD
pages 119–124, 1972. 2013 Workshop on Data Analytics for Renewable En-
[2] H. Braun, S. T. Buddha, V. Krishnan, A. Spanias, ergy Integration (DARE’13), 2013.
C. Tepedelenlioglu, T. Yeider, and T. Takehara. Signal [18] Bin Zhang and Steve Horvath. A general frame-
processing for fault detection in photovoltaic arrays. work for weighted gene co-expression network analy-
In 37th IEEE International Conference on Acous- sis. Statistical Applications in Genetics and Molecular
tics, Speech and Signal Processing, pages 1681–1684, Biology, 4(17), 2005.
2012. [19] Chongming Zhang, Jiuchun Ren, Chuanshan Gao,
[3] K. H. Chao, S. H. Ho, and M. H. Wang. Modeling Zhonglin Yan, and Li Li. Sensor fault detection in
and fault diagnosis of a photovoltaic system. Electric wireless sensor networks. In Proc. of the IET Interna-
Power Systems Research, 78(1):97–105, 2008. tional Communication Conference on Wireless Mobile
and Computing, pages 66–69, 2009.
[4] Jinran Chen, Shubha Kher, and Arun Somani. Dis-
tributed fault detection of wireless sensor networks. In [20] Yang Zhang, N. Meratnia, and P. Havinga. Outlier de-
Proc. of the 2006 Workshop on Dependability Issues tection techniques for wireless sensor networks: a sur-
in Wireless Ad Hoc Networks and Sensor Networks, vey. Communications Surveys and Tutorials, IEEE,
pages 65–72, 2006. 12(2):159–170, 2010.
[5] A. Chouder and S. Silvestre. Fault detection and
automatic supervision methodology for PV systems.
Energy Conversion and Management, 51:1929–1937,
2010.
[6] R. Cole, J.S. Salowe, W.L. Steiger, and E. Szemeredi.
An optimal-time algorithm for slope selection. SIAM
Journal on Computing, 18(4):792–810, 1989.
[7] M. Ding, Dechang Chen, Kai Xing, and Xiuzhen
Cheng. Localized fault-tolerant event boundary detec-
tion in sensor networks. In Proc. of the 24th Annual
Joint Conference of the IEEE Computer and Commu-
nications Societies, volume 2, pages 902–913, 2005.
[8] S.K. Firth, K.J. Lomas, and S.J. Rees. A simple model
of PV system performance and its use in fault detec-
tion. Solar Energy, 84:624–635, 2010.
[9] Trevor Hastie, Robert Tibshirani, and Jerome Fried-
man. The elements of statistical learning. Springer,
2008.
[10] Steve Horvath. Weighted network analysis: applica-
tions in genomics and systems biology. Springer Sci-
ence & Business Media, 2011.
[11] S. Jukna. Extremal combinatorics: with applications
in computer science. Springer, 2nd edition, 2011.
[12] Dror Y. Kenett, Michele Tumminello, Asaf Madi, Gi-
tit Gur-Gershgoren, Rosario N. Mantegna, and Eshel
Ben-Jacob. Dominating clasp of the financial sector re-
vealed by partial correlation analysis of the stock mar-
ket. PLoS ONE, 5(12):e15032, 12 2010.
[13] Donald E. Knuth. The art of computer programming:
seminumerical algorithms, volume 2. Addison-Wesley
Longman Publishing Co., Inc., 3rd edition, 1997.
[14] B. Marion. Comparison of predictive models for PV
module performance. In 33rd IEEE Photovoltaic Spe-
cialist Conference, pages 1–6, 2008.
[15] J. Matousek, D. M. Mount, and N. S. Netanyahu. Ef-
ficient randomized algorithms for the repeated median
line estimator. Algorithmica, 20(2):136–150, 1998.
[16] Peter J Rousseeuw and Annick M Leroy. Robust re-
gression and outlier detection, volume 589. John Wi-
ley & Sons, 2005.

18
Proceedings of the 26th International Workshop on Principles of Diagnosis

Applied multi-layer clustering to the diagnosis of complex agro-systems

Elisa Roux1, Louise Travé-Massuyès1 and Marie-Véronique Le Lann1,2
1
CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
emails: lisa.roux@laas.fr, louise@laas.fr, mvlelann@laas.fr
2
Univ de Toulouse, INSA, LAAS, F-31400 Toulouse, France

Abstract qualitative valued data, which can be nominal or ordinal,
mixed with quantitative and interval data. Many situations
In many fields, such as medical, environmental, a lot of
leading to well-conditioned algorithms for quantitative
data are produced every day. In many cases, the task of
valued information become very complex whenever there
machine learning is to analyze these data composed of
are several data given in qualitative form. In a non-
very heterogeneous types of features. We developed in
exhaustive list, we can mention, rule based deduction,
previous work a classification method based on fuzzy
classification, clustering, dimensionality reduction… Dur-
logic, capable of processing three types of features (data):
ing the last decades, few research works have been di-
qualitative, quantitative, and more recently intervals. We
rected to defy the issue of representing multiplicity for data
propose to add a new one: the object type which is a mean-
analysis purposes [3, 11]. However, no standard principle
ingful combination of other features yielding the possibil-
has been proposed in the literature to handle in a unified
ity of developing hierarchical classifications. This is illus-
way heterogeneous data. Indeed, a lot of proposed tech-
trated by a real-life case study taken from the agriculture
niques process separately quantitative and qualitative data.
area 1.
In data reduction tasks for example, they are either based
on distance measures for the former type [12] and on in-
1 Introduction formation or consistency measures for the later one.
Nowadays, large scale datasets are produced in various Whereas in classification and clustering tasks, eventually
different fields such as social networks, medical, process only a Hamming distance is used to handle qualitative data
operation, agricultural/environmental,... Many studies [4,11,14]. Other approaches are originally designed to
relate to data mining with the intention of analyzing and if process only quantitative data and therefore arbitrary trans-
possible extracting knowledge from these data. The data formations of qualitative data into a quantitative space are
classification has to provide a relevant and well-fitted proposed without taking into account their nature in the
representation of reality. In this context, the issue of repre- original space [12,15,16]. For example, the variable shape
senting of data is crucial since the formalisms must be can take values in a discrete unordered set {round, square,
generic yet well suited to every new problem. For machine triangle}. These values are transformed respectively to
learning, the concern is to be able to detect adequate pat- quantitative values 1, 2, and 3. However, we can also
terns from heterogeneous, large, and sometimes uncertain choose to transform them to 3, 2 and 1. Another inverse
datasets. In diagnosis, the necessity to quickly recognize a practice is to enhance the qualitative aspect and discretize
problem to provide a sure solution to solve it appears to be the quantitative value domain into several intervals, then
essential. One of the main challenges is the necessity to objects in the same interval are labeled by the same quali-
process heterogeneous data (qualitative, quantitative...) and tative value [17,18]. Obviously, both approaches introduce
sometimes to merge data obtained in different contexts. distortion and end up with information loss with respect to
We developed a classification method based on fuzzy logic the original data. Moreover, none of the previously pro-
[1] capable of processing heterogeneous data types and posed approaches combines in a fully adequate way, the
noisy data. The LAMDA (Learning Algorithm for Multi- processing of symbolic intervals simultaneously with
variate Data Analysis) method is a classification method, quantitative and qualitative data. Although extensive stud-
capable to process three types of data: qualitative, quantita- ies were performed to process this type of data in the Sym-
tive, and intervals [2]. We addressed one of the main diffi- bolic Data Analysis framework [19], they were focused
culties encountered in data analysis tasks: the diversity of generally on the clustering tasks [8, 10] and no unified
information types. Such information types are given by principle was given to handle simultaneously the three
types of data for different analysis purposes. In [2], a new
general principle, was introduced as “Simultaneous Map-
1 ping for Single Processing (SMSP)”, enabling the reason-
This work was supported by the FUI/FEDER project MAISEO
involving the companies VIVADOUR, CACG, GEOSYS, ME- ing in a unified way about heterogeneous data for several
TEO FRANCE, PIONEER and laboratories CESBIO, LAAS- data analysis purposes. The fact that SMSP together with
CNRS.

19
Proceedings of the 26th International Workshop on Principles of Diagnosis

LAMDA can process simultaneously these three types of 2.1 Calculation of MAD for quantitative features
data without pre-processing is one of its principal ad- The quantitative type allows the representation of numeri-
vantages compared to other classical machine learning cal values, assuming that the including space is known as a
methods such as SVM (Support Vector Machine [20]), K- defined interval. For this type of descriptor, membership
NN [21]. Decision trees are very powerful tools for classi- functions can be used, such as the Gaussian membership
fication and diagnosis [22] but their sequential approach is function so that the membership function for the xth sample
still not advisable to process multidimensional data since, descriptor to the kth class is:
by their very nature, they cannot be processed as efficient-
ly as totally independent information [23]. A complete − ( xi − ρ ki ) 2

µ i (x ) = exp 2σ i
2
description of the LAMDA method and comparison with (1)
k i
other classification techniques on various well known data
sets can be found in [24, 25, 26]. Its other main character-
istic is the fuzzy formalism which enables an element to or the binomial membership function:
belong to several classes simultaneously. It is also possible
to perform clustering (i.e. with no a priori knowledge of
the number and the class prototypes).
µ ki (xi ) = ρ ki
xi
(1 − ρki )(1− x )
i
(2)
Besides the three existing types, we propose to add an- where:
other type: the class type which can be processed simulta- ρ ki ∈ [0, 1] is the mean of the ith feature based on the
neously with the three former ones: quantitative, qualita-
tive, intervals thanks to the “SMSP”. In this configuration samples belonging to the class Ck, xi ∈ [0, 1] is the normal-
the class feature represents a meaningful aggregation of ized xth feature and σi the standard deviation of the ith fea-
other features. This aggregation can be defined by a class ture value based on the samples belonging to the class Ck.
determined by a previous classification, or the result of an
abstraction. This new type gives the possibility to develop
2.2 Calculation of MAD for qualitative features
hierarchical classifications or to fuse different classifica- In case of qualitative feature, the possible values of the ith
tions. It allows an easier representation of many various feature forms a set of modalities such as Di= Q1i , Qi Qm
i
and complex types of data, like multi-dimensional data,
while being realistic and conserving their constraints. In a with m the total number of modalities. The qualitative type
first part, the LAMDA method is briefly explained. The permits to express by words the different modalities of a
second part is devoted to the new type of data introduced: criterion.
the object type. Finally, this new method is exemplified The frequency of a modality Qli of the ith feature for the
through an agronomical project.
class Ck is the quantity of samples belonging to Ck whose
2 The LAMDA method modality for their ith feature is Qli [1].So each modality
i
The LAMDA method is an example of fuzzy logic based Qli ∈ Di has an associated frequency. Let θ kj be the fre-
classification methods [9]. The classification method takes
as input a sample x made up of N features. The first step is quency of a modality Q ij for the class Ck. The membership
to compute for each feature of x, an adequacy degree to
each class Ck , k = 1..K where K is the total number of function concerning the ith feature is:
classes. This is obtained by the use of a fuzzy adequacy
( )q * (q ki 2 )q *2* (q kmi )q
i i i
m ki (xi ) = q ki 1
function providing K vectors of Marginal Adequacy De- 1 2 m
(3)
gree vectors (MAD). This degree estimates the closeness
of every single sample feature to the prototype correspond-
ing to its class. At this point, all the features are in a com- where qli =1 if xi = Qli and qli = 0 otherwise, for l=1, ..m.
mon space. Then the second step is to aggregate all these
marginal adequacy degrees into one global adequacy de-
gree (GAD) by means of a fuzzy aggregation function. 2.3 Calculation of MAD for interval features
Thus the K MAD vectors become K GADs. Fuzzy logic[1] Finally, to take in account the potential uncertainties or
is here used to express MADs and GADs, since the mem- noises in data, we can use the interval representation [2].
bership degree of a sample to a given class is not binary The membership function for the interval type descriptors
but takes a value in [0,1]. Classes can be known a priori, is regarded as being the similarity S ( xi, ρ ki ) between the
commonly determined by an expert and the learning pro-
cess is therefore supervised, or classes can created during symbolic interval value for the ith feature xi and the interval
the learning itself (unsupervised mode or clustering). [ ρ ki − , ρ ki + ] which represents the value of the ith feature for
Three types of features can be processed by the LAMDA
the class Ck, so that:
method: quantitative, qualitative and intervals for the
MAD calculation [2]. The membership functions µ(x) used
by LAMDA are based on the generalization of a probabil- µ ki (xi ) = S ( xi, ρ ki ) (4)
istic rule defined on 0, 1 to the [0,1]-space.

20
Proceedings of the 26th International Workshop on Principles of Diagnosis

Let ω be defined as the scalar cardinal of a fuzzy set in a This optimization problem has an analytical solution de-
discrete universe as ϖ [X ] = ∑x ∈V µ x (x i ) . termined by the classical Lagrangian method. Details of
the method can be found in [9].
In case of a crisp interval, it becomes:
ϖ [X ] = upperBound(X)- lowerBound(X). 3 The new object type
Given two intervals A=[a-, a+] and B=[b-, b+], the distance In order to allow the combination of various data types
is defined as: into one single global object and therefore to support mul-

[( { } { })] (5)
ti-dimensional features, we develop a novel data type.
δ [A, B ] = max 0, max a − , b − − min a + , b + Each feature of an object descriptor can be described by a
measured value and an extrinsic object-related weight. A
and the definition of the similarity measure between two sample GAD calculus formula is then the weighted mean
crisp intervals: of all MADs:
GADk = ∑  MADk .w
j ji ~ 
fi  for j=1…J (9)
 
1  ϖ [I1 ∩ I 2 ] δ [I , I ] 
S ( I1 , I 2 ) =  + 1 − 1 2  (6) ji
2  ϖ [I1 ∪ I 2 ] ϖ [V ]  where MADk = MAD of the jth sample for the ith feature
~ ∈ [0,1] = Normalized value of weight
to class k and w fi
The similarity combines the Jaccard's similarity measure w f i of the ith feature determined by the MEMBAS meth-
which computes the similarity when the intervals overlapp,
and a second term which allows taking into account the od, and J is the total number of samples which have been
case where the intervals are not straddled. classified.

2.4 Calculation of feature weights
It is possible to determine the relevance of a feature to
optimize the separation between classes. The MEMBAS
method [8, 9] is a feature weighting method based on a
membership margin. A distinguishable property of this
method is its capability to process problems characterized
by mixed-type data (quantitative, qualitative and interval).
It lies on the maximization of the margins between two
closest classes for each sample. It can be expressed as:

∑ N w µ i ( x ( j ) ) 
J J  i =1 fi c i 
Max ∑ j =1 β j (w f ) = 1/N ∑ j =1   (7)
( j )
wf − ∑ iN=1 w fi µ ~i ( xi )
 c 
Subject to the following constraints: || w f ||22 = 1 , w f ≥ 0 .
Figure 1: LAMDA architecture
The first constraint is the normalized bound for the modu- The main advantage of using this new object-oriented data
lus of wf so that the maximization ends up with non- type is to capture the distinct features of a same object as a
infinite values, whereas the second guarantees the whole. An object of layer i-1 is regarded as one single
nonnegative property of the obtained weight vector. Then feature for the layer i then can be processed as all other
can be simplified as: descriptors. The weights of the descriptors composing the
objects are determined using MEMBAS once the cluster-
Max (w f ) s
T ing is finished for the layer i-1. An object is regarded as
wf (8) being a combination of features, each of which is associat-
2
Subject to || w f ||2 = 1, w f ≥ 0 ed to its weight. In other words, an object regarded as a
single entity in reality can be processed as a complex unit.
where: {
s = 1 / N ∑ Jj =1 U jc − U j~c } and
For instance, the weather can be considered as a global
concept but also as detailed data (rain, temperature, etc…).
U jc =  µ 1c  xi ,, µ cN  xi  , µ ki  xi( j )  is the
( j) ( j) All of its features are parts of a same object and are strong-
       ly connected together. That realistic consideration implies
membership function of class c (c corresponds to the several distinct clustering layers. The layer i concerns the
classification of a sample set called A and the i-1 one in-
“right” class for sample x(j), c~ the closest class evaluated
volves some of their constituent units. Obviously, a second
( j)
at the given value xi of the ith feature of pattern x(j). s is layer of classification is consistent only in case at least one
computed with respect to all samples contained in the data of the sample features is a complex entity. Therefore, for
base excluding x(j) (“leave-one-out margin”). each sample of the set, an object feature becomes itself a
whole sample in the layer i-1 and is compared to the others

21
Proceedings of the 26th International Workshop on Principles of Diagnosis

to constitute a new sample set called B. Then a classifica- where N is the total number of individuals in the data base
tion of the B samples is processed. Once the classification and K the total number of classes.
of the B samples has been done, its results are used to Dis represents the dispersion given by:
compute the classification of A. If the samples of the A set J
have C complex features, the second classification level ∑ δ kj ⋅ exp(δ kj )
K j =1
implies C distinct sample sets B1, B2, … BC thus C distinct Dis = ∑ 1 − (11)
classifications.
k =1 N ⋅ GADMk ⋅ exp(GADMk )
The MEMBAS algorithm [8, 9] can then calculate the
with: δ kj = GADMk − GADk ∀j , j ∈ [1, J ]
j
weights of every feature for the classes definition. It is (12)
applied on the B samples so that its involved features be-
GADMk = max GADk 
j
and (13)
come the weighted components of a meaningful object.  
The complex features of an A sample is then a balanced
* is the minimum distance between two classes. This
Dmin
combination of attributes.
distance is computed by using the distance d*(A,B) be-
tween two fuzzy sets A and B [8] defined by:

J j j
∑ min(GAD A , GADB ) (14)
M [A ∩ B ] j =1
d * ( A, B ) = 1 − = 1−
M [A ∪ B ] J j j
∑ max(GAD A , GADB )
j =1
The highest value of CV corresponds to a better partition.

4 Application to an agronomical project
The agronomical project aims at developing a diagnosis
system for an optimized water management system and an
efficient distinctive guidance for corn farmers in order to
decrease the use of phytosanitary products and the water
Figure 2: Principle of hierarchical classification consumption for irrigation. The project involves two as-
pects. The first one aims at complementing the benefits of
As explained in the Figure 2, the sample Sample1 is de- adopting and implementing the cultural profile techniques
scribed by X features, including the object-type feature [28, 29]. In this context, we perform a classification of
Desc1,1 . Desc1,1 is described by Desc1,α , Desc1,β, etc. plots based on various agronomic and SAFRAN meteoro-
To get their respective importance Wα , Wβ etc in Desc1,1 logical data [30], so that each plot should mostly belong to
description, a previous classification is performed regard- one particular class whose features are known. Thanks to
ing Desc1,1 as a sample (Sample1,p), so that each weight the provided information stemmed from the classification
can be calculated using the MEMBAS algorithm [8, 9]. results, advice can be offered to the corn farmers concern-
Once the respective weights of each feature are known, ing the corn variety they should sow and the schedule they
objects are automatically instantiated to be involved in the should follow for an optimized yield. This study includes
main classification. Desc1,1 is then described in line with two steps which are described in figure 3. The first one
the obtained weights Wα , Wβ and the known values concerns the clustering of a training set of 50 plots, using
V1,α , V1,β. the unsupervised LAMDA classification.

2.5 Evaluation of a classification quality
The comparison of two classifications can be performed
by measuring their respective compactness and their sepa-
ration. Better the classes are compact and separated easier
will be the recognition process.

A method to measure the quality of a partition has been
proposed by [10]. This index measures the quality partition
in terms of classes compactness and separation. This parti-
tion index is the Clusters Validity Index (CV, Eq.(10))
which depends only on the GADs (membership degree of
an individual to a class) and not explicitly on data values.

Dis *
CV = ⋅ Dmin . K (10)
N

22
Proceedings of the 26th International Workshop on Principles of Diagnosis

to divide the area in three sub-areas. The results of cluster-
ing (B) and the meteorological supervised classification
(B’) have been first performed with every sample of the set
and the distribution of the weights between the meteoro-
logical features has been determined.
The result of this classification is consistent and so, we
can use the obtained classes and weights of the meteoro-
logical features (obtained with MEMBAS) as object-
features in classification (A). To analyze the benefit of
using hierarchical classification, a clustering (A') has been
performed by using the twenty-one meteorological features
separately and the agronomical features (twenty-seven
features taken indistinctly). We can notice that the proto-
types of the classes are highly dependent on the meteoro-
logical classes for clustering (A) while clustering (A') is
mainly influenced by the ground type.

Figure 3: Learning System functioning

The data used for this classification are six distinctive
agronomical descriptors, describing the plots' features and
that are highly involved in their capacity for yield and
water retention, and twenty-one weather features, defining
the meteorological class in which the plot is situated. The
second part of the project will be repeated annually to
update and improve the clustering performed previously by
adding new information returned by the farmers after har-
vest. In the following, only the first part is presented.
Firstly a previous meteorological clustering (A) is re-
quired to realize a realistic plot classification since the Figure 4: Meteorological sub-areas obtained with classifica-
yield of seedling is highly related to the meteorological tion (B)
conditions. The weather is then regarded as a complex
entity so that it is only one of a plot features. It is based on To enlighten this, we chose arbitrarily two very close
the historical meteorological data of the geographical posi- classes containing the similar plots in both clustering. Each
tion corresponding to the studied plot. Those descriptors class prototype is described by the mean value of its mar-
refer to the temperature, the quantity of rainfall, and the ginal degree memberships (MAD). We represent in Figure
evapotranspiration which occurred during three crucial 5 these prototype parameters for meteorological features
periods of the year. Each feature is described in several only for both cases (A with diamond and A' with square)
distinctive ways. For instance, one period temperature is with in abscises, the marginal membership degree for class
evaluated according three types of information. This mete- 1 and in ordinate the same marginal membership degree
orological clustering is an unsupervised classification for class 2. For a better quantification of the benefits that
based on weather data covering every single days of the the use of the object representation brings, the CV is sys-
determined periods during the fifty last years for all the tematically calculated in order to determine the better
geolocalized points belonging to the area studied in this partition quality. The results are very encouraging since
project (South-West of France). In the event that the plot is CV = 0.69 when the meteorological data are regarded as a
part of the training set (studied area), the weather type of whole object and 0.2 when they are treated separately. The
its area is known and the plot classification can be done object type representation enables to multiply by more
directly. Otherwise, the weather type is obtained thanks to than 3 this index and therefore the compactness of the
a supervised classification mode (B') delivering the most obtained partition.
appropriate context. In any cases, the weather type is an
object-feature. This hierarchical treatment permits to re-
gard each meteorological type as a whole and let the
weather contexts follow their natural evolution inde-
pendently of agronomical variations. Moreover, consider-
ing the meteorological features as a single global object
permits taking into account the environmental constraints
and getting a realistic model. As we can observe in the
Figure 4, the meteorological clustering (B) has permitted

23
Proceedings of the 26th International Workshop on Principles of Diagnosis

4 Conclusion
This modular architecture allows more flexibility and a
more precise treatment of data. As we can notice with the
previous agronomical classification, the object approach
makes each module able to be managed independently of
the others so that they can evolve autonomously, depend-
ing on their own specific features and contexts. The object
representation permits to preserve multi-dimensionality
and makes fusion of datasets easier. A better overview is
offered since we can percept the variations of each module
distinctively and the evolution of their influences.
As a perspective, an agent-oriented architecture, based
on the multi-agents theory [31] will be developed so that
each sample could be considered independently of the
others. They would be so able to create classes acting
simultaneously and comparing themselves to the others, so
that the classes definition won’t depend on the samples
order in the file anymore but will directly result from the
samples set definition. This orientation will assure that the
classification result of our method is unique and stable for
a given samples set. We aim at developing some methods
to allow a semantic data processing also.

Figure 5: Meteorological prototypes for two close classes in References
case (A) and (A')
[1] D. Dubois and H. Prade editor. The three semantics of
fuzzy sets, Fuzzy sets and systems, vol. 90,N° 2, pp141-
The second aspect of our implication in the project deals
150,Elsevier, London, 1997.
with the water utilization of various clusters of farmers
[2] L. Hedjazi, J. Aguilar-Martin, M.V. Le Lann and T.
with the aim of forecasting the needs of each cluster and
Kempowsky, Towards a unified principle for reasoning
adjusting the repartition. From this perspective, we realize
about heterogeneous data: a fuzzy logic framework, Inter-
an unsupervised classification of a training data-set of
national Journal of Uncertainty, Fuzziness and
2900 samples described by seven features: distance to the
Knowledge-Based Systems, vol. 20,N°2,pp. 281-302,
closest waterway, orientation, altitude… Orientation con-
World Scientific, 2012
cerns cardinal points and we assume that it is not expressi- [3] R.S. Michalski and R.E. Stepp, automated construction
ble with different modalities since continuity cannot be of classifications: Conceptual clustering versus numerical
represented by qualitative descriptors. It cannot be a num- taxonomy, IEEE Trans. Pattern Anal. Machine Intell., vol.
ber nor an interval because of the cyclic form to be kept. PAMI-5, no. 4 (1980), pp. 396-410.
Thus we choose to regard a cluster orientation as an object [4] D.W. Aha, Tolerating noisy, irrelevant and novel at-
composed of two descriptors that correspond to the coor- tributes in instance based learning algorithms, Int. Man-
dinates of its cardinal point in a trigonometric circle base. Machine Studies 36 (1992), pp. 267-287.
The orientation of each cluster can take eight different [5] S. Cost, S. Salzberg, A weighted nearest neighbor
values: N, NE, E, SE, S, SW, W, and NW, which bring us algorithm for learning with symbolic features, Machine
to consider eight different combinations. In accordance learning (10) (1993), pp.57-78.
with the trigonometrical circle, these eight combinations [6] T. Mohri, T. Hidehiko, An optimal Weighting Criterion
√2 √2 √2 √2 of case indexing for both numeric and symbolic attributes,
are respectively: (0,1), ( 2 , 2 ), (1,0), (- 2 , 2 ), (0,-1),
in D.W. Aha (Ed.), Case-based Reasoning: papers from the
√2 √2 √2 √2 1994 workshop. Menlo Park, CA:AIII Press, pp. 123-127.
(- ,- ), (1,0), ( ,- ).
2 2 2 2
[7] C. Giraud-Carrier, M. Tony, An Efficient Metric for
Once our results are validated by an expert, the classifi- heterogeneous Inductive Learning Applications in the
cation is experimented twice: firstly treating each de- Attribute-Value Language, Intelligent systems (1995) 341–
scriptor separately and secondly involving the object type. 350.
Such as meteorological data in the first example, the CV is [8] K.C. Gowda, E. Diday, Symbolic clustering using a
calculated in order to determine the better partition quality. new similarity measure, IEEE Trans. SMC 22(2) (1992)
In this case, which implies 2900 samples, CV= 0.08 368–378.
[9] Q.H Hu, Z.X. Xie, D.R. Yu, Hybrid attribute reduction
when abscissa and ordinate are separated, and CV= 0.13
based on a novel fuzzy-rough model and information
when using an orientation object. As in the first example,
granulation, Pattern Recognition 40 (2007) 3509–3521.
these results show a qualitative gain for the partition when [10] F.A.T. De Carvalho, R.M.C.R. De Souza, Unsuper-
the object type is used to express the semantically connect- vised Pattern Recognition Models for Mixed Feature-Type
ed data. Symbolic Data, Pattern Recognition Letters 31 (2010)
430–443.

24
Proceedings of the 26th International Workshop on Principles of Diagnosis

[11] I. Kononenko, Estimating Attributes: Analysis and Analysis of near-surface atmospheric variables: Validation
Extensions of Relief, Proc. European Conf. Mach. Learn- of the SAFRAN analysis over France. Journal of applied
ing ECML (1994), pp. 171-182. meteorology and climatology, 47(1), 92-107.
[12] K. Kira, L. Rendell, A practical approach to feature [31] Ferber, J. (1999). Multi-agent systems: an introduc-
selection. In proced. 9th Int’l Workshop on Machine Learn- tion to distributed artificial intelligence (Vol. 1). Reading:
ing (1992), pp. 249-256. Addison-Wesley.
[13] M. Dash, H. Liu, Consistency-based search in feature
selection, Artif. Intell. 151 (2003) 155–176.
[14] D.W. Aha, Incremental, instance-based learning of
independent and graded concept descriptions, in Proced.
Of the 6th int’l Mach. Learning Workshop. (1989) 387–
391.
[15] J. Weston, S. Mukherjee, O. Chapelle, M. Pontil, V.
Vapnik, Feature Selection for SVMs, Advances in Neural
Information Processing Systems (2001), pp. 668-674.
[16] T. Cover, P. Hart, Nearest neighbor pattern classifica-
tion, IEEE Trans. Inf. Theory 13 (1967), pp.21-27.
[17] H. Liu, F. Hussian, C.L. TAM, M. Dash, Discretiza-
tion: an enabling technique, J. Data Mining and
Knowledge Discovery 6(4) (2002) 393–423.
[18] M.A. Hall, Correlation-based Feature Selection for
Discrete and Numeric Class Machine Learning, Int. Conf.
Mach. Learning ICML (2000), pp. 359-366.
[19] H.H. Bock, Diday E. Analysis of Symbolic Data,
Exploratory methods for extracting statistical information
from complex data. (Springer, Berlin Heidelberg,2000).
[20] V. Vapnik, The Nature of Statistical Learning Theory
Data Mining and Knowledge Discovery ,pp. 1-47, 6,
Springer Verlag, 1995
[21] T. Cover and P. Hart, Nearest neighbor pattern classi-
fication Information Theory, IEEE Transactions
on,13,1,pp. 21-27, 1967
[22] Michie D., Spiegelhalter D.J., Taylor C.C., Machine
Learning, Neural and Statistical Classification, Ellis Hor-
wood series in Artificial Intelligence, february, 1994
[23] Rakotomalala R., Decision Trees, review MODU-
LAD, 33, 2005.
[24] J. C. Aguado and J. Aguilar-Martin, A mixed qualita-
tive-quantitative self-learning classification technique
applied to diagnosis, The Thirteenth International Work-
shop on Qualitative Reasoning, (QR'99)pp. 124-128, 1999.
[25] L. Hedjazi, J. Aguilar- Martin, M.V. Le Lann, Simi-
larity-margin based feature selection for symbolic interval
data, Pattern Recognition Letters, Vol.32, N°4, pp. 578-
585, 2012
[26] L. Hedjazi, J. Aguilar-Martin, M.V. Le Lann, and
Tatiana Kempowsky-Hamon, Membership-Margin based
Feature Selection for Mixed Type and High-dimensional
Data: Theory and Applications, Information Sciences,
accepted to be published, 2015.
[27] C. V. Isaza , H. O. Sarmiento, , T. Kempowsky-Hamon ,
M.V. Le Lann , Situation prediction based on fuzzy cluster-
ing for industrial complex processes, Information Sciences,
Volume 279, 20 September 2014, pp. 785-804, 2014
[28] Henin S., Gras R., Monnier G., 1969, Le profil cultu-
ral (2e edition), Masson Ed. Paris.
[29] Gautronneau Y., Gigleux C., 2002, Towards holistic
approaches for soil diagnosis in organic orchards, Pro-
ceedings of the 14th IFOAM Organic World Congress,
Victoria, p 34.
[30] Quintana-Seguí, P., Le Moigne, P., Durand, Y., Mar-
tin, E., Habets, F., Baillon, M., ... & Morel, S. (2008).

25
Proceedings of the 26th International Workshop on Principles of Diagnosis

26
Proceedings of the 26th International Workshop on Principles of Diagnosis

A Bayesian Framework for Fault diagnosis of Hybrid Linear Systems

Gan Zhou1 Gautam Biswas2 Wenquan Feng1 Hongbo Zhao1 and Xiumei Guan1
1
School of Electronic and Information Engineering, Beihang University, Beijing, China
email: zhouganterry@hotmail.com; buaafwq@buaa.edu.cn;
bhzhb@buaa.edu.cn; guanxm@buaa.edu.cn
2
Institute for Software Integrated Systems, Vanderbilt University, Nashville, USA
email: gautam.biswas@vanderbilt.edu

Abstract system. Some faults may be parametric, and they directly
affect the continuous behavior, others are discrete, thus they
Fault diagnosis is crucial for guaranteeing safe, directly affect the mode of system operation. Both types of
reliable and efficient operation of modern engi- faults also have indirect effects on the other type of beha-
neering systems. These systems are typically hy- vior. Moreover, faults can have different time-varying pro-
brid. They combine continuous plant dynamics files, such as abrupt faults, intermittent faults and incipient
described by continuous-state variables and dis-
faults [2]. In addition, faults may occur in the plant, the
crete switching behavior between several operat-
actuators and the sensors. The diagnosis of multiple fault
ing modes. This paper presents an integrated ap-
proach for online tracking and diagnosis of hybrid types in the same framework is challenging, because some
linear systems. The diagnosis framework com- faults may produce similar effects in the particular mea-
bines multiple modules that realize the hybrid surements. Therefore, the diagnosis approach should pro-
observer, fault detection, isolation and identifica- vide more discriminatory power.
tion functionalities. More specifically, a Dynamic Previous model-based diagnosis approaches of hybrid
Bayesian Network (DBN)-based particle filtering systems were developed separately for parametric faults or
(PF) method is employed in the hybrid observer to discrete faults. For example, [1], [3] combined system
track nominal system behavior. The diagnostic monitoring with an integrated approach: qualitative and
module combines a qualitative fault isolation me- quantitative fault isolation to generate, refine, and identify
thod using hybrid TRANSCEND, and a quantita- parametric faults. [4]-[5] are typical discrete fault diagnosis
tive estimation method that again employs a approaches, which modeled the discrete faults as fault
DBN-based PF approach to isolate and identify modes, and relied on estimating the system behavior for
abrupt and incipient parametric faults, discrete diagnosis. In recent years, some integrated approaches have
faults and sensor faults in a computationally effi- been proposed for diagnosis of parametric and discrete
cient manner. Finally, simulation and experimental faults together. [6] introduced a global ARRs
studies performed on a hybrid two-tank system (GARRs)-based mode diagnoser to track discrete system
demonstrate the effectiveness of this approach. modes, and combined it with a quantitative approach to
diagnose discrete and abrupt or incipient parametric faults
1 Introduction within a common framework. The approach presented in [7]
The increasing complexity of modern industrial systems monitored system behavior using a timed Petri-Net model
motivates the need for online health monitoring and diag- and mode estimation techniques, and isolated the faults by
nosis to ensure their safe, reliable, and efficient operation. means of a decision tree approach. Unfortunately, this me-
These systems are typical hybrid involving the interplay thod was application-specific, and was not generalized.
between discrete switching behavior and continuous plant Our goal in this paper is to propose an integrated mod-
dynamics. More specifically, the system configuration el-based approach to diagnose single and persistent inci-
changes consist of known controlled mode transitions pient or abrupt parametric faults, discrete faults and sensor
generated from external supervisory controller and auto- faults in hybrid linear systems. This extends our earlier
nomous mode transitions triggered by internal variables work [8] from continuous systems to hybrid systems. A PF
crossing boundary values. The continuous dynamic beha- technique using switched DBN is adopted for tracking
vior is modeled by continuous-state variables that are a nominal hybrid system behavior. When a non-zero residual
function of the particular discrete mode of operation. As a value is detected using a statistical hypothesis testing me-
result, tasks like online monitoring and diagnosis have to thod, this fault detection scheme triggers the fault isolation
seamlessly integrate continuous behaviors interspersed with and identification modules. We combine a fast qualitative
discrete transitions that often require model switching to fault isolation (Qual-FI) scheme using the hybrid TRAN-
accommodate the discrete transitions [1]. SCEND approach [1] with quantitative fault isolation and
For complex hybrid systems, faults will typically affect identification (Quant-FII) scheme based on a PF-based
the continuous behavior and the discrete dynamics of the parameter estimation technique to support the diagnosis of
multiple faults types in hybrid linear systems. The

27
Proceedings of the 26th International Workshop on Principles of Diagnosis

Quant-FII scheme derives a switched faulty DBN model for of zero flow and zero effort, respectively. The dynamic
each fault hypothesis that remains when the switch from behavior of switched junctions is implemented by a finite
Qual-FI to Quant-FII is initiated. In addition, Quant-FII is state machine control specification (CSPEC). A CSPEC
also designed to estimate possible parameter values [8]. defines finite number of states, and captures controlled and
The rest of this paper is organized as follows. Section 2 autonomous changes.
briefly presents the different models employed in our di- The hybrid two-tank system, shown in Figure 1, is the
agnosis approach and some basic definition of the different running example we employ in this paper. This system
types of faults. A hybrid two-tank system is used as a run- consists of two tanks connected by a pipe, a source of flow
ning example to explain the hybrid bond graph modeling into the first tank, and drain pipes at the bottom of each tank.
method and the derivation of temporal causal graph and Three valves valve1, valve2 and valve3 can be turned on
DBN from hybrid bond graph models. Section 3 gives a and off by commands generated from the supervisory con-
brief overview of our diagnosis architecture, and then troller. When the liquid level in tanks 1 ( h1 ) and/or 2 ( h2 )
presents our online tracking and fault detection, qualitative reaches the height at which pipe R12 is placed ( h ), a flow is
fault isolation and quantitative fault isolation and identifi-
cation schemes in some detail. Section 4 discusses the re- initiated through pipe R12 . The autonomous mode changes
sults of the application of our algorithm to the hybrid associated with this pipe are triggered when the liquid level
two-tank system. Finally, the discussion and conclusions of in tank1 and/or tank 2 goes above or below the height of the
this paper are presented in the last section. pipe R12 . We assume five sensors: M 1 and M 2 measure the
outflow from tank 1 and tank 2, respectively. M 3 measures
2 Theoretical Background the flow through the autonomous pipe R12 , and M 4 and
In this section, we formalize the basic definitions, concepts M 5 measure the liquid pressure in tank 1 and tank 2, re-
and notation of the modeling approach that goes in con-
junction with our diagnosis architecture. spectively.

2.1 Hybrid Bond Graphs CSPEC4 CSPEC5
Bond graphs (BGs) are a domain-independent topologi- C : C1 Df : M 3 C : C2
cal-modeling language that captures energy-based interac- CSPEC1 De : M 4
tions among the processes that make up a physical system 3 4 9 12

[9]. The nodes in bond graphs represent components of 1 2 8 11 16
Sf 1 0 R12 0 De : M 5
dynamic systems including energy storage elements (ca-
pacities, C and inertias, I), energy dissipation elements 5 10 13
(resistors, R), energy sources (effort source, Se and flow
6 14
source, Sf) and energy transformation elements (gyrators, Df : M 1 1 R : R12 1 Df : M 2
GY and transformers, TF). Bonds, drawn as half arrows,
7 CSPEC2 15
represent the energy exchange paths between the bond
graph elements. Two junctions (1 and 0), also modeled as R : R1 R : R2
nodes, represent the equivalent of series and parallel to- CSPEC3
pologies respectively.
Valve1 Autonomous pipe R12
1 LS f f(x) RS f 1

Left 1 1 1 Right

R12 R : R12
C1 C2

Figure 2 Hybrid bond graph of the plant
Figure 2 illustrates the HBG model for the plant in Figure
Valve2 Valve3
R1 R2
1 (The HBG model for autonomous pipe R12 is shown
Tank1 Tank2
separately at the bottom part of Figure 2). The tanks and
pipes are modeled as fluid capacitances C and resistances R,
Figure 1 Schematics of hybrid two-tank system respectively. Measurement points occur at junctions. They
are denoted by elements with symbols De for effort variable
Hybrid bond graphs (HBGs) extend BGs by introducing
measurements and Df for flow variable measurements.
switched junctions to enable discrete changes in the system
Moreover, the two-tank system has five switched junctions:
configuration [10]. The switched junctions may be dy-
the CSPEC1, CSPEC2 and CSPEC3 describe the control
namically switched on and off as system behavior evolves.
logic for the three valves. CSPEC4 and CSPEC5 together
When a switched junction is on, it behaves as a normal
capture the autonomous mode transitions of the connecting
junction. When off, the 1 and 0 junctions behave as sources

28
Proceedings of the 26th International Workshop on Principles of Diagnosis

pipe between the two tanks. Figure 3 (a) shows the CSPEC between the variables [12]. The system variables consists of
for a valve controlled by the switching signal sw. Figure 3 four different set of variables  X t , Zt ,Ut , Yt  , which de-
(b) shows CSPEC4 that describes the state of the left tank.
notes the continuous state variables, other hidden variables,
When the liquid height in tank1 is below that of the auto-
input variables and measured variables for dynamic system,
nomous pipe R12 , that state is OFF. If the liquid level ex- respectively. The relations between these variables can be
ceeds the height of the pipe, this CSPEC transitions to the generated as equations in the state space formalism. The
ON state. Similarly, CSPEC5 denotes the state of the right across-time links between the successive times slice t and
tank, and the mode of the autonomous pipe depends on the t+1 are derived as transition equations between the state
combination of these two CSPECs. Table 1 shows the dis- variables in the system. Since the TCG describes the causal
crete mode for pipe R12 and the corresponding state of constraints between system variables, the DBN can be
CSPEC4 and CSPEC5 in detail. The corresponding bond easily constructed from TCG. More details of this process
graph configurations are described in [15]. are presented in Lerner, et al. [13].
t t+1

sw h1  h f1 f1
f6 f6
S1 : ON S2 : OFF S1 : ON S2 : OFF

e4 e4
sw h1  h
f9 f9
（a） （b）

Figure 3 (a) Controlled transition; (b) Autonomous transi- e12 e12
tion for CSPEC4
f14 f14
Table 1 Four different possible configurations for auto-
nomous pipe R12 Figure 4 Nominal DBN
Mode Constraint Function CSPEC4 CSPEC5 When all the valves are ON and the liquid level in tank1
1 h1  h  h2  h ON OFF and tank2 are above the height of the autonomous pipe R12 ,
2 h1  h  h2  h OFF ON the nominal DBN model for hybrid two-tank system is
shown in Figure 4. This DBN model derived from the TCG
3 h1  h  h2  h OFF OFF
as the following random variables: the continuous state
4 h1  h  h2  h ON ON variables X  e4 , e12  presents the pressures at the bottom
of each tank, input variables U   f1 denotes the input
The temporal causal graph (TCG) is a signal flow dia-
gram that captures the causal and temporal relations be- flow into tank 1, and measured variables Y   f6 , f9 , f14 
tween system variables, and can also be systematically indicates the outflow from tank1, the flow through the au-
derived from a BG [11]. In our work, we can efficiently tonomous pipe R12 and the outflow from tank 2.
reason about the qualitative behavior of each continuous
t t+1
mode of hybrid system behavior using the TCG when a
fault is detected. Formally, a TCG is defined as follows [2]: f1 f1
Definition 1 (Temporal Causal Graph): A TCG is a di-
rected graph that can be denoted by a tuple .
f9 e4 e4 f9
V  E  F  S  M is a set of vertices involving effort
variables E, flow variables F, discrete fault event S and
measurement M in hybrid bond graph model. L is a label set e12
f14 e12 f14
{1, 1, , p, p 1 , N , Z , p  dt , p 1  dt} . The propagation type
of first seven labels is instantaneous, and the last two are
temporal. D  V  L  V is a set of edges. R1 R1

For lack of space, the TCG for hybrid two-tank system is
not shown in this paper, but the algorithms for deriving Figure 5 Single DBN model for both abrupt and incipient
TCGs directly from bond graph model can be found in [2]. parametric fault
It should be noted that for each mode of operation, the TCG
may need to be re-derived to capture the changes in the BG Since the discrete faults only influence the system mode,
model configuration when mode transitions occur. but not parameter variables, the DBN fault model corres-
ponding to discrete fault will be constructed from the TCG
2.2 Dynamic Bayesian Networks in the particular discrete mode. For parametric faults, the
Assuming that the system is Markovian and time-invariant, DBN fault model is generated on the basis of nominal DBN
we can model the system as a two-slice temporal Bayes net model by augmenting a new random variable for each fault
that illustrates not only the relations between system va- candidate. Figure 5 shows DBN model with parametric
riables at any time slice t, but also the across-time relations faults represented explicitly for the hybrid two-tank system.

29
Proceedings of the 26th International Workshop on Principles of Diagnosis

The abrupt fault R1 a and incipient fault R1 i are 3 Diagnosis Approach of Hybrid Linear
represented in the same model. When the fault occurs, fault Systems
parameter R1 becomes the additional state variable that
Our integrated diagnosis approach for hybrid linear systems
need to be tracked. (See Figure 6) combines the Hybrid TRANSCEND ap-
proach [2] with switched DBN-based PF scheme [14] to-
2.3 Modeling Faults gether, which diagnoses abrupt or incipient parametric
In this paper, we focus on the diagnosis of persistent single faults, discrete faults and sensor faults in a common
faults. We consider incipient or abrupt parametric faults and framework. It includes three main parts: system monitoring,
discrete faults occurring in hybrid linear systems, as well as qualitative fault isolation (QFI) and quantitative fault iso-
sensor faults. The precise definition for these faults can be lation and identification (QFII). These three steps are
given as follow. summarized below.
Definition 2 (Incipient parametric fault): An incipient Initially, a nominal DBN is constructed from the current
fault profile is defined by a gradual drift in the corres- TCG model. A hybrid observer uses a PF-based nominal
ponding component parameter value p(t) from the fault DBN model to track the system behavior in individual
occurrence time t f . The incipient fault parameter pi (t ) modes of operation. At the same time, a finite automata
can be described by: method in hybrid bond graph scheme implements the
CSPECs, executes controlled and autonomous mode

 p(t ) t  tf
pi (t )   (1) changes, and determines the system model for hybrid ob-

 p (t )  d (t )  p (t )   i
p (t  t f ) t  t f server.
The fault detection continually monitors the statistically
where d (t )   ip (t  t f ) is a linear function with a con-
significant deviations between the observation y(t) and
stant slope  ip that added to the nominal parameter value estimation yˆ (t ) generated by hybrid observer. Once a fault
from the time point of fault occurrence. Our approach to is determined, QFI is triggered to generate the initial fault
isolation and identification of incipient fault parameters is hypothesis, and refine them as additional deviations are
to calculate this constant slope  ip [8]. observed. When remaining fault hypothesis set satisfies
particular condition, the QFII scheme is invoked to run in
Definition 3 (Abrupt parametric fault): An abrupt para-
parallel with QFI. The goal of this scheme is to refine the
metric fault is characterized by step changes in nominal
fault hypothesis further and estimate the value of the fault
component parameter value p(t) from the fault occurrence
parameter. The following subsections describe these steps
time t f . The abrupt fault parameter p a (t ) is given by:
in more detail.

 p(t ) t  tf
p a (t )   (2) 3.1 Online Tracking and Fault Detection
 p(t )  b(t )  p(t )   p  p(t ) t  t f
a
 Since the hybrid system is piecewise continuous, discrete
where b(t )   pa  p(t ) is a step function that gets added to mode changes of the hybrid system have to be detected
the parameter value from the time point of fault occurrence. accurately as the continuous behavior of the system
 pa is the percentage change in the parameter expressed as a evolves. In our work, we have designed hybrid observers
that are based on the nominal DBN-based PF scheme to
fraction, and our goal is to estimate this value [8]. track the continuous behavior in individual modes of oper-
Definition 4 (Discrete fault): A discrete fault manifests as ation. PF is a general purpose Markov chain Monte Carlo
a discrepancy between the actual and expected mode of a method that approximates the belief state using a set of
switching element in the model [2]. samples or particles, and keeps the distribution updated as
Discrete faults occur in discrete actuators, like valves and new observations are made over time. Moreover, the PF
switches that operate in discrete modes (e.g., on and off). approach for DBNs exploits the sparseness and compact-
Consider the example of a valve, it may be commanded to ness of the DBN representation to provide computationally
close, but remain stuck open. Also, it may unexpectedly efficient solutions, because each measured variable in a
open or close without a command. This type of fault ma- DBN typically depends on some but not all continuous state
nifests as an unexpected system mode change, unlike pa- variables.
rametric faults, which cause deviations in continuous be- For discrete mode changes, the finite state machine
havior. (FSM) for each switched junction determines mode transi-
Definition 5 (Sensor fault): A sensor fault is a discre- tions. Since the continuous behavior and discrete mode
pancy between the measurement and actual value in the changes will interact with each other as system evolves, the
model. FSM needs to execute controlled or autonomous mode
In this paper, we only consider sensor bias fault, which changes. Explicit controlled changes are relatively simple,
can be represented as: but the autonomous mode changes depend on the internal

m(t ) t  tf continuous variables. If mode changes occur, the hybrid
mb (t )   (3) observer will regenerate the nominal DBN model from
m(t )   m t  t f
b
 TCG in new mode, and use the PF to continuously track
where m(t) is the true value, and bm is the sensor bias system dynamic behavior. The online tracking algorithm
term. for hybrid systems is shown in Algorithm 1.

30
Proceedings of the 26th International Workshop on Principles of Diagnosis

Qualitative Fault Isolation
System Monitoring Symbol Hypothesis Progressive
y(t)
System r(t) Generation Generation Monitoring

+
u(t) r(t) Fault
y(t) Detection Quantitative Fault Isolation and Identification
-
Hybrid
 Fault Isolation Fault Identification
Observer y (t )

Temporal DBN Modeling
Nominal DBN
Causal Graph Faulty Behavior

Hybrid bond
graph

Figure 6 The diagnosis architecture
Algorithm 1: Online tracking algorithm [1] to hybrid systems. Daigle, et al. [2] extended this me-
Input: Number of particles, N; a initial DBN model thod to model discrete and sensor faults in continuous and
D  { X , Z ,U , Y } hybrid systems. All of these methods are based on a formal
For each particle i, from 1 to N do definition of fault signature as follows:
Definition 6 (Qualitative Fault Signature): Given a fault f
Sample X 0i from the prior probability distribution
and measurement m, the qualitative fault signature can be
Assign Y0i as the measurement at time step 0 denoted by QFS ( f , m)  {(s1s2 , s3 ), s1 , s2  (, , 0,*), s3 
End For ( N , Z , X ,*)} ; where  and 0 indicate an increase, de-
For each time-step t>0 do
crease, and no change for residual magnitude or slope. N, Z
If the controlled or autonomous mode change oc-
and X imply zero to nonzero, nonzero to zero, and no dis-
curs
crete change behavior in the measurement from the esti-
Regenerate a DBN model D ' from TCG in new mate. * denotes the ambiguity in the signatures.
system configuration
End If Table 2 Selected fault signature for hybrid two-tank system
Prediction: Sample each particle in DBN model for the mode when all the valves are open and liquid level in
D ' both tanks are above the height of the autonomous pipe
Weighting: Compute the weight considering the Fault f6 f9 f14
observation
Resampling: Normalize the weighted samples, and C1 a (, X ) (, X ) (0, X )
resample N new samples C1 i (0, X ) (0, X ) (0, X )
Calculate the estimated continuous state variables R1 a (, X ) (0, X ) (0, X )
X t and Yt at time step t R1 i (0, X ) (0, X ) (0, X )
End For v1.off (0, X ) (0, X ) (0, X )
v 2.off (, X ) (0, X ) (0, X )
The fault detection module compares the measured va-
riable y(t) from sensors with its estimate, yˆ (t ) computed f 6 ( 0, ) (00, X ) (00, X )
by the hybrid observer at each time-step t. Ideally, any f 6 ( 0, ) (00, X ) (00, X )
inconsistency r (t )  y (t )  yˆ (t ) implies a fault, and in-
When measurement deviations are detected, the symbol
vokes the qualitative fault isolation module. However, to
generator module in QFI scheme is triggered to calculate
account for noise in the measurements and modeling errors,
the QFS for the current mode of operation. However, since
statistical techniques are employed to determine significant
the fault may have occurred but not detected in an earlier
deviations from zero for the residual. In this paper, a Z-test,
mode, the fault hypothesis generation module rolls back to
which uses a sliding window to compute the residual mean
find the previous modes in which fault may have occurred,
and variance, is adopted by reliable fault detection with low
and generate fault hypothesis set F  {( fi , i , qi )} , where
false-alarm rates [3].
 i denotes the deviation of fault parameter value, and q i
3.2 Qualitative Fault Isolation indicates the possible modes. The progressive monitoring
The QFI scheme is based on qualitative fault signature module applies the forward propagation algorithm to con-
(QFS) method, which was proposed by Mosterman and tinually refine the fault candidates in the fault hypotheses
Biswas [11] and then extended by Narasimhan and Biswas set. For hybrid systems, the progressive monitoring also has

31
Proceedings of the 26th International Workshop on Principles of Diagnosis

to include forward propagation through mode changes, tively. The abrupt parameter faults are modeled as step
which makes the tracking algorithm much more complex. decrease in tank capacity and step increases in pipe resis-
Narasimhan and Biswas [1] discuss the details of the roll tances and represented as C1 a , C2 a , R1 a , R2 a and R12 a re-
back and roll forward algorithms used to support the pro- spectively. We consider discrete faults in each controlled
gressive monitoring task. When a fault signature is no valves including the valve gets stuck and valve changes
longer consistent with the observed measurements, and the mode without a command. For sensor faults, bias faults
changes cannot be resolved by autonomous mode transi- causing abrupt changes in the measurement are considered.
tions, this fault candidate is dropped. We assume that the tanks are initially empty, and start to
The selected qualitative fault signature for hybrid fill in at a constant rate. The initial configuration of the
two-tank system in particular mode is shown in Table 2. For system is all the valves are set to open. We will denote the
incipient parametric faults, the QFS is shown as (0 , s3 ) , system mode as qijkm , where i, j and k are the modes of
where  is the first nonzero symbol in the QFS for the
valve1, valve2 and valve3 respectively, and m is the mode
abrupt faults with same system parameter. Sensor faults
of autonomous pipe R12 . More specifically, the mode of
only affect the measurement provided by the sensor, so
other measurements that are not affected are denoted by 00. valves includes S1 : on, S2 : off , S3 : Stuck _ on and S 4 :
Stuck _ off . Therefore, the initial mode of the system is
3.3 Quantitative Fault Isolation and Identifica-
q1113 . At time step t=6.7s, the liquid level in tank 1 reaches
tion
the height of autonomous pipe R12 . The system mode tran-
Quant-FII scheme will be activated when any of the fol-
lowing conditions are fulfilled: 1) All the measurements sitions from q1113 into q1111 . Now the autonomous pipe R12
have deviated from nominal, so the remaining fault candi- acts as an outflow pipe for the tank 1 but as flow source for
dates cannot be refined further only by the Qual-FI scheme; the tank 2. As system evolves, the liquid level in tank 2 will
2) The number of fault candidates has been reduced to a also reach the autonomous pipe at time step t=53s. After
predefined value k; 3) A predefined time l has elapsed. We that, system mode changes into q1114 . The experiments
restrict the length of Quant-FII scheme as a pre-specified have been run for a total of 400s using a sampling period
value, and assume that no autonomous change occurs dur- 0.1s. Gaussian white noise with zero mean and variances
ing this period. 0.018 is added to measurements.
The steps describing this scheme are illustrated as fol-
lows: First, a separate DBN faulty model will be con- 4.1 Incipient Parametric Fault in R1
structed for each remaining fault candidate in the hypothe- In this first experiment, we present our diagnosis approach
sis set. Second, we combine each switched DBN faulty
for a fault scenario. A 10% rate of increase in pipe R1 is
model with PF method to estimate the system behavior.
Similar to fault detection scheme, a Z-test method is em- injected as the incipient fault at time step t = 60s.
ployed to detect the inconsistency between estimated values
from PF and measurements. Ideally, only the correct true
fault model will converge to the observed values of the
measurements. Once the deviation is determined, the cor-
responding fault candidate will be dropped. This scheme
runs in parallel with the qualitative fault isolation scheme,
and if a controlled mode change occurs, these two schemes
need to reload the DBN model for new system mode. This is
the big difference between continuous systems and hybrid
systems.
If the fault hypothesis cannot be refined further or only a
single parametric or sensor fault candidate is left, fault Figure 7 Observed and estimated result for nominal DBN
identification scheme will be activated to identify the abrupt model
or incipient parametric fault in the same model and estimate
the fault parameter value. We can use the PF result of the We only consider the measurement M 3 and M 2 for the
fault parameter to calculate the abrupt parameter fault flow f 9 through the autonomous pipe R12 and the output
magnitude  pa , incipient parameter fault slope  ip or sensor flow f14 from tank 2. At time step t=82s, the fault detection
fault bias term bm . scheme detects an increase in the flow f 9 , resulting in the
initial fault hypothesis F  {(C1 a , q1114 ), (C1i , q1114 ), ( R1 a ,
4 Experimental Results
q1114 ),( R1i , q1114 ),(v2.off , q1414 ),( f9 , q1114 )} . At 88.4s, the
To demonstrate the effectiveness of our approach, we apply
flow f14 shows an increase above nominal (+). A possible
it to the hybrid two-tank system in Figure 1. In this plant,
the incipient parametric faults are modeled as gradual de- autonomous transition is executed for the current inconsis-
crease in tank capacity and gradual increases in pipe resis- tent candidate ( f9 , q1114 ) . After that, the first order change
tances and denoted as C1i , C2i , R1 i , R2 i and R12 i respec- of flow f 9 is determined to decrease and increase in mode

32
Proceedings of the 26th International Workshop on Principles of Diagnosis

q1414 and q1114 at time steps t=94.8s and 97.7s, respective- models are shown in Figure 8 and Figure 9 respectively, and
ly, and finally the possible fault hypotheses are the plot for estimated value for R1 is presented in Figure 10.
F  {(C1i , q1114 ), ( R1 a , q1114 ), ( R1 i , q1114 )} . According to
4.2 Discrete Fault in Valve 2
the fault signatures in mode q1114 , these three candidates
In this subsection, we investigate an unexpected switch
cannot be refined further using observed deviations. Figure fault: valve 2 closes without a command at time step t=80s.
7 represents observed and estimated result generated by the
We only consider the flow f 6 and flow f 9 in this experi-
nominal DBN model.
ment.
Figure 11 shows the observed and estimated outputs us-
ing nominal DBN model. The fault is detected at time step
t=80.1s, and the symbol generator reports a decrease in
flow f 6 . QFI scheme generates the fault hypothesis set
F  {( R1 a , q1114 ), ( R1i , q1114 ),(v1.off , q4114 ),(v2.off , q1414 ),
( f6 , q1114 )} . At time step t=80.6s, the symbol generator
determines the flow f 6 to Z in mode q1114 and q4114 , be-
cause of estimated flow fˆ  0 and the observation f  0 .
6 6

This symbol eliminates all the parametric faults and discrete
Figure 8 Estimated observation using fault model C1 i fault v1.off from current trajectory. At 83.6s, the flow f10
shows a positive deviation (+), so the fault candidate
(v2.off , q1414 ) is correctly isolated. In this experiment, the
real fault candidate is isolated by the QFI scheme, so the
QFII scheme is not invoked.

Figure 9 Estimated observation using fault model R1 a / i

Figure 11 Observed and estimated result for nominal DBN
model

We also perform several additional experiments with
different fault types, fault magnitude, noise level and fault
occurrence time, and obtain satisfactory results. For lack of
space, we do not discuss these results in detail.

5 Conclusion
Figure 10 Estimated value of true fault parameter R1 i In this paper, we presented an integrated approach for on-
line monitoring and diagnosis of incipient or abrupt para-
The QFII scheme is initiated at time step t=72s, and two
metric faults, discrete faults and sensor faults in hybrid
separate DBN fault model using C1 i and R1 a / i are con- linear systems. First of all, we adopt the HBGs to model the
structed. As more measurements are obtained, the Z-tests system, and construct the diagnosis models, i.e., the TCGs
indicate a deviation in the measurement estimates obtained and the DBN models from the HBG model in different
by the fault model C1 i , and the estimation generated by modes. A PF method based on the switched DBN model is
employed for online monitoring of the system dynamic
possible true fault model R1 a / i is consistent with mea-
behavior. Once the discrete finite automaton in the HBGs
surement. The quantitative fault identification part esti- detects the controlled or autonomous mode changes, HBGs
mates the value of R1 , and determines that R1 indeed has an will regenerate the TCGs and DBN model in new mode.
incipient fault. While the actual fault slope is 0.1, the esti- These modeling approaches guarantee that the hybrid sys-
mated slope is 0.1009. The estimation using two faulty tems can be tracked correctly.

33
Proceedings of the 26th International Workshop on Principles of Diagnosis

Then, we demonstrate that we can accommodate discrete the 5th Symposium on Fault Detection, Supervision
faults and sensor fault models into the TCG and DBN and Safety for Technical Processes, 1125-1131, 2003
models that represent dynamic system behavior. As a result, [4] Dearden, R. and Clancy, D. Particle filters for real-time
our model-based approach can diagnose parametric, dis- fault detection in planetary rovers. In Proceedings of
crete and sensor faults within the same modeling and the Thirteenth International Workshop on Principles of
tracking framework. Finally, QFI scheme using Hybrid Diagnosis, 2002
TRANSCEND approach and QFII scheme by means of
[5] Hofbaur, M. W. and Williams, B. C. Hybrid estimation
switched DBN-based PF approach are combined together of complex systems. Systems, Man, and Cybernetics,
into a common framework, which provides more discri-
Part B: Cybernetics, IEEE Transactions on, 34(5):
minatory power and less computational complexity.
2178-2191, 2004.
This work builds on approaches presented in
[1][2][11][14]. [1] extends our previous work [11] from [6] Levy, R., Arogeti, S., and Wang, D.. An integrated
continuous systems to hybrid systems, but previous diag- approach to mode tracking and diagnosis of hybrid
nosis framework could only handle abrupt parametric faults. systems. IEEE Transactions on Industrial Electronics,
Soon after, Daigle [2] further extended the work in [1] to 61(4), 2024–2040, 2014.
capture discrete faults and sensor faults. Roychoudhury [7] Zhao, F., Koutsoukos, X., Haussecker, H., Reich, J.
[8][14] combined a qualitative fault isolation scheme with and Cheung, P. Monitoring and fault diagnosis of
an efficient DBN approach to diagnose both abrupt and hybrid systems. Systems, Man, and Cybernetics, Part
incipient parametric faults for continuous systems. This B: Cybernetics, IEEE Transactions on, 35(6),
paper proposes a comprehensive diagnosis methodology, 1225-1240, 2005
which extends DBN-based PF observer [8][14] to track [8] Roychoudhury, I., Biswas, G., Koutsoukos, X..
behavior of linear hybrid systems within and across mode Comprehensive diagnosis of continuous systems using
changes, and combines qualitative fault isolation scheme in dynamic bayes nets. Proceedings of the 19th
[2] with PF-based quantitative fault isolation and identifi- International Workshop on Principles of Diagnosis.
cation scheme in [8][14] to diagnose multiple fault types. 151-158, 2008
This method has been successfully applied to a hybrid [9] Karnopp, D. C., Margolis, D. L. and Rosenberg, R.
two-tank system, and experimental results demonstrate the C. System Dynamics: Modeling, Simulation, and
effectiveness of the approach. However, since the applica- Control of Mechatronic Systems. Wiley. 2012
tion in this paper is only a relatively simple hybrid linear
[10] Roychoudhury, I., Daigle, M. J., Biswas, G. and
system, our future work will scale up this methodology for
Koutsoukos, X. Efficient simulation of hybrid systems:
more realistic linear and nonlinear hybrid systems. More-
A hybrid bond graph approach. Simulation, 87(6),
over, distributed diagnostics techniques can efficiently
467-498, 2011.
decrease the computational complexity for complex real
systems, so this is also a research direction in future [16]. [11] Mosterman, P. J., and Biswas, G. Diagnosis of
continuous valued systems in transient operating
Acknowledgments regions. Systems, Man and Cybernetics, Part A:
Systems and Humans, IEEE Transactions on, 29(6),
This research was supported by China Scholarship Council 554-565, 1999.
under contract number 201306020068. The work was per-
[12] Murphy, K. P. Dynamic bayesian networks:
formed in Prof. Biswas’ lab at the Institute for Software
representation, inference and learning. PhD thesis,
Integrated Systems (ISIS), Vanderbilt University, USA University of California, Berkeley. 2002
[13] Lerner, U., Parr, R., Koller, D. and Biswas, G.
References
Bayesian fault detection and diagnosis in dynamic
[1] Narasimhan, S. and Biswas, G. Model-based diagnosis systems. In AAAI/IAAI, 531-537, 2000.
of hybrid systems. Systems, Man, and Cybernetics, [14] Roychoudhury, I. Distributed diagnosis of continuous
Part A: Systems and Humans, IEEE Transactions on, systems: Global diagnosis through local analysis. PhD
37(3): 348-361, 2007. thesis, Vanderbilt University. 2009
[2] Daigle M J. A qualitative event-based approach to fault [15] Narasimhan, S. Model-based diagnosis of hybrid
diagnosis of hybrid systems. PhD thesis, Vanderbilt systems. PhD Dissertation, Vanderbilt University.
University, 2008 Department on Electrical Engineering and Computer
[3] Biswas, G., Simon, G., Mahadevan, N., Narasimhan, Science, August 2002.
S., Ramirez, J. and Karsai, G. A robust method for [16] Roychoudhury, I., Biswas, G., & Koutsoukos, X.
hybrid diagnosis of complex systems. Proceedings of (2009). Designing distributed diagnosers for complex
continuous systems. Automation Science and
Engineering, IEEE Transactions on, 6(2), 277-290.

34
Proceedings of the 26th International Workshop on Principles of Diagnosis

ADS2 : Anytime Distributed Supervision of Distributed Systems that Face
Unreliable or Costly Communication

Cédric Herpson∗ and Vincent Corruble and Amal El Fallah Seghrouchi
Sorbonne Universités, UPMC Univ Paris 06, UMR 7606, LIP6, F-75005, Paris, France
CNRS, UMR 7606, LIP6, F-75005, Paris, France
e-mail: firstname.lastname@lip6.fr

Abstract Within the Dem@tFactory1 project, our objective is thus
to improve the supervision of an existing digitizing chain
The purpose of a supervision system is to detect, distributed over several sites (see Fig 1). Different faults
identify and repair any fault that may occur in the – single or multiple – can occur and alter or prevent the
system it supervises. Nowadays industrial pro- processing of the documents (e.g. a scanner quits working,
cess are mainly distributed, and their supervision a disruption of the connection between different sites halts
systems are still centralized. Consequently, when or corrupts a data transfer, an OCR software is poorly set
communications are disrupted, it slows down or and generates unexploitable results, etc.).
stops the supervision process. Increasing produc-
tion rates make this subjection to the state of the
communications no more acceptable. To allow the
anytime supervision of such systems, we propose
a distributed approch based on a multi-agent sys-
tem where each supervision agent autonomously
handles both diagnosis and repair on a given lo-
cation. This degree of delegation, never consid-
ered in the literature nor in the industry outside
of the theoretical framework, requires to over-
come several difficulties : How can one agent au-
tonomously make a diagnosis with dynamically
arriving information ? How can several agents
may coordinate and reach a consensus on a given
diagnosis or repair with asynchronous communi-
cation ? Finally, how to allow a human to trust Figure 1: In red, the communication links between the main
the decisions of such a system ? This paper devel- sites of the digitization chain of the Dem@tFactory project.
ops our proposal allong these three axis and evalu- In yellow, the links with the current (centralised) supervi-
ates ADS2 using an industrial case-study. Exper- sion system.
iments demonstrate the relevance of our approach
with an overall reduction of the supervised system Centralized supervision systems are currently the most
down-time of 34%. common in industry. However, they do not perform well
in asynchronous contexts. Indeed, communication malfunc-
tions between the supervision system and the geographically
1 Introduction distributed regions of the supervised system delay the repair
and do not allow to quickly return to normalcy, even though
Supervision systems were initially monitoring tools whose a number of malfunctions may have local predefined repair
role was limited to collect and display information for their procedures available. The unbounded communication time
interpretation and use by the human expert. Today, the ad- between the supervision and the supervised system is the
vent of complex and physically distributed systems leads to main reason for this problem.
a semantic shift from supervision tools to supervision sys- To overcome this lack of robustness when facing unreli-
tems. Indeed, as the complexity of systems increases, hu- able communications and to reduce the supervised system
mans can no longer process the flow of information arriving down-time, we present in this article ADS2 : a multi-agent
at each instant. The need to minimize the down-time and architecture where each supervision agent autonomously
to improve system effectiveness requires the delegation of handles both diagnosis and repair on a given location. The
some of the decision-making power of the human supervi- proposed architecture is composed of three mechanisms: A
sor to the supervision system. This requirement has lead to decision mechanism, a coordination and consistency recov-
the (re)birth of a research community around the notions of
autonomic computing [1] and self-* systems [2]. Our work 1
Project of the French R&D initiative Cap Digital federating 4
lies within this context. industrials and 3 laboratories.

35
Proceedings of the 26th International Workshop on Principles of Diagnosis

ery mechanism and an intertwining mechanism. The deci- messages between agents can be lost but not corrupted. The
sion mechanism tackles the dynamicity of the information agents are supposed to be reliable (no Byzantine behaviour).
available to an agent in order to make a diagnosis. The coor- Finally, we consider that the simultaneous occurrence of dif-
dination and consistency mechanism deals with the problem ferent faults does not result in phenomena of masking ob-
of reaching a consensus between several agents on a global servables.
diagnosis (or repair) in a context of asynchronous commu-
nications. Finally, the intertwining mechanism address the 2.2 Fault model and repair plan
problem of the size of the search-space in a multiple-faults Let F be the set of known faults of a system S and R be the
context. set of existing repair plans. The signature of a fault f is a
In this article, we first present our fault and repair model sequence of observable events generated by the occurrence
and the various assumptions made in section 2. We then de- of f . The set of signatures of a given fault f is Sig(f ).
scribe the three mechanisms of our multi-agent architecture To be able to represent any temporal dependencies, each
in section 3 to 5. We then demonstrate the viability of our fault is modeled as a t-temporised Petri net (Fig. 3). Each
proposal with experiments in section 6. Finally, we discuss fault is supposed to be repairable, that is to say that there ex-
related work in section 7 before concluding. ists at least one partially ordered sequence of atomic repairs
rk that repairs it (a repair plan).
2 A Multi-Agent Architecture for the
Supervision of Distributed Systems
Our architecture comes within the scope of fault-based
model2 approaches with spatially distributed knowledge.
The supervision process is distributed among several au-
tonomous agents having each a local view of the system to Figure 3: Let f be a fault that possesses 2 signatures.
00
be supervised, and endowed with diagnosis and repair capa- Sig(f ) = {o1 o5 ; o1 [o2 , o3 ][to1 ,to1 +5 ] }. The oi are the
bilities. The supervised system is partitioned into regions, events observed on the supervised system. The toi indicate
each one is supervised by one agent. As illustrated in Fig. the temporal constraints. Thus, [to1 , to1 + 500 ] constrains the
2, the supervision agents (Ai ) exchange information in order sequence of observations [o2 , o3 ] to appear under the 5 sec-
to establish a diagnosis and a repair consistent with the ob- onds that follow the occurrence of o1 for f to be recognized.
servations (Oj ) they get from the various units of the super-
vised system (Uk ). The links between the square units rep- The supervised system is partitioned into regions rgj .
resent the standard workflow of the supervised system. The Each supervision agent is associated with one unique region
dashed arrows represent the fact that some elements may be and knows the models of the faults that may occur in the re-
reprocessed if the quality is not sufficient. The arrows be- gion it oversees. However, a fault can cover several regions.
tween the units and the agents represent the communication In that case, an agent only knows the part of the model that
links used to transmit alarms logs. The remaining links rep- concerns its region. Its model is completed with the names
resent the communications between the supervision agents. of the agents responsible for the others regions. This hy-
pothesis allow to model workflow involving different actors
that do not share their data.

rgb rgb rgc rgc SigArgb (f ) = o1 o2 Argc
Sig(f ) = {o1 o2 o3 o4 } =⇒
SigArgc (f ) = Argb o3 o4
Beside getting the models of faults, the issue of defining a
global precedence relation between events that occur within
the supervised system remains. Indeed, there is no common
clock to the different regions. It is therefore necessary to
add in each agent a stamping mechanism allowing to recre-
Figure 2: Example of our supervision systems deployed on ate this order relation. We will not detail here the concept
a workflow. of distributed clock.We consider in the following that the
agents are able to recreate this partial-order relation.

2.1 Assumptions 2.3 Diagnosis and multiple faults
During the period of time [t − ∆t , t], agent Ai collects
We consider that communications are asynchronous and that
a sequence of observations seqObsAi (t, ∆t ) generated by
there is no upper bound on transmission delay. We assume
the occurrence of faults on the system. However agent Ai
that the messages exchanged between supervised units may
does not know which faults have occurred. It thus anal-
be lost or corrupted, and that some units are not supervised
yses seqObs in order to determine the set of all faults
(e.g. unit U2 on Fig. 2). This assumption is based on the
f pAi (t, ∆t ) whose signatures partially or totally match ele-
fact that a complex industrial process commonly involves
ments of seqObs. A diagnosis dg is a set of faults that can
different actors that do not share their supervision informa-
explain seqObs. Dg is the set of all possible diagnoses of
tion3 . Moreover, we assume that the observations and the
seqObs.
2
No model of the system’s correct behaviour is available. The
system can only use faults model, a priori known or dynamically 2.4 Fault cost and repair cost
learned from the system observation. Finally, each fault f (respectively each repair plan rp(f ))
3
subcontractors in the case of the Dem@tFactory project. is associated with a cost of malfunction which depends of

36
Proceedings of the 26th International Workshop on Principles of Diagnosis

the fault duration Ctdysf (f, t) (resp a cost of execution these explanations according to available information and to
CtEx (rp(f ))). The cost of a diagnosis dg for the supervi- the constraints we chose to focus on (e.g the most probable
sion system is the result of the aggregation of the respective explanation). After this step , the first element of Dg is the
costs of the faults that compose it. In the general case: diagnosis considered as the most relevant at the current time.
It is then necessary to estimate its cost.
Ctdysf (dgi , t) = Aggregfj ∈dgi (Ctdysf (fj , t)) (1) The cost of the immediate repair Ct(Drepopt ) must take
Similarly, the execution cost of a repair plan rp associ- into account the execution cost of the repair plan associated
ated to a given diagnosis depends on the aggregation of the to the diagnosis retained (CtEx , equation 2), as well as a
respective costs of the repairs that compose it. Thus, in the cost representative of the potential error relative to this de-
case the repair plan depends directly on the faults: cision, CtErr . Indeed,if the only cost considered is the one
of the execution of the repair plan, the final decision (step
CtEx (rp(dgi )) = Aggregf0 j ∈dgi (CtEx (rp(fj ))) (2) 5) will always favour an immediate action compared with a
delayed one due to the additional waiting cost of the delayed
3 Agent Decision Model action.
We consider highly dynamic systems. Consequently,
information available to an agent at a given time can be Ct(rp(dgi )) = CtEx (rp(dgi )) + CtErr (dgi |Dg\{dgi })
insufficient to determine with certainty which action to (3)
select. A supervision agent has thus to determine the The computation of the error cost CtErr relies on the
optimal decision (Dopt ) between the immediate triggering fact that we assume that the good diagnosis – and so the
of the plan made under uncertainty (Dimmopt ), and a good repair – belongs to the sorted list Dg of the po-
delayed action (Ddelayopt ) which lets him to wait and tential diagnoses. Thus, in case of misdiagnosis when
communicate with other supervisor agents during k time selecting the first diagnosis dg1 of Dg, the system will
steps. This waiting time can yield information that reduces lose a time equal to the execution time of the first re-
uncertainty and thereby improve decision-making. The pair plan (CtExecT ime (rp(dg1 ))) which will be supple-
counterpart is that the elapsed time may have a significant mented by the execution cost of the newly chosen repair
negative impact on the system. The expected potential gain plan (CtEx (rp(dg2 ))) associated to the 2nd diagnosis of
in terms of accuracy must be balanced with the risks taken. Dg. As this second choice may also turn out to be an er-
ror, we define CtErr recursively on Dg. Thus:
Let Ct(x) the cost of an action x and Ctwait (k) the cost
related to the extra time k before selecting a repair plan. The 
decision-making process of each supervision agent works as  CtErr (dg1 |[]) = 0// Dg is empty, the diagnosis is correct.


follows : 
 Err (dg1 |Dg\{dg1 }) = P (dg1 |Dg\{dg1 })×
Ct
1. Observation gathering 
 CtExecT ime (rp(dg1 )) + CtEx (rp(dg2 ))

2. Computation of the different sets of faults that can ex- + CtErr (dg2 |Dg\{dg1 , dg2 })
plain the current observations : Dg (set of diagnosis)
with P (dg1 |Dg\{dg1 }), the probability that choosing dg1
3. Determination of the immediate repair Dimmopt as the final diagnosis is an error.
based on available information and on the constraints
we chose to focus on (Most Probable Explanation, Law 3.2 Delayed repair Ddelayopt
of parsimony, Worst case,...) and computation of its es- A time t, an agent knows the set of the faults that may
timated cost Ct(Dimmopt ) be occurring in the region it supervises f pAi (t, ∆t ). The
4. A time t, an agent knows the set of the faults that may different faults models are represented using t-temporised
be occurring in the region it supervises f pAi (t, ∆t ). Petri-nets (Fig. 3 page 2). The agent is thus able to predict,
Knowing theirs signatures the agent is able to predict, for each fault of f pAi (t, ∆t ), the set of observables that
for each fault of f pAi (t, ∆t ), the set of observables can be expected to appear during the time interval [t, t + k],
that can be expected to appear during the time interval with k an a priori fixed parameter. Note that the agent uses
[t, t + k], with k an a priori fixed parameter. The agent the current transmission duration (computed over the inter-
uses these information to compute the waiting cost val [t−∆t , t]) to determine the set of potential observations.
Ctwait (k), the expected potential gain of a delayed re-
pair Ddelayopt and its associated cost Ct(Ddelayopt ). From this information, the agent builds the tree represent-
ing the set of all possibles futures working towards the cur-
5. Choice between the immediate repair Dimmopt and
the delayed repair Ddelayopt rent time plus k units of time, Arbpossibles
Ai (k). Each node
of the tree is associated with a set of observations and rep-
This algorithm is executed at each time-step and by each resent one possible future (Fig. 4 below). The agent then
agent when faults occur. The value k represents an upper computes, for each node of the tree, the set of diagnoses
bound delay as an agents’ decision is updated each time an that explain this future (Dg 0 ).
observation is received. We will detail in the following sub- The agent can then compute, for each possible future,
sections the steps 3 and 4 relative to the determination of the the immediate decision considered as optimal. At time
immediate and delayed repair and of their respectives costs. t, the determination of the delayed decision with horizon
k (Ddelayopt ) involves choosing between the various
3.1 Immediate repair Dimmopt possibles situations. This choice is realised by sorting
The knowledge of the different signatures of faults allows the first elements of each Dg 0 of the tree of the possibles
us to establish a list of potential diagnoses Dg. We sort futures with each other using the same criterion than the

37
Proceedings of the 26th International Workshop on Principles of Diagnosis

∅ Dg10 The multi-Paxos algorithm, initially developped for
reaching an agreement in a network of unreliable proces-
sors, falls into this category. The interesting aspect of this
o1 o2 algorithm is that it was designed to resist to halt failures -
Dg20 Dg40
with recovery possibility - of a number of processes, includ-
f pAi (t, ∆t )

ing the coordinator. Its very low number of assumptions
o2 makes it operational in an environment with unreliable
Dg30 communications. These properties make it particularly
suited to multiagent systems. Using the multi-Paxos, each
agent is able to initiate, integrate or leave a coalition.
Figure 4: Illustrative example of a tree of the possibles fu-
tures. The fact that there is no upper bound on the time needed
to reach a consensus will inexorably lead to some unilat-
one used to identify the immediate repair in the sub-section eral decision-making by agents or agent groups in case of
3.1. communication disruption. This feature of our system guar-
antees the avoidance of deadlock situations when communi-
Once the delayed decision is identified, its cost cations are too unstable to let the agents reach a consensus.
Ct(Ddelayopt ) is established using Equation (3). We then However, this ability requires the introduction of an algo-
have to add to this cost the waiting cost Ctwait . This waiting rithm to restore a consistent view of the system state by all
cost represents the consequences of the faults on the super- agents.
vised system during the time where no action was triggered.
The computation of the waiting cost depends on the respec- 4.2 System consistency
tive costs of the malfunctions associated to the remaining Algorithm 1 works in the manner of producer-consumer
diagnoses and of the elapsed time. with the decision-making process introduced in section 3.
The two algorithms share, within an agent, a common in-
Ctwait (k) = Aggregdgi ∈Dg (Ctdysf (dgi , k))) (4) consistency queue Finc . When a coalition is left by at least
one agent before reaching a consensus (due to a communi-
cation breakdown or to an agent’s decision), the members of
4 Distributed Supervision and System the coalition store their respective decision-making context
Consistency (the current sequence of observables, the set of considered
In the previous section, we addressed the problem of one explanations and the list of agents which belong to the coali-
agent making a decision. However, as each agent has a lo- tion) into their own potential inconsistency queue Finc .
cal view of the system, a decision about a diagnosis and/or a The consistency maintenance algorithm is available
repair frequently requires information and knowledge from within each agent as a behaviour, it continuously observes
other supervision agents. It therefore becomes necessary to the state of the queue Finc . When an entry is added to Finc ,
reach a consensus on the decision to make. the algorithm is automatically triggered.
However, distributed supervision works in a context of asyn-
chronous and unbounded communication. Within these Algorithm 1 Check consistency
constraints the theorem of Fisher-Lynch-Paterson [3] states Require: Pattern observer on Finc
the impossibility of guaranteeing the achievement of a con- 1: if Finc 6= ∅ then
sensus between different components. 2: Try to contact Finc .getF irst().getCoalition()
To circumvent this difficulty, the literature on supervision 3: if contact successful then
frequently introduce hypotheses on the quality of the com- 4: Send Finc .getF irst()
5: Receive other agents decision context
munications. As our work tends to work under real-life hy-
6: Make pairing between local decision context and others.
potheses, we do not make any regarding the (un)reliability 7: if pairing is ok then
of the communication. We discuss in this section the use 8: Finc .removeF irst()
the multi-Paxos algorithm [4] to reach a consensus when the 9: else
state of the communication allows it, and propose a consis- 10: start new paxos instance
tency mechanism to restore a common view of the system 11: end if
by the agents after a unilateral decision taken by some of 12: end if
them. 13: end if

4.1 Consensus algorithm This algorithm lets each agent find a match between its
In the general case, establishing a consensus must meet the actions and those selected by the other members of the coali-
following properties : (Agreement) All correct processes tion. Thus, in case of faults due to past inconsistency deci-
decide the same value. (Integrity) Every process decides at sions taken by the agents, they are able to trigger a sequen-
most once. (Validity) Each value determined belongs to the tial diagnosis and to discriminate initials disturbances from
set of proposed values. (Termination) Every correct process the consequences of their decisions.
eventually decides in a finite time. However, in the context When the potential inconsistency queue of an agent con-
of Fisher-Lynch-Paterson theorem, the supervision system tains one element, the agent tries to resolve it. The agent
can only offer a guarantee of “best-effort”. i.e, to assure that tries to contact each of the agents of the coalition concerned
the consensus can be reached, but only if the system is stable with this potential inconsistency Pinc . If these agents are
on a sufficiently important period of time [5]. able to communicate (the communications are restored),

38
Proceedings of the 26th International Workshop on Principles of Diagnosis

they will exchange their respective decision-making con- faults which are not real and which prevent the repair of
texts. By comparing them, they will be able to determine the system. We call these situations virtual deadlocks.
whether the decisions made locally by the different groups To disambiguate these situations, we added a relationship
of agents are consistent with one another. If this is the case of innocuousness I to these definitions. Thus, for a fault
(the faults are repaired, the system is in a stable state and f , a repair plan r(f ), its repair conflict set C R (f ) and
correct), each agent removes Pinc of the queue. Otherwise, considering the current state of the system, I returns the set
the subset of agents involved initiates a new coalition in or- of sets of faults belonging to CtR on which the execution of
der to resynchronize their respective views of the system and repair plan r(f ) leaves the system unchanged. The result
make a decision consistent at the system’s scale. If commu- of this innocuousness relation is the set of disambiguation
nications are too unstable (or too costly), this consensus will under repair, denoted by DR (f ). Taking into account all
not be reached, which results in adding a new entry in Finc . this information, we are then able to propose an algorithm
Restoring the consistency of the system state as it is per- to plan the order of the repairs and to resolve some conflicts.
ceived by the supervision agents is again relying on the sta- We illustrate how it operates below :
bility of the communication links for a sufficient amount of
time. Example : Let F = {f1 , f2 , f3 } with Sig(f1 ) =
Sig(f2 ) = {a} and Sig(f3 ) = {b}. Rp(f1 ) = {r1 },
5 Intertwining Diagnosis and Repair Stages Rp(f2 ) = {r2 } and Rp(f3 ) = {r3 }. Moreover, we know
In the previous sections, we endowed the supervision agents that C R (f1 ) = {{f3 }} and that DR (f2 ) = {{f2 }, {f1 }}.
with decision-making and coordination mechanisms. These As the signatures of f1 and f2 are identical, it follows that
abilities allow the agents to dynamically adapt theirs be- C D (f1 ) = {{f2 }} and C D (f2 ) = {{f1 }}. We assume that
haviours to the current state of the communications and of an agent detects the observables a ∧ b.
the supervised system. In case of uncertainty regarding the
decision to make, the agents are thus able to explore the so-
lution space, collectively as well as individually. However, F
Init End
the large size of this set remains a problem. Indeed, it is both {a, b}

a source of misdiagnosis in case of local decision-making
and the cause of a large number of supervision messages M F12 M F13 M F23

when a consensus must be reached. In order to reduce the
{a} {a, b} {a, b}
*
complexity of the decision process, we adress in this section
f1 f2 f3
the question of obtaining the minimal set of diagnoses and {a} {a} {b}
P lannif ication

of associated repair plans. To this aim, we discuss the idea
*
of intertwining the diagnosis and repair stages.
∅
This idea has been introduced by Cordier et al [6] on the {ok}
Disambiguation
formalization of self-healing systems [7]. Several failures
may indeed have the same signature without calling into Figure 6: Operation
question the repairability of the system, all that is needed Figure 5: Diagnosis state for
scheme of the active repair
is that a repair be common to all of the faults involved (no- the agent : dg1 = {f2 , f3 },
algorithm
tion of macrofault). dg2 = {f1 , f3 }
However, restricted to the single-fault context, this
formal model defines the diagnosability and repairability of
a system as static properties that can be computed offline. In Fig. 6, the initialization of the algorithm determines for
This is not the case in the multiple-faults context. Indeed, each potential fault fi all repair conflicts existing at the cur-
the appearance of faults can prevent the triggering of a rent time CtR (fi ). The planning phase recursively builds the
repair associated to another fault currently occurring in repairing order from CtD and CtR adding the faults whose
the system, and the possible situations are endless. Being conflict sets are empty, and then updates the remaining ones.
able to represent this kind of interference is essential to our At the end of this phase, if some faults remain, they poten-
work. This led us to introduce context-dependent notions tially are in a deadlock. In our example, the agent has to
of diagnosability and repairability. choose between {f2 , f3 } and {f1 , f3 }. As highlighted in
Fig.5, the agent can repair f3 but is unable to make a dis-
tinction between f1 and f2 (we assume that this conflict is
Definition 1 (Conditionnal Diagnosability).
virtual and that only one of these faults is occurring).
Diagnosable(fi , t) ⇐⇒ ∀x ∈ C D (fi ), x ∈
/ ss(t) The disambiguation phase then attempts - from the
⇐⇒ CtD (fi ) = ∅ proven innocuousness of some repairs in the current con-
text - to solve these conflicts. If one of them is solved, the
A fault f is diagnosable at time t if none of the faults that planning phase is retried after updating the conflict sets. If
may prevent its diagnosability (e.g if they share the same the disambiguation does not work, it means that the agent
signature) is appearing in the system at this instant. This does not have, at the current time, enough information to
set of faults is the conflict set in diagnosis of the fault f solve the problem. The decision-making process previously
(denoted by CtD (f )). Following the same reasoning, we can introduced in section 3 is then triggered. In our example,
define CtR (f ) as the conflict set in repair of the fault f . repair r2 is selected. If the system returns to normalcy, then
Finally, the uncertainty regarding the faults that are both diagnosis and repair phases end. If not, the previous
currently occurring on the supervised system, may conduct action guarantee the occurrence of f1 , and the associated
the supervision agents’ to “believe” the occurrence of repair plan is executed.

39
Proceedings of the 26th International Workshop on Principles of Diagnosis

6 Experimental Evaluation We fixed a priori the locations and responsibilities (the
To evaluate our approach we developed a simulator for dis- regions) of the supervision agents according to the geo-
tributed systems. Based on the JADE multi-agent platform graphical location of the units that compose the supervised
[8], our environment allows us to model both physical units process. We used a dataset provided by our industrial part-
and communications links, and to simulate the occurrence ners (8 GB of data corresponding to 48 hours of logs) to
of failures in it. For a given simulation, a list of faults is extract nineteen different faults models. From these data,
associated with each site and each communication link (A we determined that the probability of occurrence of n faults
communication link can increase its transmission time, a per unit of time follows a Poisson distribution with param-
unit may stop working properly,...). Each fault is associated eter λ = 0.043. Finally, using information gathered from
with one or more trigger conditions: a date and/or the our partners, we were able to estimate the costs of the faults
occurrence of another failure. This allows us to simulate over time (constant, logarithmic,...) and the associated re-
cascading faults. Our supervision system is deployed pair costs.
on this simulator. When running the simulation, some 6.2 Experiments
faults trigger the sending of an alarm message to the agent
responsible for the site where they appear. The agents Our experiments study the behaviour of the supervisions
will try to determine the appropriate behaviour from these systems when varying the (heterogeneous) transmission
messages. cost. We used a random generator to affect a transmission
time (between 0 to 30 units of time) to each transmission
Our agent decisional model is generic. It can be instan- link for each time unit of the simulation. We arbitrarily set
tiated with various criteria (the most probable explanation, at 10% the probability of a link to get a transmission time
the worst case hypothesis, etc.) depending on available in- greater than 1. In order to obtain a baseline, we first per-
formation and on the constraints we choose to focus on.In formed different simulations with homogeneous transmis-
the Dem@tFactory project, it appeared essential to consider sion costs.
the utility of a decision based on the cost of the occurrence The performance evaluation is based on three criteria:
of a given set of faults rather than on its occurrence probabil- (1) The average response time to a malfunction. (2) The av-
ity. This reasoning led us to favor a robust decision criterion. erage number of supervision messages exchanged. (3) The
The decisions taken by the agents will therefore rely on the average total cost of repairs made during the experiment.
worst case hypothesis.
The upper bound k which is the horizon considered by Figs. 8(a) and 8(b) present the evolution of the behaviour
the agents of ADS2 for the computation of the delayed de- of supervision systems ADS2 and SC for the two first cri-
cision is set to 15 units of time. Moreover, we assume in teria in the case of homogeneous (Ho) and heterogeneous
these experiments that the respective costs of the faults that (He) communication links. The vertical bar at t=15ut is the
compose a diagnosis are additive. Finally, in order to have horizon considered by the agents of ADS2 for the compu-
benchmarks for the evaluation of the principles underlying tation of the delayed decision.
our architecture (ADS2), we also implemented a central- Fig. 8(a) shows that our architecture is very robust, al-
ized supervision systems (SC) where all observables are lowing the supervised system to rapidly recover from fail-
transmitted to a single supervision agent. ures. The response time of ADS2 (Ho and He) progres-
sively stabilized around 15ut, when the response time of SC
6.1 Experimental setup increases over time and becomes higher than ADS2.This is
Our goal is to study the behaviour of the supervision system due to the fact that the agents of ADS2 can decide to act
facing an industrial case-study. without waiting for the reception of all the messages that
come from the units of the supervised system.
We can observe an increase of the average response time
of ADS2(Ho) when the transmission delay is close to 15
ut. This is due to the parameter k of our algorithm, a pri-
ori fixed to 15 ut. This parameter defines the agent’s hori-
zon for the computation of the delayed decision. When the
transmission time becomes greater or equal to k, an agent
no longer sees interest in waiting or trying to exchange in-
formation with other agents of ADS2; so it decides to act
despite the risk of making a mistake. The impact of pa-
Figure 7: Workflow of the digitizing chain of the rameter k is less important on ADS2(He). Indeed, as the
Dem@tfactory project. communication links are in this case heterogeneous, a su-
pervision agent is still able to exchange information with
Fig. 7 represents the digitizing chain of the some of the other agents. This leads to a better response
Dem@tFactory used for the experiments. Each dotted rect- time for ADS2(He) than for ADS2(Ho). This behaviour
angle corresponds to a factory situated on a given geograph- is clearly highlighted in figure 8(b). The number of mes-
ical location (2 in France, 1 in Madagascar and 1 in Mau- sages exchanged by the agents of ADS2 drop with the in-
ritius). The circles correspond to the various process re- crease of the transmission delay. We see a sudden drop of
quired to the digitization. The inter-rectangle links corre- this number when the transmission delay becomes greater
spond to communications between the different factories, than 15 ut for ADS2(Ho), confirming the local decisions-
and the intra-rectangle one to local communications. All making of the agents.
theses components are modeled in the simulator and can en- Fig. 8(c) shows that the decisions of ADS2(He) agents
gender the occurrence of faults. generate a limited repair extra-cost in comparison to the cen-

40
Proceedings of the 26th International Workshop on Principles of Diagnosis

(a) Response time to a malfunction (b) Communication cost (c) Total cost of repairs

Figure 8: Experimental results. The curves corresponding to homogeneous/heterogeneous communication links are respec-
tively marked with (Ho) and (He). The x-axis is the transmission delay. The y-axis correspond for each figure to one of the
evaluation criteria.

tralised approach (9%). With the Dem@tFactory fault mod- They have shown that to obtain a minimum overall diagno-
els, the overall gain regarding the supervised system down- sis is NP-hard in the case of spatially distributed information
time reach 34%. and that the complexity of obtaining the diagnosis is inde-
Considering the reactivity of ADS2 and the limited pendent of the communications costs engendered during its
repair extra-cost it generates, the communication extra-cost establishment[13].
for low transmission delays can be considered as an Given these theoretical results, reducing the space of
acceptable consequences compared with a total-absence of potential solutions is generally based on a hierarchical
supervision (SC). structure of the diagnosis agents [12] and on the choice of
not returning to previously excluded explanation. Though
Our next set of experiments evaluates the impact of the the no back-track of past decisions guarantees convergence
intertwining of the diagnosis and repair phases on the per- and termination of the algorithm, it is a source of diagnosis
formances of the supervision system. In order to evaluate errors in an asynchronous environment. The best-effort
the impact of this behaviour for a supervision agent, we ini- approach we chose allows us to reduce these diagnosis
tially activated this capability in only one agent of ADS2. errors, and the termination of the diagnosis algorithm is
We realised 100 simulations. The 10 first simulations are guaranteed through the anytime decision-making process of
performed with a number a simultaneous faults restricted to our agents.
1. Then the number is gradually incremented every 10 simu-
lations to reach 10 simultaneous faults. For each simulation,
the number of potential diagnoses considered by the agent is To our knowledge, the work of Nejdl et al [15] is the only
saved at 5 specific time steps. In order to obtain a baseline, one that addresses the distribution of both the diagnosis
we performed the same 100 simulation with the intertwining and repair phases. However, placed at a relatively abstract
behaviour deactivated. Fig. 9(a) shows that the interleaving level of analysis, this work makes the assumptions that
of the diagnosis and repair process does lead to a reduction communication links are reliable and that messages can be
of the diagnosis search space of an agent between 10 to 20% exchanged between agents at no cost. In a real situation
for the set of faults of the Dem@tFactory project. these hypotheses are too restrictive. Indeed, to not consider
Fig. 9(b) shows that the reduction of the number of po- the communication state may render the supervision system
tential explanations of each agent is of an extend sufficient ineffective or inoperable. Our proposal does not make such
to allow the agent to reduce the number of supervision mes- assumptions.
sages. The everage response time to a malfunction is not
significantly improved (Fig. 9(c)) but the repair extra-cost The problem of online decision-making under uncertainty
fall from +9% (ADS2) to 7.2% (ADS2+) (with p<0.05). is the central point of the work by Horvitz [16], Hansen et
Zilberstein [17] on the control of anytime algorithms. In-
7 Related Work deed, they propose a formal framework to dynamically de-
The supervision of a system consists of four steps: De- termine the time to stop a calculation taking into account the
tection, Isolation, Identification and Repair. The literature quality of the current solution and the cost of the algorithm
aggregates the 3 first steps under the name FDI (Fault De- computation.
tection and Isolation) [9]. Although several approaches for The first distinction between these work and ours is that in
the distributed supervision of distributed systems have been their work the authors determine when to stop the computa-
proposed in the literature, whether work is from the diagno- tion based on the distance between the current solution and
sis and control communities[10; 11] or from the multi-agent the optimal one. This requires knowing the optimal solution
domain [12; 13], they do not cover the repair phase. (or an estimatation) and to be able to dynamically determine
In the work from areas related to distributed systems, this distance. In our work, talking about the quality of a so-
emphasis is placed on the distribution of available knowl- lution (i.e a diagnosis) is meaningless insofar as a diagnosis
edge on the status and behaviour of the supervised system. is right or wrong, and its “value” is only known a posteri-
Frohlich et al [14] and Roos et al [13] have addressed the ori. The second point of divergence is that we try to select a
question of the ability of a set of agents to determine an candidate (a diagnosis or a repair) among a set of potential
overall diagnosis according to the shape of this distribution. solutions. The complexity of the task is therefore increased.

41
Proceedings of the 26th International Workshop on Principles of Diagnosis

(a) Size of the potential diagnosis set (b) Response time to a malfunction (c) Communication cost

Figure 9: Simulation results when integrating the interleaving of the diagnosis and repair steps. The symbol “+” is associated
to the components which integrate this additional mechanism.

8 Conclusions [5] R. De Prisco, B. Lampson, and N. Lynch. Revisiting
the paxos algorithm. Distributed Algorithms, pages
We presented the first anytime multi-agent architecture for 111–125, 2000.
the supervision of distributed systems that is able to dynam-
[6] Marie-Odile Cordier, Y. Pencolé, L. Travé-Massuyes,
ically adapt its behaviour to the current state of the super-
vised system. In particular, the decision model allows each and T. Vidal. Self-healability = diagnosability + re-
supervision agent to find a balance between a quick local pairability. In The 18th International Workshop on
diagnosis and repair under uncertainty, and a delayed, sys- Principles of Diagnosis, volume 7, pages 251–258.
temic one, based on the respective costs of misdiagnosis and Citeseer, 2007.
communication. The distributed consistency algorithm al- [7] Philip Koopman. Elements of the self-healing system
lows each agent to form a coalition to reduce its uncertainty problem space. In ICSEWADS03, 2003.
or to restore a consistent view of the system state in case [8] K. Chmiel, M. Gawinecki, P. Kaczmarek, M. Szym-
some had to act locally with incomplete information at an czak, and M. Paprzycki. Efficiency of JADE agent
earlier stage. Moreover, the intertwining of the diagnosis platform. volume 13, pages 159–172. IOS Press, 2005.
and the repair phases allows an efficient reduction of the di-
agnosis search-space. The overall reduction of 34% of the [9] Giovanni Betta and Antonio Pietrosanto. Instrument
Dem@tFactory system down-time associated with a repair fault detection and isolation: state of the art and new
extra-cost of 7.2% demonstrate that ADS2 is able to effi- research trends. volume 49, pages 100–107, 1998.
ciently supervise complex systems under real-life assump- [10] S. Lafortune, D. Teneketzis, M. Sampath, R. Sengupta,
tions. and K. Sinnamohideen. Failure diagnosis of dynamic
A fully autonomous supervision system is presently not systems: an approach based on discrete event sys-
realistic in an industrial context as Humans wants to keep tems. In Proc. American Control Conference, vol-
control on what they perceive as critical decisions. ADS2 ume 3, pages 2058–2071, June 25–27, 2001.
represents what we see as an acceptable trade-off as the def- [11] Christos G. Cassandras and Stéphane Lafortune. In-
inition of its autonomy degree can be easily accomplished. troduction to Discrete Event Systems. Springer, 2008.
Thus ADS2 organizes the set of known faults and repairs
in several subclasses : the ones whose repair plan can be [12] H. Wörn, T. Längle, M. Albert, A. Kazi, A. Brighenti,
triggered automatically, and those whose final repair deci- S. Revuelta Seijo, C. Senior, M A S. Bobi, and JV. Col-
sion rests with a human supervisor. The risk aversion of the lado. Diamond: distributed multi-agent architecture
users defines the size of these two respective sets. If the for monitoring and diagnosis. Production Planning &
confidence in the efficiency of the autonomous supervision Control, 15:189–200, 2004.
of complex and distributed systems is not common today, [13] Nico Roos, Annette ten Teije, André Bos, and Cees
we believe that the work presented herein provides a step Witteveen. A protocol for multi-agent diagnosis with
towards this goal. spatially distributed knowledge. In Proceedings of the
second international joint conference on Autonomous
agents and multiagent systems, page 7. ACM, 2003.
References
[14] Peter Fröhlich and Wolfgang Nejdl. Resolving con-
[1] J. O. Kephart and D. M. Chess. The vision of auto- flicts in distributed diagnosis. In ECAI Workshop on
nomic computing. Computer, 36(1):41–50, 2003. Modelling Conflicts in AI, 1996.
[2] M. Salehie and L. Tahvildari. Self-adaptive software: [15] W. Nejdl and M. Werner. Distributed intelligent agents
Landscape and research challenges. ACM Trans. Au- for control, diagnosis and repair. RWTH Aachen, In-
ton. Adapt. Syst., 4(2):1–42, 2009. formatik, Tech. Rep, 1994.
[16] E. Horvitz and G. Rutledge. Time-dependent utility
[3] M.J. Fischer, N.A. Lynch, and M.S. Paterson. Impossi-
and action under uncertainty. pages 151–158, 1991.
bility of distributed consensus with one faulty process.
Journal of the ACM (JACM), 32(2):374–382, 1985. [17] E.A. Hansen and S. Zilberstein. Monitoring and con-
trol of anytime algorithms: A dynamic programming
[4] L. Lamport. Paxos made simple. ACM SIGACT News, approach. Artificial Intelligence, 126:139–157, 2001.
32:18–25, 2001.

42
Proceedings of the 26th International Workshop on Principles of Diagnosis

Data Driven Modeling for System-Level Condition Monitoring on Wind Power
Plants
Jens Eickmeyer1 , Peng Li2 , Omid Givehchi2 , Florian Pethig1 and Oliver Niggemann1,2
1
Fraunhofer Application Center Industrial Automation IOSB-INA
e-mail: {jens.eickmeyer, florian.pethig, oliver.niggemann}@iosb-ina.fraunhofer.de
2
inIT - Institute Industrial IT
e-mail: {peng.li, omid.givehchi, oliver.niggemann}@hs-owl.de

Abstract failure. On the other hand, there is the strategy of correc-
tive maintenance, which reacts to occurred failures. Both
The wind energy sector grew continuously in the strategies need time for actual maintenance, which lead to
last 17 years, which illustrates the potential of non productive downtimes. Especially, when considering
wind energy as an alternative to fossil fuel. In offshore WPP, these downtimes produce high costs.
parallel to physical architecture evolution, the To reduce these downtimes a precise proactive schedul-
scheduling of maintenance optimizes the yield ing of maintenance task is needed. This is achieved through
of wind power plants. This paper presents an condition monitoring (CM) systems [4]. Those systems try
innovative approach to condition monitoring of to reason about the inherent system states such as wear, al-
wind power plants, that provides a system-level though these conditions cannot be measured directly, but the
anomaly detection for preventive maintenance. At growing amount of sensors in modern WPP enable an ade-
first a data-driven modeling algorithm is presented quate description of the machines state. To make use of this,
which utilizes generic machine learning methods. CM systems need a model of the WPP, which describes the
This approach allows to automatically model a system behavior based on observed data.
system in order to monitor the behaviors of a Existing CM solutions for WPP rely on specific sensors
wind power plant. Additionally, this automati- and are specialized to monitor single parts of the system.
cally learned model is used as a basis for the sec- The gearbox [5], the bearing [6], the generator [7] or the
ond algorithm presented in this work, which de- blades [8] have been monitored in order to perform proac-
tects anomalous system behavior and can alarm tive maintenance. Here, specific sensors are needed as a
its operator. Both presented algorithms are used in requirement for these specialized methods.
an overall solution that neither rely on specialized This article presents a system-level solution which han-
wind power plant architectures nor requires spe- dles heterogeneous WPP architecture regardless of installed
cific types of sensors. To evaluate the developed sensor types. Also, an algorithm for modeling a WPP on
algorithms, two well-known clustering methods system level and another algorithm for anomaly detection
are used as a reference. are stated. To achieve this, three challenges are tackled and
their solutions are presented:
1 Introduction I. Logging data from available sensors of a WPP, using
existing infrastructure independent of the architecture.
According to a wind market statistic by the GWEC (Global Additionally, the opportunity must be given to add new
Wind Energy Council) [1], the global wind power capac- sensors and sensor types on demand.
ity grew continuously for the last 17 years. In 2014, the II. Automatic modeling of a WPP, by combining existing
global wind industry had a 44 % rise of annual installations and generic data-driven methods. Such a model must
and the worldwide total installed capacity accumulated to be able to learn the complex sensor interdependencies
369553 megawatt at the end of 2014. In Europe, renewable without extra manual effort.
energy from wind power plants (WPP) covers up to 11% of
the energy demand [2]. With this rapid continuous growth, III. Anomaly detection for a WPP regardless of its kind of
the wind power is considered as one of the most competitive architectures, especially with no assumptions on avail-
alternative to fossil fuels. able types of sensors.
In a case study, Nilsson [3] denotes an unscheduled down- The article is structured as follows. Section 2 deals with
time with 1000 e per man-hour, with costs of up to 300000 state of the art technology in WPP CM. Hardware and data
e for replacements. This does not take into account the acquisition for the presented solution are specified in section
reduced yield through production loss. Therefore, the ob- 3, here point I is the central issue. Data-driven models real-
jective of maintenance is to reduce WPPs downtimes and izing point II and the analyzed machine learning approaches
provide high availability and reliability. are the purpose of section 4. Anomaly detection and its gen-
High availability is currently achieved by two different eral approach, according point III is stated in section 4.2.
strategies. On the one hand, maintenance is planned as regu- The results of an evaluation of the presented methods is con-
lar time-interval based on the manufacturer’s data of specific tent of section 5. Finally, this paper concludes in section 6
WPP parts. This is performed in order to prevent wearout and describes future aims of the presented work.

43
Proceedings of the 26th International Workshop on Principles of Diagnosis

2 Related Work
The core task of a CM system is anomaly detection. As
stated in [9], the models used for anomaly detection of
complex systems should be learned automatically and data-
driven approaches to learning such models should be moved
into the research focus.
A wide range of data-driven algorithms that deal with
modeling the system behavior for anomaly detection are
available in the literature.
Because of its simplicity in processing huge amounts of
data, the Principal Component Analysis (PCA) based algo-
rithms are widely applied in the condition monitoring of
WPP [10][11]. Figure 1: Diagram showing the inside of a nacelle and main
As one of the classic density based clustering method, components [4]
DBSCAN shows its advantages over the statistical method
on anomaly detection in temperature data [12].
on generic industrial standards (IEC 61400-25) and trans-
Piero and Enrico proposed a spectral clustering based
fer them to a database for storage and processing. Such a
method for fault diagnosis where fuzzy logic is used to mea-
data logger meets point I (see section 1). In addition, the
sure the similarity and the fuzzy C-Means is used for clus-
timestamp of the data should be synchronized between data
tering the data [13].
loggers, database and application accurately.
Due to the high complexity of a WPP and its harsh work-
ing environment, the modeling of WPPs on system level is In this work we followed a three layer architecture for
very challenging. Most data-driven solutions to WPP con- data acquisition as shown in Figure 2 which covers all of
dition monitoring concentrate on the errors of one partic- the CM system components. In layer 1, the physical ma-
ular component (in component level) [4]. These methods chine components are connected to the data logger hardware
are designed to detect specific faults (e.g. fault in gearbox, using different industrial connections and protocols e.g. dig-
generator). ital GPIO, RS485, MODBUS, etc. The data loggers are time
The application of such methods is available in differ- synchronized using global positioning system (GPS) or net-
ent studies. In [6], a shock pulse method is adapted for work time protocol (NTP) time references via an embed-
bearing monitoring. A multi-agent system is developed in ded time client running in the data logger. Collected sensor
[5] for condition monitoring of the wind turbine gearbox data is attached to their accurate timestamps by an embed-
and oil temperature. In [8], the ultrasonic and radiographic ded OPC UA server inside the data logger. The sensor data
techniques are used for non-destructive testing of the WPP is categorized based on an OPC unified architecture (OPC
blades. Using these methods can prevent the WPP break- UA) data model (e.g. conformant to IEC 61400-25) for a
downs caused by the particular faults. For enhancing the standalone WPP.
availability and the reliability of the whole WPP, a method The communication between data logger, OPC UA server
for monitoring the WPP on system-level is desired. and layer 2 is realized with a secure general packet radio ser-
In this work, a PCA-based algorithm for condition mon- vice network (GPRS) or a virtual private network (VPN),
itoring of WPP is presented. This approach is aimed to while it can be accessed for widely distributed WPPs in
model a WPP on system-level in order to perform auto- different geographical locations. The layer 2 comprises a
matic anomaly detection. As a comparison, DBSCAN and middleware to collect and host the sensor data coming from
spectral clustering are utilized for the same purpose. To the distributed data loggers. It mainly covers a database with
best of our knowledge, no application of either DBSCAN or support of historical data and also an OPC UA server aggre-
spectral clustering in condition monitoring of WPP exists. gating the data incoming from distributed WPP data loggers
and pushes them to the database using an OPC UA database
wrapper. As shown in Figure 2, the main component of layer
3 Data Acquisition Solution 3 is an analysis engine. This engine applies algorithms on
A WPP includes different types of sensors, actuators and the database. Based on the learned machine models an out-
controllers installed to monitor and control the different de- put about the machines condition is presented to the operator
vices and components as shown in Figure 1. To monitor the by a human machine interface (HMI).
condition of a WPP, it is necessary to collect process data
from its sensors and components accurately and continu-
ously feed this data to the diagnosis algorithms. To max-
4 Modeling Solution
imize accuracy, data should be acquired directly from the The main idea of the presented solution is to automatically
sensors and components or via the existing communication learn a model of normal system behavior from the observed
systems. Despite the fact that IEC 61400-25 [14] addresses data using data-driven methods. Classical manual model-
a variety of standards and protocols in WPP, lots of propri- ing utilizes expert process knowledge to build a simulation
etary solutions exist today. A general approach to accurate model as a reference for anomaly detection. But a process
data acquisition in an uniform way implies protocol adapters such as a WPP contains numerous continuous sensor val-
or data loggers (DL) to connect the diagnosis framework. ues, which make it difficult to model the system manually.
This is done not only for IEC 61400-25 conformant WPP, Therefore, as first step of the solution a model is learned
but also for proprietary ones using e.g. the MODBUS pro- from a set of data. The second step utilizes this model as
tocol or a direct connection via general-purpose input/output reference to perform anomaly detection. This section con-
(GPIO) [15]. Also the data logger should model data based siders these two steps.

44
Proceedings of the 26th International Workshop on Principles of Diagnosis

Clustering based modeling
The goal of cluster analysis is to partition data points into
different groups. Similarity of points is defined by a mini-
mal intra-cluster distance, whereas different cluster aim for
a maximum inter-cluster distance. Thus, cluster analysis
can be utilized to find the pattern of a system direct us-
ing the multi-dimensional data without explicit descriptions
about the system features. This is the main advantage in
using cluster analysis for modeling complex systems with
seasonal components, e.g. WPP.
In the presented solution, a system model for anomaly de-
tection should characterize the normal system behavior and
can be used to identify unusual behavior. For most com-
plex system, the normal behavior might consist of multiple
modes that depend on different factors, e.g. work environ-
ments, operations of the systems. When the cluster analysis
is performed on a data set representing the normal behav-
ior of a system, multiple clusters can be recognized. Each
cluster (group) represents a particular status of the system.
Then such multiple clusters can be used as the normal be-
havior model of a system for anomaly detection.
In this paper, two well-known clustering algorithms, DB-
SCAN and spectral clustering, are utilized to model the nor-
Figure 2: Architecture overview of the presented system- mal behavior of a WPP on system level. Each of them has
level condition monitoring solution for a WPP advantages in clustering the data with complex correlations.
DBSCAN is resistant to noise and can recognize patterns
of arbitrary shapes. In DBSCAN, the density for a particu-
4.1 Step 1: Data-Driven Modeling lar point is defined as the number of neighbor points within
a specified radius of that point [17]. Two user-defined pa-
rameters are required: Eps - the radius; M inP ts - the min-
In order to automatically compute a system model, the pre- imal number of neighbors in the Eps. DBSCAN uses such
sented solution use generic methods to analyze training data center-based density to classify the data points as core point
and aim for process knowledge. These methods from the (Eps-neighbors ≥ M inP ts), border point (not core point
field of machine learning reduce effort of time for generat- but the neighbor of minimal one core point) or noise point
ing a system model caused by the complex sensor interde- (neither a core nor a border point). Two core points that are
pendencies. Additionally, a WPP is influenced by seasonal within Eps of each other are defined as density-reachable
components and a normal state of work cannot be declared core points. DBSCAN partitions the data into clusters by
as precise as for a machine that works in a homogeneous en- iteratively labeling the data points and collecting density-
vironment of a factory. This meets the requirement in point reachable core points into same cluster. As result, DBSCAN
II (see section 1). In this solution, step 2 detects anomalies delivers several clusters in which noise points are also col-
as deviation between an observation and the learned refer- lected in a cluster. DBSCAN is not suitable to cluster high
ence model of the system, this is described in section 4.2. dimensional data because density is more difficult to define
in high dimensional space. Therefore, a method to reduce
Common strategies for data-driven modeling are super- dimensionality should be applied to the data before using
vised and unsupervised learning methods. Supervised meth- the DBSCAN. This leads to a density based description of
ods such as Multilayer Perceptron, Support Vector Ma- the normal behavior.
chines or Naive Bayes Classifier (see [16] for more infor- This method assumes that the training data perfectly de-
mation) can be used to directly classify data according to scribe the distribution of system normal states. For WPP,
learned hyperplanes in the data space. To be reliable, those some special states of the plant occur so rarely that the
methods need a-priori knowledge from labeled data of pos- recorded data can not represent such special states very well.
sible faults and the normal state. Gathering those precise In addition, environmental influences lead to noise points
data for a continuous production system like a WPP is hard within the data set. Therefore, a complete coverage of the
to realize, as faults are rare and environmental conditions normal states of a WPP in learning data set is unrealistic to
increase the number of possible faults dramatically. achieve.
In comparison, unsupervised learning methods (e.g. Compared to the traditional approaches to clustering (e.g.
Clustering, Self Organizing Maps) seek to model data with- k-means, DBSCAN), spectral clustering can generally de-
out any a-priori knowledge. Therefore, they are able to liver better results and can be solved efficiently by standard
extract knowledge from unlabeled data sets and generate linear algebra methods[18]. Another advantage of spectral
a model out of this knowledge. In this article, two types clustering is the ability to handle the high dimensional data
of unsupervised learning methods are investigated to model using spectral analysis. Thus, extra dimensionality reduc-
a WPP using unlabeled data. The PCA based modeling is tion method is not required. The idea of spectral cluster-
compared against cluster based modeling methods, which ing is to represent the data in form of a similarity graph
are used as reference. G(V, E) where each vertex vi ∈ V presents a data point

45
Proceedings of the 26th International Workshop on Principles of Diagnosis

Algorithm 1 PCA based modeling
1
1: Input: X . learning data set Σ0 ≈ XT X
2: Output: ModelX . model of input data N −1
By means of EVD (eigen value decomposition) or the
3: procedure PCA_ BASED _ MODELING (X) equivalent SVD (singular value decomposition) the covari-
4: l: reduced dimensionality ance matrix is decomposed as follows:

5: PCA_Matrix = performPCA(X) T Λpc 0
Σ0 = P ΛP, Λ =
6: XP CA = mapToLowDimension(X) 0 Λres
7: ModelX = generate_N-Tree(XP CA ) With Λ = diag(σ12 < σ22 < · · · < σi2 ) where σi ,
8: end procedure i = 1, · · · , m is the i-th eigenvalue and P is a matrix of the
eigenvectors, sorted according to the eigenvalues of Λ. Λpc
9: function GENERATE _N-T REE(XP CA ) are the chosen principal components according to a thresh-
10: Tree: List with length 2l old l and Λres denotes the less informative rest. l is a pa-
11: for (xpca in XP CA ) do rameter which depends on the eigenvalues proportion of to-
12: i = determine_orthant(xpca ) tal variance and determines the dimension of the reduced
13: Treei = append(Treei , xpca ) normal space.
14: end for
15: for ( leaf in Tree ) do Y = PTX
16: if (sizeOf(leaf) > 1) then
17: leaf = generate_N-Tree(leaf) Transforms the p-dimensional dataset X into a dataset Y of
18: end if a lower dimension l, with a minimum of information loss.
19: end for The axes of the dimensionally reduced data space are or-
20: return ( Tree, PCA_Matrix ) thonormal and aligned to the maximum variance of data.
21: end function Prerequisite for modeling a WPP with this kind of trans-
formation is the input data to calculate eigenvalues and the
rotation matrix. Therefore, the presented data set of a WPP
in the dataset. Each edge eij ∈ E between two vertices vi needs to describe a period of fault free operation, which is
and vj carries a non-negative weight (similarity between the denoted by the term ’normal state’. Using this data set as
two points) wij . Then, the clustering problem can be han- a learning base, the PCA described above spans a reduced
dled as graph partition[19]. G will be divided into smaller normal state space, where signal covariances are taken into
components, such that the vertices within the small compo- account due to the eigenvalues of the covariance matrix as
nents have high connection and there are few connections the basis for transformation. The input variables are trans-
between the small components. These small components formed within the algorithm 1 in line 6.
correspond to the clusters in the results of spectral cluster- In comparison to clustering methods only the covariance
ing and can be used as normal status model for anomaly matrix stores explicit shape informations. This leads to the
detection. necessity of taking into account all data points for classi-
fying a new observation. That is why computational effort
PCA based modeling for this model increases with the number of data points in
Algorithm 1 presents the stated modeling solution for a the data set and their dimension. To overcome this issue,
system-level approach to a WPP. The algorithm utilizes the the model is extended with an N-Tree as geometrical data
Principal Component Analyses (see, line 5 algorithm 1 ) as a structure (see function generate_N −T ree in algorithm 1).
very first step to achieve a dimensional reduced description The axis of the PCA transformed normal state space divides
of the training data set. Although a part of the information is the data into 2l subspaces. Centering these subspaces in
lost due to the reduction, the sensor correlations in the low each iteration divides the subspaces recursively until each
dimensional space are reduced drastically, which minimizes leaf of the tree contains one data point or is empty. Note,
the computational effort. that the mean of each subspace needs to be stored.
The PCA is based on the assumption, that most of the
information is located in the direction of most variance.
Therefore, this method aims to project a data set to a sub- 4.2 Step 2: Anomaly Detection
space with a lower dimension by minimizing the sum of To comply with point III (see section 1), the prerequisite
squares of yi and to their projections θi following cost func- for a system-level anomaly detection is a data-driven model
tion: as stated above. Given such a model, a distance measure
Xm
= ||yi − θi ||2 . is needed to calculate the deviation between a new system
i=1
observation and the model in order to identify anomalies.
Let x1 , . . . , xm be the data point of m sensor values and Therefore, an observation vector needs to be transformed
X is a historical dataset of N scaled data points. into the dimensionally reduced space of the model. Then
  the deviation of an actual observation and the learned model
x1,1 . . . x1,m can be calculated using a distance metric, such as Euclidean
X =  ... ..  ∈ RN ×m
 .. distance, Mahalanobis distance or Manhattan distance.
. . 
DBSCAN generated cluster provide a discrimination of
xN,1 . . . xN,m
core and border data points. Distance computation in DB-
Then as first step for computing the PCA, the covariance SCAN use the euclidean distance metric. Only core points
matrix is formed as are used to measure the distance between an observations

46
Proceedings of the 26th International Workshop on Principles of Diagnosis

Algorithm 2 Anomaly detection
1: Input: Tree . (Learned model, see algorithm 1)
2: Input: O . Input observation
3: Output: Boolean . Anomaly
4: procedure A NOMALY _D ETECTION(Tree, O)
5: OPCA = mapToLowDimension(O)
6: subset = get_subset(Tree, O)
7: dist = calculate_distance(O, subset)

8: if ( dist > 0 ) then
9: anomaly: TRUE
10: else
11: anomaly: FALSE
12: end if
Figure 3: Characteristics of Gaussian distribution in com- 13: return ( anomaly )
parison to Marr Wavelet (dashed). Spots are marked where 14: end procedure
the Marr Wavelet reach zero
15: function GET _ SUBSET( Tree , OPCA )
16: i = determine_orthant(xpca )
and the core points. This leads to the decision whether an 17: if (size(leafi ) > 1) then
observation is part of the models cluster or not. 18: get_subset(leafi )
Spectral clustering computes clusters in a dimensionally 19: else
reduced space but gives no further information about core or 20: subset = neighbors(leafi )
border points. Measuring the distance between such clus- 21: end if
ters can be achieved by a prototype, for example the clus- 22: return ( subset )
ter center. Then, for computing the distance, a metric like 23: end function
the Mahalanobis distance is used, which is sensible for the
multidimensionality of such cluster. Representing a cluster
based on a prototype is a generalization. Where l denotes dimensions of reduced normal-space and
The PCA based modeling approach uses the dimension- v
ally reduced input data as description of the multidimen- u l
uX
sional normal state space. Algorithm 2 shows how the k = t (Opcai − Xpcai )2
model, computed with algorithm 1, is used for anomaly de- i
tection. At first a new observation is mapped to the low
dimensional space of the model, using the rotation matrix k is the l-space euclidean distance. For ψ > 0 an observa-
from the PCA (see line 5). Then the mapped observation tion in principal space Opca is denoted part of the normal
is compared with the normal state space. Therefore, the N- state space (see line 17).
Tree is searched for its corresponding subset first (see func-
tion get_subset). If an empty leaf is found, all neighbor 5 Results
leafs are aggregated to a most relevant subset of data points. The data used in the evaluation is collected over a duration
As the data is not generalized by border points or cluster of 4 years from 11 real WPPs in Germany with 10 minutes
means as prototypical points it is necessary to measure dis- resolution. The dataset consists of 12 variables which de-
tance of the observation to each point of this subset. Now scribe the work environment (e.g. wind speed, air temper-
the distance is computed (see line 7). ature) and the status of WPP (e.g. power capacity, rotation
Absolute distance measuring is missing a threshold to de- speed of generator, voltage of the transformer).
cide when an observation meets the model or not. Even For evaluation, a training data set of 232749 observations
when utilizing a Gaussian density function to provide an in- of the 10 minutes resolution was used to model the nor-
dicator for classification, a threshold needs to be estimated mal behavior of a WPP. The evaluation data set of 11544
for classification. In this project, a Marr wavelet function observations contains 4531 reported failures and 7013 ob-
is used to decide whether a new observation is part of the servations of normal behavior. Table 1 shows the confusion
learned normal space. Instead of a Gaussian distribution the matrix [21] as a result of the evaluation. Here, true negative
characteristic form of a Marr wavelet [20] allows a classi- denotes a correct predicted normal state and true positive a
fication where the threshold can be set to zero, see figure correct classified failure For this use case, the F1-score is
3. Taking into account the Marr wavelet and the euclidean used to analyze the system’s performance in anomaly de-
distance function the process of distance measuring is com- tection. Also, the runtime for the evaluation is denoted in
puted as follows. Table 1 to compare speed performance of the different ana-
Let Xpca = [x1 , · · · , xl ] be a vector of the models’ prin- lyzed methods.
cipal normal-space and Opc = [o1 , · · · , ol ] a transformed As can be seen, the presented PCA based algorithm outper-
observation, where l denotes the number of principal com- forms the standardized spectral clustering. Especially a sig-
ponents. Then the distribution function to measure if a new nificant performance boost in computation time is achieved
observation is part of the normal state space is formed as: due to the extended N-Tree data structure.
Both, DBSCAN and Spectral Clustering, rely on com-
2 k k plete sensor information for clustering the data set. A defect
ψ(Xpca , Opca ) = √ 1 ·1− · exp (− 2 ) sensor leads to a maintenance action. The delay for this
3σπ 4 σ2 2σ

47
Proceedings of the 26th International Workshop on Principles of Diagnosis

True Pos. True Neg. False Pos. False Neg. Bal. Acc. F-Measure elapsed
Time
DBSCAN 1812 6827 186 2719 68.66% 55.50% 3s
Spectral 3832 6328 685 699 87.40% 84.71% 6637s
Clustering
PCA based 3970 6517 496 561 90.27% 88.25% 68s

Table 1: Evaluation results of wind power station data.

maintenance is based on the localization of the WPP and References
cause missing sensor values for a certain time. To be oper- [1] Global Wind Energy Council. Global wind statistics
able in the use case of WPP such a model needs a fall back 2014. Avaiable online at http://www.gwec.net/wp-
strategy in case of missing sensor values. Here, redundancy content/uploads/2015/02/GWEC_GlobalWindStats
and correlation of different sensors comes in handy. By ex- 2014_FINAL_10.2.2015.pdf, March 2015.
tending the PCA to a Probabilistic Principal Component An-
[2] European Wind Energy Association et al. Wind in
alyzes (PPCA), missing values can be estimated according
to the data learned from the data set. Tipping and Bishop power, 2012 european statistics, 2013. Available
[22] extend a classic PCA by a probability model. This online at http://www.ewea.org/fileadmin/files/library
model assumes Gaussian distributed latent variables which /publications/statistics/Wind_in_power_annual
can be inferred from the existing variables and the matrix _statistics_2012.pdf, March 2015.
of eigenvectors from the PCA. With the use of a PPCA, the [3] Julia Nilsson and Lina Bertling. Maintenance man-
solution for a system-level is robust enough to stay reliable agement of wind power systems using condition mon-
even when sensors are missing. This was tested by training itoring systems life cycle cost analysis for two case
the model with a defect data set, containing 10% missing studies. Energy Conversion, IEEE Transactions on,
sensor values. While evaluating this model, also 10% of the 22(1):223–229, 2007.
data was damaged, simulating missing sensor values. The [4] Wenxian Yang, Peter J Tavner, Christopher J Crabtree,
result of this evaluation is presented in table 1. Y Feng, and Y Qiu. Wind turbine condition monitor-
ing: technical and commercial challenges. Wind En-
ergy, 17(5):673–693, 2014.
6 Conclusion [5] AS Zaher and SDJ McArthur. A multi-agent fault de-
tection system for wind turbine defect recognition and
In this work a solution for system-level anomaly detection
diagnosis. In Power Tech, 2007 IEEE Lausanne, pages
was presented. Three main requirements are identified and
22–27. IEEE, 2007.
satisfied: At first a hardware concept for sensor data ac-
quisition in the heterogeneous environment of WPPs was [6] Li Zhen, He Zhengjia, Zi Yanyang, and Chen Xue-
developed. This hardware logs existing sensor values and feng. Bearing condition monitoring based on shock
offers an adaptive solution to integrate new sensors on de- pulse method and improved redundant lifting scheme.
mand. Second, generic data-driven algorithms to automati- Mathematics and computers in simulation, 79(3):318–
cally compute a system-level model out of minimal labeled, 338, 2008.
historical sensor data is presented. At last an anomaly detec- [7] Saad Chakkor, Mostafa Baghouri, and Abderrah-
tion method has been shown, which reaches an F-Measure mane Hajraoui. Performance analysis of faults de-
of 89.02% and a ballanced accuracy of 91.46%. This solu- tection in wind turbine generator based on high-
tion is not specialized for specific parts of a WPP and can be resolution frequency estimation methods. arXiv
trained in a short period. With an extension of the standard preprint arXiv:1409.6883, 2014.
PCA to a probabilistic PCA, the robustness of the algorith- [8] E Jasinien, R Raiutis, A Voleiis, A Vladiauskas,
mic solution against sensor failures is ensured. D Mitchard, M Amos, et al. Ndt of wind turbine blades
In the future, this solution will be evaluated using data using adapted ultrasonic and radiographic techniques.
from more WPPs with different working environment. Be- Insight-Non-Destructive Testing and Condition Moni-
yond the task of anomaly detection, diagnosis of the root toring, 51(9):477–483, 2009.
cause of an anomaly is also a sensible functionality of a CM [9] Oliver Niggemann and Volker Lohweg. On the diag-
system. The presented solution will be extended by a root
nosis of cyber-physical production systems: State-of-
cause analysis. Such an extension can support maintenance
the-art and research agenda. In Association for the Ad-
personal to trace the detected anomaly. Another focus will
vancement of Artificial Intelligence (AAAI), 2015.
be the prognosis of anomalies in a WPP. To achieve this, an
appropriate algorithm will be developed to predict the future [10] O. Bennouna, N. Heraud, and Z. Leonowicz. Condi-
system status using the learned model of the system behav- tion monitoring and fault diagnosis system for offshore
ior. wind turbines. In Environment and Electrical Engi-
neering (EEEIC), 2012 11th International Conference
on, pages 13–17, May 2012.
Acknowledgments [11] Ning Fang and Peng Guo. Wind generator tower vibra-
tion fault diagnosis and monitoring based on pca. In
Funded by the German Federal Ministry for Economic Af- Control and Decision Conference (CCDC), 2013 25th
fairs and Energy, KF2074717KM3 & KF2074719KM3 Chinese, pages 1924–1929, May 2013.

48
Proceedings of the 26th International Workshop on Principles of Diagnosis

[12] M. Celik, F. Dadaser-Celik, and A.S. Dokuz. Anomaly
detection in temperature data using dbscan algorithm.
In Innovations in Intelligent Systems and Applications
(INISTA), 2011 International Symposium on, pages
91–95, June 2011.
[13] P. Baraldi, F. Di Maio, and E. Zio. Unsupervised clus-
tering for fault diagnosis. In Prognostics and System
Health Management (PHM), 2012 IEEE Conference
on, pages 1–9, May 2012.
[14] Karlheinz Schwarz and Im Eichbaeumle. Iec 61850,
iec 61400-25 and iec 61970: Information models and
information exchange for electric power systems. Pro-
ceedings of the Distributech, pages 1–5, 2004.
[15] Richard A Zatorski. System and method for control-
ling multiple devices via general purpose input/output
(gpio) hardware, June 27 2006. US Patent 7,069,365.
[16] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar.
Introduction to Data Mining. Addison-Wesley, 2006.
[17] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A
density-based algorithm for discovering clusters in
large spatial databases with noise. In Proceedings of
the 2nd International Conference on Knowledge Dis-
covery and Data Mining (KDD96), 1996.
[18] Ulrike von Luxburg. A tutorial on spectral clustering.
In Statistics and Computing, 17 (4), 2007.
[19] Shifei Ding, Liwen Zhang, and Yu Zhang. Research on
spectral clustering algorithms and prospects. In Com-
puter Engineering and Technology (ICCET), 2010 2nd
International Conference on, volume 6, 16-18 2010.
[20] Lei Nie, Shouguo Wu, Jianwei Wang, Longzhen
Zheng, Xiangqin Lin, and Lei Rui. Continuous
wavelet transform and its application to resolving and
quantifying the overlapped voltammetric peaks. Ana-
lytica chimica acta, 450(1):185–192, 2001.
[21] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar.
Introduction to Data Mining. Addison-Wesley, 2006.
[22] Michael E Tipping and Christopher M Bishop. Prob-
abilistic principal component analysis. Journal of the
Royal Statistical Society: Series B (Statistical Method-
ology), 61(3):611–622, 1999.

49
Proceedings of the 26th International Workshop on Principles of Diagnosis

50
Proceedings of the 26th International Workshop on Principles of Diagnosis

Using Incremental SAT for Testing Diagnosability of Distributed DES

Hassan IBRAHIM1 and Philippe DAGUE1 and Laurent SIMON2
1
LRI, Univ. Paris-Sud and CNRS, Orsay, France
hassan.ibrahim@lri.fr, philippe.dague@lri.fr
2
LaBRI, Univ. Bordeaux and CNRS, Bordeaux, France
lsimon@labri.fr

Abstract positioning the sensors to manage the observation require-
ments. The main difficulty in diagnosability algorithms is
We extend in this work the existing approach to related to the states number explosion. Another difficulty
analyse diagnosability in discrete event systems appears when checking diagnosability of a system which
(DES) using satisfiability algorithms (SAT), in or- is actually diagnosable, i.e. the inexistence of a counter-
der to analyse the diagnosability in distributed example witnessing non diagnosability. Thus all possibili-
DES (DDES) and we test this extension. For this, ties need to be tested as for proving the non existence of a
we handle observable and non observable com- plan in a planning problem, and usually in this case some
munication events at the same time. We also pro- approximations are used to avoid exploring all the search
pose an adaptation to use incremental SAT over space.
the existing and the extended approaches to over- The paper is structured as follows. Section 2 will introduce
come some of the limitations, especially concern- the system transition models for centralized DES and recall
ing the length and the distance of the cycles that the traditional definition of the diagnosability in those mod-
witness the non diagnosability of the fault, and els and the state of the art of encoding this definition as a
improve the process of dealing with the reacha- satisfiability problem in propositional logic. Section 3 will
bility limit when scaling up to large systems. present our first contribution, an extension of this state of the
art to DDES with observable and non observable communi-
1 Introduction cation events in the same model, and will give experimental
results of this extension. Section 4 is devoted to our sec-
Diagnosis task is mainly using the available observations to
ond contribution, using incremental SAT calls to overcome
explain the difference between the expected behavior of a
the limitation when the number of steps required to check
system and its real behavior which may contain some faults.
diagnosability, i.e., the length of possible paths with cycles
Many works have been done to study the automatic ap-
witnessing non diagnosability, is large, and will present ex-
proaches to system fault diagnosis. They all try to deal with
perimental results showing how the method scales up. Sec-
the main problem, i.e. the compromise between the number
tion 5 will present related works and section 6 will conclude
of possible diagnoses to the considered faults and the num-
and give our perspectives for future work.
ber of observations which must be given to make the deci-
sion. Diagnosis problem is NP-hard and one always needs
to cope with an explosion in the number of system model 2 Using SAT in Diagnosability Analysis of
states. Moreover, the diagnosis decision is not always cer- Centralized Systems
tain, and thus running a diagnosis algorithm may not be ac-
curate. For example, two sets of observations provided by We recall first the definitions of DES models we use and of
different sets of sensors or at different times may lead to diagnosability for these models.
different diagnoses. This uncertainty raises the problem of
diagnosability which is essential while designing the system 2.1 Preliminaries
model. After that, the model based diagnosis will be used We will use finite state machines (FSM) to model systems.
in applications to explain any anomaly, with a guarantee of We define labeled transition systems following [1].
correctness and precision at least for anticipated faults.
Diagnosability of the considered systems is a property de- Definition 1. A Labeled Transition System (LTS) is
fined to answer the question about the possibility to distin- a tuple T = hX, Σo , Σu , Σf , δ, s0 i where:
guish any possible faulty behavior in the system from any
other behavior without this fault (i.e., correct or with a dif- • X is a finite set of states,
ferent fault) within a finite time after the occurrence of the • Σo is a finite set of observable correct events,
fault. A fault is diagnosable if it can be surely identified
from the partial observation available in a finite delay af- • Σu is a finite set of unobservable correct events,
ter its occurrence. A system is diagnosable if every possible • Σf is a finite set of unobservable faulty events,
fault in it is diagnosable. This property provides information
• δ ⊆ X × (Σo ∪ Σu ∪ Σf ) × X is the transition relation,
before getting into finding the explanations of the fault. It
also helps in designing a robust system against faults and in • s0 is the initial state.

51
Proceedings of the 26th International Workshop on Principles of Diagnosis

In [2] the authors used an equivalent but more compact that L(T ) is live (i.e., for any state, there is at least one tran-
representation than LTS for modeling systems in order to sition issued from this state) and convergent (i.e., there is no
analyze their diagnosability: succinct transition systems, cycle made up only of unobservable events).
that exploit the regularity in the systems structures and A system T is said to be diagnosable iff any fault f ∈ Σf
are expressed in terms of propositional variables, which is diagnosable in T . In order to avoid exponential complex-
allowed them to translate more easily to a SAT problem the ity in the number of faults during diagnosability analysis,
twin plant method proposed by [3] for checking diagnos- only one fault at a time is checked for diagnosability. It will
ability. thus be assumed in the following that there exists only one
As we aim at studying the diagnosability of DDES using fault event f (Σf = {f }), without restriction on the num-
SAT solvers, we will follow the model of [2] who stud- ber of its occurrences. Diagnosability checking has been
ied the same problem in centralized DES. It represents proved in [3] to be polynomial in the number |X| of states
the system states by the valuations of a finite set A of for LTS, so exponential in the number |A| of state variables
Boolean state variables where valuation changes reflect for SLTS (actually the problem is NLOGSPACE-complete
the transitions between states according to the events. The for LTS and PSPACE-complete for SLTS [4]).
set of all literals issued from A is L = A ∪ {¬a|a ∈ A}
and L is the language over A that consists of all formulas 2.2 SLTS Diagnosability as Satisfiability
that can be formed from A and the connectives ∨ and An immediate rephrasing of the definition 3 shows that T is
¬. We use the standard definitions of further connec- non diagnosable iff it exists a pair of trajectories correspond-
tives Φ ∧ Ψ ≡ ¬(¬Φ ∨ ¬Ψ), Φ → Ψ ≡ ¬Φ ∨ Ψ and ing to cycles (and thus to infinite paths), a faulty one and
Φ ↔ Ψ ≡ (Φ → Ψ) ∧ (Ψ → Φ). The transition relation a correct one, sharing the same observable events. Which
is defined to allow two or more events to take place is equivalent to the existence of an ambiguous (i.e. made
simultaneously. Thus each event is described by a set of up of pairs of states respectively reachable by a faulty path
pairs hφ, ci which represent its possible ways of occurrence and a correct path) cycle in the product of T by itself, syn-
by indicating that the event can be associated with changes chronized on observable events, which is at the origin of the
c ∈ 2L in states that satisfy the condition φ ∈ L. so called twin plant structure introduced in [3]. This non
diagnosability test was formulated in [2] as a satisfiability
Definition 2. A Succinct Transition System (SLTS) problem in propositional logic. We recall below this encod-
is described by a tuple T = hA, Σo , Σu , Σf , δ, s0 i where: ing with the variables and the formulas used, where super-
• A is a finite set of state variables, scripts t refer to time points and (eto ) and (êto ) refer respec-
tively to the faulty and correct events occurrences sequences
• Σo is a finite set of observable correct events,
(corresponding states being described by valuations of (at )
• Σu is a finite set of unobservable correct events, and (ât )) of a pair of trajectories witnessing non diagnos-
• Σf is a finite set of unobservable faulty events, ability (so sharing the same observable events represented
L by (et ) and forming a cycle). The increasing of the time
• δ : Σ = Σo ∪ Σu ∪ Σf → 2L×2 assigns to each event step corresponds to the triggering of at least one transition
a set of pairs hφ, ci, and the extension by an event of at least one of the two tra-
• s0 is the initial state (a valuation of A). jectories. T = hA, Σu , Σo , Σf , δ, s0 i being an SLTS, the
It is straightforward to show that any LTS can be repre- propositional variables are thus:
sented as an SLTS (one takes dlog(|X|)e Boolean variables • at and ât for all a ∈ A and t ∈ {0, . . . , n},
and represents states by different valuations of these vari- • eto for all e ∈ Σo ∪ Σu ∪ Σf , o ∈ δ(e) and t ∈ {0, . . . ,
ables; one assigns to each occurence of an event e labeling n − 1},
a transition (x, e, y) a pair hφ, ci, with φ expressing the
valuation of x and c the valuation changes between x and • êto for all e ∈ Σo ∪ Σu , o ∈ δ(e) and t ∈ {0, . . . ,
y). And reciprocally any SLTS can be mapped to an LTS n − 1},
(see Definition 2.4 in [2]). • et for all e ∈ Σo and t ∈ {0, . . . , n − 1}.
The formal definition of diagnosability of a fault f in a
The following formulas express the constraints that must be
centralized system modeled by (an LTS or SLTS) T was
applied at each time step t or between t and t + 1.
proposed by [1] as follows:
1. The event occurrence eto must be possible in the current
Definition 3. Diagnosability. A fault f is diagnos- state:
able in a system T iff
eto → φt for o = hφ, ci ∈ δ(e) (2.1)
∃k ∈ N, ∀sf ∈ L(T ), ∀t ∈ L(T )/sf , |t| ≥ k ⇒
and its effects must hold at the next time step:
∀p ∈ L(T ), (P (p) = P (sf .t) ⇒ f ∈ p).
^
In this formula, L(T ) denotes the prefix-closed language of eto → lt+1 for o = hφ, ci ∈ δ(e) (2.2)
T whose words are called trajectories, sf any trajectory end- l∈c
ing by the fault f , L(T )/s the post-language of L(T ) after
s, i.e., {t ∈ Σ∗ |s.t ∈ L(T )} and P the projection of tra- We have the same formulas with êto .
jectories on observable events. The above definition states 2. The present value (T rue or F alse) of a state variable
that for each trajectory sf ending with fault f in T , for each changes to a new value (F alse or T rue, respectively)
t that is an extension of sf in T with enough events, every only if there is a reason for this change, i.e., because of
trajectory p in T that is equivalent to sf .t in terms of obser- an event that has the new value in its effects (so, change
vation should contain in it f . As usual, it will be assumed without reason is prohibited). Here is the change from

52
Proceedings of the 26th International Workshop on Principles of Diagnosis

T rue to F alse (the change from F alse to T rue is de- 3.1 DDES Modeling
fined similarly by interchanging a and ¬a): In order to model DDES with SLTS, we need to extend
(at ∧ ¬at+1 ) → (eti1 o ∨ · · · ∨ etik o ) (2.3) these ones by adding communication events to each com-
j1 jk
ponent. So we use the following definition for a distributed
where the ojl = hφjl , cjl i ∈ δ(eil ) are all the occur- SLTS with k different components (sites):
rences of events eil with ¬a ∈ cji .
We have the same formulas with ât and êtil o . Definition 4. A Distributed Succinct Transition
System (DSLTS) with k components is described by a tuple
jl

3. At most one occurrence of a given event can occur at T = hA, Σo , Σu , Σf , Σc , δ, s0 i where (subscripts i refer to
a time and the occurrences of two different events can- component i):
not be simultaneous if they interfere (i.e., if they have
two contradicting effects or if the precondition of one • A is a union of disjoint finite sets (Ai )1≤i≤k of com-
contradicts the effect of the other): ponent own state variables, A = ∪ki=1 Ai ,
¬(eto ∧ eto0 ) ∀e ∈ Σ, ∀{o, o0} ⊆ δ(e), o 6= o0 (2.4) • Σo is a union of disjoint finite sets of component own
observable correct events, Σo = ∪ki=1 Σoi ,
¬(eto ∧ e0t
o0 ) ∀{e, e0} ⊆ Σ, e 6= e0, ∀o ∈ δ(e),
• Σu is a union of disjoint finite sets of component own
∀o0 ∈ δ(e0) such that o and o0 interfere (2.5) unobservable correct events, Σu = ∪ki=1 Σui ,
We have the same formulas with êto . • Σf is a union of disjoint finite sets of component own
4. The formulas that connect the two events sequences unobservable faulty events, Σf = ∪ki=1 Σf i ,
require that observable events take place in both se-
• Σc is a union of finite sets of (observable or unobserv-
quences whenever they take place (use of et ):
_ _ able) correct communication events, Σc = ∪ki=1 Σci ,
eto ↔ et and êto ↔ et ∀e ∈ Σo (2.6) which are the only events shared by at least two differ-
o∈δ(e) o∈δ(e) ent components (i.e., ∀i, ∀c ∈ Σci , ∃j 6= i, c ∈ Σcj ),
• δ = (δi ), where δi : Σi = Σoi ∪ Σui ∪ Σf i ∪ Σci →
The conjunction of all the above formulas for a given t is Li
denoted by T (t, t + 1). 2Li ×2 , assigns to each event a set of pairs hφ, ci in
A formula for the initial state s0 is: the propositional language of the component where it
^ ^ occurs (so, for communication events, in each compo-
I0 = (a0 ∧â0 ) ∧ (¬a0 ∧¬â0 ) (2.7) nent separately where they occur),
a∈A,s0 (a)=1 a∈A,s0 (a)=0
• s0 = (s0i ) is the initial state (a valuation of each Ai ).
At last, the following formula can be defined to encode
In this distributed framework, synchronous communication
the fact that a pair of executions is found with the same ob-
is assumed, i.e., communication events are synchronized
servable events and no fault in one execution (first line), but
such that they all occur simultaneously in all components
one fault in the other (second line), which are infinite (in
where they appear. More precisely, a transition by a com-
the form of a non trivial cycle, so containing at least one
munication event c may occur in a component iff a simul-
observable event, 1 at step n; third line), witnessing non di-
taneous transition by c occurs in all the other components
agnosability:
where c appears (has at least one occurrence). In particular,
ΦTn = I0 ∧ T (0, 1) ∧ · · · ∧ T (n − 1, n) ∧ all events before c in trajectories in all these components
n−1
_ _ _ necessarily occur before all events after c in these trajecto-
eto ∧ ries. The global model of the system is thus nothing else that
t=0 e∈Σf o∈δ(e)
the product of the models of the components, synchronized
on communication events. Notice that we allow in whole
n−1
_ ^ n−1
_ _ generality communication events to be, partially or totally,
( ((an ↔ am ) ∧ (ân ↔ âm )) ∧ et ) unobservable, so one has in general to wait further obser-
m=0 a∈A t=m e∈Σo vations to know that some communication event occurred
From this encoding in propositional logic, follows the re- between two or more components. On the other side, as-
sult (theorem 3.2 of [2]) that an SLTS T is not diagnosable suming these communications to be faultless is not actually
if and only if ∃n ≥ 1, ΦTn is satisfiable. It is also equivalent a limitation. If a communication process or protocol may be
to ΦT22|A| being satisfiable, as the twin plant states number is faulty, it has just to be modeled as a proper component with
an obvious upper bound for n, but often impractically high its own correct and faulty behaviors (the same that, e.g., for
(see in [2] some ways to deal with this problem). a wire in an electrical circuit). In this sense, communica-
tions between components are just a modeling concept, not
3 Using SAT in Diagnosability Analysis of subject to diagnosis. It will be also assumed that the observ-
Distributed Systems able information is global, i.e. centralized (when observable
information is only local to each component, distributed di-
We extend from centralized systems to distributed systems agnosability checking becomes undecidable [5]), allowing
the satisfiability framework of subsection 2.2 for testing di- to keep definition 3 for diagnosability.
agnosability and we provide some experimental results.
1 3.2 DSLTS Diagnosability as Satisfiability
This verification that the cycle found is not trivial was not done
in [2]; it is why the authors had to add for each time point a for- Let T be a DSLTS made up of k components denoted by
mula, not needed here, guaranteeing that at least one event took indexes i, 1 ≤ i ≤ k. In order to express the diagnosability
place, to avoid silent loops with no state change. analysis of T as a satisfiability problem, we have to extend

53
Proceedings of the 26th International Workshop on Principles of Diagnosis

the formulas of subsection 2.2 to deal with communication We have tested our tool on small examples with sev-
events between components. Let Σc = Σco ∪ Σcu be the eral communication events with multiple occurrences (three
communication events, with Σco = ∪ki=1 Σco i the observ- communicating components) with global communication
able ones and Σcu = ∪ki=1 Σcu i the unobservable ones. (all components share the same event) or partial commu-
The idea is to treat each communication event as any nication (only some components share the same event), as
other event in each of its owners and, as it has been done in Figure 1, which was the running example in [7].
with events et for e ∈ Σo for synchronizing observable
events occurrences in the two executions, to introduce in the
same way a global reference variable for each communica-
tion event at each time step, in charge of synchronizing any
communication event occurrence in any of its owner with
occurrences of it in all its other owners. We use one such
reference variable for each trajectory, et and êt , for unob-
servable events e ∈ Σcu , and only one for both trajectories,
et , for observable events e ∈ Σco as it will also in addition
play the role of synchronizing observable events between
trajectories exactly as the et for e ∈ Σo . So, we add to the
previous propositional variables the new following ones:
• eto , êto for all e ∈ Σc , o ∈ δ(e) = ∪i δi (e) and
t ∈ {0, . . . , n − 1},
• et for all e ∈ Σc , êt for all e ∈ Σcu and
t ∈ {0, . . . , n − 1}. Figure 1: A DDES made up of 3 components C1, C2 and
Formulas in T (t, t + 1) are extended as follows. C3 from left to right. ci ,1≤i≤2 are unobservable communi-
cation events, oi ,1≤i≤5 are observable events and fi ,1≤i≤2
1. Formulas (2.1), (2.2), (2.3) and (2.5) extend unchanged are faulty events.
to eto and êto ∀e ∈ Σc , expressing that a communication
event must be possible and has effects in each of its The total number of propositional variables V arsN um
owner components and that two such different events in the generated formula ΦTn after n steps is:
cannot be simultaneous if they interfere. PObs
V arsN um = n × (2|A| + 3 i=1 ObOcci +
PF aults PU nobs
2. Formulas (2.4) extend to prevent two simultaneous oc- i=1 F aultOcci + 2 i=1 U nobOcci ), where:
currences of a given communication event in the same |A| is the total number of state variables,
owner component, i.e. apply ∀e ∈ Σc , ∀i, ∀{oi , oi 0} ⊆ Obs the total number of observable events,
δi (e), oi 6= oi 0 and the same with ê (obviously they do ObOcci the total number of occurrences of the observable
not apply to different owner components, by the very event ei ,
definition of communication events). F aults the total number of faults,
3. Finally, the new following formulas express the com- F aultOcci the total number of occurrences of the faulty
munication process itself, i.e. the synchronization of event ei ,
the occurrences of any communication event e in all its U nobs the total number of unobservable correct events,
owners components (S(e) being the set of indexes of U nobOcci the total number of occurrences of the unob-
the owners components of e) and extend also formulas servable correct event ei .
(2.6) to observable communication events: The results are in Table 1, where the columns show the
system and the fault considered (3 cases), the steps number
_ _ n, the numbers of variables and clauses and the runtime.
etoi ↔ et and êtoi ↔ êt ∀e ∈ Σcu ∀i ∈ S(e)
oi ∈δi (e) oi ∈δi (e) System Fault |Steps| SAT? |Variables| |Clauses| runtime(ms)
_ _ C2 f2 4 No 106 628 27
etoi ↔ et and êtoi ↔ et ∀e ∈ Σco ∀i ∈ S(e) C2 f2 5 Yes 131 783 15
C2, C3 f2 5 No 225 1157 28
oi ∈δi (e) oi ∈δi (e) C2, C3 f2 32 No 1386 7340 641
C2, C3 f2 64 No 2762 14668 1422
The formula ΦTn is unchanged except that, in the verification C2, C3 f2 128 No 5514 29324 5061
C2, C3 f2 256 No 11018 58636 18970
that the found cycle (third line) is not trivial, any observable C2, C3 f2 512 No 22026 117260 130164
event can be used, so the final disjunct of events et is ex- C2, C3 f2 1024 No 44042 234508 548644
tended to all e ∈ Σo ∪ Σco . We have thus the result that a C1, C2, C3 f1 8 No 576 3546 91
C1, C2, C3 f1 9 Yes 646 3987 110
DSLTS T is not diagnosable if and only if ∃n ≥ 1, ΦTn is
satisfiable. Table 1: Results on the example of Figure 1.
Which means that f 2 is not diagnosable in C2 alone
3.3 Implementation and Experimental Testing while it becomes diagnosable when synchronizing C2 and
We have implemented the above extension in Java. We used C3. For this last result, we have increased the steps number
the well designed API of the SAT solver Sat4j [6]. If more until reaching 22|A| , which is the theoretical upper bound of
efficient solvers could have been chosen, it fitted well our the twin plant states represented in the logical formula. As
clause generator written in Java and only a limited speed in general it is not always possible to reach this bound in
up can be awaited from C++ solvers (a speed up of 4, i.e. practice, we propose in section 4 using incremental SAT to
reduction of 75% of the runtime is often observed). improve the management of increasing steps number. While

54
Proceedings of the 26th International Workshop on Principles of Diagnosis

f 1 is not diagnosable even after synchronizing all three of the behavior of both trajectories (represented by the con-
components together. Numbers of variables and clauses are junction of formulas T (t, t+1), 0 ≤ t ≤ n−1, representing
small in comparison to what SAT solvers can handle (up to the (t + 1)th step). The second part Dn describes the diag-
hundred thousands propositional variables and millions of nosability property at step n, i.e., the occurrence of a fault
clauses). These tests are mentioned as a proof of concept. in the n previous steps of the faulty trajectory (given by the
However, to test the tool on larger systems and because of formula Fn ) and the detection of a cycle at step n (given by
the absence of benchmark in the literature, we have created the formula Cn ). So we obtain, for n ≥ 1:
in subsection 4.2. an example that can be scaled up.
ΦTn = Tn ∧ Dn
4 Adaptation to Incremental SAT n−1
^
Tn = I0 ∧ T (t, t + 1) Dn = Fn ∧ Cn
Diagnosability Checking t=0
We adapt satisfiability algorithms for checking diagnosabil- n−1
_ _ _
ity of both centralized (subsection 2.2) and distributed (sub- Fn = eto
section 3.2) DES in order to incrementally process the max- t=0 e∈Σf o∈δ(e)
imum length of paths with cycles searched for witnessing n−1 n−1
non diagnosability and we provide experimental results. _ ^ _ _
Cn = ( ((an ↔ am ) ∧ (ân ↔ âm )) ∧ et )
4.1 Diagnosability as Incremental Satisfiability m=0 a∈A t=m e∈Σo

Two cases have to be distinguished while testing diagnos- Add now at each step j a control variable hj allowing to
ability using SAT solvers to verify the satisfiability of the disable (when its truth value is F alse) or activate (when its
logical formula ΦTn for a given n [2]. The first case is when truth value is T rue) the formulas Fj and Cj and keep at step
we find a model for ΦTn , which definitely indicates the non n all these controlled formulas for 1 ≤ j ≤ n. We obtain
diagnosability of the studied fault. The second case is when the following ΨTn formula, for n ≥ 1:
we do not find such a model: this result indicates just that the n
^
studied fault has not been found non diagnosable according ΨTn = Tn ∧ Dj 0 Dj 0 = Fj 0 ∧ Cj 0 1≤j≤n
to the value of n. In other words, after testing all the possible j=1
first n steps, we did not find a pair of executions of length
at most n containing cycles such that one of them contains Fj 0 = ¬hj ∨ Fj Cj 0 = ¬hj ∨ Cj 1≤j≤n
the fault and not the other and such that the two executions We have thus the equivalence, for all n ≥ 1:
are equivalent in terms of observation. However, as the the-
n−1
^
oretical upper bound n = 22|A| which would guarantee that
ΦTn ≡ ΨTn ∧ hn ∧ ¬hj
the fault is actually diagnosable is often in practice unreach-
j=1
able, such a pair may exist for a greater value of n. Testing
it means increasing n and rebuilding the logical formula ΦTn This allows one, for all n ≥ 1, to replace the SAT call on
then recalling the SAT solver. ΦTn by a SAT call on ΨTn under the control variables set-
Instead, we propose to adapt the formula ΦTn in order to ting given by Hn = {¬h1 , . . . , ¬hn−1 , hn } (indicated in a
be tested in an incremental SAT mode by multiple calls to second argument of the call):
a Conflict Driven Clause Learning (CDCL) solver. Using
CDCL solvers in a specialized, incremental, mode is rela- SAT (ΦTn ) = SAT (ΨTn , Hn )
tively new but already widely used [8] in many applications. The idea is now to consider the control variables hj as as-
In this operation mode, the solver can be called many times sumptions and use incremental SAT calls IncSATj under
with different formulas. However, solvers are designed to varying assumptions, for 1 ≤ j ≤ n. For this, we use
work with similar formulas, where clauses are removed and the following recurrence relationships for both formulas ΨTj
added from calls to calls. Learnt clauses can be kept as soon and assumptions Hj :
as the solver can ensure that clauses used to derive them are
not removed. This is generally done by adding specialized ΨT0 = I0 ΨTj+1 = ΨTj ∧ T (j, j + 1) ∧ Dj+1 0 j≥0
variables, called assumptions, to each clause that can be re- H1 = {h1 } Hj+1 = Hj [{¬hj , hj+1 }] j ≥ 1
moved. By assuming the variable to be F alse, the clause
is activated and by assuming the variable to be T rue, the where the notation Hj [{assi }] means updating in Hj
clause is trivially satisfied and no longer used by the solver. assumptions hi by their new settings assi , i.e., in the
What is interesting for our purpose is that the CDCL solver formula above, replacing the truth value of hj , which was
can save clauses learnt during the previous calls and test T rue, by F alse, and adding the new assumption hj+1
multiple assumptions in each new call. This means that af- with truth value T rue. From these relationships, the unique
ter n steps we hope that the solver will have learnt some call to SAT under given assumptions SAT (ΨTn , Hn ) can
constraints about the behavior of the system. Although we be replaced, starting with the set of clauses I0 , by multiple
are interested in testing the diagnosability property on a de- calls, 0 ≤ j ≤ n − 1, to an incremental SAT under varying
fined system, this property is independent from the system assumptions:
behavior which can be learnt by the solver from the previous IncSATj+1 (N ewClausesj+1 , N ewAssumptionsj+1 )
calls.
In order to extend the clauses representation given in sub- = IncSATj+1 (T (j, j + 1) ∧ Dj+1 0 , {¬hj , hj+1 }) (4.1)
sections 2.2 and 3.2 to this mode of operation, we propose If IncSATj answers SAT, the search is stopped as non diag-
to divide the formula ΦTn in two parts. The first part Tn de- nosability is proved, if it answers UNSAT, then IncSATj+1
scribes the first n steps, synchronized on the observations, is called.

55
Proceedings of the 26th International Workshop on Principles of Diagnosis

Notice that we used a unique assumption hj for control- model (for k = 3, 13, 23, 33, 43 and 63). The length of a pair
ling both Fj and Cj as non diagnosability checking requires of executions with cycles witnessing the non diagnosability
the presence of both a fault occurrence in the faulty trajec- of f in each example is k + 2 and we consider the satisfia-
tory and of a cycle. But the same framework allows the bility of the formula ΦTk+2 , so the number of steps required
independent control of formulas by separate assumptions. for SAT to provide the answer Yes is: |Steps] = k + 2. In
For sake of simplicity, we also assumed we called IncSAT order to obtain a fair comparison between IncSAT , which
at each step, but this is not mandatory and indexes j for the manages internally by handling assumptions the successive
successive calls can be decoupled from indexes t for steps. satisfiability checks of increasing formulas for j = 1, . . . ,
We should also say that, even if IncSAT allows us to re- k + 2, and SAT, for which k + 2 successive calls are made to
activate an already disabled clause, we are sure in our case the solver with respective formulas ΦTn for n = 1, . . . , k+2,
to never use this function (when hk has been set to F alse, the sum of the k + 2 runtimes of the SAT solver calls are
it always remains so) and we can thus force the solver to considered in this case (last column in the tables).
do a hard simplification process that removes the forgotten
clauses permanently. As a result of our adaptation we will |Steps| |Clauses| Inc. SAT(s) SAT(s)
be able to scale up the size of the tested system and the dis-
tance and length of a cycle witnessing non diagnosability. 20 42,614 1.5 1.3
30 131,714 10.3 13.1
4.2 Experimental Results 40 303,736 49.3 77.8
50 576,466 106 223
We show in this subsection a comparison between our 60 970,156 320 699
adapted version of subsection 4.1, that uses incremental 100 4,334,018 9410 13040
SAT, and the previous versions, for centralized model (sub-
section 2.2 following [2]) and for distributed model (subsec-
tion 3.2). We have created the example in Figure 2 which Table 2: Results on the faulty component of Figure 2.
contains 2k + 1 components: one faulty component and
two sets of k neighboring components. The faulty compo-
nent has two separated paths, each one containing k differ- |Steps| |Comps| |Clauses| Inc. SAT(s) SAT(s)
ent successive unobservable events ci and ending with the
5 7 1,962 0.04 0.06
same observable cycle of length 1, but only one of them
15 27 30,313 0.8 0.5
contains the fault. The centralized model will be limited to
25 47 113,906 6.5 4.8
this faulty component alone and thus in this case the events
35 67 277,873 33.8 33.7
ci , 1 ≤ i ≤ 2k, are just unobservable events as is u. In
45 87 542,033 111 132
the distributed model, these events ci are communication
65 127 1,490,590 967 1090
events and the faulty component is considered with the other
two sets of components, where each component in both sets Table 3: Results on the whole system of Figure 2.
shares one event ci with the faulty component to ensure a
number 2k of communications before arriving to the cycles
that will witness the non diagnosability of the fault. Each Although these examples remain relatively simple and do
set of components will be synchronized with only one path, not reflect any potential constraint that could be resumed by
either the faulty path or the correct one. This allows us to some learnt clauses (e.g. no interfering events), we can al-
study the effect of the cycle distance in both models. ready notice the difference in runtime in favor of our incre-
mental version in the centralized case and for the two largest
values of k in the distributed case. This difference could be
explained by the fact that generating all variables from the
beginning for all time steps and for all events imply many
meaningless clauses that would add a load on the solver in
the version in [2], this load being avoided in our incremen-
tal version because of the clauses learnt by the CDCL SAT
solver. From another side, we should say that generating in
both versions all variables from the beginning has two main
advantages: firstly, it allows the system description without
unfolding it (even if this description is verbose); secondly,
it allows the ordering of these variables by their time step
in order to generate the constraints for only one time step
and then get next steps constraints by just shifting the num-
bers (as we are representing the clauses in DIMACS for-
mat). One last point could help to a more efficient descrip-
tion of the system: in the succinct systems we represent all
the occurrences of an event together, but in its SAT encod-
Figure 2: One faulty component that communicates with ing we “unfold” this succinctness by generating for each
two sets of k components. Each set communicates with one occurrence n variables (for n time steps), even though log-
path (resp. faulty and correct) in the faulty component. ically only one of them will be assigned to True. We could
thus mark this relation among these n copies by introducing
The results are in Table 2 for the centralized model (for k a global cardinality constraint to express that these copies
= 18, 28, 38, 48, 58 and 98) and in Table 3 for the distributed belong to only one occurrence of an event.

56
Proceedings of the 26th International Workshop on Principles of Diagnosis

5 Selection of Related Works faults.
The work by [11] has optimized the construction of lo-
The first introduction to the notion of diagnosability was by cal twin plants, by exploiting the fact that one distinguishes
[1]. The authors studied diagnosability of FSM, as defined
two behaviors (faulty and correct) and one synchronizes at
in definition 1. Their formal definition of diagnosability is two levels (observations first and communications later). It
the one we mentioned in definition 3. They introduced an improved the construction of the twin plants proposed by
approach to test this property by constructing a deterministic [7] by exploiting the different identifiers given to the com-
diagnoser. However, in the general case, this approach is munication events at the observation synchronization level
exponential in the number of states of the system, which (depending on which instance, left or right, they belong to)
makes it impractical. to assign them directly to the two behaviors studied (left
In order to overcome this limitation [3] introduced the copy assigned to the faulty behavior, right copy to the cor-
twin plant approach, which is a special structure built by rect one). This helped in deleting the redundant informa-
synchronizing on their observable events two identical in- tion, then in abstracting the amount of information to be
stances of a nondeterministic fault diagnoser, and then transferred later to next steps if the diagnosability was not
searched for a path in this structure with an observed cy- answered. The generalization to fault patterns in DDES was
cle made up of ambiguous states, i.e. states that are pairs introduced by [12].
of original states, one reached by going through a fault and After the reduction of diagnosability problem to a path
the other not. Thus faults diagnosability is equivalent to the finding problem by [3], it became transferable to a satis-
absence of such a path, called a critical path. This approach fiability problem like it is the case for planning problems
turns the diagnosability problem in a search for a path with [13]. This was done by [2] which formulated the diagnos-
a cycle in a finite automaton, and this reduces its complexity ability problem (in its twin plant version) into a SAT prob-
to be polynomial of degree 4 in the number of states (and ex- lem, assuming a centralized DES with simple fault events.
ponential in the number of faults, but processing each fault The authors represented the studied transition system by a
separately makes its linear in the number of faults). succinct representation (cf. definition 2). This allows both
Let us mention here that the two previous works were in- a compact representation of the system states and a max-
terested in centralized systems with simple faults modeled imum amount of non interfering events to be fired simul-
as distinguished events. The first studies about fault pat- taneously. Thus, they represented the system states by the
terns were introduced in [9] and [10] which generalize the valuation of a set of Boolean state variables (dlog(q)e state
simple fault event in a centralized DES to handle a sequence variables for q states) and the interference relation between
of events considered together as a fault, or handle multiple two events according to the consistency among their effects
occurrences of the same fault or of different faults. More and preconditions, one versus the other. They distinguished
generally, a fault pattern is given as a suffix-closed rational between an occurrence of an event in the faulty sequence or
events language (so by a complete deterministic automaton in the correct sequence by introducing two versions of it and
with a stable subset of final states). constructed the logical formula expressing states transitions
The first work that addressed diagnosability analysis in for each possible step in the system. Each step may con-
DDES was [7]. A DDES is modeled as a set of communicat- tain simultaneous events that belong to faulty and correct
ing FSM. Each FSM has its own events set, communication sequences but must synchronize the occurrence of observ-
events being the only ones shared by at least two different able events whenever they take place. For a given bound n
FSM. In [7] was introduced an incremental diagnosability of paths length, they made the conjunct of these formulas
test which avoids to build the twin plant for the whole dis- for n steps and added the logical formula that represents the
tributed system if not needed. Thus one starts by building occurrence of the fault in the faulty sequence and the oc-
a local twin plant for the faulty component to test the exis- currence of a cycle in both sequences. The satisfiability of
tence of a local critical path. If such a path exists one builds the obtained formula is equivalent to finding a critical path,
the local twin checkers of the neighboring components. Lo- i.e. to the non diagnosability of the fault (see subsection 2.2
cal twin checker is a structure similar to local twin plant, for a summary of this approach). Although this approach
i.e., where each path in it represents a pair of behaviors with allows one to test diagnosability in large systems, it has a
the same observations, except that there is no fault infor- limitation which is that we cannot dynamically increase n
mation in it since it is constructed from non-faulty compo- to ensure reaching more states while scaling up the size of
nent. After constructing local twin checkers, one tries to the system where the cycles that witness non diagnosabil-
solve the ambiguity resulting from the existence of a critical ity can be very long. However the authors notice that we
path in the local twin plant. This is done by synchronizing are not always forced to test all reachable states in many
on their communication events this local twin plant with the cases where an approximation for the reachable states can
local twin checker of one neighboring component. In other be applied, but without explaining explicitly how such an
words, one tries to distinguish the faulty path from the cor- approximation can be found.
rect one by exploiting the observable events in the neigh-
boring components, because theses events occurrences that
are consistent with the occurrences of the communication
6 Conclusion and Future Works
events could solve the ambiguity. The process is repeated By extending the state of the art works for centralized DES,
until the diagnosability is answered, so only in the worst we have expressed diagnosability analysis of DDES as a
case has the whole system to be visited. Another impor- satisfiability problem by building a propositional formula
tant contribution in this work was to delete the unambigu- whose satisfiability, witnessing non diagnosability, can be
ous parts after each synchronization on the communication checked by SAT solvers. We allow both observable and
events, reducing thus the amount of information transferred non observable communication events in our model. Our
to next check (if needed). The approach assumed simple expression of these communication events, which avoids

57
Proceedings of the 26th International Workshop on Principles of Diagnosis

merging all their owner components, helps in reducing the [2] J. Rintanen and A. Grastien. Diagnosability testing
number of clauses used to represent them and this reduction with satisfiability algorithms. In Proceedings of the
is proportional to the number of their occurrences. We have 20th International Joint Conference on Artificial Intel-
also proposed an adaptation of the logical formula in order ligence (IJCAI’07), pages 532–537, 2007.
to use incremental SAT calls helping managing the scaling [3] S. Jiang, Z. Huang, V. Chandra, and R. Kumar. A poly-
up of the distance and the length of the intended cycles nomial algorithm for testing diagnosability of discrete-
witnessing non diagnosability and thus the size of the tested event systems. IEEE Transactions on Automatic Con-
system. Thus we exploited the clauses learnt about the trol, 46(8):1318–1321, 2001.
system behavior in the previous calls. This approach is
more practical and more efficient for complex systems than [4] J. Rintanen. Diagnosers and diagnosability of succinct
existing ones, as it avoids starting from scratch at each call. transition systems. In Proceedings of the 20th Interna-
tional Joint Conference on Artificial Intelligence (IJ-
We are now considering the extension of this work to CAI’07), pages 538–544, 2007.
fault patterns diagnosability [12]. We will use the same ap- [5] L. Ye and P. Dague. Undecidable case and decidable
proach to express predictability analysis [14] as a satisfia- case of joint diagnosability in distributed discrete event
bility problem, for DES and DDES [15] and both for simple systems. International Journal On Advances in Sys-
fault events and fault patterns [16] . Although our represen- tems and Measurements, 6(3 and 4):287–299, 2013.
tation can be easily extended to deal with local observations [6] D. Le Berre and A. Parrain. The sat4j library, release
(i.e., observable events in one component are observed only 2.2. Journal on Satisfiability, Boolean Modeling and
by this component), we know that in general diagnosability Computation, 7:59–64, 2010.
checking becomes then undecidable, e.g. when communica-
tion events are unobservable (obviously it remains decidable [7] Y. Pencolé. Diagnosability analysis of distributed dis-
when these events are observable in all their owners) [5]. A crete event systems. In Proceedings of the 16th Euro-
future work will be to study decidable cases of diagnosabil- pean Conference on Artificial Intelligence (ECAI’04),
ity checking in DDES with local observations, e.g. assum- 2004.
ing some well chosen communication events being observ- [8] A. Nadel and V. Ryvchin. Efficient SAT solving under
able. Another natural question is to study if the methods assumptions. In Proceedings of the 15th International
used in [7] and refined in [11] to check diagnosability in Conference on Theory and Applications of Satisfiabil-
DDES in an incremental way in terms of the system com- ity Testing (SAT’12), 2012.
ponents could be transposed as guiding strategies for some [9] T. Jéron, H. Marchand, S. Pinchinat, and M.-O.
component incremental SAT based approach for testing di- Cordier. Supervision patterns in discrete event sys-
agnosability in DDES. Transposing in SAT these methods, tems diagnosis. In Proceedings of the 8th International
based on building a local twin plant and local twin check- Workshop on Discrete Event Systems, 2006.
ers for gaining efficiency with regards to a global checking,
seems difficult. Basically, at any step k, corresponding to [10] S. Genc and S. Lafortune. Diagnosis of patterns in
considering a subsystem made up of k components, these partially-observed discrete-event systems. In Proceed-
methods build all critical paths witnessing non diagnosabil- ings of the 45th IEEE Conference on Decision and
ity at the level of this subsystem and the incremental step, Control, pages 422–427. IEEE, 2006.
when adding a (k + 1)th neighboring component, consists [11] L. Ye and P. Dague. An optimized algorithm for diag-
in checking the consistency of these pairs with the observa- nosability of component-based systems. In Proceed-
tions in the new component: only those pairs which can be ings of the 10th International Workshop on Discrete
consistently extended are kept, if any. In addition, in [11], Event Systems (WODES’10), 2010.
only useful and abstracted information is kept from one step [12] L. Ye, Y. Yan, and P. Dague. Diagnosability for pat-
to the next one. With SAT, only one critical pair witness-
terns in distributed discrete event systems. In Proceed-
ing non diagnosability of the subsystem (i.e., a model for
ings of the 21st International Workshop on Principles
the formula) will be built. If it is not consistent, and thus
of Diagnosis (DX’10), 2010.
disappears, when adding the (k + 1)th component, diagnos-
ability is not proven for all that: other critical pairs in the [13] H. Kautz and B. Selman. Planning as satisfiability. In
subsystem, not completely computed at step k, may exist Proceedings of the 10th European Conference on Ar-
and be extendible to step (k + 1). So, they have to be com- tificial Intelligence (ECAI’92), pages 359–363, 1992.
puted now, which limits the incremental characteristic of the [14] S. Genc and S. Lafortune. Predictability of Event
approach. In the same way, abstracting some information Occurrences in Partially-observed Discrete-event Sys-
is difficult to achieve with SAT. So, there is no evidence tems. Automatica, 45(2):301–311, 2009.
a priori that efficiency gain could be obtained by trying to
[15] L. Ye, P. Dague, and F. Nouioua. Predictability Analy-
develop a component incremental SAT based approach for
testing DDES diagnosability. sis of Distributed Discrete Event Systems. In Proceed-
ings of the 52nd IEEE Conference on Decision and
Control (CDC-13), pages 5009–5015. IEEE., 2013.
References [16] T. Jéron, H. Marchand, S. Genc, and S. Lafortune.
Predictability of Sequence Patterns in Discrete Event
[1] M. Sampath, R. Sengupta, S. Lafortune, K. Sinnamo- Systems. In Proceedings of the 17th World Congress,
hideen, and D. Teneketzis. Diagnosability of discrete- pages 537–453. IFAC., 2008.
event systems. IEEE Transactions on Automatic Con-
trol, 40(9):1555–1575, 1995.

58
Proceedings of the 26th International Workshop on Principles of Diagnosis

Improving Fault Isolation and Identification for Hybrid Systems with Hybrid
Possible Conflicts

Anibal Bregon and Carlos J. Alonso-González and Belarmino Pulido
Departamento de Informática, Universidad de Valladolid, Valladolid, 47011, Spain
e-mail: {anibal,calonso,belar}@infor.uva.es

Abstract electrical systems, respectively. These changes in the con-
tinuous behavior increase the difficulties for accurate and
Model-based fault isolation and identification in timely online fault diagnosis. Our focus in this paper is on
hybrid systems is computationally expensive or developing efficient model-based methodologies for online
even unfeasible for complex systems due to the fault isolation and identification in complex hybrid systems.
presence of uncertainty concerning the actual Both the DX and the FDI communities have approached
state, and also due to the presence of both dis- hybrid systems modeling and diagnosis during the last 20
crete and parametric faults coupled with changing years. They have used different modeling proposals [1; 2;
modes in the system. In this work we improve 3], and have approached diagnosis either as hybrid state es-
fault isolation and identification performance for timation [2] or as online state tracking [4; 5; 6], or a combi-
hybrid systems diagnosis using Hybrid Possible nation of both methods [7]. The main difficulties in any ap-
Conflicts. The Hybrid Bond Graph modeling ap- proach is to estimate the current state or set of states, and to
proach makes feasible to track system behavior diagnose that set of feasible states. Both tasks are computa-
without enumerating the complete set of system tionally expensive or even unfeasible for complex systems.
modes. Hybrid Possible Conflicts focus the anal- Several approaches have been proposed in the DX field to
ysis on potential mode changes on those sub- tackle these problems [4; 6].
systems whose behavior deviates from expected. In this work we have selected the hybrid system model-
Moreover, using information derived from the ing based on Hybrid Bond Graphs (HBGs) [1; 6], together
Hybrid Bond Graph model, we can cope with both with consistency-based diagnosis using Possible Conflicts
discrete and parametric faults in a unique frame- (PCs) [8]. HBGs are an extension of Bond Graphs (BG)
work. [9], which models the discrete changes as ideal switching
Fault detection with Hybrid Possible Conflicts re- junctions that can be set to ON or OFF according to an au-
lied upon an statistical test to decide when a sig- tomaton. In [10] we presented Hybrid Possible Conflicts
nificant deviation in the residual occurs. Fault de- (HPCs) as an extension of Possible Conflicts using HBGs
tection time was later used to start the fault isola- to track hybrid systems behavior. Later, the HPCs approach
tion and identification stages. In this work we pro- was extended to integrate fault diagnosis of both parametric
pose to analyze the evolution of the residual sig- and discrete faults using HPCs [11] in a unique framework.
nal using CUSUM to find a more accurate estima- In order to achieve efficient fault identification, it is very
tion of the time of fault occurrence, which allows important to determine the time of fault occurrence as ac-
to improve both the potential new modes track- curately and quickly as possible. But there is a required
ing and the parametric fault identification. More- trade-off between fast and reliable fault detection. In our
over, we extend our previous proposal for fault approach we relied upon an statistical test to decide when a
identification in continuous systems to cope with residual deviates from the current mode, and used this time
fault identification along a set of mode changes to start the fault isolation and identification stages, however,
while performing parameter identification. We the fault detection instant can be delayed from the fault oc-
have tested these ideas in a four-tank hybrid sys- currence time and this has some problems (e.g., that the fault
tem with satisfactory results. identification process is delayed, or that we have to assume
that we know the value of the state variables at the beginning
of the identification process). In this work we propose to an-
1 Introduction alyze the evolution of the residual signal using the CUSUM
Complex hybrid systems are present in a broad range of en- algorithm [12; 13] to find a more accurate estimation of the
gineering applications, such as mechanical systems, electri- time of fault occurrence, both for potential new modes track-
cal circuits, or embedded computation systems. The behav- ing and for parametric fault identification. Moreover, we
ior of these systems is made up of continuous and discrete extend our previous proposals for fault identification [14;
event dynamics.The main sources of hybrid behavior are 15] to cope with fault identification along a set of mode
discrete actuators, like discrete valves or switches in fluid or changes while performing the parameter identification.

59
Proceedings of the 26th International Workshop on Principles of Diagnosis

The rest of the paper is organized as follows. Section A finite state machine control specification (CSPEC) im-
2 presents the case study used along the paper and intro- plements those junctions. Transitions between the CSPEC
duces the Hybrid Bond Graph (HBG) modeling technique. states can be triggered by endogenous or exogenous vari-
Section 3 summarizes the Hybrid Possible Conflicts (HPCs) ables, called guards. CSPECs capture controlled and au-
background, while section 4 explains the unified framework tonomous changes as described in [17]. Figure 2 shows the
for both discrete and parametric faults. Section 5 introduces HBG model of the four-tank system in Figure 1.
some concepts related to the CUSUM algorithm required
in our approach. Section 6 explains our approach for fault
identification. Section 7 introduces some results obtained
applying our proposal on our case study. Finally, Section 8
draws some conclusions.

2 Case Study
The hybrid four-tank system in Figure 1 will be used to show
some concepts and to present some results in this work. The
system has an input flow which can be sent to tank 1, to tank
3 or to both tanks. Next to tank 1 there is tank 2, once the
liquid in tank 1 reaches a level of h it starts to fill also tank
2. The lower part of the system has the same configuration,
tank 4 is next to tank 3 connected by a pipe at a distance h
above the base of the tanks.

Figure 2: Bond graph model of the plant.

The system has four switching junctions: SW1 , SW2 ,
SW3 and SW4 . SW1 and SW3 are controlled ON/OFF
transitions, while SW2 and SW4 are autonomous transi-
tions. Both kinds of transitions are represented using a finite
state machine. Figure 3 shows: a) the automaton associated
with switching junction SW1 and b) the automaton repre-
senting the autonomous transition in SW2 . Since the system
is symmetric, automata for SW3 and SW4 are equivalent to
the ones shown in Figure 3.

Figure 1: Schematics of the four-tank system

The methodology chosen to model the system in this
work is Hybrid Bond Graph (HBG), which is an exten-
sion of Bond Graphs (BGs). BGs are defined as a domain-
independent energy-based topological modeling language
for physical systems [9]. Several types of primitive elements
are used to build BGs: storage elements (capacitances, C, Figure 3: a) Automaton associated with the ON/OFF
and inductances, I), dissipative elements (resistors, R) and switching junction SW1 ; b) Automaton representing the au-
elements to transform energy (transformers, TF, and gyra- tonomous transition in SW2 .
tors, GY). There are also effort and flow sources (Se and
Sf), which are used to define interactions between the sys-
tem and the environment. Elements in a BG are connected
by 0 or 1 junctions (representing ideal parallel or series con- 3 Hybrid Possible Conflicts background
nections between components). Each bond has associated Consistency-based diagnosis of continuous systems using
two variables (effort and flow). The power is defined as ef- Possible Conflicts (PCs) [8] is based upon a dependency-
fort × flow for each bond. The SCAP algorithm [16] is used compilation technique from the DX community. PCs are
to assign causality automatically to the BG. computed offline, finding minimal structurally overdeter-
To model hybrid systems using BGs we need to use some mined subsets of equations with sufficient analytical redun-
kind of connections which allow changes in their state. Hy- dancy to generate fault hypotheses from observed measure-
brid Bond Graphs (HBGs) [1] extend BGs by including ment deviations. Only structural and causal information
those connections. They are idealized switching junctions about the system description is required. This information
that allow mode changes in the system. If a switching junc- can be obtained from a set of algebraic and/or differential
tion is set to ON, it behaves as a regular junction. When it equations, or can be automatically derived from bond graph
changes to OFF, all bonds incident on the junction are de- models [18; 19]. Once the set of PCs is found, they can
activated forcing 0 flow (or effort) for 1 (or 0) junctions. be implemented as simulation, state-observers or gray-box

60
Proceedings of the 26th International Workshop on Principles of Diagnosis

models for tracking online actual system behavior [20], or
for online fault identification [14]. Table 1: Reduced Qualitative Fault Signature Matrix.
The PCs approach has been recently extended to cope HP C1 HP C2 HP C3 HP C4
with hybrid system dynamics, and the set of PCs for hybrid C1+ −+
C2+ −+
systems were called Hybrid Possible Conflicts (HPCs) [10]. C3+ −+
HPCs rely upon the Hybrid Bond-Graph modeling formal- C4+ −+
ism [1], whose main advantage is that the set of possible +
R01 0− 0+
modes in the system do not need to be enumerated. More- +
R03 0+ 0−
over, HBGs are capable to track online hybrid system be- R1+ 0+
havior, performing online causality reassignment in the sys- R2+ 0+
R3+ 0+
tem model by means of the HSCAP algorithm [17]. Using R4+ 0+
HPCs we make even more efficient the HSCAP algorithm, +
R12 0− 0−
because causality needs only to be revised within the sub- +
R34 0+ 0−
system defined for each HPC, and these changes are local to
the switching junction affected by the mode change.
For the four-tank system we have found four HPCs. Each The relation between the HPCs and their related switch-
one of them estimates one of the measured variables (p1 , ing junctions can be seen in Table 2, which is called Hybrid
p2 , p3 , or p4 ). Figure 4 shows the BG fragments of these Fault Signature Matrix (HFSM). This information can be
four HPCs. In this example, the four HPCs were computed used in the unified framework for discrete and parametric
assuming that all switching junctions are set to ON. fault isolation and identification [11].
As mentioned before, when any of these junctions is
switched to OFF, causality in the system needs to be re- Table 2: Hybrid Fault Signature Matrix (HFSM) showing
assigned, but the HPCs generation process does not need to the relations between switching junctions and each HPC.
be restarted again [10]. The decomposition of a hybrid sys-
HP C1 HP C2 HP C3 HP C4
tem model obtained from HPCs is unique, and after a mode 1SW1 1 1
change some portions of some HPCs can disappear (or even 1SW2 1 1
the entire HPC), but no additional HPC appears. It is proved 1SW3 1 1
in [10] that once PCs of the system have been generated con- 1SW4 1 1
sidering all switching junctions set to ON mode, turning a
switch from ON to OFF or viceversa, no genuine new HPCs Discrete faults usually introduce high non-linearities in
will ever appear. the system outputs, that should be easily detected if mag-
Regarding fault profiles, our current proposal works with nitudes related to the failing switch were measured, gener-
single fault, and abrupt fault assumptions. Abrupt faults are ating almost instantaneous detection for discrete faults. In
modeled as an instantaneous change in a parameter, whose this case, exoneration could be applied. But even if those
magnitude does not change afterwards (can be modeled as a measurements are not available we can still use the qual-
step function). itative signature of the effects of the discrete faults in the
Regarding parametric faults, fault isolation is performed HPC residuals. With this information we can build the so-
by means of the Reduced Qualitative Fault Signature Matrix called Hybrid Qualitative Fault Signature Matrix (HQFSM)
(RQFSM). Table 1 shows the RQFSM for the mode where that can also be used for exoneration purposes in the fault
each switch is set to ON. For a given mode, the RQFSM can isolation stage. In our system we can build the following
be computed online from the TCG associated to an HPC [1]. HQFSM for HP C1 and HP C3, which are linked to com-
In this table there is a row for each fault considered. And manded switches SW1 and SW3 , which are the potential
there is a column for each HPC. The entry in the table rep- source of discrete faults in our system. We do not show
resent the Qualitative Fault Signature of the fault in the HPC SW2 and SW4 in the table since they introduce hybrid dy-
residual, as computed in TRANSCEND [1]. The “reduced” namics in the system, but they can not be the source of a
tag means that the Qualitative Fault Signature is computed discrete fault.
within the subsystem delimited by a HPC, and not for the
whole set of measurements [18]. Once fault detection is Table 3: Hybrid Qualitative Fault Signature Matrix.
performed, we can use this information to reject those faults
whose residual evolution does not match the qualitative sig- HP C1 HP C3
1SW1 (11) + −
natures in this table. 1SW1 (00) − +
We also consider discrete faults, i.e. faults in discrete ac- 1SW1 (01) + −
1SW1 (10) − +
tuators, as commanded mode switches which do not per- 1SW3 (11) − +
form the correct action. In our case study, there are four 1SW3 (00) + −
faulty situations to be considered, where SWi denotes the 1SW3 (01) − +
switching junction i of the system. 1SW3 (10) + −

1. SWi = 11: SWi stuck ON (1). Next section presents our diagnosis framework for hybrid
2. SWi = 00: SWi stuck OFF (0). systems using HPCs.

3. SWi = 01: Autonomous switch ON (SWi is OFF (0) 4 Hybrid Systems Diagnosis using HPCs
and it switches to ON itself (1)).
As we mentioned before, tracking of hybrid systems can
4. SWi = 10: Autonomous switch OFF (SWi is ON (1) be performed using Hybrid PCs [10]. Initially, the set of
and it switches to OFF itself (0)). HPCs is built assuming all switching junctions are set to ON.

61
Proceedings of the 26th International Workshop on Principles of Diagnosis

CSPECSW2 CSPECSW2
CSPECSW1 R: R01 C: C1 R: R12 R: R12 C: C2
De: p2
3 5 8 8 10
4 7 9 7 9
1SW1 0 1SW2 0 0 1SW2 0

6 11
2

1 R: R1 De: p1 Se: p2 HPC2 Se: p1 R: R2
Sf 0

12 Se: p3 CSPECSW1 R: R01
HPC3
14
1SW3 0 3
4 Se: p1
1SW1 0
13

CSPECSW3 R: R03
HPC1 2

Sf
1 Se: p4
0
R: R34 C: C4 C: C3 R: R34
De: p4 12
18 20 15 18
19 14 19
17 17
0 1SW4 0 1SW3 0 1SW4 0

21 13 16

Se: p3 R: R4 CSPECSW3 R: R03 R: R3 De: p3
CSPECSW4 HPC4 CSPECSW4

Figure 4: Bond graphs of the four PCs found for the four-tank system.

Afterwards, the set of models for the HPCs for the actual only one mode has the residual close to zero, this is the new
mode are efficiently built, and they start tracking the system. system mode.
Whenever a mode change, commanded or autonomous, is If the residual for each hypothesized new mode does not
detected, a new set of models for the HPCs is computed on- converge to zero, discrete faults (as mode changes) are dis-
line. carded and we focus on parametric faults, starting the identi-
In case a fault occurs, one or more HPC residuals will fication stage. As mentioned before, qualitative fault signa-
trigger. Significant deviations in the residuals are found us- tures in the RQFSM can be used to reject those parametric
ing the statistical Z-test. Based on the activated residuals faults non consistent with current observations thus focusing
for the set of HPCs in the current mode, the structural in- even further the fault identification stage.
formation in the HQFSM (Table 3), and the RQFSM (Table Finally, once the set of parametric fault candidates is
1), we build the current set of fault candidates. This set can refined through the RQFSM, we perform fault identifica-
contain both discrete and parametric faults. Since discrete tion for the set of remaining parametric fault candidates.
faults generally have a bigger and potentially more danger- Fault identification is done with hybrid parameter estima-
ous influence in the system behavior, in our framework we tors, which are presented in Section 6.
consider discrete faults as preferred candidates before con-
sidering the parametric ones. If there is no discrete fault as
candidate, then we directly go to the fault identification as
5 Time of Fault Estimation using CUSUM
described in Section 6. In the previous section we have presented our fault isolation
At this point we run the CUSUM algorithm (described approach of discrete faults by hypothesizing the faults com-
in Section 5) to approximately determine the time of fault patible with the Hybrid Qualitative Fault Signature Matrix
occurrence. Once this is done, we create a new simulation and filtering out those faults whose models do not converge.
model using the HPCs, and starting at the fault time deter- Divergence of non-current models is usually easy to check
mined by the CUSUM, we begin tracking the system be- when we are dealing with discrete faults. However, the con-
havior in each one of the hypothesized mode changes (the vergence of the current model may be slow if initial values
HQFSM and the qualitative value of the HPC residuals are of the state variables of the model are not known or our ini-
used to reject those modes that are inconsistent with ex- tial guess is far from the actual value. We are assuming that
pected deviations in the HQFSM). If the hypothesized mode we are able to track the system dynamic before the occur-
is the correct one, the residual for that mode will go to zero rence of a fault. In other words, we are assuming that we
after a relatively small period of time (this is possible, as we know -or we are able to estimate- the state variables before
will show later, thanks to the accurate estimation of the fault the time of fault occurrence. Hence, in order to speed up the
time provided by the CUSUM). If the hypothesized mode is convergence of the current model, it is important to have a
not correct, the residual will keep deviating from zero, and good estimation of that time.
after an empirically determined time window without con- The cumulative sum algorithm, CUSUM, introduced by
verging, the discrete fault candidate will be discarded. If [12] and discussed in detail in [13] and elsewhere, is an op-

62
Proceedings of the 26th International Workshop on Principles of Diagnosis

timal fault detection algorithm that can also provide a esti- Proposition 1. A HPC, HP Ck , along with its set of in-
mation of the time of fault occurrence t0 , as we will detail put variables, uhpck , the commanded signals of the switch-
later. Nevertheless, it makes the strong assumption that the ing junctions, swhpck , and initial value of the parameter
signal we are tracking changes its mean value from a con- to identify, θf , can be used as a parameter estimator using
stant initial mean µ0 to a final constant mean µ1 . ŷhpck = ehpck (uhpck , θf , swhpck (t)), where the measured
On the other hand, the Z-test [21] is a sub-optimal fault variable estimated by the HPC, ŷhpci , is solved in terms of
detection algorithm compared to CUSUM, but it makes no the remaining measured variables.
assumptions concerning the properties of the new mean Each estimator is uniquely related to one HPC, hence it
value. Particularly, it does not require this to be constant. contains minimal redundancy required for parameter esti-
In order to have a robust fault detection mechanism and mation. In this case, each HPC has an executable model
a good approximation of the fault time, we have opted for that can be used for simulation purposes. For the four-tank
combining both tests. We use Z-test to perform fault de- system we have obtained four hybrid parameter estimators
tection and, afterwards, we estimate the fault time using shown in table 4, one for each HPC.
CUSUM.
CUSUM was designed to detect abrupt changes in the Estimator
Related
PC P arameters Inputs Output
mean of stochastic signals. In the simple case of a Gaus- e1 HP C1 R01 , R03 , R12 , R1 , C1 Sf , p2 , p3 p1
sian residual, res(i), of constant variance σ 2 , constant e2 HP C2 R12 , R2 , C2 p1 p2
and known initial mean µ0 and constant and known fi- e3 HP C3 R01 , R03 , R34 , R3 , C3 Sf , p1 , p4 p3
e4 HP C4 R34 , R4 , C4 p3 p4
Pk
nal mean µ1 , the decision signal, Sk , is Sk = si =
i=1
P
k Table 4: Hybrid parameter estimators found for the four-
µ1 −µ0
σ2 (res(i)− µ0 +µ
2
1
). Hence, for a window of N sam- tank system, and their related HPCs.
i=1
ples with a change in mean at 1 ≤ t0 ≤ N , Sk decreases at The basic idea is to use the estimator ehpck to compute
the constant rate µ = µ1 −µ
2
0
for k < t0 and increases by µ estimations for ŷhpck with different values of the parameter
for t0 ≤ k. It can be shown [13] that the change time t0 can θf , so that we can find a value of the parameter that min-
be estimated as t̂0 = arg mink Sk . imizes the least squares (LS) error between the estimation
When µ1 is unknown, it can be set to the residual corre- ŷhpck and the measured value yhpck .
sponding to the smallest fault to be detected, typically some Fig. 5 shows the parameter estimation process using the
units of the residual noise deviation, σ. This can be done hybrid estimators. A parametrized estimator, ehpck , uses the
without increasing the fault positive alarm rate because we inputs of the system, uhpck , and a parameter value, θf , to
use Z-test to perform fault detection, and we only use this generate an estimation of the output, ŷhpck . This estimated
CUSUM variant to estimate the time of fault occurrence, t0 . output is compared against the observed output, yhpck , by
We have also tried estimating µ as the empirical mean of the quadratic error calculator block. This block computes
the residual, with similar results. In all the cases we have the quadratic error between ŷhpck and yhpck for the fault
tested, the estimated time of fault occurrence, t̂0 , computed candidate f , Ef2 . Then, the iteration engine block, that con-
by CUSUM, is smaller than the detection time provided by tains a nonlinear optimization algorithm, finds the minimum
Z-test. of the error surface Ef2 (θf ), by iteratively invoking the es-
timator with different parameter values. The value of the
6 Fault Identification with HPCs parameter and its minimum LS error will be the output of
the parameter estimation block (and the input for the deci-
Once all the discrete fault candidates have been discarded,
sion procedure block).
we have to do fault identification for the set of isolated para-
metric faults. In previous work [14] we proposed to use Fault candidate f (θf initial value)
minimal parameter estimators computed from PCs to gen-
erate parameterized estimators. However, that approach is eHPC k estimator
not applicable for hybrid systems fault identification since Inputs: uHPC k
(obtained from HPCk)
we can have mode changes during the identification pro-
cess. As a solution, we propose a extension of our minimal
parameterized estimators which are computed directly from ŷHPCk θ*f
HPCs, thus being able to handle mode changes during the
identification process.
The fault identification process is done by the following quadratic E2f Iteration
Output: yHPC k error
steps: (i) model decomposition by offline computation of calculator
Engine
the set of HPCs from the hybrid bond graph model; (ii)
offline computation and selection of the better hybrid esti-
mator for each fault candidate; (iii) after the fault isolation Figure 5: Parameter estimation using the hybrid estimators
process, online quantitative parameter estimation procedure from HPCs.
over the hybrid estimators related with the set of isolated
fault candidates; and (iv) decision procedure to select the
faulty candidate. 7 Results
Using HPCs we can derive the structure of a hybrid pa- To test the validity of the approach, we implemented the
rameterized estimator, ehpck , for a hybrid system. The pa- four hybrid HPCs for the four-tank system, with its cor-
rameterized estimator ehpck can be used as a hybrid estima- responding estimators, and run different simulation exper-
tor as stated in the following proposition: iments.

63
Proceedings of the 26th International Workshop on Principles of Diagnosis

Pr e s s ur e ( Pas c als )
600 300
p1 p 1 - me as u r e me nt
p2 p 1 - H P C 1 e s t i mat i on
p3
500 p4 200
Pr e s s ur e ( Pas c als )

400 100

300 0
0 50 100 150 200 250 300 350 400 450 500
Tim e ( s )
200
100
Re s i d u al f or H P C 1

Re s idual
100

0
0
−50

−100 −100
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Time ( s ) Tim e ( s )

Figure 6: Measured pressures in the four tanks when a fault Figure 8: Estimation and residual for HP C1 (using
in SW1 is introduced at t = 190 s. CUSUM) when a fault in SW1 occurs and the hypothesized
fault is SW1 .

Pr e s s ur e ( Pas c als )
300
p 1 - me as u r e me nt
30 p 1 - H P C 1 e s t i mat i on

200
25
Cus um value

100
20

0
0 50 100 150 200 250 300 350 400 450 500
15 Tim e ( s )
100
10
Re s i d u al f or H P C 1

50
Re s idual

5
0

−50
0
80 100 120 140 160 180 200
Tim e ( s ) −100
0 50 100 150 200 250 300 350 400 450 500
Tim e ( s )

Figure 7: CUSUM output for a fault in SW1 .
Figure 9: Estimation and residual for HP C1 (using
CUSUM) when a fault in SW1 occurs and the hypothesized
In the first experiment, we assume that the water tanks are fault is SW3 .
initially empty, and start to fill in at constant rate. Hence, the
initial configuration of the system is SW1 and SW3 set to
ON, and SW2 and SW4 set to OFF. Tanks 1 and 3 start to estimation and the residual for HP C1 when the hypothe-
fill in, and approximately at time 20 s level in both tanks sized faults are SW1 (10) and SW3 (10), respectively (we do
reach the height of the connecting pipes and tanks 2 and 4 not show the result for HP C3 since are similar to the results
start to fill in. At time 190 s, a fault occurs in the controlled obtained for HP C1 ). Looking at the results, it is obvious
junction SW1 , which switches off (see Fig. 6 for the mea- that the residual converges to zero when a fault in SW1 (10)
sured pressures in the four tanks for this experiment). is hypothesized, while the residual when SW3 (10) is hy-
Four seconds after the fault is introduced, at t = 194 pothesized does not converge. Hence, SW1 (10) is con-
s, both HP C1 and HP C3 trigger, and consequently both firmed as the fault. This confirmation is done by continu-
SW1 or SW3 are initially considered as discrete fault can- ously analyzing residual signals with the Z-test. Please note
didates. At this point, the CUSUM algorithm is run, de- that, since the CUSUM algorithm gives a good approxima-
termining that the fault has occurred at t = 191 s. In this tion of the point of failure, the residual is able to converge
case study we use a CUSUM window of size 100. Figure 7 very quickly when the true fault is hypothesized. For com-
shows the output of the CUSUM algorithm where the abso- parison purposes, Fig. 10 shows the estimation and resid-
lute maximum represents the approximate time (due to noise ual for HP C1 when CUSUM is not used to re-initialize the
in the system) of fault occurrence. simulation (for the hypothesized fault SW1 ). By comparing
Once the point of fault occurrence has been determined at this figure with Fig. 8 it is clear that using CUSUM allows
t = 191 s, the diagnosis framework takes the values of the the HPC to converge faster.
simulation at such time instant and launches two parallel As a second diagnosis experiment, we start off from the
diagnosis experiments, one for each hypothesized fault can- same situation of the previous experiment, but in this case,
didate, i.e., SW1 (10) and SW3 (10). Figs. 8 and 9 show the we introduce a small parametric fault and after a short while,

64
Proceedings of the 26th International Workshop on Principles of Diagnosis

Pr e s s ur e ( Pas c als )

Pr e s s ur e ( Pas c als )
300 300
p 1 - me as u r e me nt p 1 - me as u r e me nt
p 1 - H P C 1 e s t i mat i on p 1 - H P C 1 e s t i mat i on

200 200

100 100

0 0
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Tim e ( s ) Tim e ( s )
100 100
Re s i d u al f or H P C 1 Re s i d u al f or H P C 1

50 50
Re s idual

Re s idual
0 0

−50 −50

−100 −100
0 50 100 150 200 250 300 350 400 450 500 0 50 100 150 200 250 300 350 400 450 500
Tim e ( s ) Tim e ( s )

Figure 10: Estimation and residual for HP C1 (without us- Figure 12: Estimation and residual for HP C1 when a fault
ing CUSUM to re-initialize the simulation) when a fault in in R01 occurs and then SW1 is set to OFF mode.
SW1 occurs and the hypothesized fault is SW1 .

600
p1
p2
p3 25
500 p4

20
Pr e s s ur e ( Pas c als )

Cus um value
400

300 15

200
10

100
5

0
80 100 120 140 160 180 200
−100 Tim e ( s )
0 50 100 150 200 250 300 350 400 450 500
Time ( s )

Figure 13: CUSUM output for a fault in R01 .
Figure 11: Measured pressures in the four tanks when a fault
in R01 is introduced at t = 190 s and the switching junction
SW1 is turned off at t = 210 s. used a total of 60 seconds of data starting from t = 191 s,
hence, the estimator was capable of correctly estimating the
value of the faulty parameter even if the system transitions
a discrete change. Specifically, a 20% blockage in the input from one mode to another during the estimation process.
pipe of tank 1, R01 , is introduced at t = 190 s, and then We run several experiments with different mode config-
SW1 is commanded to switch OFF at t = 210 s (Fig. 11 urations and different faults, varying the size, time of fault
shows the measured pressures in the four tanks for this ex- occurrence (in some of them by introducing faults immedi-
periment). ately after the mode change). Results for all these situations
For this experiment, both HP C1 and HP C3 trigger at were equivalent to the examples shown in this section.
t = 198 s (as an example, see Fig. 12 with the estimation
and residual for HP C1 ), and consequently both SW1 (10) 8 Conclusions
and SW3 (10) are initially considered as discrete fault candi-
dates. However, in this scenario, after running the CUSUM In this work we have presented an approach for hybrid sys-
(see Fig. 13 for the CUSUM output), which estimated the tems fault identification using Hybrid Possible Conflicts.
fault time at t = 191s, and the diagnosis experiments for Using HBGs we can generate minimal estimators that can
both fault candidates, none of the residuals was able to con- be used for fault identification just considering the possi-
verge within a reasonable, empirically determined, amount ble mode changes within the estimators. Additionally, we
of time, thus concluding that a parametric fault has occurred. have proposed the integration of the CUSUM algorithm to
At this point, the fault identification process is triggered for accurately determine the time of fault occurrence. A more
R01 , which is the only parametric fault candidates (R03 accurate estimation of the fault instant allows to quickly iso-
is discarded due to the qualitative sign in the residuals). late discrete faults, and to obtain a better approximation of
The estimated value for parameter R01 was 0.1937, i.e., a the values of the state variables, which are needed as initial
19.37% blockage in the pipe. Please note that the estimator values for the fault identification.

65
Proceedings of the 26th International Workshop on Principles of Diagnosis

Diagnosis results using a four-tank system showed that 2013, volume 8109 of Lecture Notes in Computer Sci-
the proposed approach can be successfully used for fault ence, pages 239–249. Springer-Verlag Berlin, 2013.
identification of hybrid systems. [12] E.S. Page. Continuous inspection schemes.
In future work, we will test the approach in more com- Biometrika, 41:100–115, 1954.
plex systems with real data, and will propose a distributed
approach for hybrid systems fault diagnosis. [13] M. Basseville and I.V. Nikiforov. Detection of Abrupt
Changes: Theory and Applications. Prentice Hall,
Acknowledgments 1993.
This work has been funded by the Spanish MINECO [14] A. Bregon, G. Biswas, and B. Pulido. A Decompo-
DPI2013-45414-R grant. sition Method for Nonlinear Parameter Estimation in
TRANSCEND. IEEE Trans. Syst. Man, Cyber. Part A,
42(3):751–763, 2012.
References
[1] P. Mosterman and G. Biswas. Diagnosis of continuous [15] N. Moya, B. Pulido, CJ Alonso-González, A. Bre-
valued systems in transient operating regions. IEEE gon, and D. Rubio. Automatic Generation of Dynamic
Trans. on Sys., Man, and Cyber. Part A, 29(6):554– Bayesian Networks for Hybrid Systems Fault Diagno-
565, 1999. sis. In Proc. of the 23rd Int. WS on Pples. of Diagnosis,
DX12, Great Malvern, UK, Jul-Aug 2012.
[2] M.W. Hofbaur and B.C. Williams. Hybrid estimation
[16] D.C. Karnopp, R.C. Rosenberg, and D.L. Margolis.
of complex systems. IEEE Trans. on Sys., Man, and
Cyber. Part B, 34(5):2178 –2191, oct. 2004. System Dynamics, A Unified Approach. 3nd ed., John
Wiley & Sons, 2000.
[3] E. Benazera and L. Travé-Massuyès. Set-theoretic es-
timation of hybrid system configurations. IEEE Trans. [17] I. Roychoudhury, M. Daigle, G. Biswas, and X. Kout-
on Sys. Man Cyber. Part B, 39:1277–1291, October soukos. Efficient simulation of hybrid systems: A hy-
2009. brid bond graph approach. SIMULATION: Transac-
tions of the Society for Modeling and Simulation In-
[4] S. Narasimhan and L. Brownston. Hyde - a general ternational, (6):467–498, June 2011.
framework for stochastic and hybrid model-based di-
agnosis. In Proc. of the 18th Int. WS on Pples. of Di- [18] A. Bregon, B. Pulido, G. Biswas, and X. Koutsoukos.
agnosis, DX07, pages 186–193, Nashville, TN, USA, Generating Possible Conflicts from Bond Graphs Us-
May 29-31 2007. ing Temporal Causal Graphs. In Proc. of the 23rd
European Conference on Modelling and Simulation,
[5] Mehdi Bayoudh, Louise Travé-Massuyès, and Xavier ECMS09, pages 675–682, Madrid, Spain, 2009.
Olive. Fault detection and diagnosis; hybrid systems
modeling and control; discrete event systems model- [19] A. Bregon, G. Biswas, B. Pulido, C. Alonso-González,
ing and control. In Proc. of the Int. Conference on Con- and H. Khorasgani. A Common Framework for Com-
trol, Automation and Systems, ICCAS08, pages 7265– pilation Techniques Applied to Diagnosis of Linear
7270, Seoul, Korea, October 2008. Dynamic Systems. IEEE Trans. on Syst., Man, and
Cyb.: Systems, 44(7):863–873, 2013.
[6] S. Narasimhan and G. Biswas. Model-Based Diagno-
sis of Hybrid Systems. IEEE Trans. on Sys., Man and [20] A. Bregon, C. Alonso-González, and B. Pulido. In-
Cyber., Part A, 37(3):348–361, May 2007. tegration of simulation and state observers for on
line fault detection of nonlinear continuous systems.
[7] Th. Rienmüller, M. Bayoudh, M.W. Hofbaur, and
IEEE Trans. on Syst., Man, and Cyb.: Systems,
L. Travé-Massuyès. Hybrid Estimation through Syner- 44(12):1553–1568, 2014.
gic Mode-Set Focusing. In Proc. of the 7th IFAC Sym-
posium on Fault Detection, Supervision and Safety of [21] G. Biswas, G. Simon, N. Mahadevan, S. Nararsimhan,
Technical Processes, SAFEPROCESS09, pages 1480– J. Ramirez, and G. Karsai. A robust method for hy-
1485, Barcelona, Spain, 2009. brid diagnosis of complex systems. In Proceeding
of the 5th IFAC Symposium on Fault Detection, Su-
[8] B. Pulido and C. Alonso-González. Possible Conflicts:
pervision and Safety of Technical Processes, SAFE-
a compilation technique for consistency-based diagno- PROCESS 2003, pages 1125–1130, Washington D.C.,
sis. IEEE Trans. on Sys., Man, and Cyber. Part B: Cy- USA, June 2003.
bernetics, 34(5):2192–2206, Octubre 2004.
[9] D.C. Karnopp, D.L. Margolis, and R.C. Rosen-
berg. System Dynamics: Modeling and Simulation of
Mechatronic Systems. John Wiley & Sons, Inc., New
York, NY, USA, 2006.
[10] A. Bregon, C. Alonso, G. Biswas, B. Pulido, and
N. Moya. Fault diagnosis in hybrid systems using pos-
sible conficts. In Proc. of the 8th IFAC Symposium on
Fault Detection, Supervision and Safety of Technical
Processes, SAFEPROCESS12, Mexico City, Mexico,
2012.
[11] N. Moya, A. Bregon, CJ Alonso-González, and
B. Pulido. A Common Framework for Fault Diagnosis
of Parametric and Discrete Faults Using Possible Con-
flicts. In Advances in Artificial Intelligence, CAEPIA

66
Proceedings of the 26th International Workshop on Principles of Diagnosis

State estimation and fault detection using box particle filtering with stochastic
measurements
Joaquim Blesa 1 , Françoise Le Gall2 , Carine Jauberthie2,3 and Louise Travé-Massuyès2
1
Institut de Robòtica i Informàtica Industrial (CSIC-UPC), Llorens i Artigas, 4-6, 08028 Barcelona, Spain
e-mail: joaquim.blesa@upc.edu
2
CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
Univ de Toulouse, LAAS, F-31400 Toulouse, France
e-mail: legall,cjaubert,louise@laas.fr
3
Univ de Toulouse, UPS, LAAS, F-31400 Toulouse
Abstract stochastic noise instead of bounded noise. The errors af-
fecting the system dynamics are kept bounded because this
In this paper, we propose a box particle filter- type uncertainty really corresponds to many practical situa-
ing algorithm for state estimation in nonlinear tions, for example tolerances on parameter values. Combin-
systems whose model assumes two types of un- ing these two types of uncertainties following the seminal
certainties: stochastic noise in the measurements ideas of [5] and [6] within a particle filter schema is the
and bounded errors affecting the system dynam- main issue driving the paper. This issue is different from the
ics.These assumptions respond to situations fre- one addressed in [7] in which the focus is put on Bernouilli
quently encountered in practice. The proposed filters able to deal with data association uncertainty. The
method includes a new way to weight the box proposed method includes a new way to weight the box par-
particles as well as a new resampling procedure ticles as well as a new resampling procedure based on repar-
based on repartitioning the box enclosing the up- titioning the box enclosing the updated state.
dated state. The proposed box particle filtering The paper is organized as follows. Section 2 describes
algorithm is applied in a fault detection schema the problem formulation. A summary of the Bayesian fil-
illustrated by a sensor network target tracking ex- tering is presented and the box-particle approach is intro-
ample. duced. The main steps of this approach are developed in
section 3. Section 4 and 5 are devoted to the repartitioning
of the boxes and the computation of the weight of the box
1 Introduction particles in order to control the number of boxes. In section
For various engineering applications, system state estima- 6 the box particle filter is used for state estimation and fault
tion plays a crucial role. Kalman filtering (KF) has been detection; the results obtained with the proposed method for
widely used in the case of stochastic linear systems. The a target tracking in a sensor network are presented in sec-
Extended Kalman Filter (EKF) and Unscented Kalman Fil- tion 7. Conclusion and future work are overviewed in the
ter (UKF) are KF’s extensions for nonlinear systems. These last section.
methods assume unimodal, Gaussian distributions. On the
other hand, Particle Filtering (PF) is a sequential Monte 2 Problem formulation
Carlo Bayesian estimator which can be used in the case We consider nonlinear dynamic systems represented by dis-
of non-Gaussian noise distributions. Particles are punctual crete time state-space models relating the state x(k) to the
states associated with weights whose likelihoods are defined measured variables y(k)
by a statistical model of the observation error. The efficiency
and accuracy of PF depend on the number of particles used
in the estimation and propagation at each iteration. If the x(k + 1) = f (x(k), u(k), v(k)) (1)
number of required particles is too large, a real implementa- y(k) = h(x(k)) + e(k), k = 0, 1, . . . (2)
tion is unsuitable and this is the main drawback of PF. Sev-
eral methods have been proposed to overcome these short- where f : Rnx × Rnu × Rnv → Rnx and h : Rnx → Rny
comings, mainly based on variants of the resampling stage are nonlinear functions, u(k) ∈ Rnu is the system input,
or different ways to weight the particles ([1]). y(k) ∈ Rny is the system output, x(k) ∈ Rnx is the state-
Recently, a new approach based on box particles was pro- space vector, e(k) ∈ Rny is a stochastic additive error that
posed by [2; 3]. The Box Particle Filter handles box states includes the measurement noise and discretization error and
and bounded errors. It uses interval analysis in the state up- is specified by its known pdf pe . v(k) ∈ Rnx is the process
date stage and constraint satisfaction techniques to perform noise.
measurement update. The set of box particles is interpreted In this work the process noise is assumed bounded
as a mixture of uniform pdf’s [4]. Using box particles has |vi (k)| ≤ σi with i = 1, . . . , nx , i.e pv ∼ U([V]), where
been shown to control quite efficiently the number of re- [V] = [−σ1 , σ1 ] × · · · × [−σnx , σnx ].
quired particles, hence reducing the computational cost and
providing good results in several experiments. 2.1 Bayesian filtering
In this paper, we take into account the box particle fil- Given a vector of available measurements at instant k:
tering ideas but consider that measurements are tainted by Y(k) = {y(i), i = 1, ..., k}, Y(0) = y(0), the Bayesian

67
Proceedings of the 26th International Workshop on Principles of Diagnosis

solution to compute the posterior distribution p(x(k)|Y(k)) • to provide the prior probabilities associated to the par-
of the state vector at instant k + 1, given past observations ticles of the new state estimation set
Y(k) is given by (Gustafsson 2002): P ([x(k + 1)]i |Y(k)) i = 1, · · · , Nk+1 (10)

p(x(k + 1)|Y(k)) = 3 Interval Bayesian formulation
Z
(3) This section deals with the evaluation of the Bayesian so-
p(x(k + 1)|x(k))p(x(k)|Y(k))dx(k)
Rn x lution of the state estimation problem considering bounded
state boxes (6).
where the posterior distribution p(x(k)|Y(k)) can be
computed by 3.1 Measurement update
Whereas each particle is defined as a box by (6), the mea-
1 surement is tainted with stochastic uncertainty defined by
p(x(k)|Y(k)) = p(y(k)|x(k))p(x(k)|Y(k − 1)) the pdf pe . The weight w(k)i associated to a box particle is
α(k)
(4) updated by the posterior probability P ([x(k)]i |Y(k)):
where α(k) is a normalization constant, p(y(k)|x(k)) is
the likelihood function that can be computed from (2) as: 1
w(k)i = P ([x(k)]i |Y(k − 1))pe (y(k) − h([x(k)]i )
Λ(k)
p(y(k)|x(k)) = pe (y(k) − h(x(k)) (5) Z
1
and p(x(k)|Y(k − 1)) is the prior distribution. = P ([x(k)]i |Y(k − 1)) pe (y(k) − h(x(k))) dx(k)
Λ(k) x(k)∈[x(k)]i
Equations (5), (4) and (3) can be computed recursively (11)
given the initial value of p(x(k)|Y(k − 1)) for k = 0 de- i = 1, . . . , Nk
noted as p(x(0)) that represents the prior knowledge about where the normalization constant Λ(k) is given by
the initial state.
Nk
X Z
2.2 Objective Λ(k) = P ([x(k)]i |Y(k − 1)) pe (y(k) − h(x(k))) dx(k)
Considering the assumptions of our problem, we adopt a i=1 x(k)∈[x(k)]i

particle filtering schema which is well-known for solving (12)
numerically complex dynamic estimation problems involv-
ing nonlinearities. However, we propose to use box particles then
and to base our method on the interval framework. Box par- Nk
ticle filters have been demonstrated efficient, in particular to X
reduce the number of particles that must be considered to w(k)i = 1 (13)
i=1
reach a reasonable level of approximation [2].
Let’s consider the current state estimate X (k) as a set, de- The deduction of the measurement update equation (11)
noted by {X (k)}, that is approximated by Nk disjoint boxes from the particle filtering update equation (4) is detailed in
the Appendix for nx = 1, without the loss of generality. The
[x(k)]i i = 1, · · · , Nk (6) principle of the proof is that the point particles are grouped
into particle groups inside boxes, then the posterior proba-
i i i
where [x(k)] = [x(k) , x(k)i ], with x(k) , x(k)i
∈ bility of a box can be approximated by the sum of posterior
Rnx . The width of every box is smaller or equal to a given probabilities of the point particles when the number of these
accuracy for every component, i.e particles tends to infinity.
3.2 State update
xj (k)i − xj (k)i ≤ δj i = 1, · · · , Nk ,
j = 1, . . . , nx This step is similar to the state update state as in [2] and [3].
(7) Hence, we have:
where δj is the predetermined minimum accuracy for every
component j. Nk
Moreover, every box [x(k)]i is given a prior probability X
p(x(k + 1)|Y(k)) ≈ w(k)i U[f ]([x(k)]i ,u(k),[v(k)]) (14)
denoted as
i=1

P ([x(k)]i |Y(k − 1)) i = 1, · · · , Nk (8) The interval boxes [x(k + 1)|x(k)]i are computed from
(1) using interval analysis as follows,
with
Nk
X
P ([x(k)]i |Y(k − 1)) ≥ γ (9) [x(k + 1)|x(k)]i ≈ [f ]([x(k)]i , u(k), [v(k)]) (15)
i=1
The update interval boxes inherit the weights w(k)i of
where γ ∈ [0, 1] is a confidence threshold. their mother boxes [x(k)]i i = 1, . . . , Nk .
Then, given a new output measurement y(k), the problem
that we consider in this paper is: 4 Resampling
• to compute the state estimate X (k + 1), Once the updated boxes [x(k + 1)|x(k)]i and their associ-
• to decide about the number Nk+1 of disjoint boxes of ated weights w(k)i have been computed, the objective is to
the approximation of X (k + 1), each with accuracy compute a new set of disjoint boxes. This corresponds to
smaller or equal to δj , the resampling step of the conventional particle filter.

68
Proceedings of the 26th International Workshop on Principles of Diagnosis

4.1 Repartitioning Algorithm 1 Weights of the new boxes.
We assume that the new boxes are of the same size, that they Algorithm Weights-new-boxes (Z, [x(k + 1)|x(k)]1 ,
cover the whole space defined by the union of the updated . . . , [x(k + 1)|x(k)]Nk , w(k)1 , . . . w(k)Nk )
boxes [x(k + 1)|x(k)]i i = 1, . . . , Nk , and that their weight wzi ← 0 i = 1, . . . , M
is proportional to the weight of the former boxes. for j = 1, . . . , Nk do
For this purpose, a support box set Z is computed as the [Ninter , Vinter ] = intersec([x(k + 1)|x(k)]j , Z)
minimum box such that for h = 1, . . . , Ninter do
Nk
[ i = Vinter (h) Q T
nx
|[x (k+1)|x(k)]j [z ]i |
Z⊇ [x(k + 1)|x(k)]i . (16) wzi = wzi + l=1 Qnx l j
l
w(k)j
l=1 |[xl (k+1)|x(k)] |
i=1
end for
Z is partitioned into M disjoint boxes of the same size end for
Return (wz1 , . . . , wzM )
[z]i i = 1, · · · , M (17) endAlgorithm
where [z] i
= [z , zi ], zi , zi ∈ Rnx , and
i

4.2 Controlling the number of boxes
zji − zji = εj i = 1, · · · , M j = 1, . . . , nx . (18)
Once the new disjoint boxes and their associated weights
The box component widths are computed as have been computed, the associated weights can be used
to select the set of boxes that are worth pushing forward
Zj − Zj through the next iteration. This is performed by selecting
εj = j = 1, . . . , nx (19) the boxes with highest weights and discarding the others. In
mj
order to fulfill the confidence threshold criterium (9) pro-
where mj is the number of intervals along dimension j posed in Section 2.2, Algorithm 2 is proposed. The set Wz
computed as of weights wzi associated to the boxes [z]i is defined as
Zj − Zj Wz = {wz1 , . . . , wzM }. (25)
mj = ⌈ ⌉ j = 1, . . . , nx (20)
δj
Given a desired confidence threshold γ, the M disjoint
where ⌈.⌉ indicates the ceiling function and δj the mini- boxes [z]i that compose the uniform grid partition of Z and
mum accuracy for every state component j defined in Sec- vector Wz with the associated weights, Algorithm 2 deter-
tion 2.2. In this way, we guarantee that mines the minimum number Nk+1 of boxes [z]i with highest
weights wzi that fulfill
εj ≤ δj j = 1, . . . , nx (21)
Finally, the number M of boxes of the uniform grid par- Nk+1
X
tition is given by wzi ≥ γ (26)
i=1
nx
Y
M= mj (22) The new state estimate X (k + 1) is approximated by this
j=1 set of Nk+1 boxes and their prior probability by
Once the new boxes [z]i have been computed, the weight
of the new boxes wzi can be computed as P ([x(k + 1)]i |Y(k)) ≈ Wk+1
i
i = 1, . . . , Nk+1 . (27)

Nk Qnx
X T i
where Wk+1 are the Nk+1 highest weights of Wz associated
|[x (k + 1)|x(k)]j [zl ]i |
wzi = Qnx l
l=1
j
w(k)j
with the disjoint boxes [x(k + 1)]i , i = 1, · · · , Nk+1 , that
j=1 l=1 |[xl (k + 1)|x(k)] | approximate X (k + 1). Wk+1i
can be referred as the a priori
(23) weights.

i = 1, . . . , M Algorithm 2 State update at step k + 1 with confidence
where [vl ]i refers to the l-th component of the vector [v]i threshold γ.
and the interval width xl − xl is denoted by |[xl ]| for more
Algorithm State-update([z]1 , . . . , [z]M ,Wz ,γ)
compactness. The new weights fulfill
γc ← 0, {X (k+1)} ← {∅}, Wk+1 ← {∅}, Nk+1 ← 0
M
X Nk
X while γc < γ do
wzi = w(k)i = 1 (24) [value, pos] = max(Wz )
i=1 i=1 addbox(X (k + 1), [z]pos )
addelement(Wk+1 , value)
The new weights wzi in (4.1) can be computed efficiently γc = γc + value
using Algorithm 1. This algorithm searches the number Wz (pos) ← 0
Ninter of boxes of Z that intersect every [x(k + 1)|x(k)]j . Nk+1 ← Nk+1 + 1
Then, the weight w(k)j is distributed proportionally to endwhile
the volume of the intersection between the updated boxes Return (X (k + 1), Wk+1 , Nk+1 )
[x(k + 1)|x(k)]j and each of the Ninter boxes of Z that endAlgorithm
have a non-empty intersection.

69
Proceedings of the 26th International Workshop on Principles of Diagnosis

This algorithm generates a set of state boxes {X (k + 1)} • Abnormal low sum of the unnormalized posterior prob-
i
a list of weights Wk+1 , a cumulative weight variable γc , ability of all the particles at instant k, which means
and a cardinality variable Nk+1 . At the beginning of the that all the particles have been penalized by the cur-
algorithm, the state boxes and weight list are initialized as rent measurements. This abnormality can be checked
empty sets and cumulative weight and cardinality variable by thresholding Λ(k) defined in (12).
are initialized to zero. The loop "while" operates as a sort- If enough representative fault free data are available, the
ing, eliminating the boxes with smallest weights so that the indicators defined above can be determined by means of
cumulative sum of the boxes with largest weights is greater thresholds computed with these data. For example, the
or equal to the threshold γ. If the state space is not bounded, threshold that defines the abnormal abrupt change in state
the threshold 0 < γ < 1 does not guarantee a bounded num- estimation can be computed as
ber of boxes in a worst-case scenario in which the measure- q
ments do not emphasize some particles against others. In ∆x̂max = β1 max (x̂(i) − x̂(i − 1)) (x̂(i) − x̂(i − 1))T
i=2,··· ,L
this case, a maximum number of particles N max should be (29)
imposed. where L is the length of the fault free scenario and β1 > 1
a tuning parameter. Then the fault detection test consists in
5 State estimation and fault detection checking at each instant k if
q
5.1 State estimation T
(x̂(k) − x̂(k − 1)) (x̂(k) − x̂(k − 1)) > ∆x̂max
Once the set of Nk+1 disjoint boxes [x(k + 1)]i , i =
1, · · · , Nk+1 , that approximate X (k + 1) and their asso- (30)
ciated a priori weights Wk+1 i
have been computed, their In a similar way, threshold Λmin that defines the min-
imum expected unnormalized posterior probability can be
measurement updated weights w(k + 1)i are obtained us- computed as
ing (11). Then, according to [2], the state at instant k + 1 is
approximated by Λmin = β2 min (Λ(i)) (31)
i=2,··· ,L
Nk+1
X where Λ(i) is determined using (12) and 0 < β2 < 1 is a
x̂(k + 1) = w(k + 1)i xi0 (k + 1) (28) tuning parameter. Then the fault detection test consists in
i=1 checking at each instant k if
where xi0 (k + 1) is the center of the particle box [x(k + 1)]i . Λ(k) < Λmin (32)
Algorithm 3 summarizes the whole state estimation pro-
cedure.
6 Application example
Algorithm 3 State estimation In this section a target tracking in a sensor network exam-
Algorithm State estimation ple presented in [8] is used to illustrated the state estima-
Initialize X (0), N0 and P ([x(k)]i |Y(k − tion method presented above. The problem consists of three
1))k=0,i=1...N0 sensors and one target moving in the horizontal plane. Each
for k = 1, . . . , end do sensor can measure distance to the target, and by combining
Obtain Input/Output data {u(k), y(k)} these a position fix can be computed. Fig. 1 depicts a sce-
Measurement update nario with a trajectory and a certain combination of sensor
compute Λ(k) using Eq. (12) locations (S1 , S2 and S3 ).
compute w(k)i using Eq.(11) i = 1 . . . N0
State estimation 4
compute x̂(k) using (28)
3.5
State update
compute [x(k + 1)|x(k)]i i = 1 . . . N0 using (15) 3

compute Z that fulfils (16) 2.5
compute disjoint boxes [z]i i = 1, · · · , M of (17) S2
2
compute weights wzi using Algorithm 1
x2(m)

compute new state estimation using Algorithm 2 1.5
S
Nk+1 disjoint boxes that approximate X (k +1) 1 3

Prior probabilities given by weights Wk+1 0.5
end for S
1
0
endAlgorithm
−0.5

−1
−1 −0.5 0 0.5 1 1.5 2 2.5 3
5.2 Fault detection x1(m)

In our framework, fault detection can be formulated as de-
tecting inconsistencies based on the state estimation. To do Figure 1: Target true trajectory and sensor positions in the
so, we propose the two following indicators: bounded horizontal plane
• Abrupt changes in the state estimation provided by (28)
p instant k−1 to instant k, i.e. abnormal high values
from The behaviour of the system can be described by the fol-
of (x̂(k) − x̂(k − 1))(x̂(k) − x̂(k − 1))T lowing discrete time state-space model:

70
Proceedings of the 26th International Workshop on Principles of Diagnosis

Box particle filtering weight of boxes using measurement y1(1)

x1 (k + 1) x1 (k) v1 (k)
= + Ts (33)
x2 (k + 1) x2 (k) v2 (k) 0.01

 q  0.005
  (x1 (k) − S1,1 )2 + (x2 (k) − S1,2 )2
y1 (k)  q  0
 y2 (k)  =   4
3
 (x1 (k) − S2,1 )2 + (x2 (k) − S2,2 )2  2
0 1 2
y3 (k)  q  −2 −1 0

(x1 (k) − S3,1 )2 + (x2 (k) − S3,2 )2 (34)
  Box particle filtering weight of boxes using measurements y1(1),y2(1) and y3(1)
e1 (k)
+  e2 (k)  0.2
e3 (k)
0.1
where x1 (k) and x2 (k) are the object coordinates bounded
by −1 ≤ x1 (k) ≤ 3 and −1 ≤ x2 (k) ≤ 4 ∀k ≥ 0. 0
4
3
Ts = 0.5s is the sampling time, v1 (k) and v2 (k) are the 2
0 1 2
−2 −1 0
speed components of the target that are unknown but con-
sidered bounded by the maximum speed σv = 0.4m/s
(|v1 (k)| ≤ σv and |v2 (k)| ≤ σv ). y1 (k), y2 (k) and y3 (k)
are the distances measured by the sensors. Si,j denotes Figure 3: Box weights using measurement y1 (k) (up) and
the component j of the location of sensor i. e1 (k), e2 (k) measurements (y1 (k), y2 (k), y3 (k))T (down)
and e3 (k) are the the stochastic measurement
√ additive er-
rors pei ∼ N (0, σi ) with σ1 = σ2 = σ3 = 0.05m.
Fig. 2 shows the evolution of the real sensor distances
Box particle filtering weight contour of boxes using measurement y (1)
and measurements in the target trajectory scenario depicted 1
4
in Fig. 1. Real point
3 Estimated BPF
2
4
Distance 1 (m)

1
Real
Measured 0
2
−1
−1 −0.5 0 0.5 1 1.5 2 2.5 3
0
0 5 10 15
Box particle filtering weight of boxes using measurements y1(1),y2(1) and y3(1)
1.5
Distance 2 (m)

4
1 Real point
3 Estimated BPF
0.5
2
0 1
0 5 10 15
3 0
Distance 3 (m)

2 −1
−1 −0.5 0 0.5 1 1.5 2 2.5 3
1

0
0 5 10 15
Time (s) Figure 4: Box weight contours using measurement y1 (k)
(up) and measurements (y1 (k), y2 (k), y3 (k))T (down)
Figure 2: Real and measured distances from the target to the
sensors
instant k = 1 (y1 (1), y2 (1) and y3 (1)) (down). Fig. 5 de-
In order to apply the state estimation methodology pre-
picts the box weights and their contours using the measure-
sented above, a minimum accuracy δ1 = δ2 = δ = 0.2m
ments at hand at instant k = 2.
has been selected for both components. No a priori infor-
mation has been used in the initial state. Then, a uniform The real trajectory and the one estimated using (28) are
grid of disjoint boxes with the same weights and component shown in Fig. 6.
widths ε1 = ε2 = δ that covers all the bounded coordi-
Finally, different additive sensor faults have been simu-
nates −1 ≤ x1 ≤ 3 and −1 ≤ x2 ≤ 4 has been chosen as
lated and satisfactory results of the fault detection tests (30)
initial state X (0). Posterior probabilities of the boxes have
and (32) have been obtained for faults bigger than 0.5m us-
been approximated by weights w(k)i computed using the
ing thresholds ∆x̂max and Λmin computed with (29) and
new sensor distances measurements in (4.1). State update
(31)with L = 3200, β1 = 1.1 and β2 = 0.9.
has been computed considering speed bounds in (33). The
new boxes have been rearranged considering the minimum Fig. 7 shows the real trajectory and the one estimated us-
accuracy δ and their associated weights have been computed ing (28) when an additive fault of +0.5m affects sensor S1
using (4.1). Finally, Algorithm 2 with threshold γ = 1 has at time k = 22. The behaviour of fault detection tests (30)
been applied to reduce the number of boxes. and (32) is depicted in Fig. 8. As seen in this figure, both
Figs. 3 and 4 depict the box weights and their contours thresholds are violated at time instant k = 22 and therefore
using measurement y1 (1) (up) and all the measurements at the fault is detected at this time instant.

71
Proceedings of the 26th International Workshop on Principles of Diagnosis

Box particle filtering weight of boxes using available measurements at instant k=2
4

0.2 3.5
real
3 Box Particle Filter
0.1
2.5
0 S2
4 2
2 3
2 k=21 k=22
0 1

x2 (m)
−2 0
−1 1.5
Fault Detection
Box particle filtering weight contour of boxes using available measurements at instant k=2 1
4 S
0.5 3
Real point
3 Estimated BPF S1
0
2

1 −0.5

0 −1
−1 −0.5 0 0.5 1 1.5 2 2.5 3
−1 x (m)
1
−1 −0.5 0 0.5 1 1.5 2 2.5 3

Figure 7: Trajectories in fault scenario
Figure 5: Box weights (up) and Box weights contours
(down) at instant k = 2
0.6

∆x̂(k)
0.4
4
0.2
3.5 Real trajectory Fault Detection
Box particle estimation 0
3 5 10 15 20 25 30
Time (Ts=0.5s)
2.5
120
S2 100
2 Fault Detection
80
x2 (m)

Λ(k)

1.5 60
S3 40
1
20

0.5 0
5 10 15 20 25 30
S1 Time (Ts=0.5s)
0

−0.5
Figure 8: Fault indicators and thresholds in the fault sce-
−1
−1 −0.5 0 0.5 1 1.5 2 2.5 3 nario
x1 (m)

Figure 6: Trajectories Acknowledgments
This work has been partially funded by the Spanish Ministry
of Science and Technology through the Project ECOCIS
7 Conclusion and perspectives (Ref. DPI2013-48243-C2-1-R) and Project HARCRICS
(Ref. DPI2014-58104-R).
A Box particle algorithm has been proposed for estimation
and fault detection in the case of nonlinear systems with A Demonstration of Measurement update:
stochatic and bounded uncertainties. Using this method in "From particles to boxes"
the case of a target tracking sensor networks illustrates its
A.1 Particle filtering
feasibility. It has been shown how the measurement up-
date state for the box particle is derived from the particle Consider the particles {x(k)j }N
j=1 uniformly distributed in
case. However convergence and stability of this filter have to x(k)j ∈ [x(k), x(k)] ∀j = 1, . . . , N where x(k), x(k) ∈
be proved. Resampling unfortunatly drops information and R. Then according to [1] the relative posterior probability
waives guaranteed results that characterize interval analysis for each particle is approximated by
based solutions. However without resampling the particle
filter suffers from sample depletion. This is the reason why 1
P (x(k)j |Y(k)) ≈ P (x(k)j |Y(k − 1))pe (y(k) − h(x(k)j ))
resampling is a critical issue in particle filtering (Gustafsson c(k)
2002). This approach has to be compared to other PF vari- (35)
ants which reduce the number of particles [2] and further with
N
X
investigations concerning resampling are required, in par-
ticular if we want to take better benefit of the interval based c(k) = P (x(k)j |Y(k)) (36)
approach. j=1

72
Proceedings of the 26th International Workshop on Principles of Diagnosis

A.2 Grouping particles
i∆N
X
If we group the N particles in Ng groups of ∆N elements
pe (y(k) − h(x(k)j ))∆x(k) ≈
Ng j=1+(i−1)∆N
[
{x(k)j }N {x(k)l }i∆N Z (i∆N )∆x(k)
j=1 = l=1+(i−1)∆N (37)
(47)
i=1 pe (y(k) − h(x(k)))dx(k) ≈
(1+(i−1)∆N )∆x(k)
with Ng = ∆NN Z
If we select the groups of points in such a way that pe (y(k) − h(x(k)))dx(k)
x(k)∈[x(k)]i
Finally, multiplying the numerator and denominator of
{x(k)l }i∆N
l=1+(i−1)∆N ∈ [x(k)]
i
∀i = 1, . . . , Ng (38) equation (44) by ∆x, we obtain the particle box measure-
ment update equation
where
P ([x(k)]i |Y(k)) ≈
R
P ([x(k)]i |Y(k − 1)) x(k)∈[x(k)]i pe (y(k) − h(x(k)))dx(k)
[x(k)]i = [x(k) + (i − 1)∆L, x(k) + i∆L] (39) PNg R
l
l=1 (P ([x(k)] |Y(k − 1)) x(k)∈[x(k)]l pe (y(k) − h(x(k)))dx(k))
with (48)
x(k) − x(k) that corresponds to the equation (11) with
∆L = (40) Λ(k) =
Ng
Ng
X Z
If the number of particles N → ∞ and therefore ∆N →
(P ([x(k)]l |Y(k − 1)) pe (y(k) − h(x(k)))dx(k))
∞ x(k)∈[x(k)]l
l=1
(49)
i∆N
X
P ([x(k)]i |Y(k)) ≈ P (x(k)j |Y(k)) (41)
j=1+(i−1)∆N
References
[1] F. Gustafsson, F. Gunnarsson, N. Bergman, U. Forssell,
according to (35) J. Jansson, R. Karlsson, and P.J. Nordlund. Particle
filters for positioning, navigation, and tracking. Sig-
P ([x(k)]i |Y(k)) ≈ nal Processing, IEEE Transactions on, 50(2):425–437,
Pi∆N j j 2002.
j=1+(i−1)∆N P (x(k) |Y(k − 1))pe (y(k) − h(x(k) ))
PNg Pl∆N [2] F. Abdallah, A. Gning, and P. Bonnifait. Box particle fil-
j j
l=1 j=1+(l−1)∆N P (x(k) |Y(k − 1))pe (y(k) − h(x(k) )) tering for nonlinear state estimation using interval anal-
(42) ysis. Automatica, 44(3):807–815, 2008.
[3] A. Doucet, N. De Freitas, and N. Gordon. An intro-
If we consider the particles in the same group i have the
duction to sequential Monte Carlo methods. Springer,
same prior probabilities, then:
2001.
[4] A. Gning, L. Mihaylova, and F. Abdallah. Mixture
p(x(k)j |Y(k − 1)) = of uniform probability density functions for non linear
P ([x(k)]i |Y(k − 1)) state estimation using interval analysis. In Information
∀j = 1 + (i − 1)∆N, . . . , i∆N Fusion (FUSION), 2010 13th Conference on, pages 1–
∆N
(43) 8. IEEE, 2010.
[5] R.M. Fernández-Cantí, S. Tornil-Sin, J. Blesa, and
and (42) leads to V. Puig. Nonlinear set-membership identification and
fault detection using a bayesian framework: Applica-
P ([x(k)]i |Y(k)) ≈ tion to the wind turbine benchmark. In Proceedings of
i Pi∆N j the IEEE Conference on Decision and Control, pages
P ([x(k)] |Y(k − 1)) j=1+(i−1)∆N pe (y(k) − h(x(k) ))
PNg P
496–501, 2013.
l |Y(k − 1)) l∆N j )))
l=1 (P ([x(k)] j=1+(l−1)∆N p e (y(k) − h(x(k) [6] J. Xiong, C. Jauberthie, L. Travé-Massuyès, and F. Le
(44) Gall. Fault detection using interval kalman filtering en-
hanced by constraint propagation. In Proceedings of the
If the N particles are uniformly distributed in the interval IEEE Conference on Decision and Control, pages 490–
[x(k), x(k)], i.e 495, 2013.
[7] A. Gning, B. Ristic, and L. Mihaylova. Bernoulli
x(k)j − x(k)j−1 = ∆x(k) ∀j = 2, . . . , N (45) particle/box-particle filters for detection and tracking in
the presence of triple measurement uncertainty. IEEE
where Transactions on Signal Processing, 60(5):2138–2151,
x(k) − x(k) ∆L 2012.
∆x(k) = = (46) [8] F. Gustafsson. Statistical sensor fusion. Studentlitter-
N ∆N
Then atur, Lund, 2010.

73
Proceedings of the 26th International Workshop on Principles of Diagnosis

74
Proceedings of the 26th International Workshop on Principles of Diagnosis

Minimal Structurally Overdetermined Sets Selection for Distributed Fault
Detection
Hamed Khorasgani1 Gautam Biswas1 and Daniel Jung2
1
Institute of Software Integrated Systems, Vanderbilt University, USA
e-mail: {hamed.g.khorasgani,gautam.biswas}@vanderbilt.edu
2
Dept. of Electrical Engineering, Linkoping University, Sweden
e-mail: daner@isy.liu.se

Abstract which can further affect diagnostic accuracy. Detection time
is important for the safe and reliable operation of safety-
This paper discusses a distributed diagnosis ap- critical systems. Faster fault detection and isolation en-
proach, where each subsystem diagnoser operates ables accompanying fault tolerant control units to react in
independently without a coordinator that com- a timely manner, thus reducing damage and down time of
bines local results and generates the correct global systems [Roychoudhury et al., 2009; Daigle et al., 2007;
diagnosis. In addition, the distributed diagnosis Duarte Jr and Nanya, 1998; Rish et al., 2005; Bregon et al.,
algorithm is designed to minimize communica- 2014]. The computational intractability of building central-
tion between the subsystems. A Minimal Struc- ized diagnosers for the large systems is another important
turally Overdetermined (MSO) set selection ap- reason to develop distributed solutions for FDI problems.
proach is developed as a Binary Integer Linear
Programming (BILP) optimization problem for In this paper, we formulate the distributed minimal struc-
subsystem diagnoser design. For cases, where a turally overdetermined set selection as a binary integer lin-
complete global model of the system may not be ear programming (BILP) problem [Wolsey, 1998]. The ap-
available, we develop a heuristic approach, where proach efficiently picks a minimal number of measurements
individual subsystem diagnosers are designed in- from a subsystem and its neighboring subsystems to develop
crementally, starting with the local system MSOs a local diagnoser for each subsystem of the larger, complex
and progressively extending the local set to in- dynamic system. We start with an efficient algorithm de-
clude MSOs from the immediate neighbors of the signed by [Krysander et al., 2008a] for finding minimally
subsystem. The inclusion of additional neighbors overdetermined sets of constraints to generate the minimal
continues till the MSO set ensures correct global structurally overdetermined (MSO) sets for designing the
diagnosis results. A multi-tank system is used to diagnoser. Other researchers have employed binary inte-
demonstrate and validate the proposed methods. ger programming and binary linear integer programming for
optimal sensor placement for fault detection and isolation
[Sarrate et al., 2007; Rosich et al., 2009]. In this paper, we
1 Introduction utilize BILP for distributed MSO selection to facilitate an
The Minimal Structurally Overdetermined (MSO) sets ap- efficient distributed diagnosis approach.
proach has been used extensively for designing model based Our method is designed in a way that the subsystem di-
fault detection and isolation (FDI) schemes for complex agnosers, once designed can operate independently with no
systems [Krysander et al., 2008a; Krysander et al., 2008b; communication with the other subsystem diagnosers (other
Svard et al., 2012]. However, for large complex systems than a minimal number of shared measurements), but still
such as aircraft and other transportation systems, manufac- provide globally correct diagnosis results. Unlike [Lafor-
turing processes, supply chain and distribution networks, tune, 2007; Debouk et al., 2000; Indra et al., 2012] this
and power generation and the power grid it is becoming method does not require the use of a centralized coordina-
imperative to develop distributed approaches to monitoring tor during on-line operations. Therefore, we avoid the sin-
and diagnosis to overcome the need for complete global gle point-of-failure problem of centralized diagnosers. Our
models, while also addressing computational complexity method assumes the availability of a global system model
and reliability problems for the diagnosers [Leger et al., from which the set of MSOs for the system can be derived.
1999; Shum et al., 1988; Deb et al., 1998; Lanigan et al., The independent subsystem diagnosers are designed to min-
2011]. imize the sharing of measurements across subsystems, thus
Unlike centralized approaches, distributed approaches are decreasing the cost, and increasing the reliability of the
more reliable because they avoid single points of failure. overall system diagnosis.
In addition, they can reduce the problems of noise, cor- However, global models of a complex system are hard to
ruption, and losses that can occur when transmitting sig- construct and may not be readily available. Subsystems are
nals from individual subsystems to a centralized fault di- often provided by different manufacturers, who are not will-
agnosis unit. Measurement noise and signal corruption can ing to pass along all of the intellectual property associated
significantly affect diagnoser robustness and accuracy [Fer- with the subsystem to the system integrator. Therefore, to
rari et al., 2012]. Transmission delays not only increase avoid the unrealistic assumption that the complete model of
detection time, but can also affect the order of detection, the complex system is available for subsystem diagnoser de-

75
Proceedings of the 26th International Workshop on Principles of Diagnosis

sign, we propose a second algorithm that constructs the in-
dividual subsystem diagnosers without assuming the avail- 1
ability of a global model. The modified algorithm is com- e1 : ṗ1 = (qin1 − q1 )
CT 1 + f1
putationally more efficient, but we cannot guarantee that the e4 : qin1 = u1
shared measurements between the subsystems is minimal p1 − p2
e2 : q 1 = e5 : p1 = y1
globally (i.e., across the entire system). R + f2
Z P1 e6 : q1 = y2 .
The rest of this paper is organized as follows. The back- e3 : p1 = ṗ1 dt
ground material, definitions and the running example, a
four-tank system, are presented in Section 2. The distributed (1)
diagnosis problem formulation is presented in Section 3. Al- Therefore, E1 = {e1 , e2 , e3 , e4 , e5 , e6 } defines the set of
gorithm 1 for distributed MSO set selection is described in equations, V1 = {ṗ1 , p1 , p2 , qin1 , q1 } defines the set of vari-
Section 4. The heuristic modifications to Algorithm 1 given ables, M1 = {u1 , y1 , y2 } defines the set of subsystem mea-
the global model is not available is presented in Section 5 as surements, and F1 = {f1 , f2 } defines the set of faults asso-
the incremental algorithm. Section 6 discusses the contribu- ciated with this subsystem model.
tions of the paper in relation to previous work, and presents Similarly, the second subsystem model is defined by the
the conclusion of the paper. following equations:
1
e7 : ṗ2 = (q1 − q2 )
CT 2 + f3
2 Background p2 − p3 e10 : p2 = y3
e8 : q2 =
This section introduces the basic concepts associated with RP 2 + f4 e11 : q2 = y4 .
Z
MSO set selection for structural diagnosis of dynamic sys- e9 : p2 = ṗ2 dt
tems. The system model S is defined as follows.
(2)
Definition 1 (System model). A system model S is a four- For this subsystem the set of equations is E2 =
tuple: (V , M , E, F ), where V is the set of variables, M is {e7 , e8 , e9 , e10 , e11 }, the set of variable is V2 = {ṗ2 , p2 ,
the set of measurements, E is the set of equations and F is p3 , q1 , q2 }, the set of measurements is M2 = {y2 , y4 }, and
the set of system faults. F2 = {f3 , f4 } is the set of faults.
In this paper, we assume there are no overlapping com-
We use a configured four tank system, shown in Fig- ponents among the subsystems. However, the subsystems
ure 1, as a running example throughout this paper to de- may share variables at their interface. For example, the liq-
scribe the problem, and to illustrate the algorithms for dis- uid flowrate at outlet pipe of subsystem qi = qi0 , the liquid
tributed MSO set selection. We assume each tank, and the flowrate at input to connected tank i + 1.
outlet pipe to its right, constitute a subsystem. Therefore,
this system has four subsystems. Two of the subsystems, Definition 3 (First Order Connected Subsystems). Two sub-
1 and 3, also have inflows into their tanks. We assume the systems, Si and Sj are defined to be first order connected if
subsystems are disjoint, i.e., they have no overlapping com- and only if they have at least one shared variable.
ponents. Associated with each subsystem are a set of mea- In the running example, subsystems S1 and S2 are first
surements that are shown as encircled variables in the figure. order connected and their shared variables are V1 ∩ V2 =
{p2 , , q1 }. The two other subsystems in the running example
are:
u1 u2
1
S2 e12 : ṗ3 = (qin2 + q2 − q3 )
S1 CT 3
p3 − p4 e15 : qin2 = u2
e13 : q3 =
P2 P3 RP 3 + f5 e16 : q3 = y5 .
P1 P4 Z
T1 T2 T3 T4
e14 : p3 = ṗ3 dt
y2 y4 y5
y1 y3 y6 (3)

1 Z
Figure 1: Running example: Four Tank System. e17 : ṗ4 = (q3 − q4 )
CT 4 + f6 e19 : p4 = ṗ4 dt
p4
More generally, we assume the system, S has n pre- e18 : q4 = e20 : p4 = y6 .
RP 4
defined subsystems, S1 , S2 , ....Sn . Each subsystem model (4)
is defined as: In more general terms, ith order connected subsystem
models are defined as follows.
Definition 2 (Subsystem model). A subsystem model of sys-
tem model S, Si (1 ≤ i ≤ k) is also a four-tuple: (Vi , Mi , Definition 4 (ith Order Connected Subsystems). Two sub-
Ei , Fi ), where Vi ⊆ V , Mi ⊆ M , Ei ⊆ E and Fi ⊆ F . systems, Sk and Sj are defined to be ith order connected
Also, S1 ∪ S2 ∪ ....Sk = S. if and only if there exists a subsystem model Sm that is
(i−1)th order connected to Sk , and is first-order connected
For illustration, the first subsystem in our running exam- to Sj , or Sm is (i − 1)th order connected to Sj , and is first-
ple is described by the following set of equations: order connected to Sk .

76
Proceedings of the 26th International Workshop on Principles of Diagnosis

For example in the four tank system, S1 and S3 are sec- Definition 11. (Binary integer linear programming problem
ond order connected because both of them are first order (BILP)) A Binary integer linear programming problem is a
connected to S2 . special case of an integer linear programming (ILP) opti-
In this paper, we use MSO sets [Krysander et al., 2008b] mization problem in which some or all the unknown vari-
as the primary conceptual approach for fault detection and ables to be solved for are required to be binary, and the
isolation. The formal definitions of Structurally Overdeter- constraints in the problem and the objective function, like
mined (SO) and MSO sets are: ILP, are linear.
Definition 5. (Structural Overdetermined Set) Consider a The mathematical formulation of BILP is as follows.
set of equations and its associated variables, measurements,
and faults: (E, V, M, F ). This set of equations is struc- min cT x
turally overdetermined (SO) if the cardinality of the set {E} Ax ≤ b
is greater than the cardinality of set {V }, i.e. |E| > |V |. (5)
∃xb ⊂ x
Definition 6. (Minimal Structurally Overdetermined Set) ∀xk ∈ xb ⇒ xk ∈ {0, 1},
A set of over determined equations is minimal structurally
overdetermined (MSO) if it has no subset of structurally where vector c is the cost weights and matrix A and vector
overdetermined equations. b define linear constraints, x represents the variables, and
Consider subsystem S1 of the four tank system in equa- xb represents the binary variables [Wolsey and Nemhauser,
tion (1). Using the software developed by [Krysander 2014].
et al., 2008a], we can compute the only minimal struc-
turally overdetermined set in this subsystem as M SO11 = 3 Problem Formulation
(E11 , V11 , M11 , F11 ), where E11 = {e1 , e3 , e4 , e5 , e6 },
V11 = {ṗ1 , p1 , qin1 , q1 }, M11 = {u1 , y1 , y2 } and F11 = Designing a set of distributed diagnosers that together have
{f1 }. For the sake of brevity and simplification we simply the same diagnosability as a centralized diagnoser is the fo-
say a specific equation, variable, measurement, or fault is a cus of our work in this paper. In the ideal case, each sub-
member of a MSO in the rest of the paper. For example, we system includes sufficient redundancies, such that its set
say f1 ∈ M SO11 . of MSOs is sufficient to detect and isolate all of its faults,
MSOs represent the redundancies in the system and can Fi uniquely and unambiguously. In that case, we can as-
form the basis for fault detection and isolation. Global and sociate an independent diagnoser Di with each subsystem
Local fault detectability are defined as: Si ; 1 ≤ i ≤ k, and each diagnoser operates with no cen-
Definition 7. (Globally detectable fault) A fault f ∈ F is tralized control, and no exchange of information with other
globally detectable in system S if there is a minimal struc- diagnosers. If the independence among diagnosers does not
turally overdetermined set M SOi in the system, such that f hold, then the subsystems need to communicate some of
∈ M SOi . their measurements to other subsystems to detect and iso-
late the faults. To address this problem in an efficient way,
Definition 8. (Locally detectable fault) A fault f ∈ Fi is lo- we derive an integrated approach to select a set of MSOs for
cally detectable in subsystem Si if there is a minimal struc- each subsystem that guarantee full diagnosability and mini-
turally overdetermined set M SOi in the subsystem that f ∈ mum exchange of measurements among subsystems.
M SOi . Given subsystems, Si ; 1 ≤ i ≤ k, with a set of local fault
Consider Definition 8 and equation (1). Fault f1 is lo- S k
candidates, Fi , such that i=1 Fi = F . We may need to
cally detectable because f1 ∈ M SO11 but f2 is not lo- augment each subsystem with additional measurements that
cally detectable since there is no MSO in this subsystem are typically acquired from the (nearest) neighbors of the
that includes f2 . To detect f2 locally, the diagnosis subsys- subsystem, such that all of the faults associated with the ex-
tem needs to include additional measurements. Global and tended model of this subsystem are detectable and isolable.
Local fault isolability are defined as: In the worst case, all of the measurements from another sub-
Definition 9. (Globally isolable fault) A fault fi ∈ F is system may have to be included to make the current subsys-
globally isolable from fault fj ∈ F if there exists a mini- tem diagnosable. When such a situation occurs, we say the
mal structurally overdetermined set M SOi in the system S, two subsystems are merged and represented by a common
such that fi ∈ M SOi and fj 6∈ M SOi . diagnoser, therefore, the total number of independent dis-
Definition 10. (Locally isolable fault) A fault fi ∈ Fi is tributed diagnosers may be less than k.
locally isolable from fault fj ∈ F if there exists a mini- Each MSO is sensitive to a set of faults and, therefore can
mal structurally overdetermined set M SOi in subsystem Si , be used to detect them and isolate them from the other faults
such that fi ∈ M SOi and fj 6∈ M SOi . in the system. For each subsystem Si , our goal is to find a
Note that if a fault fi is locally detectable in a subsys- minimal set of MSOs that provide maximum detectability
tem Si , it is globally detectable too, and if a fault fi is lo- and isolability to that subsystem. A set of MSOs is mini-
cally isolable from a fault fj , it is globally isolable from fj mal if there is no subset of MSOs that provides the same
as well. The problem of MSO selection is presented as a detectability and isolability. To achieve distributed fault di-
binary integer linear programming (BILP) problem in this agnosis, we also want each subsystem to use the minimum
paper. BILP is a special case of the integer linear program- number of measurements from the other subsystems. In
ming problem (ILP), where the unknowns to be solved for other words, we want to minimize communication or the
are binary variables.1 amount of data (measurements) to be transmitted between
the subsystems. More formally, the problem for designing a
1
See definition in Wikipedia: https://en.wikipedia. diagnoser for a particular subsystem Si can be described as
org/wiki/Integer_programming. follows:

77
Proceedings of the 26th International Workshop on Principles of Diagnosis

Consider MSO = {M SO1 , M SO2 , . . . , M SOr } as the To formulate the problem (6) as a BILP problem we de-
set of possible MSOs for the subsystem Si . We need to de- fine a binary variable x(k): 1 ≤ k ≤ l, for measurement
velop an algorithm to select a minimal subset of MSO that mk in the system as follows:
guarantees maximal structural detectability and isolability (
for faults Fi associated with the subsystem, and include a 1 if mk ∈ Mi ∪ Mo
x(k) = (7)
minimum number of measurements from the other subsys- 0 if mk ∈ / Mi ∪ Mo ,
tems in the system to assure the equivalence of local and
global diagnosability , i.e., where Mo is the answer to problem (6). We also define
x(k + l): 1 ≤ k ≤ r, for MSO M SOk in the system as
∀Si ; 1 ≤ i ≤ k follows.
Select M SOSi ⊂ MSO (
1 if M SOk ∈ MSOi
s.t. min |Mo | (6) x(k + l) = (8)
Mo ⊆M 0 if M SOk ∈ / MSOi .
Di (Mi ∪ Mo ) = Di (M ),
To minimize the number of measurements from the other
Ii (Mi ∪ Mo ) = Ii (M ), subsystems, we develop the following cost function c as:
(
where Mo represents the set of measurement we need to 0 if m k ∈ Mi
communicate to the subsystem Si along with the set of mea- c(k) = 1 if mk ∈ M \Mi (9)
surements, Mi associated with the subsystem Si . M repre- 0 if l < k ≤ l + r,
sents the set of all measurements in the system. For a given
set of measurements, X, Di (X) represents the set of de- where l is the number system measurements and r is the
tectable faults in Fi , and Ii (X) represents the set of isolable number of MSOs in the system. Using the algorithm pro-
faults in Fi from the system faults, F . posed in [Krysander et al., 2008a] 165 MSOs are generated
In the next section we formulate the problem as a BILP for the running example, the four tank system. Since there
problem. Formulating the problem as a BILP, enables us to are 8 measurements in the system c is a vector with 173 el-
use a number of well-developed tools like branch and bound ements for this example.
algorithms [Land and Doig, 1960] and branch and cut al- Consider subsystem Si with local faults Fi and the set of
gorithms [Mitchell, 2002] to solve the problem. However, system faults, F . Each local fault fj ∈ Fi has to be lo-
much like integer linear programming, the general BILP so- cally detectable. Given definition 8, we can guarantee local
lution is exponential. detectability of all the faults fj ∈ Fi with the following
constraints in the optimization problem (5).
(
4 MSOs Selection for Distributed Fault 0 if k 1, θk,bol is replaced by inf(θˆk (T0 ), θˆk (T0 )), P2
• if αk < 1, θk,bol is replaced by sup(θˆk (T0 ), θˆk (T0 )), Figure 3 – Partition P2 and test results for this partition.
and one of the bounds of the components of S(θ)T1 remains
equal to θk,eol . We iterate the process until the precision gain
In the general case, when considering the inspection time G(Pi+1 /Pi ) is greater then a given threshold, as it is
Ti , S(θ)Ti is hence obtained with θ̂(Ti−1 ) as follows: shown in Fig. 4.

• if αk > 1, inf(θˆk (Ti−1 ), θˆk (Ti−1 )) is replaced by
P1
inf(θ̂(Ti ), θ̂(Ti )),
• if αk < 1, sup(θˆk (Ti−1 ), θˆk (Ti−1 )) is replaced by
P2
sup(θˆk (Ti ), θˆk (Ti )).

4.3 FRP Parameter Estimation for a Single P3
Parameter
Figure 4 – Test results for partition P3 .
In this section, we consider one single parameter θ whose
evolution is monotonically increasing. As an example, let’s
state that θ is a bearing friction coefficient that grows with Remarks
the bearing wearing and the clogging of the environment. In The method can be easily generalized to a system whose
the general case, this kind of knowledge must be brought by parameter vector has dimension nθ > 1. The computing
an expert of the system and/or the manufacturer. costPis proportional to the number of boxes that are tested,
Let us consider the first inspection time and the initial nP
i.e. i=1 1/(Pi ), where nP is the number of partitions.
search space S(θ)T0 given by the domain value of the pa- Let’s notice that the partition may be non-regular. For ex-
rameter Ω(θ) = [θbol , θeol ]. The search space is partitioned ample, for a slowly ageing parameter, one may choose small
into boxes, in our case intervals (cf. Fig. 2). boxes for the values of θ that are close to θbol and larger
The dynamic equation of Σ is integrated on the time win- ones for the values close to θeol . The result is guaranteed
dow ti = t0 , . . . , tH , where tH = T0 , as many times as even if the partition has not been properly chosen or if the
the number of intervals in the partition P1 . The number parameter has evolved in a non expected way, although the
of intervals is defined by the partition factor (P1 ), which computation cost may be higher.
equals 1/15 in our example (cf. Fig. 2). We start with The convex union provides a poor result if the set of ad-
[θ]1 = [θbol , θbol + pw], where pw = (P1 )w(S(θ)T0 ) is missible values is made of several mutually disjoint con-
the width of the partition intervals, then proceed with the nected sets, as shown in Fig. 5. The algorithm may test
subsequent intervals [θ]j . For each interval, we get an esti- some boxes that have already been rejected by the tests of
mation of the state vector at times ti = t0 , . . . , tH , denoted the previous partition. This drawback could be addressed
as x̂(t0 . . . tN )j , and obtain ŷ(t0 . . . tN )j thanks to the ob- by defining the solution as a list of boxes whose labels (un-
servation equation (1). This latter is tested for consistency feasible, feasible, or undetermined) are inherited by the next
against the measurements ym (t0 . . . tN ). partition boxes.
Depending on the output of the tests (4), (5), and (7), the
parameter interval [θ] is rejected or added to the solution as
feasible or undetermined (red-colored, green-colored, and P
yellow-colored parts, respectively, in Fig. 2). Solution

Figure 5 – The returned solution is the convex hull of mutu-
P1 ally disjoint connected intervals.

Figure 2 – Partition P1 and test results for this partition.
5 SM prognosis
The convex union of feasible and undetermined
h i intervals The prognosis phase consists is calculating the number of
provides a guaranteed estimation θ̂ = θ̂, θ̂ of the admis- cycles remaining before anomaly, which is also called the
sible values for hθ. Wei iterate the process by creating a new Remaining Useful Life or RUL. To optimally adapt this cal-
culation to the system’s life requires the knowledge of the
partition P2 of θ̂, θ̂ with a precision (P2 ) = 1/10 (cf. health status of the system at the current time, which was
Fig. 3). the topic of Section 4.

86
Proceedings of the 26th International Workshop on Principles of Diagnosis

5.1 Component degradation partitioned into NΠ = Πnk=1θ
Nk possible boxes that must be
The global model (Σ + ∆) assumes that the parameters of fed as input to D. Let us for instance consider a two param-
the behavior model Σ given by (1) evolve in time, and that eters vector and its beginning-of-life and end-of-life values
their evolution is represented by the degradation model ∆ as follows:

given by the dynamic equation (2) that is recalled below: θ 1 4 3
θ = 1 , θ bol = , θ eol = and N = , (12)
∆ : θ̇(t) = g(t, θ(t), w, x(t)). θ2 1 9 2

∆ provides the dynamics of the parameter vector as a func- then, if we select the partition landmarks as {5} for θ1 and
tion of the state of the system x(t) and of a degradation {2, 3} for θ2 2 , D must be run for the following 6 box values:
parameter vector w that allows one to tune the degradation
for each of the considered parameters. [1, 2] [2, 3] [3, 4]
[θ]1 = , [θ]2 = , [θ]3 = ,
The global model (Σ + ∆), in the form of a dynamic [1, 5] [1, 5] [1, 5]
model with varying parameters, cannot be directly inte-
[1, 2] [2, 3] [3, 4]
grated by VNODE-LP. An original method, coupling the [θ]4 = , [θ]5 = , [θ]6 = . (13)
[5, 9] [5, 9] [5, 9]
two models Σ and ∆ iteratively is proposed in the following.
The method is illustrated by Fig. 6 and used to determine For each of these box values taken as input for cycle i, i.e.
the degradation suffered by each parameter during one unit θ i = [θ]l , l = 1, . . . , 6, D returns the (box) value θ i+1 after
cycle as defined in Section 3.2. one unit cycle. This computation is then projected on each
Let us denote uC (t), t ∈ [τ, τ + dC ], the system input dimension to obtain a set of nθ tables, Dθk , k = 1, . . . , nθ ,
stress during one unit cycle C . As shown in Fig. 6, the that provide the degradation of each individual parameter θk
following steps are iteratively executed, every iteration cor- after one unit cycle.
responding to a computation step given by the sampling pe-
riod δ: 5.2 RUL determination
1. The normal behavior model Σ is used first with input The RUL, understood as a RUL for the whole system, can
u(t) = uC (τ ) to compute the state x(τ ) and the output now be determined by computing the number of cycles that
y(τ ); are necessary for the parameters to reach the threshold defin-
2. The parameters are updated with the degradation ing the end-of-life (cf. Fig. 7).
model ∆ using the value of the state determined pre-
viously, i.e. θ(τ ) is computed;
Diagnosis at
3. The parameters of the behavior model Σ are updated inspection
No

with θ(τ ); time Tk

4. The next stress input value uC (τ + δ) is considered, Yes

and so on until the end of the cycle, i.e. until the last RUL = i
value of the cycle uC (τ + dC ) is reached.
Figure 7 – RUL computation
Σ
For the cycle i = 0, θ 0 is initialized with θ̂, which is
the result of the parameter estimation computed by the di-
agnosis engine. θ̂ is given as input to D, which returns
D(θ̂) = θ 1 . i is incremented by 1 and θ 1 is given as input
Δ
to D and so on until the set-membership test θ i θ eol is
achieved, which provides the stopping condition. This test
Figure 6 – Computation of the degradation parameters dur- may take several forms as explained in Section 5.3. If the
ing one unit cycle. test is true, then the index i is the number of cycles required
to reach the degradation threshold, so RUL = i.
The above algorithm defines the function: For a given cycle i, the box value θ i that must be given
as input to D is not necessarily among the values [θ]l , l =
D : IRnθ → IRnθ (11)
1, . . . , NΠ , of the partition. We propose to compute θ i+1
where nθ is the number of parameters of the system. Let’s by assuming that the mapping between θ i and θ i+1 is linear
assume the cycle i, then D maps θ i into D(θ i ) = θ i+1 , in every domain l of the partition. Considering p ∈ Rnθ ,
which is the value of θ after one unit cycle. D(p) is approximated as follows:
D is nonlinear. Thus the value of the parameter vector
∀θ ∈ [θ]l , D(θ) ≈ a θ + b, l=1, . . . , NΠ (14)
after one cycle θ i+1 depends on the initial value θ i . Indeed,
we know that a system generally degrades in a nonlinear where a= w(D([θ]l ))./w([θ]l ), b= D([θ]l ) − a[θ]l , and
fashion. We must hence compute θ i+1 for all possible val- is the product of two vectors term by term.
ues of the parameter vector θ i . i
Equation (14) is applied to θ and θ i to obtain an approx-
For this purpose, the domain value Ω(θk ) of each param- i
imation of D(θ ).
eter θk is partitioned into Nk intervals. Nk is chosen suffi-
ciently large to reduce non conservatism of the interval func- 2
Notice that the intervals issued from the partitioning are not
tion D. The domain value of the parameter vector θ is hence required to be of equal length.

87
Proceedings of the 26th International Workshop on Principles of Diagnosis

5.3 Set-membership test for the RUL • if ζ = 0, then the answer is a sinusoid;
The set-membership test implemented with the order rela- • if 0 < ζ < 1, then the answer is a damped sinusoid;
tion may take several forms. For instance, if the test
• if ζ ≥ 1, then the answer is a decreasing exponential.
θ i θ eol is interpreted as:
The state model is given by the equation:
∃k ∈ {1, . . . , nθ } | 
i  0 1 0
θk ≥ θk,eol if αk > 1 or θik ≤ θk,eol if αk < 1, (15)  Ẋ(t) = −k −c X(t) +
 U (t)
1
then it means that the bound of the interval value of at least m m (19)

one parameter θk is above or below its end-of-life threshold  Y (t) = 1 0 X(t)

value θk,eol . The RUL is then qualified as the “worst case 0 1
RUL”, which means that the RUL indicates the earliest cycle with X(t) = [x(t), ẋ(t)]T , and the transfer function is:
at which the system may fail.
One can also test whether the value higher bound of one X(p) 1
of the parameters is higher than its end-of-life threshold, that = 2 c k
. (20)
U (p) p +m p+ m
is to say:
An example of bounded error step response obtained with
∃k ∈ {1, . . . , nθ } |
VNODE-LP with a sampling parameter δ = 0.1 s, c = 1,
i
θik ≥ θk,eol if αk > 1 or θk ≤ θk,eol if αk < 1. (16) m = 2 and k = [3, 9 ; 4, 1] is shown in Fig. 9a. There,
The RUL then represents the cycle at which it is certain that ζ ' 0.177 and the step response is a damped sinusoid. Be-
the system will fail. cause k is assumed to have an uncertain value bounded by
It is obviously possible to combine these different tests an interval, the outputs are in the form of envelops.
applied to the different individual parameters depending on 6.2 Unit cycle
their criticality.
In the case study, a unit cycle is defined by the application
6 Case study of a power unit for a determined time. The force is applied
at time t0 +5s, where t0 is the cycle starting time. The force
6.1 Presentation lasts 20s and cancels at t0 + 25s as shown by the red curve
The case study is a shock absorber that consists of a moving of Fig. 9b. The cycle ends at t0 + 50s.
mass connected to a fixed point via a spring and a damper as Fig. 9b presents the system’s response for a spring con-
illustrated by Fig. 8. The movement of the mass takes place stant k = [3.9, 4.1] N/m, a mass m = 2 kg, a damping co-
in the horizontal plane in order to eliminate the forces due efficient c = 10 Ns/m, and initial speed and position equal
to gravity. Aerodynamic friction forces are neglected. to zero. The response is a decreasing exponential.

x 6.3 Degradation model
k The degradation model chosen is the ageing of the damper
m cylinder. It is represented by a reduction of the damping
c
coefficient proportional to the velocity of the mass [15]:
ċ = β ẋ, β < 0. (21)

Figure 8 – Spring and damper system The more the spring is used, the weaker it becomes, charac-
terized by the change in the damping coefficient.
The Newton’s second law is written as:
6.4 Diagnosis
m~a = ΣF~ = F~r + F~c + ~u (17)
The FRP parameter estimation method presented in Section
where m is the mass, ~a is the acceleration, F~k is the spring 4 has been used with the measures shown in Fig. 9c. These
biasing force, F~c is the friction force exerted by the damper measures were obtained for
and ~u is the force applied on the mass. Expressing the forces " # " #
c 5
and the acceleration as a function of the position of the mass
θ= k = 4 (22)
x(t), we get: m 2
c k
ẍ(t) + ẋ(t) + x(t) = u(t) (18) The goal is to estimate the damping coefficient c and the
m m
where k is the spring stiffness constant (N/m), m is the mo- stiffness constant k. The search space is defined by the inter-
bile mass (kg), and c is the damping coefficient (Ns/m). (18) val [4 9] for c and [3.5, 9] for k. The value of m is assumed
is a second order ODE. Let us rewrite to be known m = 2. Using the notation introduced above,
c k we have: " # " #
= 2ζω 0 and = ω 20 4 9
m m θ bol = 3.5 , θ eol = 9 (23)
and we get 2 2
r
k c The partition P1 is achieved with a precision (P1 ) = 1/10
ω0 = and ζ = √ .
m 2 km for the two parameters to be estimated c and k. Fig. 10
The impulse response of such system depends on the value presents two examples of prediction results with two param-
T
of ζ: eter boxes of P1 : [θ]i = [4, 1, 4, 2], [4, 7, 4, 8], 2 on the

88
Proceedings of the 26th International Workshop on Principles of Diagnosis

(a) Step response for ζ = 0.177. (b) Unit cycle for the case study. (c) Measured input and output.

Figure 9 – Cases study simulation and data plots.

T
left and [θ]j = [5, 5, 1], [4, 4, 1], 2 on the right. On the 6.5 RUL computation
left figure, one can see that there is no intersection between In this section we apply the set-membership method de-
the estimate and the measurement for the position, hence the scribed in Section 5.2 to compute the RUL for the damping
box used for the simulation is rejected. On the right, there coefficient c.
is an intersection between the measurement and the estima- The damper is assumed to fail when c ≤ ceol = 2. The
tion for all time points, but the estimate is not included in degradation model (21) with β = −0, 13 allows us to deter-
the measure envelop, hence the parameter box is considered mine the degradation table Dc for the parameter c for a unit
undetermined. cycle:
ci D(ci ) = ci+1 ci D(ci ) = ci+1
[9, 10] [8.917, 9.977] [4, 5] [3.874, 4.977]
Dc = [8, 9] [7.911, 8.978] [3, 4] [2.863, 3.973]
[7, 8] [6.898, 7.979] [2, 3] [1.721, 2.97]
[6, 7] [5.814, 6.982] [1, 2] [0, 1.98]
[5, 6] [4.859, 5.979] [0, 1] [0, 0.9755]
(25)
Figure 10 – Estimation results with a rejected parameter box After proceeding to the linear interpolation given by (14),
(left) and an indetermined box (right) the graphical representation of ci+1 as a function of ci is
given by Fig. 12.
The results for partition P1 are presented in Fig. 11a and
we obtain a first estimation for θ:
" #
[4, 1, 5.8]
θ̂ = [3, 7, 4, 2] .
2
The estimation precision for partition P1 is given by:
" #
0.85
ω(P1 ) = mid(θ̂) ./( mid(θ̂) + w(θ̂)/2) = 0.94 (24)
1
The first estimation for θ is used as the search space for
partition P2 , whose precision is increased by a factor of 10, Figure 12 – Approximated degradation of the damping co-
i.e. (P2 ) = 0, 1. The obtained estimation results are shown efficient c
in Fig. 11b.
The estimation is refined as: The number of elements of the partition has been chosen
"
[4, 51, 5, 57]
# relatively small to better illustrate the method. In a real sit-
θ̂ = [3, 85, 4, 14] . uation, this number should be high in order to obtain less
2 conservative predictions.
The value of c has been previously estimated and is
The precision is now ω(P2 ) = [0.9, 0.96, 1]T , and the ĉ = [4.548, 5.526].
precision gain is G(P2 /P1 ) = [0.056, 0.025, 0]T . The val-
ues for the gain indicate that partitioning a third time might The graph of Fig. 12 allows us to approximate the predicted
be quite inefficient. To confirm this fact, let us perform a value after one unit cycle:
third partition P3 , whose precision is increased by a factor D(ĉ) = c1 = [4.4787, 5.4481].
of 5, i.e. = 0.02 (cf. Fig. 11c). The new estimation for θ is
T The next iteration of the algorithm allows us to compute c2 ,
θ̂ = [4.548, 5.526], [3.872, 4.132], 2 , and the precision
etc. After 30 iterations, we obtain c30 = [1.7665, 3.4235].
gain is G(P3 /P2 ) = [0.0073, 0.0036, 0]T . As expected, the
gain is quite negligible with respect to the computation time 3
The coefficient β has been chosen arbitrarily to illustrate the
increase. approach; it does not represent the real ageing of a damper.

89
Proceedings of the 26th International Workshop on Principles of Diagnosis

(a) Partition P1 (b) Partition P2 (c) Partition P3

Figure 11 – Partitions and estimation results (red, yellow and green boxes are resp. rejected, undetermined, accepted param-
eters values).

Since c30 < ceol , we get RU L = 30 cycles. After the 44th [5] D. Gucik-Derigny, R. Outbib, and M. Ouladsine. Es-
iteration, we get c44 = [0.037591, 1.985928]. We then have timation of damage behaviour for model-based prog-
c44 < ceol and hence RU L = 44 cycles. The RUL of the nostic. In Fault Detection, Supervision and Safety of
damper is hence given by: Technical Processes, pages 1444–1449, 2009.
[6] X. Guan, Y. Liu, R. Jha, A. Saxena, J. Celaya, and
RU L = [30, 44] cycles.
K. Geobel. Comparison of two probabilistic fatigue
damage assessment approaches using prognostic per-
7 Conclusion formance metrics. International Journal of Prognos-
This paper addresses the condition-based monitoring and tics and Health Management, 1(005), 2011.
prognostic problems with a new focus that trades the tra- [7] R.E. Moore. Automatic error analysis in digital com-
ditional statistical approach by an error-bounded approach. putation. Technical report LMSD-48421, Lockheed
It proposes a two stages method whose principle is to first Missiles and Space Co, Palo Alto, CA, 1959.
determine the health status of the system and then use this [8] R.E. Moore. Interval Analysis. Prentice-Hall, Engle-
result to compute the RUL of the system. This study uses
wood Cliffs, 1966.
advanced interval analysis tools to obtain guaranteed results
in the form of interval bounds for the RUL. [9] L. Jaulin, M. Kieffer, O. Didrit, and E. Walter. Ap-
The results for the case study demonstrate the feasibility plied Interval Analysis, with examples in parameter
of the approach. The next step is to adapt the FRP-based and state estimation, Robust control and robotics.
SM parameter estimation algorithm in order to output a list Springer, Londres, 2001.
of boxes instead of a single box given by the convex hull [10] L. Jaulin and E. Walter. Set inversion via interval anal-
of the boxes. The convex hull is indeed a very conservative ysis for nonlinear bounded-error estimation. Automat-
approximation when the solution set is not convex. ica, 29:1053–1064, 1993.
The second stream of work is to consider contextual con-
[11] L. Jaulin, M. Kieffer, O. Didrit, and E. Walter. Ap-
ditions and their associated uncertainties. Environmental
conditions, like weather, different usage, etc. may indeed plied Interval Analysis, with examples in parameter
significantly affect the stress input and prognostics results. and state estimation, Robust control and robotics.
Springer, Londres, 2001.
[12] C. Jauberthie, N. Verdière, and L. Travé-Massuyès.
References
Fault detection and identification relying on set-
[1] Indranil Roychoudhury and Matthew Daigle. An inte- membership identifiability. Annual Reviews in Con-
grated model-based diagnostic and prognostic frame- trol, 37:129–136, 2013.
work. In Proceedings of the 22nd International Work- [13] N. Nedialkov. VNODE-LP, a validated solver for ini-
shop on Principle of Diagnosis (DX’11). Murnau,
tial value problems for ordinary differential equations.
Germany, 2011.
[14] R. J. Lohner. Enclosing the solutions of ordinary initial
[2] Matthew J Daigle and Kai Goebel. A model-based
and boundary value problems. In E. W. Kaucher, U. W.
prognostics approach applied to pneumatic valves. In-
Kulisch, and C. Ullrich, editors, Computer Arithmetic:
ternational Journal of Prognostics and Health Man-
Scientific Computation and Programming Languages,
agement, 2:84, 2011.
pages 255–286. Wiley-Teubner, Stuttgart, 1987.
[3] J. Luo, K. R Pattipati, L. Qiao, and S. Chigusa. Model- [15] Ian M Hutchings. Tribology: friction and wear of
based prognostic techniques applied to a suspension engineering materials. Butterworth-Heinemann Ltd,
system. Systems, Man and Cybernetics, Part A: Sys- 1992.
tems and Humans, IEEE Transactions on, 38(5):1156–
1168, 2008.
[4] Q. Gaudel, E. Chanthery, and P. Ribot. Hybrid par-
ticle petri nets for systems health monitoring under
uncertainty. International Journal of Prognostics and
Health Management, 6(022), 2015.

90
Proceedings of the 26th International Workshop on Principles of Diagnosis

Configuration as Diagnosis: Generating Configurations with Conflict-Directed A*
- An Application to Training Plan Generation -

Florian Grigoleit, Peter Struss
Technische Universität München, Germany
email: {struss, grigolei}@in.tum.de

Abstract to a set of system components such that it is op-
timally compliant with a set of observations.
Although many approaches to knowledge-based
Based on this analogy, we exploit a search technique that
configuration have been developed, the genera- has been developed as consistency-based diagnosis, see
tion of optimal configurations is still an open is-
[5]), and as a generalization for optimal constraint satisfac-
sue. This paper describes work that addresses this
tion, called conflict-directed A*, see [6]
problem in a general way by exploiting an analo- In the following section, we discuss related work on
gy between configuration and diagnosis. Based on
configuration systems. In section 3, we present some ex-
a problem representation consisting of a set of
amples of configuration problems that we tackled using
ranked goals and a catalog of components, which GECKO and that will serve for illustration purposes. Next,
can contribute in combination to their satisfaction,
we introduce our formalization of the configuration task
configuration is formulated as a finite constraint-
and the key concepts of GECKO. In section 5, we discuss
satisfaction-problem. Configuration is then solved the analogy between diagnosis and configuration, the ap-
by state-search, in which a problem solver selects
plication of CDA*, variants of utility functions and how
components to be included in an appropriate con-
they relate to different types of configuration applications.
figuration. A variant of Conflict-Directed A* has The results are shown in section 6. Finally, our current
been implemented to generate optimal configura-
work and some of the open issues are discussed.
tions. To demonstrate its feasibility, the concept
was applied, among other domains, to personal-
ized automatic training plan generation for fitness 2 Knowledge-based Configuration
studios. Applications of configuration are immensely diverse,
but they all share a number of common problems, such as
1 Introduction compliance with domain knowledge, size of the solution
space, and the resulting complexity of the problem solving
Besides diagnosis, the task of configuration has been one
task. It requires knowledge-based approaches to support
of the earliest application areas of work on knowledge-
the problem-solving activities, such as product configura-
based systems, initially in the form of rule-based “expert tion or variability management see [3] and [4].
systems”, for instance in [1]. Today, systems for automat-
Current research on configuration, especially for large
ed configuration have reached maturity for practical appli-
applications, tends to neglect global optimization, focusing
cations, as shown in [2], [3], and [4]. Despite this success, on local optimization, user interaction, or aiming at pro-
developing algorithms for computing optimal or opti-
ducing “good” solutions, see [3] and [7].
mized configurations with general applicability still de-
The focus of this paper is a generic, constraint-based
serves more research efforts. configurator (GECKO) for solving optimal configuration
Driven by a number of different configuration tasks, we
problems. The core of GECKO is a variant of Brian Wil-
developed GECKO (Generic constraint-based Konfigura-
liams’ Conflict-Directed A* (CDA*, [6]). The solution
tion), a generic solution to the configuration problem that works on a generic representation of configuration
can be specialized to different application domains and
knowledge and tasks. We consider the task of generating
that, among other objectives, aims at supporting the gener-
configurations as similar to consistency-based diagnosis.
ation of optimal configurations. Instead of assigning modes for fault identification as in [5],
In a nutshell, the solution exploits an analogy:
GECKO assigns the activity to components contributing to
 The configuration task can be seen as searching
goals. A configuration is consistent if all task-relevant
for an assignment of active or non-active to the goals are satisfied. The quality of a configuration is given
components in a given repository, representing
by the level of goal satisfaction and the amount of resource
whether or not a component is included in the
consumption. Our approach allows the arbitrary selection
configuration, such that it achieves some goals in of optimization criteria, like minimal resource consump-
an optimal way
tion or maximal goal contribution. In the presented case
 Diagnosis has been formalized as a search for an
study, our aim was to maximize the number of satisfied
assignment of behavior modes (normal or fault_x) goals under consideration of available resources.

91
Proceedings of the 26th International Workshop on Principles of Diagnosis

3 Application Examples  enabling the use of the system without deep do-
main knowledge, esp. about how high-level goals
Configuration problems are almost ubiquitous in modern of the user break down to more detailed and
life, with applications as different as creating a customized technical ones;
computer as done by R1 in [1] and adapting the system  handling also soft domain constraints and user
functions of a car, see [8]. To illustrate the versatility of preferences, and
GECKO, we present three applications.  offering support to the user by providing expla-
nations for generated parts of the configuration
3.1 Car Configuration and for unavailable options and by suggesting re-
Today, car manufacturers offer a vast number of models, visions to resolve inconsistencies.
model variants, and equipment options to their customer. However, this paper focuses on the basis, a generic
The resulting complexity does not only prohibit a compre- problem solver for (optimal) configuration. Determining
hensive exploration of the solution space, but is also likely the solution – the configuration - means selecting a set of
to provide customers with sub-optimal car variants. A do- instances of given types of elements - components -, per-
main model for car configuration was created and mapped haps with certain attribute values and organized in a par-
to the GECKO concepts, which are presented in 4.2([9]) ticular structure. The configuration has to
1. satisfy a set of high-level user goals,
3.2 User Interface Configuration 2. be compliant with particular attributes and re-
The Beam Instrumentation group at CERN is responsible strictions supplied by the user,
3. be realizable both in principle (i.e. not violating
for the design and implementation of particle beam meas-
domain-specific restrictions on valid configura-
urement systems. These systems are specifically built for
tions),
each case, resulting in extensive work on constructing 4. under consideration of available resources, and
them. While the generation of the GUIs, that is the imple- 5. optimal (or near optimal) according a criterion
mentation, is automated, the configuration is not. This task that reflects the degree of fulfilling the goals and
currently requires an expert to select libraries, graphical the amount of resources consumed.
elements, and data sources and to parameterize them. Such Configurations can be physical devices, such as tur-
tasks are typical configuration tasks and thus enable the bines, communication systems, and computers, abstract
automation of the configuration of the GUIs by GECKO ones like a curriculum or a company structure, or a soft-
([10]). ware system. In contrast to a design task involving the
creation of new types of components, configuration as-
3.3 Training Planning in Sport Science sumes that all required Components are instances of com-
At a first glance, training planning may appear to be a typ- ponent types from a repository ([11]). This leads to differ-
ical scheduling task, instead of a configuration problem. ent kinds of reasoning involved: innovative design has to
Taking a closer look shows that it mainly involves activi- verify that its result satisfies the goals by inferring that
ties we consider the core of configuration: selecting, pa- they achieved by the system behavior based on behavior
rameterizing, and arranging components to satisfy goals, models of the components, whereas for a configuration
whereas assigning time slots to the selected exercises is, in task, it is assumed that behavioral implications of aggre-
general, fairly straightforward gated components have been compiled into explicit inter-
A trainer has to analyze the biometric state of his train- dependencies of Goals and Components. As a result, soft-
ee, such as fitness or age, to consider constraints on the ware systems for configuration are typically based on
created training plan, for example duration or available knowledge encoded as constraints or rules, as in [1] and
equipment, and to select and order appropriate exercises. [2], and do not require the exploitation of behavior models.
The sheer number of existing exercises and the size of
the solution space show that training planning includes 4.2 Core Concepts
optimization. In general, a trainer tries to maximize the The core concepts of GECKO are derived from the de-
training effect within the available time and under consid- scription above, as depicted in Fig. 1:
eration of the trainee’s goal and abilities. The specializa-  Goals express the achievements expected from a
tion of GECKO to training planning is described in section specific configuration. They may have an
5. associated priority dependent on the task and
different criteria for goal satisfaction.
4 GECKO - Foundations  Components are the building blocks of the Con-
figuration. They may be organized in a type hier-
4.1 Intuition archy (for example, Lithium battery is a voltage
With GECKO, we aim at developing a generic solution to source). In addition, there may be Components
configuration problems, which can be tailored towards a that are aggregations of lower level components.
particular domain by specializing some basic classes and  A Task specifies the requirements on a configura-
creating a knowledge base in terms of domain-specific tion from the user’s perspective. It is split into
constraints. Its design is driven by the following objec- three kinds of restrictions:
tives:
 supporting both automatic and interactive con- Task
= TaskGoals  TaskParameters  TaskRestrictions.
figuration;

92
Proceedings of the 26th International Workshop on Principles of Diagnosis

TaskGoals are a collection of Goals the user is aware of  ComponentConstraints establish interdependen-
and which can be (de)activated or prioritized by the user. cies among components (and their attributes): a
Each TaskGoaln is stated as a restriction Task- component may be dependent on or incompatible
Goaln.Satisfied=T in the Task description. with the presence of another component in the
While TaskGoals represent objectives a user requires, configuration
TaskParameters associate values to properties of the  TaskParameterComponentConstraints may in-
Task, hence have the form TaskParameter k=valuekj. For clude or exclude certain components based on
instance, in vehicle configuration, the target country may TaskParameter values
have an influence on daytime running lights being manda-
tory. However, these implications are not drawn by the A fundamental constraint type is
user (who only provides the country information), but by Requires (x, y)
the domain knowledge represented in the system. which is defined by
In contrast, TaskRestrictions refer explicitly to the x.active=T  y.active=T
choice of Components and their attributes, e.g. that for the and used to express dependencies among goals (e.g. refin-
user, a convertible is not an option or that the engine ing a goal to a set of mandatory sub-goals) and compo-
should be a Diesel engine. nents (e.g. cruise control requires automatic transmission)
A specific, and often essential, TaskRestriction can be in- and as the fundamental coupling between goals and com-
cluded: ponents (to achieve high-speed driving, an engine of a cer-
 A ResourceConstraint limits the cost of the con- tain power is needed). Furthermore, in order to express that
figuration, which may be indeed money (car con- several goals or components provide some partial contribu-
figuration) or time (in training plan configura- tions that jointly result in the satisfaction of a goal (or es-
tion), but also computer memory etc. Components tablish the preconditions of a component), we introduce
have to have an attribute that allows calculating the concept of a choice, which can also fill the role of y in
the resources needed for the entire configuration a Requires-constraint. A choice is given by a relation
(often as the sum). GoalChoice  Goals  ContributionDom
or
4.3 Constraints on Configurations ComponentChoice  Components  ContributionDom,
The configuration knowledge of a particular application where ContributionDom specifies a set of values for quan-
comprises the domain-specific specialization and instantia- tifying how much a goal or component contributes to the
tion of Goals, Components (possibly including component satisfaction of the choice and needs a zero element and an
attributes and their domains), and relevant TaskParameters operator  to add up contributions (e.g. addition of inte-
and their domains as well as constraints that capture inter- gers). The idea behind choices is implemented by three
dependencies among these instances. Dependent on which kinds of constraints. The degree of the satisfaction of a
kinds of objects are related, we distinguish between the (component) choice is given by the combined contribu-
following (illustrated in Fig. 1): tions of the active components of the choice:
Choice.satLevel =  Choice.goal.actContribution
and
Choice.goal.actContribution =
Choice.goal.contribution IF goal.active=T
zero IF goal.sctive=F .
The choice is satisfied, if the satLevel lies in a specified
range, satThreshold:
Choice.active = T 
Choice.satLevel  Choice.satThreshold .
This allows implementing not only a minimum level as
a precondition for the satisfaction of a choice, but also a
maximum. Preventing “over-satisfaction” may not be a
common requirement, but in the fitness domain, one may
want to restrict the set of exercises that impose a load on a
Fig. 1 Task constraints in GECKO
particular muscle group.
Another predefined general type of constraint is
 TaskParameterGoalConstraints express that
Excludes (x, y)
certain TaskParameter values may exclude or re-
defined by
quire certain goals x.active=T  y.active=F
 GoalConstraints relate goals to each other, in
to express conflicting goals, incompatible components, and
particular for refinement of higher-level (esp.
TaskParameterGoal/ComponentConstraints (e.g. high
TaskGoals) to lower-level ones, such as goals re- body weight may rule out certain exercises).
lated to various muscle groups that should be ex-
The application-specific configuration knowledge is,
ercized, although the user is not aware of this
thus, basically encoded as a set of the constraints explained
 GoalComponentConstraints capture essential above. This, together with the domain-specific ontology
configuration knowledge, namely whether and
(as a specialization of the basic GECKO concepts, includ-
how the available components contribute to the
ing choices, and associated attributes) and, perhaps, specif-
achievements of goals ic contribution domains and operators, establishes the con-

93
Proceedings of the 26th International Workshop on Principles of Diagnosis

figuration knowledge base, called ConfigKB in the fol- Criteria 5, optimality, will be discussed in the following
lowing. section.
We make some reasonable fundamental assumptions about
ConfigKB: 5 Generating (Near) Optimal Configura-
 Each potential TaskGoal is supported: it is the
starting node of a connected hyper graph of Re- tions
quires constraints that includes components, i.e. it
actually needs a (partial) configuration in order to
5.1 Configuration as Diagnosis
be satisfied (which does not mean it can actually The current version of GECKO is based on the assumption
be satisfied). that there exists a finite set of components, COMPS, as a
 Closure assumption: the encoded interdepend- repository for all configurations. This means, no new in-
encies, esp. the Requires constraints, are com- stances of components types are created during configura-
plete. In other words, if all constraints Requires tion and, more specifically, a component will not be dupli-
(x, y) associated with x are satisfied by a configu- cated if it is included in the configuration due to several
ration, then x is satisfied. constraints. In this case, determining ACTCOMPS of a
 It is consistent. complete configuration can be seen as an activity assign-
ment
4.4 Definition of the Configuration Task AA: COMPS  {active, inactive} ,
The goal is to select an appropriate subset of the available indicating the inclusion in or exclusion from the configura-
components, which we call the active ones, and possibly tion, and the consistency test of Definition 2 becomes
AA  ConfigKB  Task ⊭ .
determine or restrict their attributes.
This representation shows the analogy to the consistency-
Definition 1 (Complete Configuration) based formalization of component-oriented diagnosis: an
A configuration assignment MA of modes (i.e. nominal or faulty behavior)
PARCONFIG = (ACTCOMPS, COMPATTR) to a set of components,
is complete if includes exactly the active components: MA  {OK, fault1, fault2, …}
characterizes a diagnosis, if it is consistent with the do-
comp  ACTCOMPS  comp.Active = T. main knowledge (a library of behavior models and a struc-
GECKO has to generate a configuration PARCONFIG that tural description), called system description, SD, and a set
satisfies the criteria stated in section 4.1. of observations, OBS:
MA  SD  OBS ⊭ .
Definition 2 (Solution to a Configuration Task) In both cases, the assignments to the components
A configuration task is a pair AA  MA
(ConfigKB, Task) are checked for consistency with a fixed set of constraints
representing the domain knowledge
(as specified in sections 4.3 and 4.2, respectively), and a
ConfigKB  SD ,
complete configuration PARCONFIG is a solution to it, if
and a set of constraints representing a specific problem
it is consistent with the ConfigKB and the Task,
instance
PARCONFIG  ConfigKB  Task ⊭ . Task  OBS .
This may seem too weak, because criterion 1 in section 4.1 In consistency-based diagnosis, theories and algorithms
requires the entailment of the satisfaction of the TaskGoals have been developed to determine diagnostic solutions,
in Task. which can be exploited for the configuration task based on
Proposition 1 the analogy outlined above.
If PARCONFIG is a solution to a configuration task (Con-
5.2 Conflict-directed A*
figKB, Task), then
PARCONFIG  ConfigKB ⊨ Based on the above formalization, many implementations
 goalTaskGoals goal.Satisfied = T. of consistency-based diagnosis exploit a best-first search
for consistent mode assignments, using probabilities of
This follows from the closure assumption: Since for the individual behavior modes as a utility function (and usual-
chosen TaskGoals, Satisfied=T is explicitly introduced in ly making the assumption that faults occur independently)
Task, it follows that all Requires constraints related to as SHERLOCK does([12]). Classical A* search has been
them are satisfied, and, hence, they are not only consistent, extended and improved by pruning the search space based
but entailed. As for the other criteria of section 4.1: on inconsistent partial mode assignments that have been
2. Compliance with specific application require- previously detected during the search (called conflicts),
ments is guaranteed by consistency with the exploiting a truth-maintenance system (TMS, such as the
TaskParameters under the TaskParameter- assumption-based TMS [13]) as a dependency recording
Goal/ComponentConstraints in ConfigKB and mechanism that delivers conflicts. From the diagnostic
with TaskRestrictions in Task solutions, this approach has been generalized later as con-
3. Realizability is established by consistency with flict-directed A* search, see [6].
ComponentConstraints
4. The ResourceConstraint is also consistent.

94
Proceedings of the 26th International Workshop on Principles of Diagnosis

In diagnosis, it is possible to check partial mode as-
Procedure CDASTAR signments to detect useful conflicts. In configuration, we
have to consider complete variable assignments, which, in
1) Terminate=F
assigning T or F to activity variables of all components,
2) Solutions=
correspond to complete configurations. The reason is that,
3) Conflicts=
as illustrated by the above trivial example, the constraints
4) VA=VAinitial
related to a choice deliver important conflicts based on
5) DO WHILE Terminate=F
components being not active. A partial configuration, e.g.
6) Apply Constraints(VA)
assigning active=T to, say, C1 only, is consistent with the
7) Check consistency of VA
respective choice; that this configuration does not satisfy
8) IF consistent
G1 is detected only, if all other components are assumed to
9) THEN append VA to Solutions
be inactive (Of course, if the satThreshold has an upper
10) Terminate=Solutions.Terminate
limit, we obtain conflicts involving too large sets of active
11) ELSE
components, as well). This observation is related to anoth-
12) Conflicts=APPEND(Conflicts, newConflicts)
er difference:
13) END IF
(NON-)Locality of the Domain Theory
14) VA=Conflicts.BestCandidateResolvingConflicts
In diagnosis, the domain theory is as modular as the de-
15) END DO WHILE
vice: it consists of constraints that represent the local inter-
16) RETURN Solutions
action of components and constraints that capture the local
The effectiveness of the pruning of the search space
behavior of components under certain modes. Checking
based on previously detected inconsistencies (highlighted
the consistency of a partial mode assignment requires ap-
in the above pseudo code) grows with the number of (non-
plying the directly related constraints only. In contrast,
redundant) conflicts that are extracted. Achieving this,
constraints representing configuration knowledge are al-
however, can be computationally expensive and may have
most by definition non-local: they are meant to relate many
to be traded off against the computational cost of the con-
components across the entire configuration, e.g. as choices.
sistency test and/or the optimality of the solution. We will
If choices play a major role and are large, this can be a
get back to this issue below.
source of severe problems.
The straightforward mapping of the configuration prob-
The training plan generation application forms an ex-
lem to CDA* is obtained by representing configurations as
treme example: choices may involve in the order of 100
variable assignments:
components, because many exercises may be related to a
VARS={ Compi.active  CompiCOMPS}
particular muscle group, while only a handful of them to-
DOM(Compi.active)={T, F} .
gether satisfy the goal. In addition, exercises are challeng-
To illustrate how the algorithm works using a simple
ing several muscle groups. If the lower boundary of the
example, assume that goal G1 depends on a component
satisfactionThreshold of a choice is k and the size of the
choice that involves 3 components, Ci, each with a contri-
choice is n, then (assuming a contribution 1 for each com-
bution of 1 in this choice, which has a satisfactionThresh-
ponent), the number of resulting minimal conflicts will be
old (2,3), i.e. it is satisfied if at least two of the compo- 𝑛
nents are active. Search starts with an empty configuration ( )
(active=F for all components) which leads to an incon- 𝑘−1
sistency with the constraints related to the choice. Each – prohibitively large in the training application. This has
pair of inactive components establishes a (minimal) con- an impact on the algorithm, as discussed in section 5.5.
flict: First, we have to introduce appropriate utility functions to
{ C1,active=F, C2,active=F }, measure the quality of a configuration.
{ C1,active=F, C3,active=F },
{ C2,active=F, C3,active=F }.
5.4 Utility Functions
Configurations resolving these conflicts are the ones with The utility of a configuration should essentially reflect
active components  the degree of fulfillment of the relevant goals
{ C1, C2}, { C1, C2}, or { C2, C3}, and
and the best one would be checked further. If this is done  the amount of resources required.
against another choice for a goal G2, which is based on A measure of the former may also consider priorities of
components C3, C4, C5 (again all with contribution 1) and a goals. The same holds for individual components. Since
threshold (1, 3), then a new conflict inactive components neither make contributions nor con-
{ C3,active=F, C4,active=F, C5,active=F } sume resources, it is plausible to assume that the utility of
is detected, and the configurations resolving all include a configuration depends on its active components only.
active components are In the following, it is assumed that
{ C1, C3}, { C2, C3}, { C1, C2, C4}, or { C1, C2, C5}.  the contribution of a configuration is obtained
solely as a combination of contributions of the
5.3 Diagnosis vs. Configuration active components included in the configuration
Despite the mentioned basic commonality, there are some and otherwise independent of the type of proper-
important distinctions at a conceptual level, but with a po- ties of the components,
tentially strong impact on the computational complexity.  we can define a subtraction “-” of contributions,
Partial vs. complete assignments  the cost of the contribution is given as the sum of
the cost of the involved active components and
will usually be numerical,

95
Proceedings of the 26th International Workshop on Principles of Diagnosis

 we can define a ratio “/” of contributions and re- 3) Priority=max(actGoals.Priority)
sources, 4) DO WHILE Priority >=1
 there is a function that maps priority of goals to a 5) ApplyConstraints(Constraints(
weight of the contributions and a kind of multi- GoalPriorityClass(Priority))
plication, 6) VA=VA(ActComps,COMPS\ActComps)
*: DOM(weight)xDOM(contribution) 7) NewActComps=
 DOM(contribution), GECKO.CDASTAR(VA).ActComps
is defined. 8) ActComps = NewActComps.Commit
Then the following specifies a family of utility functions 9) END DO WHILE
(where we simplify the notation by writing 10) RETURN ActComps
Goalj.SatThreshold instead of Choicej.SatThreshold etc.): In line 8, the algorithm fixes the components added to
For an active Goal Goalj, the TotalContribution of a satisfy the recently considered goals. This means, when
Configuration is trying to satisfy further goals (with lower priority) they
Configuration.TotalContribution(Goalj):= will not be de-activated. This heuristic aims at satisfying as
 CompiConfiguration.ACTCOMPS Compi.Contribution(Goalj) many goals as possible with the given resources in the or-
where  denotes the Combine operation, the ActualCon- der of their priority, but, obviously, may miss a globally
tribution is given as optimal solution.
Configuration.ActualContribution(Goalj):=
max(Configuration.TotalContribution(Goalj), 6 Case Study: Training Plan Generation
Goalj.Combine(Goalj
SatThreshold,posTolerance)), We are working on the realization of the three applications
and a penalty (for over-satisfaction) as presented in section 3. To demonstrate the specialization of
Configuration.Penalty(Goalj):= GECKO concepts and the capabilities of the GECKO algo-
max (0, rithm, we selected the fitness training example. From the
Configuration.TotalContribution(Goalj) three examples, fitness is best suited to illustrate the ad-
- Goalj.Combine(Goalj SatThreshold,posTolerance). vantages of CDA* in configuration.
Based on this, we define the utility function as
Configuration.Utility(ACTGOALS):= 6.1 Domain Theory
 Goalj Configuration.ACTGOALS In fitness, trainees perform exercises, like push-ups or run-
weight(Goalj.Priority) ning, to train body parts under certain aspects (endurance,
* Configuration.ActualUtility(Goalj) muscle gain). To train means to improve physical abilities,
+ f * Configuration.Penalty(Goalj) ) like endurance, and to influence biometric parameters,
/  CompiConfiguration.ACTCOMPS Compi.Resource. such as weight. In configuration terms: exercises contrib-
The factor f determines whether or not excessive contri- ute to a set of fitness goals. Hence, we created the domain
butions are penalized (by the excessive amount); the theory for training planning using the concepts specified in
weight can emphasize contributions to Goals with high section 4.2. Table 1 contains an overview on the most im-
priority, and the tolerance interval can express how exactly portant specializations.
the intended SatThreshold has to be hit. The result may appear straightforward to outsiders, but it is
actually the result of several months of analyses carried out
5.5 GECKO Algorithm jointly with experts from sports sciences, which took as to
For the GECKO variant of CDA* we modified CDA* by several versions and revisions of the model.
activating only the constraints needed at a specific stage, Table 1: Specialization of GECKO Concepts
thereby reducing the number of occurring conflicts signifi-
cantly. GECKO Concept Fitness Concept Example
GECKO characterizes a stage in the problem solving pro- Goal TraineeGoal Muscle Gain
cess and hence the criteria for constraint activation as a TrainingGoal Strength
pair TargetGoal Biceps
S = (GOALS, configuration), Component Exercise Push-up
that is a set of goals that are considered and a configuration Task Trainee -
to be checked for consistency. This allows for search strat- TaskRestriction TrainingDuration
egies that do not consider all active goals from the begin- TaskParameter TrainingProperty Equipment
TraineeProperty Fitnesslevel
ning. Therefore, the constraints to be applied are not only
determined by the variable assignment, but also by the Task
goals. In our first application, goals are activated in a de- A GECKO Task in fitness is a trainee, or more precisely
scending order, according to their priority. the request of a training plan by a trainee. A trainee has
To determine the hitting sets of the conflicts we use dif- expectations regarding the result of the training, represent-
ferent algorithms from [14], depending on the domain. In ed by TraineeGoals. The Trainee also has a set of Train-
BestCandidateResolvingConflicts, the next-best solution is eeProperties, like Fitnesslevel, and sets the TrainingProp-
generated. erties. Furthermore, a trainee has to specify the desired
Procedure GECKO Configuration Algorithm TrainingDuration.
1) ApplyConstraints(Constraints(Initial) Special among the TraineeProperties are the FitnessTar-
2) ActComps=ACTCOMPS0 gets and FitnessCategories. A FitnessTarget has to be

96
Proceedings of the 26th International Workshop on Principles of Diagnosis

trained by an Exercise, such as legs. FitnessCategories are 2), 12 exercises( table 3), and 2 TaskParameters, namely
the main abilities of a Trainee, such as strength. Equipment and a general Fitnesslevel – thus omitting the
consideration of different Fitnesslevels related to the spe-
Goals cific FitnessTarget, as done in the application system. Fur-
The domain theory contains three types of goals: thermore, we set the duration of all exercises to require 5
 TraineeGoal: The only TaskGoal in fitness, de- minutes.
scribing the expected effect of the fitness training Using this reduced knowledge-base, we applied both the
 TrainingGoal: Abstract goals, specifying the type basic GECKO algorithms and the goal-focused variant.
of physical ability to be improved e.g. strength The results are described in the following subsection.
 TargetGoal: Body part the training has to stimu-
late. Table 2: Exemplary muscle goals with priorities
To capture the structure of the human body and the Priority: Priority:
differences in fitness categories, we decompose the ID MuscleGoal
MuscleGain GeneralFitness
TargetGoals into three levels: G1 Biceps 1 2
o RegionGoal G2 Triceps 1 2
o MuscleGroupGoal G3 Latissimus 2 3
o MuscleGoal
Reflecting the FitnessTargets, TargetGoals are struc- Table 3: Exercises and parameters
tured in a goal tree. Because FitnessTargets are trained at Required
different levels, the tree is unbalanced. For example En- Required
ID Exercise Contributions Fitness
durance is generally trained for the whole body, while Equipment
level
Strength is trained at a muscular level. C1 Biceps Biceps: 100 None 1
Curl
Components C2 Dips Triceps: 100 None 1
All components in fitness are exercises. Each exercise is Latissimus: 20
related to a FitnessCategory, e.g, pushup is a StrengthEx- C3 Lat-Pull Biceps:20 Machines 1
ercise. Exercises can contribute to multiple TargetGoals, Latissimus:
but only TargetGoals of their own FitnessCategory. For 100
example, a StrengthExercise can only contribute to Tar- C4 Rev. But- Triceps: 40 Machines 1
getGoals related to strength. terfly
Exercises comprise a set of fixed attributes, such as re- C5 Pushup Triceps: 80 None 2
quiredEquipment or requiredFitnesslevel, as well as a set C6 Pushup on Triceps: 60 None 1
of unspecified attributes, like TrainingWeight or Dura- knees
tions. The values of such volatile attributes depend on the C7 Shoulder Triceps: 80 Machines 1
selected TraineeGoal, because they define how an exercise press
effects a FitnessTarget – an increase in strength is achieved C8 Rowing Biceps: 40 Machines 1
by a small number of slow repetitions with very high Latissimus: 80
weight, while fat is burnt best with many fast repetitions C9 Pull up Biceps: 100 None 2
with little weight. Latissimus: 80
C10 Triceps Triceps 100 Machines 2
Utility Pulldown
The utility of a configuration in SmartFit depends on the C11 Pull up Biceps: 20 None 1
contributions of the active components to required Choices (supported) Latissimus: 80
DOM(compi.contributioni) ={20,40,60,80,100} C12 Rowing Biceps: 40 Machines 2
The satThreshold of the Choices depends on the priority of one-armed Latissimus:
the associated goal 100
satThreshold = combine(Goali.Priority,normThreshold),
with DOM(Priority) ={1,2,3,4,5}. To compare the results of different tasks, we conducted to
For the example in 6.2, we simply multiplied the priori- experiments with different TraineeGoals and TaskParame-
ties with the normThreshold =80. ter values. For the basic algorithm, we used the Tasks
The domain of the combined contribution is from 0 to 500 shown in Table 4.
in steps of 20. In case of contributions larger than 500, the Table 4: Task for experiments A and B
overshoot is cut, and the value set to 500.
The utility for fitness training is given by the following Variable Values A Values B
equation: TaskGoal General Fitness Muscle Gain
Config.Utility (ACTGOALS):= TaskParameter: Untrained (1) Trained (2)
 Goalj Config.ACTGOALS FitnessLevel
weight (Goalj.Priority) * Config.ActualUtility (Goalj)) TaskParameter: Machines none
/  CompiConfig.ACTCOMPS Compi.Resource. Equipment
TaskRestriction: 15 minutes 30 minutes
6.2 Simplified Example TrainingDuration
To make the capabilities of GECKO more tangible, we
The results of the configuration with the basic GECKO
present a small experiment. For brevity and clarity, we use
a reduced knowledge-base, with three MuscleGoals( table algorithm are shown in Tables 5 and 6.

97
Proceedings of the 26th International Workshop on Principles of Diagnosis

Table 5: Configuration results basic GECKO Algorithm technical level (the constraint system) will also be ex-
plored.
Experiment A Experiment B Furthermore, we are currently preparing an application
Lat-Pull Pull up to configuration of automation systems for collaborative,
Dips flexible manufacturing and modular multi-purpose vehi-
Rowing Pull up (supported) cles. This application of GECKO is likely to require
Pushup on knees stronger spatial and also temporal constraints for structur-
Shoulder press Biceps Curl ing a configuration.
Table 6: State of the Goals after running the basic Algorithm
Acknowledgments
Muscle Goal Value A Value B
We would like to thank our project partners for providing
Biceps Partially Satis- Satisfied their domain knowledge and their assistance, esp. Florian
fied Kreuzpointner and Florian Eibl. Special thanks to Oskar
Triceps Satisfied Satisfied Dressler (OCC’M Software) for proving the constraint
Latissimus Satisfied Partially Satis-
system (CS3 or Raz’r). The project was funded by the the
fied
German Federal Ministry of Economics and Technology
Application Evaluation under the ZIM program (KF2080209DB3).
The results indicate that the GECKO algorithms are capa-
ble of generating optimal solutions to configuration prob- References
lems. In experiment B, it can be seen that GECKO was not
[1] JP McDermott, “RI: an Expert in the Computer
able to satisfy G3 completely, since there were not enough
consistent exercises available. Thus, the less important Systems Domain”, Artificial Intelligence, 1980.
goals were satisfied, but not the important one. In experi- [2] U. Junker, D. Mailharro, “The logic of ILOG(J)
ment A on the other hand, the algorithm was able to fully Configurator: Combining Constraint Programming
satisfy G3 but not G1, since the duration resource was only with a Description Logic”, IJCAI, 2003.
sufficient for three exercises. [3] A. Felfernig, L. Hotz, C. Bagley, and J. Tiihonen,
“Knowledge-based Configuration: From Research to
6 Discussion and Outlook Business Cases.”
The results shown above indicate that treating configura- [4] D. Sabin, R. Weigel, “Product configuration
tion as a diagnostic problem, and solving it with tech- frameworks – a survey.” IEEE Intelligent System,
niques from consistency-based diagnosis is a promising 1998.
approach to user-oriented configurators for optimal con- [5] J. de. Kleer, BC Williams, “Diagnosing Multiple
figuration problems. Faults,” Artificial Intelligence, 1987
The analysis of different application domains, including
the ones mentioned in section 3, triggers the insight that [6] BC William, R.J. Ragno, “Conflict-directed A* and its
variations of the search algorithm may be required in order role in model-based embedded systems”, Discrete
to reflect the specific requirements and structure of the Applied Mathematics, 2007.
problems. This is particularly true for applications that [7] M. Stumptner, G. Friedrich, A. Haselböck,
involve a high level of interaction, such as leaving choices “Generative constraint-based configuration of large
to the user, providing explanations for system decisions, technical systems“, Artificial Intelligence for
and allowing him to modify his/her decisions in an in- Engineering Design, Analysis and Manufacturing,
formed way. Retracting decisions and also generating ex- 1998.
planations can be supported by the ATMS, which also [8] G. Weiß, F. Grigoleit, P. Struss, “Context Modeling
produces conflicts.
for Dynamic Configuration of Automotive Functions”,
The conceptual and algorithmic solution to configura-
ITSC, 2013
tion generation presented in this paper could certainly be
implemented using other techniques that have been pro- [9] C. Richter, “Development of an interactive car
posed and used for configuration. However, our choice of configuration system”, Master’s Thesis, Tech. Univ.
an ATMS-based solution (and CDA*) was strongly moti- of. Munich,
vated by the overall objectives stated in section 4.1: we [10] A. Verikios, “A tool for the Configuration of CERN
intend to base explanation facilities (“which user inputs Particle Beam Measurement Systems”, Master’s
and domain restriction prevent option x to be viable?”), Thesis, Tech. Univ. of. Munich,
preferences and soft constraints, and the possibility to re-
[11] U. Junker, “Configuration.” Handbook of Constraint
tract input and explore several alternative solutions on ca-
pabilities of the ATMS. Programming, Configuration, p. 837-868, 2006.
A goal of our work is to extract features from the case [12] J. De Kleer, BC Williams,”Diagnosis with Behavioral
studies that can support a classification of configuration Model”, IJCAI, 1993.
applications as a basis for selection from a set of prede- [13] J De Kleer, “An assumption-based TMS” Artificial
fined algorithm variants and strategies for man-machine intelligence, 1986.
interaction.
[14] J De Kleer, “Hitting Set Algorithms for Model-based
Other options, such as compiling (parts of) the con-
Diagnosis”, Principles of Diagnosis, 2011.
straint network and moving search heuristics to a lower

98
Proceedings of the 26th International Workshop on Principles of Diagnosis

Decentralised Fault Diagnosis of Large-scale Systems: Application to Water
Transport Networks
Vicenç Puig and Carlos Ocampo-Martinez
Universitat Politècnica de Catalunya - BarcelonaTech
Institut de Robòtica i Informàtica Industrial, CSIC-UPC
Llorens i Artigas, 4-6, 08028 Barcelona, Spain
e-mail: {vpuig,cocampo}@iri.upc.edu

Abstract several proposals where there is no centralised control struc-
ture or coordination process among diagnosers [6, 7, 8]. Ev-
In this paper, a decentralised fault diagnosis ap- ery diagnoser shares information with the neighbouring di-
proach for large-scale systems is proposed. This agnosers. In these systems the model is distributed, the di-
approach is based on obtaining a set of local agnosis is locally generated and the consistency among the
diagnosers using the analytical redundancy rela- subsystems should be satisfied.
tion (ARRs) approach. The proposed approach In this paper, the main contribution relies on the devel-
starts with obtaining the set of ARRs of the sys- opment of a decentralised fault diagnosis approach for LSS
tem yielding into an equivalent graph. From that based on analytical redundancy relations (ARRs) and graph
graph, the graph partitioning problem is solved theory. The algorithm starts considering a set of ARRs and
obtaining a set of ARRs for each local diagnoser. then stating an equivalent graph. From that graph, the prob-
Finally, a decentralised fault diagnosis strategy is lem of graph partitioning is then solved. The resultant parti-
proposed and applied over the resultant set of par- tioning consists of a set of non-overlapped subgraphs whose
titions and ARRs. In order to illustrate the ap- number of vertices is as similar as possible and the num-
plication of the proposed approach, a case study ber of interconnecting edges between them is minimal. To
based on the Barcelona drinking water network achieve this goal, the partitioning algorithm applies a set of
(DWN) is used. procedures based on identifying the highly connected sub-
graphs with balanced number of internal and external con-
1 Introduction nections in order to minimize the degree of coupling among
the resulting partitions (diagnosers). This algorithm is spe-
Large-scale systems (LSS) present new challenges due to
cially useful in systems where there is no a clear functional
the large size of the plant and its resultant model [1, 2]. Tra-
decomposition. Finally, a decentralised fault diagnosis strat-
ditional supervision methods for LSS (including diagnosis
egy is introduced and applied over the resultant set of par-
and fault tolerant control) have been mostly developed as-
titions, in a similar way to the one introduced in [5]. In
suming a centralized scheme that assumes to have the full
order to illustrate the application of the proposed approach,
information. In the same way, a global dynamical model
a case study based on the Barcelona drinking water network
of the system is considered to be available for supervision
(DWN) is used.
design (off-line). Moreover, all measurements must be col-
The remainder of this paper is organised as follows. Sec-
lected in one location in a centralised way. When consid-
tion 2 presents and discusses the overall problem statement.
ering LSS, the centrality assumption usually fails to hold,
Section 3 presents the ARR graph partitioning methodology.
either because gathering all measurements in one location
Section 4 describes the proposed decentralised fault diag-
is not feasible, or because a centralised high-performance
nosis approach. Section 5 shows both the considered case
computing unit is not available. These difficulties have re-
study and the way of implementing the proposed decen-
cently led to research in fault diagnosis (and fault-tolerant
tralised fault diagnosis approach. Finally, Section 6 draws
control) algorithms that operate in either decentralised or
the main conclusions.
distributed way. Depending on the degree of interaction of
the diagnoser associated to the subsystems and their diag-
nosis process, they can be classified into decentralised and 2 Problem Statement
distributed diagnosis categories. 2.1 Fault Diagnosis using ARRs
In the decentralised diagnosis, both a central coordina- Consider a dynamical system represented in general form
tion module and a local diagnoser for each subsystem that by the state-space model
forms the whole supervision system are running in paral-
lel. Some examples were presented in [3, 4, 5], where local x+ = g(x, u, d), (1a)
diagnosers are communicated to a coordination process (su- y = h(x, u, d), (1b)
pervisor), obtaining a global diagnosis. On the other hand,
in the distributed approach, a set of local diagnosers share where x ∈ Rn and x+ ∈ Rn are, respectively, the vectors
information by means of some communication protocol in- of the current and successor system states (that is, at time
stead of requiring a global coordination process such as in instants k and k + 1, respectively if the model is expressed
a decentralised approach. In the related literature, there are in discrete-time), u ∈ Rm is the system input vector, d ∈

99
Proceedings of the 26th International Workshop on Principles of Diagnosis

Rp is the vector containing a bounded process disturbance graph, the problem consists in partitioning the graph R into
and y ∈ Rq is the system output vector. Moreover, g : subgraphs. Since such partitioning is oriented to the appli-
Rn × Rm × Rp 7→ R is the states mapping function and cation of a decentralised fault diagnosis, it is convenient that
h : Rn × Rm × Rp 7→ Rq corresponds with the output the resultant subgraphs have the following features:
mapping function. • nearly the same number of vertices;
The design of a model-based diagnosis system is based
on utilizing the system model (1) in the construction of the • few connections between the subgraphs.
diagnosis tests. According to [9], by means of the structural These features guarantee that the obtained subgraphs
analysis tool and perfect matching algorithm, a set of ARRs, have a similar size, fact that balances computations be-
namely R, can be derived from (1). ARRs are constraints tween local diagnosers and allows minimising communica-
that only involve measured variables (y, u) and known pa- tions with a supervisory diagnoser. Hence, the partitioning
rameters θ. The set of ARRs can be represented as the ARR graph can be more formally established following
R = {ri | ri = Ψi (yk , uk , θk ), i = 1, . . . , nr }, (2) the dual problem proposed in [13] as stated here in Problem
1.
where Ψi is the ARR mathematical expression and nr is the
Problem 1 (ARR Graph Partitioning Problem). Given a
number of obtained ARRs. Then, fault diagnosis is based
graph G(V, E) obtained from a set of ARRs, where V de-
on identifying the set of consistent ARRs
notes the set of vertices, E is the set of edges, and p ∈ Z≥1 ,
R0 = {ri |ri = Ψi (yk , uk , θk ) = 0, i = 1, . . . , nr }, (3) find p subsets V1 , V2 , . . . , Vp of V such that
and inconsistent ARRs, p
S
1. Vi = V ,
R1 = {ri |ri = Ψi (yk , uk , θk ) 6= 0, i = 1, . . . , nr }, (4) i=1

at time instant k when some inconsistency in (2) is de- 2. Vi ∩ Vj = ∅, for i ∈ {1, 2, . . . , p}, j ∈ {1, 2, . . . , p},
tected [10]. Fault isolation task starts by obtaining the ob- i 6= j,
served fault signature, where each single fault signal indi- 3. #V1 ≈ #V2 ≈ · · · ≈ #Vp ,
cator φi (k) is defined as follows:
4. the cut size, i.e., the number of edges with endpoints in
0 if ri (k) ∈ R0 , different subsets Vi , is minimised.
φi (k) = (5)
1 if ri (k) ∈ R1 . Remark 2.1. Conditions 3 and 4 of Problem 1 are of high
Fault isolation is based on the knowledge about the bi- interest from the point of view of a decentralised scheme
nary relation between the considered fault hypothesis set since they are related to the degree of interconnection be-

f1 (k), f2 (k), . . . , fnf (k) and the fault signal indicators tween resultant subsystems and their size balance.
φi that are stored in the fault signature matrix M . An el- Remark 2.2. The inclusion of additional specifications di-
ement of this matrix, namely mij , is equal to 1 if the fault rectly related to the FDI performance of each subsystem di-
hypothesis fj is expected to affect the residual ri such that agnoser will be addressed as a future extension of the pro-
the related fault signal φi is equal to 1 when this fault is af- posed partitioning approach.
fecting the monitored system. Otherwise, the element mij
Remark 2.3. The partitioning approach starts from a given
is zero-valued. A column of this matrix is known as a the-
set of ARRs obtained using the perfect matching algorithm.
oretical fault signature. Then, the fault isolation task in-
The selection of the best ARRs from the set of the all pos-
volves finding a matching between the observed fault signa-
sible ARRs (that could be obtained using the available sen-
ture with some of theoretical fault signatures.
sors and system structure) such that when applying the par-
2.2 Partitioning the Set of ARRs titioning algorithm produces a set of diagnosers with good
FDI performance could be considered as an additional fu-
In order to design a decentralised fault diagnosis system fol-
ture improvement.
lowing the ARR approach recalled above, the set of ARRs in
(2) should be decomposed into subsets with minimal degree In general, graph partitioning approaches are considered
of coupling. Each subset of ARRs will allow to implement as N P-complete problems [2]. However, they can be solved
a local diagnoser. With this aim, a graph representation of in polynomial time for #Vi = 2 (Kernighan-Lin algorithm);
R in (2) is determined. The graph G(V, E) representing the see, e.g., [14]. Since the latter condition is quite restric-
set of ARRs is obtained considering that tive for large-scale graphs, alternatives for graph partition-
• the ARRs are the graph vertices collected in a set V , ing based on fundamental heuristics are properly accepted
and and broadly discussed.
• the measured input/output variables are the graph 3 Proposed Partitioning Approach
edges collected in a set E.
Starting from the system ARR graph obtained as described
The graph incidence matrix IM is obtained considering that,
in Section 2, this section proposes a partitioning algorithm
without loss of generality, the directionality of the edges are
through which a decomposition of the set of system ARRs
derived from the relation between ARRs (rows of IM ) and
can be performed. This decomposition allows the splitting
input/output variables (columns of IM ), in analog way as
of a centralised diagnoser into local diagnosers. The philos-
proposed by [11] (and references therein) for the partition-
ophy of the proposed approach comes from the partitioning
ing of LSS1 . Once IM has been obtained from the ARR
methodology reported in [13], where a dynamic system is
1
There are alternative matrix representations for a graph such decomposed into several subsystems following certain cri-
as the adjacency matrix and the Laplacian matrix (see [12]), which teria towards fulfilling a set of design conditions. For com-
are related to the matrix representation used in this paper. pleteness and full understanding of the proposed diagnosis

100
Proceedings of the 26th International Workshop on Principles of Diagnosis

methodology, that approach is explained below and suitably for the particular case study. Additional auxiliary routines
adapted if needed. might be designed in such a way that the diagnosis perfor-
The algorithm is divided into the main kernel and auxil- mance that would be achieved when used in decentralised
iary routines in order to refine the final result according to or distributed fault diagnosis is taken into account. These
the nature of the system and the given criteria depending auxiliary routines are:
on the case. Here, the ARR graph is decomposed into sub- • The pre-filtering routine, which lightens the start-up
graphs in the same way as a system would be divided into routine by merging all these vertices with single con-
subsystems. nection to those to which they are connected. It al-
3.1 Main Kernel lows to have a smaller initial graph and then perform-
ing faster clustering of vertices.
This part performs the central task of defining how the
equivalent ARR graph of the LSS is split into subgraphs. • The post-filtering routine, which adds a tolerance pa-
The steps of the algorithm are followed in the form of sub- rameter δ in such a way that the uncoarsening rou-
routines towards reaching the main goals outlined in Prob- tine yields in less subgraphs when two of them may
lem 1. Notice that the whole algorithm is used off-line, be conveniently merged but the numerical constraints
i.e., the partitioning of the ARR graph is not carried out dy- does not allow to do so. This routine might increase
namically on-line. Ongoing research is focused to adapt the the complexity since the internal weight of some sub-
proposed algorithm such that the partitioning could be per- graphs would also increase, unbalancing the resultant
formed on-line when some structural change of the network set of partitions.
occurs. The different subroutines are briefly described next. • The anti-oscillation routine, which leads to solve a pos-
• The start-up routine, which requires the matrix-based sible issue when the refining (external balance) routine
definition of the graph, e.g., via the incidence matrix, is run since it defines a maximum number of iterations
in order to state the connections between the graph ver- ρ that the refining routine is executed.
tices.
• The preliminary partitioning routine, which performs 4 Decentralised Fault Diagnosis
a clustering-like procedure where all graph vertices are Once a partitioned set of ARRs has been obtained by means
assigned to a particular subset according to predefined of the algorithm presented in Section 3, the decentralised
indices related to the resultant subgraph and its inter- fault diagnosis approach is introduced. In order to explain
nal weight (defined as the number of vertices of a sub- how the proposed fault diagnosis approach works, it is con-
graph), its external weight (defined as the number of centrated on faults affecting the sensors measuring the in-
shared edges between subgraphs) and other statistical put/output variables implied in the ARRs. The approach
measures. The resultant amount of partitions at this could be easily extended to other type of faults, but in order
stage is automatically obtained. to keep the explanation simpler, it is restricted to the discus-
• The uncoarsening routine, which is applied for reduc- sion about the set of considered faults. In this way, a fault
ing the number of resultant subgraphs if their internal can be associated to each measured input/output variable.
weight is unbalanced, which would produce partitions Each subset of ARRs will allow to implement a local di-
with large differences of amount of vertices. This rou- agnoser Di in the way described in Section 2.1. The ARRs
tine defines a design parameter ϕmax for determining associated to a local diagnoser can be split in two groups.
the variance of the internal weight for all the resultant The first group, named in the following local ARRs, is com-
subgraphs. posed of ARRs that do not involve shared variables with
other ARRs in a different local diagnoser. On the other
• The refining routine, which aims at reducing the cut hand, the second group, named shared ARRs, is composed
size of the resultant subgraphs, i.e., the number of by ARRs that involve shared variables. Figure 1 shows two
edges they share. This routine is based on the connec- sets of ARRs associated to two local diagnosers, named
tivity of the vertices of a subgraph with other vertices D2 and D4 . These two diagnosers share some variables
in the same subgraph and in neighbouring subgraphs2 . (in this case only outputs, but can be both inputs and out-
Applying the aforementioned routines to the entire ARR puts). This set of shared variables allows to define the set
graph, the expected result consists of a set of subgraphs that of shared ARRs, named DC in the figure. The remaining
determines a particular decomposition. This set P is finally ARRs, which do not share variables, are local ARRs.
defined as Similarly, faults in the fault signature matrix M of the lo-
( p
) cal diagnoser that only involve local ARRs can be locally di-
[ agnosed. Thus, the local diagnoser works in a decentralised
P = Gi , i = 1, 2, . . . , p : Gi = G . (6)
manner regarding those faults. On the other hand, faults that
i=1
involve ARRs with shared variables in different subgraphs
3.2 Auxiliary Routines can not be locally diagnosed. On the contrary, a global diag-
Although the decomposition algorithm yields to an auto- noser that evaluates the involved ARRs is used. This diag-
matic partitioning of a given graph, it does not imply that noser has a fault signature matrix M collecting the involved
the resultant set P follows the pre-established requirements ARRs with shared variables between local diagnosers and
stated in Problem 1. Therefore, complementary routines faults that should be globally diagnosed. When local diag-
enhance the partitioning routine depending on their tune nosers evaluate an ARR composed of shared variables, they
send the result of the consistency check to the global di-
2
Two subgraphs are called neighbours if they are contiguous agnoser, which proceeds with the global diagnosis using a
and share edges (see, e.g., [15] among many others). fault signature matrix that contains the involved ARRs. As

101
Proceedings of the 26th International Workshop on Principles of Diagnosis

y1,S2 . . . y4,S2
. . . y1,S4 . . . y28,S4 y29,S4 . . . y32,S4 y5,S2 . . . y12,S2 through the m actuators (pumps and valves), d ∈ Rq cor-
..
responds to the vector of the q water demands (sectors of
. consume) and y ∈ Rn are the vector of measured water
ARR1,S4
.. volumes of the n tanks. In this case, the difference equa-
. D4 tions in (7a) describe the dynamics of the storage tanks,
ARR28,S4
ARR1,S2 ARR29,S4
the algebraic equations in (7b) describe the static relations
.. .. (i.e., mass balance at junction nodes) in the network and
. . DC in (7c) describe the relation between the physical and mea-
ARR4,S2 ARR32,S4
ARR5,S2
.. D2 sured tank volumes. Moreover, A, B, Bp , C, E1 and E2
. are system matrices of suitable dimensions dictated by the
ARR12,S2
network topology.

5.3 Implementation of the Proposed Approach
Figure 1: Subsets of ARRs of two local diagnosers sharing This section discusses the way the proposed decentralised
some variables fault diagnosis approach is implemented in the considered
real case study. Figure 2 corresponds to the aggregate model
a result of the global diagnosis based on the involved ARRs of the Barcelona DWN, which is a simplification of the com-
with shared variables, a fault in these variables could be di- plete model, where groups of elements have been aggre-
agnosed or alternative excluded. In case of exclusion, local gated (not discarded) in single nodes to reduce the size of
diagnosers sharing a given ARR whose shared variable has the whole network model. Using this aggregate model, the
been considered non-faulty continue reasoning now with all ARR graph of the Barcelona DWN has been derived after
ARRs, i.e., all the involved ones, proposing a fault candidate generating the set of ARRs from the mathematical model
using the local fault signature. (7) by using the perfect matching algorithm [9] that aims
to find a causal assignment which associates unknown sys-
5 Application to a Case Study tem variables with the system constraints from which they
can be calculated. Applying the partitioning algorithm to
This section briefly describes a case study in order to exem- this graph, five groups of ARRs are obtained, which corre-
plify the application of the proposed decentralised diagnosis sponds to five diagnosers that monitor a different part of the
approach in a real LSS. In particular, the transport infras- Barcelona DWN represented with different colors in Fig-
tructure of the Barcelona Drinking Water Network (DWN) ure 2. Table 2 collects the descriptions of the resultant sub-
is used. graphs, their number of ARRs and shared variables (ma-
nipulated flows through actuators) represented using circles
5.1 Case Study Description in Figure 2. At this point it should be recalled that one of
The Barcelona DWN, managed by Aguas de Barcelona, the goals of the partitioning algorithm is to reduce as much
S.A. (AGBAR), supplies drinking water to Barcelona city as possible the number of shared edges between subgraphs
and its metropolitan area through four drinking water treat- obtaining a graph decomposition as less interconnected as
ment plants: the Abrera and Sant Joan Despí plants, which possible and with similar number of vertices for each sub-
extract water from the Llobregat river, the Cardedeu plant, system (internal weight). This will allow an easier global
which extracts water from Ter river, and the Besòs plant, diagnosis configuration, not only with respect to the num-
which treats the underground flows from the aquifer of the ber of distributed diagnosers but also with respect to the
Besòs river. All source together provide a total amount of complexity of each local diagnoser Di . Thus, the appli-
flow of around 7 m3 /s. The water flow from each source cation of the approach to the Barcelona DWN implies the
is limited, what implies different water prices depending on design of five decentralised diagnosers together with a cen-
water treatments and legal extraction canons. See [16] for tralised/supervisory one, which is in charge of the coupled
further information about this system and [17] for further relations within the corresponding fault signature matrix of
details about its modelling and management criteria. the whole system.
5.2 Monitoring-oriented Model
In order to obtain a monitoring-oriented model of the DWN, Table 1: Barcelona DWN subsystems and number of both
the constitutive network elements (i.e., tanks, actuators, wa- shared elements and ARRs
ter demand sectors, nodes and sources) as well as their basic Number Color # ARRs # Shared variables
relationships should be stated [16].
1 green 4 1
By considering the mass balance at tanks and the static
2 red 5 5
relations at α network nodes, the monitoring-oriented
3 yellow 8 6
discrete-time state-space model of the DWN can be written
4 blue 8 16
as
5 purple 5 5
xk+1 = Axk + Γνk , (7a)
E1 νk = E2 , (7b)
For this example, it is important to highlight that ARRs
yk = Cxk , (7c) have been obtained by considering the following assump-
with Γ = [B Bp ], νk = [uTk dTk ]T , where x ∈ Rn is the tion.
state vector corresponding to the water volumes of the n Assumption 5.1. Fault in actuators are only taken into ac-
tanks, u ∈ Rm represents the vector of manipulated flows count. Sensors are supposed to operate properly.

102
Proceedings of the 26th International Workshop on Principles of Diagnosis

ApotA
x1
aMS
bMS_21

d 125 PAL _1 c125PAL
d1
u1
u3 vAdd
CPIV _1 u2
u4
vAdd_45
nAportA1_19
n70PAL_20
c70PAL d2
CPII _2
nAportA2_21 d110PAP _2 c110PAP c200BARs -c

u5 x2 d3
u6 vAdd_47
d200BLL _11
n200BARs -c

VF_30
c200ALT
VB _38

d200ALT _15
c200BLL
CA _17
c176BARsud c101MIR
d176BARsud _13 VMC _44

VP _39
VBSLL _43

CF176 _15 c200BARnord
d200BARnord _17
c140LLO CF200 _14 d101MIR _18
c176BARcentre
vAdd_60
n140LLO_24
aPousE n176BARcentre_33 vAdd_56
bPousE_23 VE _31
d130BAR_12 vAdd_55
CE _13
c130BAR AportT
c100LLO vAdd_57
n100LLO_22 VCO _37
CR O_20 vAdd_54
C-PR
c115 CAST
aPou Cast CCO _16 nAportT_32 vAdd_ 312
vAdd_48 d100FCE _9 c100FCE AportT
vAdd_61
VSJD _29 n100BLLsud_25 VS _36 V COA _40

ACast 8 bCast 8 bPou Cast _24 d100BLLnord _16
VRM _32 n100BLLcentre_29 VBMC_42

d115CAST
VCR _27
d80GAVi80CAS
85CRO _6
c80GAVi80CAS
u9 c100BLLsud c100BLLcentre
CRE _8 c100BLLno bPousB_26
CCA _3 aPousB

CB _4
VCA _28
x3 d54REL _8
CC 130 _19 VZF _33
VT _35
VPSJ _41
rd

CGIV _5 CC 100_11

vAdd_64
n70LLO_23 n70FLL_26 VCT _34
d70BBEsud _14
vAdd_53
vAdd_50

CPLANTA50 _7
c70LLO
CPLANTA70 _6
u7 CC50 _9
n135SCG

u8
CC70 _12
c70FLL d120POM
c 70BBEsu
d C_MO V_CON
d10COR _10 c135SCG
dPLANTA _7
PLANTA10 _10

vAdd_ 308 vAdd_ 309 c10COR

ApotLL1 c120POM
ApotLL2

Figure 2: ARR Partitioning of the Barcelona DWN

Table 2: Barcelona DWN subsystems and number of both
shared elements and ARRs S2
Number Color # ARRs # Shared variables S1 1
1 green 4 1
2
3
red
yellow
5
8
5
6
1 4
4 blue 8 16 S5 S3
5 purple 5 5

5 5
In order to easyly understand how the proposed decen-
tralised fault diagnosis approach would work, it will be ex-
S4
plained focusing on subsystems S1 and S4 presented in Fig-
ure 3 in red lines that corresponds to the subsystems in green
Figure 3: Scheme of decentralised diagnoser scheme for the
(S1 ) and in blue (in S4 ) in Figure 2. In particular, consider-
Barcelona DWN resultant subsystems and their number of
ing the set of ARRs corresponding to S1 as
shared variables
S1
r1,k = y1,k − y1,k−1 − ∆t[u1,k−1 + u2,k−1 − d1,k−1 ],
S1
r2,k = u1,k − u2,k − d2,k , Table 3: Fault signature matrix of S1
S1
r3,k = y2,k − y2,k−1 − ∆t[u5,k−1 − d3,k−1 ], ARR fy1 fu1 fu2 fy2 fu5 fu3 fu4 fu6
S1
r4,k = u3,k − u4,k − u5,k − u6,k , S1
r1,k 7 7 7
S1
the fault signature matrix presented in Table 3 can be ob- r2,k 7 7
tained. From this table, it is possible to identify the shad- S1
r3,k 7 7
owed part, which corresponds to the faults that the local di- S1
r4,k 7 7 7 7
agnoser D1 is able to isolate when a fault activates any of
the ARRs ri,k , i = 1, 2, 3, since those ARRs only involve
local variables. However, if the resiual r4,k is activated, it is
necessary that a global diagnoser interacts with D1 discrim- u6 is then in fault and hence isolated. Otherwise, D1 can
inating whether the corresponding ARR in S4 , defined here decide locally (then isolating u3 , u4 or u5 ).
S4
as r1,k , was also activated. If this is the case, the element In Table 4, the fault signature matrix for the ARRs that

103
Proceedings of the 26th International Workshop on Principles of Diagnosis

[4] Y. Pencolé and M.-O. Cordier. A formal framework
Table 4: Part of the fault signature matrix accounting shared for the decentralised diagnosis of large scale discrete
variables between S1 and S4 event systems and its application to telecommunica-
ARR . . . fu5 fu6 fu7 . . . tion networks. Artificial Intelligence, 164(1-2):121–
S1 170, 2005.
r4,k 7 7
S4
r1,k 7 7 [5] S. Indra, L. Travé-Massuyès, and E. Chanthery. De-
centralized diagnosis with isolation on request for
spacecraft. In Fault Detection, Supervision and Safety
of Technical Processes, pages 283–288, México, 2012.
contain shared variables between both S1 and S4 is pre-
S1 [6] F. Boem, R.M.G. Ferrari, T. Parisini, and M. M. Poly-
sented. There, r4,k corresponds with the fourth ARR of S1 carpou. Distributed fault diagnosis for continuous-
(last row of Table 3), while time nonlinear systems: The input-output case. Annual
S4
r1,k = x3,k − x3,k−1 − ∆t[u7,k−1 + u8,k−1 Reviews in Control, 37(1):163 – 169, 2013.
[7] J. Biteus, E. Frisk, and M. Nyberg. Distributed diagno-
+ u6,k−1 − u9,k−1 ]
sis using a condensed representation of diagnoses with
corresponds with the first defined ARR for S4 . Notice that application to an automotive vehicle. IEEE Transac-
the global diagnoser should decide by looking at the ARR tions on Systems, Man, and Cybernetics – Part A: Sys-
activations occurred in this fault signature matrix and then tems and Humans, 41(6):1262–1267, November 2011.
interact with the different local diagnosers if needed. [8] I. Roychoudhury, G. Biswas, and X. Koutsoukos. De-
signing distributed diagnosers for complex continuous
6 Conclusions systems. IEEE Transactions on Automation Science
and Engineering, 6(2):277–290, April 2009.
In this paper, a decentralised fault diagnosis approach for
large-scale systems based on graph-theory has been pre- [9] M. Blanke, M. Kinnaert, J. Lunze, and
sented. The algorithm starts with the translation of the sys- M. Staroswiecki. Diagnosis and Fault-Tolerant
tem model into a graph representation. Then, applying the Control. Springer-Verlag, Berlin, Heidelberg, second
perfect matching algorithm, a set of analytical redundancy edition, 2006.
relations is obtained. From the analytical redundancy rela- [10] S. Tornil-Sin, C. Ocampo-Martinez, V. Puig, and
tion graph, the problem of graph partitioning is then solved. T. Escobet. Robust fault diagnosis of nonlinear sys-
The resultant partition consists of a set of non-overlapped tems using interval constraint satisfaction and analyt-
subgraphs whose number of vertices is as similar as possi- ical redundancy relations. IEEE Transactions on Sys-
ble and the number of interconnecting edges between them tems, Man, and Cybernetics: Systems, 44(1):18–29,
is minimal. To achieve this goal, the partitioning algorithm Jan 2014.
applies a set of procedures based on identifying the highly
[11] A. I. Zečević and D. D. Šiljak. Control of Complex Sys-
connected subgraphs with balanced number of internal and
external connections. Finally, a decentralised fault diagno- tems: Structural Constraints and Uncertainty. Com-
sis strategy is introduced and applied over the resultant set munications and Control Engineering. Springer, 2010.
of partitions. In order to illustrate and discuss the use and [12] J.A. Bondy and U.S.R. Murty. Graph Theory, vol-
application of the proposed approach, a case study based on ume 244 of Graduate Series in Mathematics. Springer,
the Barcelona DWN has been used. As further research, the 2008.
partitioning algorithm will be improved by acting directly [13] C. Ocampo-Martinez, S. Bovo, and V. Puig. Parti-
on the system model and not on the set of ARRs in order tioning approach oriented to the decentralised predic-
to generate a set of ARRs for each local diagnoser with en- tive control of large-scale systems. Journal of Process
hanced fault diagnosis properties. Control, 21(5):775–786, 2011.
[14] T.N. Bui and B.R. Moon. Genetic algorithm and
Acknowledgements graph partitioning. IEEE Transactions on Computers,
This work has been partially supported by the EFFINET 45(7):841–855, 1996.
grant FP7-ICT-2012-318556 of the European Commission [15] L. Addario-Berry, K. Dalal, and B. Reed. Degree-
and the Spanish project ECOCIS (Ref. DPI2013-48243-C2- constrained subgraphs. Discrete Applied Mathematics,
1-R). 156(7):1168–1174, 2008.
[16] C. Ocampo-Martinez, V. Puig, G. Cembrano,
References
R. Creus, and M. Minoves. Improving water manage-
[1] J. Lunze. Feedback Control of Large-Scale Systems. ment efficiency by using optimization-based control
Prentice Hall, Great Britain, 1992. strategies: the Barcelona case study. Water Science &
[2] D.D. Šiljak. Decentralized control of complex systems. Technology: Water supply, 9(5):565–575, 2009.
Academic Press, 1991. [17] C. Ocampo-Martinez, V. Puig, G. Cembrano, and
[3] L. Console, C. Picardi, and D. Theseider Duprè. A J. Quevedo. Application of predictive control strate-
framework for decentralized qualitative model-based gies to the management of complex networks in the
diagnosis. In International Joint Conference on Artifi- urban water cycle [applications of control]. IEEE Con-
cial Intelligence (IJCAI), pages 286–291, Hyderabad, trol Systems Magazine, 33(1):15–41, 2013.
India, 2007.

104
Proceedings of the 26th International Workshop on Principles of Diagnosis

Self-Healing as a Combination of Consistency Checks
and Conformant Planning Problems

Alban Grastien
Optimisation Research Group, NICTA
Artificial Intelligence Group, The Australian National University
Canberra Research Laboratory, Australia

Abstract element of the belief state in which the plan is not ap-
plicable. To this end we define a new type of diagnoser
We introduce the problem of self healing, in that solves the following problem: find a possible be-
which a system is asked to self diagnose and haviour of the system (that agrees with the model and
self repair. The two problems of computing the observations) that ends up in a state q in which the
the diagnosis and the repair are often solved plan is not correct; this state q is added to the sample
separately. We show in this paper how to tie of the belief state so that the planner finds a more suit-
these two tasks together: a planner searches able repair plan at the next iteration. Failure on the
a prospective plan on a sample of the belief part of the diagnoser to find such a behaviour proves
state; a diagnoser verifies the applicability of that the plan is indeed correct. In practice the prob-
the plan and returns a state of the belief state lem of verifying the correctness of a plan is reduced to a
(added to the sample) in which the plan is propositional satisfiability (sat) problem that is unsat-
not applicable. This decomposition of the isfiable iff the plan is applicable in all states and that
self healing process avoids the explicit com- returns a counterexample if not.
putation of the belief state. Our experiments The contributions of this paper are i) a formal def-
demonstrate that it scales much better than inition of the self-healing problem, ii) the solving of
the traditional approach. self-healing as a combination of diagnosis and planning
steps, and iii) the reduction of each step to sat.
This work is performed in the context of discrete
1 Introduction event systems [Cassandras and Lafortune, 1999]. As
Autonomous systems are subject to faults and require opposed to supervisory control, where actions (either
regular repair actions; systems capable of performing active or passive, such as forbidding some events) are
such tasks are called self healing. Finding the optimal performed while the system is running, we follow the
repair involves solving a diagnosis problem (what may work from Cordier et al. [2007] and assume that the
the current system state be?) together with a planning repair is being performed whilst the system is inactive.
problem (what optimal/near optimal course of actions, The paper is divided as follows. Next section defines
applicable in all of the possible states, leads to an ac- the self-healing problem formally. Section 3 presents
ceptable state?). In large, partially observable, systems the proposed algorithm with a set-based perspective.
computing an explicit “belief state” can be intractable; The sat implementation is presented in Section 4. Ex-
finding a plan applicable in all elements of this belief perimental validation is given in Section 5. A compar-
state can be also intractable. ison with other problems and approaches is given in
In this paper we propose a method that avoids these Section 6.
two intractable problems. This method relies on the in-
tuition that the full belief state is not necessary to find 2 Problem Definition
the appropriate repair. For instance, if a self-healing The problem we are addressing is illustrated on Fig-
problem requires to make sure that n given machines ure 1. We are concerned with finding the most appro-
are turned off and if the status (on or off) of these ma- priate repair for a partially observed system that has
chines is unknown, then the belief state is comprised of been running freely.
2n states. However the optimal plan (press the stop but- We assume that the system can run in two different
ton on every machine) happens to be the optimal plan modes: the “active” (and useful) mode in which the
of the state where none of the machines has been shut: system is free to operate (left half of the figure) and
this single state is “representative” of all the states in the “repair” mode in which the system state is being
the belief state. re-adjusted (right half). The system behaves quite dif-
Our approach uses a planner to compute an opti- ferently in the two modes. In the active mode, the sys-
mal plan for a small sample of the belief state (at most tem is partially observable but uncontrolled. In the re-
dozens of elements); the plan is applicable in all these pair mode, the system is not observed albeit controlled;
states and leads to the goal state. In order to vali- the state changes only through explicit application of
date the plan for the full belief state we search for an actions; and special attention must be made to their

105
Proceedings of the 26th International Workshop on Principles of Diagnosis

Initial state Current state (unknown) Goal state

e1 e2 ... en−1 en a1 a2 ... ak−1 ′ ak
q0 q1 qn−1 qn=q0′ q1′ qk−1 qk′

Partially observed,
Repair plan (problem solution)
uncontrolled, behaviour

Figure 1: Schematic description of the self-healing problem: find a repair plan that returns the state in the goal
set.

applicability/effects. One reason for assuming that the Notice that a plan is a simple sequence: we do not as-
system does not run freely in the repair mode is that we sume that additional observations are available at run-
do not want to consider scenarios where faults can oc- time. There is no probing action available. After non
cur during the repair, which would increase the overall deterministic action effects, the use of conditional plans
complexity of the problem. We believe that this limi- is a second natural extension of this work.
tation, essentially the fact that the repair actions have
deterministic effects, can be lifted. Definition 2 The self-healing problem is a pair P =
hM, Oi where M is a model and O is an observation.
2.1 Explicit Model A repair plan for P is a plan that is guaranteed to be
We are considering discrete event systems (DES, [Cas- correct in the current state. Formally a repair plan is
sandras and Lafortune, 1999]). The system is modeled a plan π such that
as a finite state machine, i.e., a finite set Q of states 1 e n e
∀ρ = q0 −→ . . . −→ qn .
together with a set T of transitions labeled with finitely-
(q0 ∈ I ∧ obs(ρ) = O ∧ qn 6∈ U ) ⇒ π(qn ) ∈ G.
many events/actions.
(1)
Definition 1 An explicit self-healing system model is
a tuple M = hQ, I, Σ, Σo , Σa , T, G, U i where The set of repair plans is denoted Π(M, O) or simply
Π.
• Q is a finite set of states, I ⊆ Q is a set of initial Given a cost function on sequences of actions, the
states, G ⊆ Q is a set of goal states, U ⊆ Q is a objective of the self-healing problem is to find a cost-
set of unstable states, minimal repair plan (for simplicity we assume that such
• Σ is a finite set of events, Σo ⊆ Σ is the set of a plan exists):
observable events, Σa ⊆ Σ is the set of actions,
and π ⋆ = arg min cost(π).
π∈Π
• T ⊆ (Q × Σ × Q) is the set of transitions hq, e, q ′ i
e This definition assumes a cost function that provides
also denoted q −→ q′ . a total order on the plans. In practice we will try to
e
In the active mode the system takes a path ρ = q0 −→ 1
minimise the number of actions (all actions have the
en
. . . −→ qn such that {e1 , . . . , en } ⊆ Σ \ Σa , q0 ∈ I and same cost, the cost is cumulative) and break ties at
qn 6∈ U . This last condition is used to prevent situ- random.
ations where a fault happened right before the repair We see two main categories of self-healing problems,
is applied, i.e., before any observation of this fault was namely i) a recurring situation where the system is
made. This assumption is similar to the one made, e.g., stopped regularly, which provides a good opportunity
by Lamperti and Zanella that the system is quiescent to perform corrective actions on the system; ii) a situa-
(no more event is about to happen) when diagnosis is tion where a diagnoser/monitor detects an anomaly on
performed [Lamperti and Zanella, 2003]. This assump- the system and triggers a self-healing procedure. The
tion can be removed by assuming U = Q. Finally the present work is independent from how the problem was
observation O = obs(ρ) of this path is the projection prompted.
of e1 , . . . , en on the observable events Σo (i.e., all non-
observable events are eliminated from the sequence). 2.2 Solving the Problem Explicitly
In the repair mode a sequence of actions, called a plan This paper works under the assumption that the sys-
π = a1 , . . . , ak , is applied ({a1 , . . . , ak } ⊆ Σa ). From tem model is very large and that it is impractical to
state q0′ ∈ Q, the application of π leads to the (single) manipulate sets of states. We discuss this issue here
a1 ak
state qk′ = π(q0′ ) such that q0′ −→ . . . −→ qk′ . We assume and present some notations.
that every action is applicable in every state (if this is The simplest way to solve the self-healing problem
not the case a non-goal sink state can be created where is to compute the belief state and then compute the
all inapplicable actions lead to) and have deterministic optimal plan for this set of states.
effects. If π leads q0′ to a goal state, we say that π is Given a model M and the observation O, the belief
correct for q0′ . state B O is defined as the set of states that the system

106
Proceedings of the 26th International Workshop on Principles of Diagnosis

could be in: Finally we look at a formulation of the planning prob-
e1 en lem that is complementary to the computation of the
BO = {q ∈ Q | ∃ρ = q0 −→ . . . −→ qn . belief state. Assume that a plan π is given and we want
q0 ∈ I ∧ obs(ρ) = O ∧ qn 6∈ U ∧ q = qn }. to compute the set of states B π in which the plan π is
Notice that the definition of the belief state matches correct: B π = {q ∈ Q | π(q) ∈ G}.
the first part of Equation (1). Lemma 1 Plan π is a correct plan iff B O ⊆ B π .
def
Writing B π = Q \ B π the set of states for which π is
A conformant plan for the set of states B O is a plan
π that is correct for all states of B O : ∀q ∈ B O . π(q) ∈ G not correct, plan π is a correct plan iff B O ∩ B π = ∅.
(cf. Figure 2). Compared to the general definition of a
conformant plan (a more detailled comparison is given 3 Set Formulation of Self-Healing
in Section 6) we only deal with uncertainty on the initial We first present a formulation of our solution that is
state and we assume that actions have deterministic based on sets and that does not consider implementa-
effects. Conformant planning is provably pspace-hard tion issues (presented in the next section).
for explicit models. We propose a lazy approach to self-healing. In this
We consider the conformant planning problem from approach we search a correct plan for a sample of the
the initial set of states B O and use b = |B O | to denote belief state (a “belief sample”) and then search for a
the size of B O . The problem can be solved by consid- state of the belief state in which the plan is not applica-
ering the finite state machine M ′ where each state of ble; this state is added to the sample and the procedure
M ′ is a set of states of the original model and each is iterated again until a robust plan has been found.
transition from state S labeled by action a leads to We first give the theoretical results that justify the
S ′ = {q ′ ∈ Q | ∃q ∈ S. hq, a, q ′ i ∈ T }. The initial algorithm presented at the end of the section.
state of M ′ is B O ; a state S of M ′ is a goal state if In the following we use the notations B O and B to
it satisfies S ⊆ G. A plan π is a sequence of actions represent sets of states such that B ⊆ B O . B O will
such that π(B O ) (in M ′ ) is a goal state. Because the represent the belief state and B a small subset (a few
original model is deterministic the transition hS, a, S ′ i elements) of B O . S, S ′ will represent any set of states.
is such that the size of S ′ is smaller than S. The num- Let Π(q) be the set of repair plans that are correct
ber of statesin M ′is bounded by the sum of binomial for state q. Let Π(S) be the set of repair plans that are
|Q| |Q| correct whichever
T is the current state from S. Then
coefficients + ···+ .
1 b Π(S) = q∈S Π(q). Notice that Π = Π(B O ).
A trivial result is:
Initial states B Goal states S ⊆ S ′ ⇒ Π(S) ⊇ Π(S ′ ).
A consequence of this proposition is that the optimal
q01 q11 ... 1
qk−1 qk1 repair for B O is a correct plan for B. Computing the
optimal repair plan for the latter may therefore yield
.. .. .. .. .. the optimal plan for the former. Let π ∗ (S) be the op-
. . . . . timal plan for a set of states. The next proposition
determines how to characterize that an optimal plan
q0b q1b ... b
qk−1 qkb was found:
S ⊆ S ′ ∧ (π ∗ (S) ∈ Π(S ′ )) ⇒ π ∗ (S) = π ∗ (S ′ ).
Figure 2: Solving conformant problems; the vertical This result can be derived from the previous propo-
lines mean that the transitions are labeled by the same sition. π ∗ (S ′ ) belongs to Π(S) since S ⊆ S ′ ; therefore
action. π ∗ (S) is better than (or equal to) π ∗ (S ′ ). However, if
π ∗ (S) ∈ Π(S ′ ) and yet π ∗ (S ′ ) 6= π ∗ (S), then π ∗ (S ′ )
The model M ′ presented before cannot be easily ex- must be strictly better than π ∗ (S), which contradicts
pressed in planning modeling languages such as strips what was just said.
or pddl, or implemented in sat. Another reduction, to Applied to S = B and S ′ = B O ⊇ B, this means
M ′′ , can be introduced whose states are tuples (with b that π ∗ (B) ∈ Π(B O ) implies π ∗ (B) = π ∗ (B O ).
elements) of states from the original model: Q′′ = Qb .
A tuple state is a goal state if all its elements are in the We reuse the notation B π for the set of states in
goal: G′′ = Gb . The transitions in M ′′ correspond to which the plan π is correct, and B π = Q \ B π for the
the parallel execution of the same action in each state set of states in which it is not. With this notation,
of the tuple (represented by the vertical lines on Fig-
π ∗ (B) ∈ Π(B O ) is equivalent to B O ∩ B π∗ (B) = ∅.
ure 2).
Assume that there exists a procedure
In general M ′′ is larger than M ′ . The model also con-
verify applicability (S, π) that extracts a state
tains symmetries that efficient implementations might
need to address explicitely: for instance in model M ′′ q ∈ S ∩ B π if such a state exists, and returns ⊥
states hq1 , q2 i and hq2 , q1 i are different while they would otherwise. Then, for S ⊆ S ′ , the following results are
be the same in M ′ : {q1 , q2 } = {q2 , q1 }. trivial:
Clearly this type of approach is only applicable if B O • verify applicability (S ′ , π ∗ (S)) = ⊥ ⇒ π ∗ (S) =
comprises no more than a few dozen elements. π ∗ (S ′ );

107
Proceedings of the 26th International Workshop on Principles of Diagnosis

• let q = verify applicability (S ′ , π ∗ (S)) 6= ⊥ be a
o1
state where π ∗ (S) is not applicable, then q 6∈ S
and π ∗ (S ∪ {q}) 6= π ∗ (S) (and cost (π ∗ (S ∪ {q})) >
cost (π ∗ (S)))1 . u a2
B D G I
The first proposition shows that verify applicability can
be used to check whether the plan π ∗ (B) is correct for
B O . The second proposition indicates how a better o2 o1 a1 a1
prospective plan can be computed if π ∗ (B) is not cor- o1 u
rect: the addition of q to S guarantees that a different o1 o2
plan will be generated. A C E F H
These results lead to the procedure presented in Al- o2 a2
a2
gorithm 1. In this procedure, find plan(B) is a method
that computes a conformant plan from B as defined
at the end of the previous section (and described next Figure 3: System example with two initial states (A
section). The procedure computes the optimal plan for and B), two goal states (A and G), one unstable state
a belief sample B. If verify applicability finds a state (B), two observable events (o1 and o2 ), and two actions
q ∈ B O in which this plan is not correct, then this state (a1 and a2 ; an action affects the system state only if
is added to the belief sample and a new optimal plan there is a transition).
is generated and tested.

Algorithm 1 Diagnosis algorithm for the self-healing that neither A nor D from B O were explicitly generated
problem without enumerating the belief state B O during the procedure.
B := ∅
loop 4 SAT Formulation of Self-Healing
π := find plan (B) In this section we show how Algorithm 1 can be im-
q := verify applicability (B O , π) plemented using sat. This implementation assumes
if q = ⊥ then a symbolic representation of the model, i.e., a repre-
return π sentation where states and transitions are not enumer-
else ated but are, instead, implicitly defined by a set V of
B := B ∪ {q} Boolean state variables (aka fluents) as can be found,
end if e.g., in a strips model.
end loop
4.1 Computing a Conformant Plan for B
Because i) each loop iteration adds an element to B The procedure we use to compute the optimal plan for
and ii) B O is finite, this procedure is guaranteed to ter- a belief sample relies on a sat solver and follows the
minate. The number of iteration is, in the worst case, schematic representation of Figure 2. In planning by
the size of B O ; we expect however that a handful of sat [Kautz and Selman, 1996], given a horizon k and a
calls to find plan (·) will be sufficient to find the opti- planning problem a propositional formula Φ is defined
mal plan. that is satisfiable iff there exists a sequence of actions
of length k that solves the planning problem.2 Fur-
Example thermore Φ is defined over k + 1 copies of the state
We illustrate Algorithm 1 with the example of Figure 3. variables (the state sat variables p0 to pk where p is a
Assume that the observations are O = [o1 , o2 ]. Accord- state variable) and k copies of the actions (the action
ing to the model, the belief state is B O = {A, D, F, H} sat variables a0 to ak−1 where a is an action). Φ is
(state B is unstable, so the system cannot be in this defined such that a solution to the planning problem
state). The state needs to be returned to a subset of can be trivially extracted from the satisfying assign-
{A, G}. ment (for instance, if ai evaluates to true, then the ith
Since the belief sample B0 is initially empty, Algo- action of the plan is a). If, for instance, action a sets
rithm 1 first generates the empty plan π0 = ε. The state variable p to false, Φ will be defined such that for
procedure verify applicability exhibits state F such that all i ∈ {1, . . . , k}
u o1 o2
B − → D −→ E −→ F could explain O and such that
Φ ≡ (ai−1 → ¬pi ) ∧ · · ·
plan π0 does not lead to a goal state when applied from
F . The optimal plan for B1 = {F } is π1 = a1 . This We refer the reader to the literature on planning by sat
time verify applicability extracts state H which also be- for more details on this reduction.
longs to the belief state and for which the application Given a sample B of b states we create b copies of the
of a1 leads to sink state I. The belief sample B2 now state sat variables: p1i , . . . , pbi ; the variables pℓi model
equals {F, H} and the optimal conformant plan for B2 the effects of applying the plan on the state qℓ ∈ B.
is π2 = a2 , a1 (remember that unobservable transition We stick to a single set of action sat variables and
u
F −→ H cannot trigger after the execution of a2 ). This each copy of the state sat variables is linked to this
plan is correct for all elements in the belief state. Notice
2
The value of k is initialized to 0 and incremented until
1
Remember that no two plans have the same cost. Φ becomes satisfiable.

108
Proceedings of the 26th International Workshop on Principles of Diagnosis

set. The formula Φ presented in the example above plan is finally computed the conformant planning re-
will therefore now translate as duction to sat is satisfiable while the reduction of the
applicability function is not.
Φ ≡ ai−1 → ¬p1i ∧ · · · ∧ ai−1 → ¬pbi ∧ · · ·
4.2 Verifying Correctness of a Plan 5 Experiments
Like the plan generation, plan correctness is imple- We ran some experimental evaluation of the approach
mented in sat. This time it matches the representation presented in this paper.
of Figure 1. Since the problem presented here is new, we had to
A plan is proved incorrect if an explanation of the build new benchmarks. We propose a variant of the
observations can be found in which the application of benchmark presented by Grastien et al. [2007] which
the plan leads to a non final goal (remember that all will be made available to the community. The sys-
plans are applicable). tem comprises 20 components interconnected in a torus
Once again a propositional formula is defined that is shape. Each component contains eight states, including
satisfiable iff such an explanation exists. This formula two unstable states and one goal state. The behaviour
contains two parts: sat variables pi∈{0,...,n} represent on each component can affect its neighbour and the
the state of the system in the active mode while vari- local observations cannot allow to determine anything
ables p′i∈{0,...,k} represent the state in the repair mode.3 about the local behaviour: the full system needs to
The formula is the conjunction of the formulas: be monitored in order to understand the state system.
• Φactive a propositional formula that is satisfiable Repair actions can also be local or affect several com-
iff there exists an explanation to the observa- ponents.
tions (whose final state is represented by the vari- We built 100 problem instances on this system. We
ables pn ); this type of reduction is quite standard restricted ourselves to totally ordered observations, but
[Grastien and Anbulagan, 2013]; notice that one of the benefits of using diagnostic tech-
niques is to be able to handle partially-ordered obser-
• Φ′repair a propositional formula that is satisfiable vations (observations where the order of the observed
iff there exists a state in which the proposed plan events is only partially known because the delay be-
is not correct (this state is represented by the vari- tween their reception is small compared to the trans-
ables p′0 ); mission/processing delay).
V
• p∈V (pn ↔ p′0 ), where p ranges over the state We compare our approach to a symbolic approach
variables, the formula that links the final state of that uses BDDs (specifically the buddy package) to
the active phase and the initial state of the repair track the belief state and then uses A* to find the op-
phase. timal repair plan. The heuristic used by A* is imple-
Intuitively, the assignments of the variables pn that mented as follows: a state of the system is extracted
are consistent with Φactive are a symbolic representa- from the BDD and the optimal repair is computed for
tion of B O . Formally let V be the set of variables that this state using sat; the length of this optimal repair is
appear in Φactive ; then ∃(V \ {pn | p ∈ V }). Φactive is used as a lower bound for the optimal repair from the
logically equivalent to the symbolic representation of current search node.
B O . Similarly the variables p′0 of Φ′repair represent B π . Our belief sample method uses glucose_static 4.0
[Audemard and Simon, 2009]. glucose is heavily based
As a consequence any other representation of B O or on the minisat solver [Eén and Sörensson, 2003].
B π could be used if such representations are more con- The experiments were run on 4-core 2.5GHz cpu with
venient (e.g., if they are more compact or if they help 4GB RAM, with GNU/Lunix Mint 16 “petra”. A ten
the sat solver). minutes (600s) timeout was provided.
Difference Between the Two Reductions 1000

The first reduction aims at finding a plan of length k
that is applicable in b states. Therefore it includes b × k 100

copies of the state variables and k copies of the action
Time (s)

variables. 10

The second reduction aims at finding a plan com- BuDDy
Belief Sample

posed of two parts: a trajectory in the active space and 1
0 10 20 30 40 50 60 70 80 90 100
a trajectory in the repair space. Therefore it includes Problems

n + k copies of the state variables and n copies of the
events (there could be k copies of the actions but the Figure 4: Runtime in seconds required to solve self-
value of these variables is known in advance since the healing problem instances; sorted.
plan is an input of this reduction).
An interesting difference between the two reductions
The results are summarized in Figure 4. The in-
is that the trajectories of the former should lead to goal
stances are sorted in increasing runtime, meaning that
states while the trajectory of the latter should lead to
the instance at position x for one implementation may
a non goal state. As a consequence when the repair
be different from the instance at the same position for
3
It is assumed that the length of the explanation can be the other. The approach based on the generation of the
bounded by a known value n; k is the length of the plan belief state only saw 64 instances solved before timeout,
being tested. against 83 for our approach. In general our approach

109
Proceedings of the 26th International Workshop on Principles of Diagnosis

is two orders of magnitude faster than A*, although generated [Micalizio, 2014].
we would need more benchmarks and comparisons to
understand better the strength of this approach. 7 Conclusion and Extensions
Out of the 87 instances instances solved by the Belief In this paper we presented a method to solve the self-
Sample method, 82 could be solved by exhibiting only healing problem. The problem consists in finding a
one element of the belief state. Another three instances repair plan that can lead back to a goal state a sys-
could be solved with a sample of two elements, and tem whose execution has been partially observed. We
two required a sample of three elements to generate a avoid computing the belief state. Instead we propose a
conformant plan. method whereby plans are computed on a sample of the
belief state whilst a diagnoser verifies their correctness
6 Discussion and generates an element of the belief state (added to
the sample) if the plan is not correct. Both the plan-
The objective of connecting the diagnostic and plan-
ning and the diagnosis problems are reduced to sat
ning tasks is quite ambitious. From the diagnostic per-
problems. We show that non trivial problems can be
spective, and since the seminal work from Sampath et
easily solved by this approach.
al. [1995] the problem has generally been the detection
of specific events or patterns of events [Jéron et al.,
2006]. The main inspiration of the present work is the There are many possible extensions to this work.
self-heability question asked by Cordier et al. [2007]; One issue is that enforcing a conformant plan may be
the aforementioned work is one of the first attempt to too restrictive. We want to avoid prohibitive repairs
frame diagnosis as the problem of finding the optimal in situations where the system is healthy. This is a
repair plan, although the complexity of computing the common problem in diagnosis of dynamic systems: the
plan is not addressed. In static contexts similar ques- state of the system can never be precisely determined
tions have been asked where the problem was framed as at the current time; it is often not unconceivable that
finding the optimal balance between increasing the cost a fault just happened on the system and has not had
of gathering information (observations) and improving time to develop into a visible faulty trace. The issue
the precision of diagnosis (and, consequently, reducing here is that conformant plans must provide for such
the cost of planning) [Torta et al., 2008]. contingencies even when there is no evidence for them.
Supervisory control [Ramadge and Wonham, 1989] is An implicit assumption of our work is that unhealthy
a problem very similar to self-healing. The goal is to system behaviours can be detected to a large extend.
control some actions (forbid their occurrence) in order The set of unstable states serves this purpose: they are
to meet some specification. The main difference with useful to model the fact that any “failure” in the sys-
our work is the fact that control applies continuously tem will lead to abnormal observations before a repair
while we assume that self-healing is performed when the action is performed.
system is not active (either because the repair process We see two avenues to handle situations where the
is expensive—it might require to stop the system for unstability feature cannot address the problem pre-
instance—or because it can only be performed at some sented before. First probabilities can be incorporated
time—every night for instance). Furthermore control into the model, which allows for chance-constrained
tries to be as unobtrusive as possible: it merely forbids planning [Santana and Williams, 2014]. Issues with this
some transitions and generally does not choose actions approach include the problem of building large models
to perform. with meaningful probabilities and the problem of ex-
Conformant planning [Smith and Weld, 1998] is the tending the sat reduction to deal with probabilities
problem of finding a sequence of actions that is guar- (as well as scaling up to large models). A second, qual-
anteed to lead to the specified goal, despite uncertainty itative, possibility is to ignore contingencies that are
on the initial state and nondeterministic action effects. supported by no strong evidence. For instance failures
Solutions to conformant planning have been proposed that are not part of a minimal diagnosis might be ig-
that compute the belief state and run heuristic search nored.
[Bonet and Geffner, 2000] or that represent the belief Another restriction of the current approach is that
state symbolically [Cimatti and Roveri, 2000]. More the goal G is assumed to be known explicitly. Speci-
similar to our work Hoffmann and Brafman [2006] pro- fication of goal states may however be more complex:
posed Conformant-FF in which the belief state is rep- Ciré and Botea [2008] have proposed to define goals
resented implicitly by the set of initial states and the as properties of states defined in linear temporal logic
sequence of actions leading to the current state; at ev- (LTL). Other relevant goal properties is diagnosability
ery time step, a sat solver is used to determine the state [Sampath et al., 1995], i.e, the property that the obser-
variable values that can be inferred with certainty. This vations on the system will allow to detect/identify the
approach is similar to ours in the way it avoids comput- important system failures. A related issue is the incre-
ing belief states. More generally, we would like to adapt mental aspect: how to handle a repair after an active
our method to solve conformant planning problems. period following a first repair. A simple solution is to
The combination of planning and diagnosis has also assume that the initial state after the repair is the goal
been studied in the context of plan repair. There, a state.
(possibly conformant) plan is computed that assumes
that contigencies are unlikely to happen. The plan ex- Acknowledgments
ecution is then monitored and if the outcome of exe- NICTA is funded by the Australian Government
cution does not match the predictions, a new plan is through the Department of Communications and the

110
Proceedings of the 26th International Workshop on Principles of Diagnosis

event/action neighbour event/action
nf f nf
t z

N2 F2 R2 Table 1: Synchronised events

reb nf reb reb nf N (no fault); it moves to F when a fault occurs and R
when it recovers. The second part of the state is gener-
N1 back F1 back R1 back ally 0 (the component is running) and moves to 1 when
it needs to reboot and to 2 when it is rebooting. A fault
f on a component forces its neighbours to reboot. One
nf nf difficulty of diagnosis for this type of system is that the
f
observations (reb and back) do not point precisely to
N0 R0 the faulty component.
The repair consists in returning to state N0 . Most
states require action t to return to state N0 but this
action can move the neighbours of the component to
Figure 5: Active model for one component (observable state N2 . Therefore finding the optimal repair requires
events are reb and back). to order the actions carefully.

N2 F2 R2 References
[Audemard and Simon, 2009] G. Audemard and L. Si-
mon. Predicting learnt clauses quality in modern
SAT solver. In 21st International Joint Conference
on Artificial Intelligence (IJCAI-09), 2009.
z N1 s F1 R1
t [Bonet and Geffner, 2000] B. Bonet and H. Geffner.
t Planning with incomplete information as heuristic
t search in belief space. In Fifth International Con-
t ference on AI Planning and Scheduling (AIPS-00),
pages 52–61, 2000.
N0 t R0
[Cassandras and Lafortune, 1999] C. Cassandras and
S. Lafortune. Introduction to discrete event systems.
Kluwer Academic Publishers, 1999.
Figure 6: Repair model for one component (no transi-
tion means that the state is not affected by action). [Cimatti and Roveri, 2000] A. Cimatti and M. Roveri.
Conformant planning via symbolic model checking.
Journal of Artificial Intelligence Research (JAIR),
Australian Research Council through the ICT Centre 13:305–338, 2000.
of Excellence Program.
[Ciré and Botea, 2008] A. Ciré and A. Botea. Learn-
ing in planning with temporally extended goals and
A Problem Benchmark uncontrollable events. In Eighteenth European Con-
We now present the system we used in the experi- ference on Artificial Intelligence (ECAI-08), pages
ments.4 578–582, 2008.
The system includes 20 components ci,j where i [Cordier et al., 2007] M.-O. Cordier, Y. Pencolé,
ranges between 0 and 3 and j between 0 and 4. The L. Travé-Massuyès, and T. Vidal. Self-healability
component ci,j is connected to ci′ ,j ′ iff the total differ- = diagnosability + repairability. In Eighteenth
ent |i − i′ | + |j − j ′ | is at most one (where i and j are International Workshop on Principles of Diagnosis
taken modulo 3 and 4). For instance, c0,1 is connected (DX-07), pages 251–258, 2007.
to four components c0,0 , c0,2 , c3,1 , and c1,1 .
The model of one component for the active mode is [Eén and Sörensson, 2003] N. Eén and N. Sörensson.
given in Figure 5 and the model for the repair mode An extensible SAT-solver. In Sixth Conference
is given in Figure 6. The connections between com- on Theory and Applications of Satisfiability Testing
ponents implies forced transitions when some events (SAT-03), pages 333–336, 2003.
occur; these are summarised in Table 1 For instance, [Grastien and Anbulagan, 2013] A. Grastien and
when event f occurs on component c0,1 , event nf oc- A. Anbulagan. Diagnosis of discrete event systems
curs on every one of its four neighbours. using satisfiability algorithms: a theoretical and
A component state contains two types of informa- empirical study. IEEE Transactions on Automatic
tion: whether a failure occurred on the component and Control (TAC), 58(12):3070–3083, 2013.
whether it is run. The first part of the state is initially
[Grastien et al., 2007] A. Grastien, A. Anbulagan,
4
The benchmark is available at this address: J. Rintanen, and E. Kelareva. Diagnosis of discrete-
http://www.grastien.net/ban/data/bench-dx15.tar.gz. event systems using satisfiability algorithms. In

111
Proceedings of the 26th International Workshop on Principles of Diagnosis

22nd Conference on Artificial Intelligence (AAAI-
07), pages 305–310, 2007.
[Hoffmann and Brafman, 2006] J. Hoffmann and
R. Brafman. Conformant planning via heuristic for-
ward search: a new approach. Artificial Intelligence
(AIJ), 170:507–541, 2006.
[Jéron et al., 2006] T. Jéron, H. Marchand, S. Pinchi-
nat, and M.-O. Cordier. Supervision patterns in
discrete-event systems diagnosis. In Seventeenth
International Workshop on Principles of Diagnosis
(DX-06), pages 117–124, 2006.
[Kautz and Selman, 1996] H. Kautz and B. Selman.
Pushing the envelope : planning, propositional logic,
and stochastic search. In Thirteenth Conference on
Artificial Intelligence (AAAI-96), pages 1194–1201,
1996.
[Lamperti and Zanella, 2003] G. Lamperti and
M. Zanella. Diagnosis of active systems. Kluwer
Academic Publishers, 2003.
[Micalizio, 2014] R. Micalizio. Plan repair driven by
model-based agent diagnosis. Intelligenza Artificiale,
8(1):71–85, 2014.
[Ramadge and Wonham, 1989] P. Ramadge and
W. Wonham. The control of discrete event systems.
Proceedings of the IEEE: special issue on Dynamics
of Discrete Event Systems, 77(1):81–98, 1989.
[Sampath et al., 1995] M. Sampath, R. Sengupta,
S. Lafortune, K. Sinnamohideen, and D. Teneket-
zis. Diagnosability of discrete-event systems.
IEEE Transactions on Automatic Control (TAC),
40(9):1555–1575, 1995.
[Santana and Williams, 2014] P. Santana and
B. Williams. Chance-constrained consistency
for probabilistic temporal plan networks. In 24th
International Conference on Automated Planning
and Scheduling (ICAPS-14), 2014.
[Smith and Weld, 1998] D. Smith and D. Weld. Con-
formant graphplan. In Fifteenth Conference on Ar-
tificial Intelligence (AAAI-98), pages 889–896, 1998.
[Torta et al., 2008] G. Torta, D. Theseider Dupré, and
L. Anselma. Hypothesis discrimination with abstrac-
tions based on observation and action costs. In Nine-
teenth International Workshop on Principles of Di-
agnosis (DX-08), pages 189–196, 2008.

112
Proceedings of the 26th International Workshop on Principles of Diagnosis

Implementing Troubleshooting with Batch Repair

Roni Stern1 and Meir Kalech1 and Hilla Shinitzky 1
1
Ben Gurion University of the Negev
e-mail: roni.stern@gmail.com, kalech@bgu.ac.il, hillash@post.bgu.ac.il

Abstract repair overhead. Instead, an efficient BRP algorithm would
repair all the faulty components in a single repair action.
Recent work has raised the challenge of efficient More generally, we expect an intelligent BRP algorithm to
automated troubleshooting in domains where re- weigh the cost of repairing batches of components as well
pairing a set of components in a single repair ac- as the repair overhead. Some discussion on repairing mul-
tion is cheaper than repairing each of them sepa- tiple components together was done in prior work on self
rately. This corresponds to cases where there is a healability [5].
non-negligible overhead to initiating a repair ac-
tion and to testing the system after a repair ac- Due to the repair overhead, repairing a single component,
tion. In this work we propose several algorithms even if it is the component most likely to be faulty, can
for choosing which batch of components to repair, be wasteful. This is especially wasteful in cases where all
so as to minimize the overall repair costs. Experi- the found diagnoses consists of multiple faulty components,
mentally, we show the benefit of these algorithms thus suggesting that repairing a single component would not
over repairing components one at a time (and not fix the problem. Alternatively, one may choose to repair the
as a batch). components in the most likely diagnoses. This may also
be wasteful, especially if there are several diagnoses which
have similar likelihood. It might be worthwhile to repair
1 Introduction by a single repair action a set of components that “covers”
Troubleshooting algorithms, in general, plan a sequence of more than a single diagnosis. This may reduce the number
actions that are intended to fix an abnormally behaving sys- of repair actions until the system is fixed, thus saving repair
tem. Fixing a system includes repairing faulty components. overhead costs. The downside in this approach is that the
Such repair actions incur a cost. These costs can be parti- component repair costs can be high, as more healthy com-
tioned into two types of repair cost. The first, referred to as ponents may be repaired.
the component repair cost, is the cost of repairing a compo- For example, consider
nent. The second, referred to as the repair overhead, is the the small system de-
cost of preparing the system to perform repair actions (e.g., scribed in Figure 1. It is in1=1
A
p({A})=0.6
Consider the
halting the system may be required), and the cost of testing a logical circuit whose out1=1 Figure~\ref{f
whose outpu
the system after performing a repair action. output is fault. Assume in2=1
B two possible

that the “OR” gate is
where the pr
This paper considers the case where the repair overhead p({B})=0.4 0.4, respectiv

is not negligible and is potentially more expensive than a known to be healthy to repair A, t
Assume that

component repair cost (of a single component). Therefore, and there are only two a componen
Figure 1: An example where chance that t
it may be more efficient to repair a batch of components possible diagnoses: either repair action
repairing components one at expected tot
in a single repair action. We call the problem of choosing A is faulty or B is faulty, Similarly, the
a time is wasteful.
where the probability that
The best opt
which batch of components to repair the Batch Repair Prob- together, in a

lem (BRP). BRP is an optimization problem, where the task A and B are faulty is cost of 12.

is to minimize the total repair costs, which is the sum of the 0.6 and 0.4, respectively. There are three possible repair
repair overheads and component repair costs incurred by all actions: to repair A, to repair B, and to repair A and
the repair actions performed until the system is fixed. B. Assume the repair overhead costs 10, and repairing a
Note that in this paper we use the term “repair” for a sin- component costs 1. If A is repaired, there is a 0.4 chance
gle or a set of components and the term “fix” to refer to that the system would not be fixed and another repair action
the entire system. Thus, repairing components eventually would be needed (repairing B). Thus, the expected total
causes the system to be fixed, and a system is only fixed if it repair cost of repairing A first is 15.4. Similarly, the total
returned to its nominal behavior. repair cost for repairing B first is 17.6. The best option is
Most previous work assumed that components are re- thus to repair A and B together in a single repair action,
paired one at a time [1; 2; 3; 4]. This approach can be incurring a total repair cost of 12.
wasteful for BRP. For example, if a diagnosis engine infers Recent work [6] proposed two high-level approaches to
that multiple faulty components need to be repaired to fix solve BRP: as a planning under uncertainty problem, or as a
the system, then it would be wasteful to repair these com- combinatorial optimization problem. When modeling BRP
ponents one at a time since each repair action incurring its as a planning under uncertainty problem the task is to find a

113
Proceedings of the 26th International Workshop on Principles of Diagnosis

repair policy, mapping a state of the system to the repair ac- We assume that all repair costs are positive and non-zero,
tion that minimizes the expected total repair costs. This ap- i.e., costrepair > 0 and costc > 0 for every component
proach, while attractive theoretically, quickly becomes not c ∈ COM P S. As defined earlier, the task in BRP is to fix
feasible in non-trivial scenarios. a system with minimum total repair cost.
In this work we focus on the second high-level approach As shown in Figure 1, an efficient BRP solver should con-
proposed for BRP, in which BRP is modeled as a combi- sider the possibility of repairing a set of components in a
natorial optimization problem, searching in the combinato- single repair action. Thus, the potential number of repair
rial space of possible repair actions for the best repair ac- actions is 2|COM P S| . Therefore, from a complexity point of
tion. There are two challenges in implementing this ap- view BRP is an extremely hard problem.
proach. First, how to measure the quality of a repair ac-
tion and how to efficiently search for the repair action that 3 Preliminaries
maximizes this measure. There are many efficient heuristic
Next, we provide background and definitions required for
search algorithms in the literature, and thus the main chal-
describing the BRP algorithms we propose.
lenge addressed in this work is in proposing several heuris-
SD describes the behavior of the diagnosed system, and
tics for estimating the merit of a repair action.
in particular the behavior of each component. The term be-
The contributions of this work are practical. A range of
havior mode of a component refers to a state of the compo-
heuristic objective functions are proposed and analyzed, and
nent that affects its behavior. SD describes for every com-
we evaluate their effectiveness experimentally on a standard
ponent one or more behavior modes. For every component,
benchmark. A clear observation from the results is that in-
at least one of the behavior modes must represent the nomi-
deed considering batch repair actions can save repair cost
nal behavior of the component.
significantly. Moreover, the most effective heuristics pro-
A mode assignment ω is an assignment of behavior
vide a tunable tradeoff between computation time and re-
sulting repair costs. modes to components. Let ω (+) be the set of components
assigned a nominal (i.e., normal) behavior mode and ω (−)
be the set of components assigned one of the other modes.
2 Problem Definition
Definition 3 (Diagnosis). A mode assignment ω is called a
A classical MBD input hSD, COM P S, OBSi is assumed, diagnosis if ω ∧ OBS ∧ SD is satisfiable.
where SD is a model of the system, COMPS represents
A model-based diagnosis engine (MBDE) accepts as in-
the components in the system, and OBS is the observed
put SD, OBS, and COM P S and outputs a set of diag-
behavior of the system. Every component can be either
noses Ω. Although a diagnosis is consistent with SD and
normal or abnormal. The assumption that a component
OBS, it may be incorrect. A diagnosis ω is correct if
c ∈ COM P S is abnormal is represented by the abnormal
predicate AB(c). by repairing the set of components in ω (−) the system is
A batch repair problem (BRP) arises when the assumption fixed. Some diagnosis algorithms return, in addition to Ω, a
that all components are normal is not consistent with the measure of the likelihood that each diagnosis is correct [7;
system description and observations. Formally, 8]. Let p : Ω → [0, 1] denote this likelihood
P measure. We
assume that p(ω) is normalized so that ω∈Ω p(ω) = 1 and
^
SD ∧ OBS ∧ ¬AB(c) is not consistent use it to approximate the probability that ω is correct.
A common way to estimate the likelihood of diagnoses,
c∈COM P S
assumes that each component has a prior on the likelihood
In such a case, at least one component must be repaired. that it would fail and component failures are independent.
Definition 1 (Repair Action). A repair action can be ap- Therefore, if p(c) represents the likelihood that a component
plied to any subset of components and results in these com- c would fail then diagnosis likelihood can be computed as
Q
ponents becoming normal. Applying a repair action to a set c∈ω − p(c)
of components γ is denoted by Repair(γ). p(ω) = P Q (1)
ω 0 ∈Ω c∈ω 0− p(c)
Definition 1 assumes that repair actions always succeed, where the denominator is a normalizing factor. We assume
i.e., a component is normal after it is repaired. in the rest of this paper that diagnoses likelihoods are com-
After a repair action, the system is tested to check if it puted according to Equation 1. Other methods for comput-
has been fixed. We assume that the system inputs in this test ing likelihood of diagnoses also exist [9].
are the same as in the original observations (OBS). The
observed system outputs are then compared to the expected 3.1 System Repair Likelihood
system outputs of a healthy system. Thus, the result of a If the MBDE returns a single diagnosis ω that is guaranteed
repair action is either that the system is fixed, or a new ob- to be correct, then the optimal solution to BRP would be to
servation that may help choosing future repair actions. perform a single repair action: Repair(ω − ). This, however,
Repairing a set of components incurs a cost, composed is rarely the case, and more often a possibly a very large
of a repair overhead and component repair costs. The repair set of diagnoses is returned by diagnosis algorithms. This
overhead is denoted by costrepair , and the component repair introduces uncertainty as to whether a repair action would
cost of a component c ∈ COM P S is denoted by costc . actually fix the system. We define this uncertainty as fol-
Definition 2 (Repair Costs). Given a set of components γ ⊆ lows:
COM P S, applying a repair action Repair(γ) incurs a cost: Definition 4 (System Repair Likelihood). The System Re-
X pair Likelihood of a set of components γ ⊆ COM P S,
cost(Repair(γ)) = costrepair + costc denoted SystemRepair(γ), is the probability that
c∈γ Repair(γ) would fix the system.

114
Proceedings of the 26th International Workshop on Principles of Diagnosis

Consider the relation between p(ω) and F (c) is an estimate of the likelihood that component c
SystemRepair(ω). If ω is correct, then repairing is faulty given a set of diagnoses Ω and their likelihoods.
all components that are faulty, meaning ω (−) , would fix the Based on the system’s health state, we propose the following
system. Therefore, the likelihood of repairing ω (−) causing utility function, denoted uHP :
the system to be fixed is at least p(ω), i.e., X
uHP (γ) = F (c)
SystemRepair(ω (−) ) ≥ p(ω) c∈γ

where γ is any subset of COM P S that has not been re-
Moreover, if ω is correct then repairing any superset of ω (−) paired yet.
would also fix the system. Thus, SystemRepair(ω (−) ) The repair action that maximizes uHP is trivial — repair
may be larger than p(ω). On the other hand, repairing any all components. This would result in the system being re-
set of components that is not a superset of ω (−) , as there pairs, but of course, may repair many components that are
would still be faulty components in the system. Therefore, likely to be healthy. To mitigate this effect, we propose the
a repair action Repair(COM P S 0 ) would fix the system if k highest probability repair algorithm (k-HP), which limits
and only if ω ∗(−) ⊆ COM P S 0 , where ω ∗ is the correct the number of components that can be repaired in a single
diagnosis. While we do not know ω ∗ , we can compute repair action to k, where k is a user-defined parameter. Note
SystemRepair(γ) from Ω and p(·): that computing k-HP does not need any exhaustive search:
X simply sort the health state in descending order of F (·) val-
SystemRepair(γ) = p(ω) ues and repair the first k components.
ω∈Ω∧ω⊆γ The k-HP repair algorithm has two clear disadvantages.
First, the user needs to define k. Second, k-HP does not
For example, in the logical circuit depicted in Fig- consider repair costs (neither component repair costs nor
ure 1, there are two diagnoses, {A} and {B}, such overhead costs). The next set of utility functions and cor-
that p({A}) = 0.6 and p({B}) = 0.4. Thus, responding repair algorithms address these disadvantages.
SystemRepair({A})=0.6, SystemRepair({B})=0.4, and
SystemRepair({A, B})=p({A})+p({B})=1. 4.2 Wasted Costs Utilities
Before describing the next set of proposed utility functions
4 BRP as a Combinatorial Search Problem we explain the over-arching reasoning behind it. Repair-
As mentioned in the introduction, the approach for solving ing a system requires performing repair actions. Some re-
BRP that we pursue in this paper formulates BRP as a com- pair costs are inevitable. These are the repair overhead of
binatorial search problem. The search space is the space of a single repair action, and the component repair costs that
possible repair actions, i.e., every subset of the set of com- repair the faulty components. We propose a family of utility
ponents there were not repaired yet. The search problem is functions that try to estimate the expected total repair costs
to find the repair action that maximizes a utility evaluation beyond these inevitable costs. We refer to these costs as
function u(·) that maps a repair action to a real value that wasted costs and to utility functions of this family as wasted
estimates its merit. cost functions.
The effectiveness of this search-based approach for BRP We model these wasted costs as being composed of two
depends on the search algorithm used and how the u(·) util- parts.
ity function is defined. There are many existing heuristic • False positive costs (costF P ). These are the costs
search algorithm for searching large combinatorial search incurred by repairing components that are not really
spaces [10; 11]. Thus, in this work we propose and evalu- faulty.
ate a set of possible utility functions. Note that for some of • False negative costs (costF N ). These are the overhead
the utility functions described next it is possible to find the costs incurred by future repair actions.
best repair action without searching the entire search space
It is clear why the false positive costs are wasted costs —
of possible actions, while others are more computationally
these are repair costs incurred on repairing healthy compo-
intensive.
nents. The false negative costs are wasted costs because if
4.1 k Highest Probability one knew upfront which components are faulty, then the op-
timal repair algorithm would repair all these components in
A key source of information for all the utility functions de- a single batch repair action, incurring no further overhead
scribed below is the set of diagnoses Ω and their likelihoods costs. Thus, future overhead costs represent wasted costs.
(p(·)). We assume that this information is obtained by us- We borrow the terminology of false positive and false
ing a diagnosis engine over the observations of the current negative from the machine learning literature, but use it in a
state of the system. The set of returned diagnoses may be somewhat different manner. To explain this choice of ter-
very large. The first utility function we propose is based on minology, assume that positive and negative mean faulty
the system’s health state, which has been recently proposed and healthy components respectively. Choosing to repair
as a method for aggregating information from a set of diag- a faulty component is regarded as a true positive, and not
noses [12]. repairing a healthy component is regarded as a true nega-
Definition 5 (Health State). A health state is a mapping F : tive. Thus, the wasted costs incurred by repairing healthy
COM P S → [0, 1] where components are costs incurred due to false positives, and
X the wasted costs incurred by not repairing a faulty compo-
F (c) = p(ω) nent are costs incurred due to false negatives. While this is
ω∈Ωs.t.c∈ω not a perfect match in terminology, we belief that it helps
clarify the underlying intention of costF P and costF N .

115
Proceedings of the 26th International Workshop on Principles of Diagnosis

The Wasted Cost Utility Function 4.3 Handling the Computational Complexity
For a given set of components γ, we denote by costF P (γ) The search space is very large — the size of the power set of
and costF N (γ) the fast positive costs and false negative all components that were not repaired so far. We explored
costs, respectively, incurred by performing a batch repair ac- two simple ways to handle this. The first approach is to
tion of repairing all the components in γ. Given costF P (γ) only consider subset of components with up to k compo-
and costF N (γ), we propose the following general formula nents, where k is a parameter. This approach is referred to
for computing the expected wastes costs, denoted by CW C . as Powerset-based search.
costF P (γ) + (1 − SystemRepair(γ)) · costF N (γ) The second approach we considered is to consider only
supersets of the diagnoses in Ω. This has the intuitive rea-
The left hand side of the formula is the false positive costs. soning that at least one of these diagnoses is supposed to be
The right hand side of the formula is the false negative true (according to the known observation), and thus a repair
costs, multiplied by the probability that the system will algorithm should try to aim for fixing the problem in the
not be fixed by repairing the components in γ. Thus, the next repair action. Thus, in this approach, we considered
formula gives the total expected wastes costs. We define in the search for the best repair action every set of compo-
UW C = −CW C as the wasted cost utility function. nents that are unions of at most k diagnoses, where k is a
The wasted cost utility function is a theoretical utility parameter. This approach is referred to as the Union-based
function, since one does not know upfront the values of search.
costF P and costF N . Next, we propose several ways to For both powerset-based search and union-based search,
estimate uW C by proposing ways to estimate costF P and increasing k results in a larger search space. This means
costF N . higher computational complexity, but also increases the
range of repair actions considered, and thus using higher
Estimating the False Positives Cost k can potentially find better repair actions than using lower
We propose to estimate the false positive costs by consider- k values. This provides an often desired tradeoff of com-
ing the system’s health state (Definition 5), as follows. putation vs. solution quality. Experimentally, we observed
X that the union-based search approach yields much better re-
d F P (γ) =
cost (1 − F (c)) · cost(Ci ) sults and thus we only show results for it in the experimental
c∈γ results below.
This estimate of the false positive costs can be understood
as an expectation over the false positive costs. The cost of a 5 Experimental Results
repaired component c ∈ γ is part of the false positive costs
We evaluated the proposed batch selection algorithms on
only if c is in fact healthy. The probability of this occurring
two standard Boolean circuits: 74283 and 74182. We exper-
is (1 − F (c)). Thus, (1 − F (c)) · cost(c) is the expected
imented on 21 observations for system 74283 and 23 obser-
false positive cost due to repairing component c.
vations for system 74182. These observations were selected
False Negatives Cost randomly from Feldman et al.’s [13] set of observations. For
Correctly estimating costF N is more problematic than each observation, all subset minimal diagnoses were found
costF P , as it requires considering the future actions of the using exhaustive search.
repair algorithm. In the best case, only one additional repair
action would be needed. This would incur a single addi- 5.1 Baseline Repair Algorithms
tional overhead cost. We call this the optimistic costF N , The main hypothesis of this line of work is that performing
or simply costoF N , which is equal to costrepair . The other a batch repair action can save repair costs. To evaluate if
extreme assumes that every component not repaired so far the proposed batch repair algorithms are able to do so, we
would be repaired by a single repair action, and correspond- compare them with two repair algorithms that do not con-
ingly an incurred overhead cost. We experimented with a sider batch repair actions. These baseline repair algorithms,
slightly less extreme estimate, in which we assume that only named “Best Diagnosis” (BD) and “Highest Probability”
faulty component will be repaired in the future, but each will (HP), are inspired by previous work on test planning [14]
be repaired in a single repair action, incurring one costrepair and work as follows. BD chooses to repair a single com-
per faulty component. Since we do not know the number of ponent from the most preferred diagnosis in Ω (that with
faulty components, we use the expected number P of faulty the highest p(·) value). From the set of components in the
components according to the health state: c∈γ / F (c). The most probable diagnosis, BD chooses to repair the one with
resulting estimate is referred to as the pessimistic estimate the lowest repair costs. The HP repair algorithm chooses
of costF N , denoted by costpF N , is thus computed as: to repair the component that is most likely to be faulty, as
X computed by the system’s health state (F [·]).
costpF N (γ) = costrepair · F (c) Another baseline repair algorithm we evaluated experi-
c∈γ
/ mentally that serves as a baseline is to repair all components
of the most likely diagnosis in a single batch repair action.
Summarizing all the above, we propose two utility func- Note that this algorithm, denoted Batch Best Diagnosis, ig-
tions from the wasted cost utility function family. A pes- nores repair costs, and serves as an extreme alternative to
d F P and costpF N
simistic wasted cost function, that uses cost the BD algorithm that repairs a single component from the
to estimate costF P and costF N , and an optimistic wasted most likely diagnosis.
d F P and costoF N . The cor-
cost function that uses cost Table 1 shows the average repair costs incurred until the
responding repair algorithms search in the combinatorial system was fixed for the proposed repair algorithms. The
space of all possible sets of components to find the set of average was over all the observations we used for system
components that maximizes uW C . 74182. The rows labeled BD, HP, 2-HP, and 3-HP show the

116
Proceedings of the 26th International Workshop on Principles of Diagnosis

Overhead cost tomated troubleshooting were proposed in previous works.
Algorithm 10 15 20 25 Heckerman et al. [1] proposed the decision-theoretic trou-
BD & HP 83.5 111.3 139.1 167.0 bleshooting (DTT) algorithm, that uses a decision theoretic
2-HP 61.5 77.8 94.1 110.4 approach for deciding which components to observe in or-
3-HP 53.0 65.0 77.0 88.9 der to identify the faulty component. Later work also ap-
Opt.(1) 55.2 68.9 82.6 96.3 plied a decision theoretic approach that integrated planning
Opt.(2) 53.0 65.0 75.2 86.7 and diagnosis to a real world troubleshooting application [3;
Opt.(3) 55.2 66.5 72.6 83.7 15]. Torta et al. [4] proposed using model abstractions for
Pes.(1) 55.0 68.9 81.3 96.1 troubleshooting while taking into account the cost of repair
Pes.(2) 52.8 59.8 63.7 70.0 actions. All these works did not consider the possibility of
Pes.(3) 49.6 50.4 55.9 64.6 repairing a set of components together, allowing only repair
actions that repair a single component at a time.
Table 1: Average repair costs for the 74182 system. Our current paper on BRP do not consider applying
further diagnostic actions such as probing and testing,
Overhead cost which are considered by previous troubleshooting algo-
Algorithm 10 15 20 25 rithms. Thus, our work on BRP could be integrated in previ-
BD 116.4 155.2 194.0 232.9 ous troubleshooting frameworks so as to consider both batch
HP 109.3 145.7 182.1 218.6 repair actions and diagnostic actions. This is left to future
2-HP 81.2 102.1 123.1 144.0 work.
3-HP 70.5 85.7 101.0 116.2 Friedrich and Nedjl [2] discussed the relation between di-
Opt.(1) 76.0 95.7 115.2 134.8 agnoses and repair, in an effort to minimize the breakdown
Opt.(2) 72.9 89.8 102.4 111.7 costs. Breakdown costs roughly correspond to a penalty in-
Pes.(1) 75.2 95.7 114.0 134.8 curred for every faulty output in the system, for every time
Pes.(2) 72.4 84.8 93.6 96.0 step until the system is fixed. In BRP, the goal is to mini-
mize costs until the system if fixed, and there is no partial
Table 2: Average repair costs for the 74283 system. credit for repairing only some of the system outputs.

results for the BD, HP, and k-HP repair algorithms (for k=2 7 Conclusion and Future Work
and 3). The rows Opt.(1), Opt.(3), and Opt.(3) show the re- We addressed the problem of troubleshooting with the pos-
sults for the union-based search repair algorithm using the sibility of performing a batch repair action — a repair action
d F P to estimate costF P
wasted cost utility function with cost in which more than a single component is repaired. Batch
and costoF N to estimate costF N . The rows Pes.(1), Pes.(2), repair makes sense only if repairing a set of components
and Pes.(3) show results for the same configuration, except in a single repair action is cheaper than repairing each of
for using costpF N to estimate costF N instead of costoF N . them separately. We proposed several algorithms for select-
The repair costs of a single component was arbitrary set ing which batch of components to repair. Experimental re-
to 5 and the cost of the overhead (costrepair ) was varied sults clearly show the benefit of batch repair over single re-
(10,15,20,25). Each column represents results for different pair actions, and the benefit of the algorithms we suggested
values of costrepair . In this domain, the results of HP and for choosing these set of components to repair. Future work
BD were virtually the same, and thus we grouped them to a will investigate when should batch repair be considered, and
single row. how to detect such cases upfront. Additionally, expanding
The results clearly show the benefit of considering batch beyond Boolean circuits is also needed, as well as address-
repair actions. The best performing repair algorithm is ing uncertainty on the outcome of repair actions.
Pes.(3), which required more than half the repair costs
needed for BD and HP, which do not consider batch repair.
This supports the main hypothesis of this paper: batch re- References
pair actions can save significant amount of repair costs. As [1] David Heckerman, John S Breese, and Koos Rom-
expected, the gain of batch repair actions increases as the melse. Decision-theoretic troubleshooting. Commu-
repair overhead (costrepair ) increases. Also note that for nications of the ACM, 38(3):49–57, 1995.
Pes.(k) we observe the desired trend of increasing k result- [2] Gerhard Friedrich and Wolfgang Nejdl. Choosing ob-
ing in lower repair costs. This is also observed for the k-HP servations and actions in model-based diagnosis/repair
repair algorithm (note that the HP algorithm is in fact 1-HP), systems. KR, 92:489–498, 1992.
but is not always the case for Opt.(k), where for lower over-
head cost k = 2 yielded lower repair costs than k = 3. This [3] Anna Pernestål, Mattias Nyberg, and Håkan Warn-
suggests that the optimistic estimate of costF N is not robust. quist. Modeling and inference for troubleshooting with
Computationally, increasing k required much more runtime, interventions applied to a heavy truck auxiliary brak-
and we could not run experiments with k = 4 on our cur- ing system. Engineering Applications of Artificial In-
rent machines in reasonable time. Table 2 shows the results telligence, 25(4):705–719, June 2012.
for the 74283 system. The trends observed are the same as [4] Gianluca Torta, Luca Anselma, and Daniele Theseider
those discussed above for the results of 74182 system. Dupré. Exploiting abstractions in cost-sensitive abduc-
tive problem solving with observations and actions. AI
6 Related Work Commun., 27(3):245–262, 2014.
BRP is a troubleshooting problem, where the goal is to per- [5] Marie-Odile Cordier, Yannick Pencolé, Louise Travé-
form repair actions so as to fix a system. Algorithms for au- Massuyès, and Thierry Vidal. Self-healablity = diag-

117
Proceedings of the 26th International Workshop on Principles of Diagnosis

nosability + repairability. In the International Work-
shop on Principles of Diagnosis (DX), pages 251–258,
2007.
[6] Roni Stern and Meir Kalech. Repair planning with
batch repair. In International Workshop on Principles
of Diagnosis (DX), 2014.
[7] Brian C Williams and Robert J Ragno. Conflict-
directed A* and its role in model-based embedded sys-
tems. Discrete Applied Mathematics, 155(12):1562–
1595, 2007.
[8] Rui Abreu, Peter Zoeteweij, and Arjan J. C. van
Gemund. Simultaneous debugging of software faults.
Journal of Systems and Software, 84(4):573–586,
2011.
[9] O.J. Mengshoel, M. Chavira, K. Cascio, S. Poll,
A. Darwiche, and S. Uckun. Probabilistic model-based
diagnosis: An electrical power system case study. Sys-
tems, Man and Cybernetics, Part A: Systems and Hu-
mans, IEEE Transactions on, 40(5):874–885, 2010.
[10] Stuart J. Russell and Peter Norvig. Artificial Intelli-
gence - A Modern Approach (3. internat. ed.). Pearson
Education, 2010.
[11] Stefan Edelkamp and Stefan Schroedl. Heuristic
search: theory and applications. Elsevier, 2011.
[12] Roni Stern, Meir Kalech, Shelly Rogov, and Alexan-
der Feldman. How many diagnoses do we need? In
AAAI, 2015.
[13] Alexander Feldman, Gregory Provan, and Arjan van
Gemund. Approximate model-based diagnosis using
greedy stochastic search. Journal of Artificial Intelli-
gence Research (JAIR), 38:371, 2010.
[14] Tom Zamir, Roni Stern, and Meir Kalech. Using
model-based diagnosis to improve software testing. In
AAAI (to appear), 2014.
[15] Håkan Warnquist, Jonas Kvarnström, and Patrick Do-
herty. Planning as heuristic search for incremental
fault diagnosis and repair. In Scheduling and Planning
Applications Workshop (SPARK) at the International
Conference on Automated Planning and Scheduling
(ICAPS), 2009.

118
Proceedings of the 26th International Workshop on Principles of Diagnosis

Formulating Event-Based Critical Observations in Diagnostic Problems

Cody James Christopher1,2 and Alban Grastien2,1
1
Artificial Intelligence Group, The Australian National University.
2
Optimisation Research Group, NICTA∗

Abstract the diagnosis, in addition to providing information as to the
causes of the fault.
We claim that in scenarios involving a human Further, we assume that a more concise explanation is
operator with responsibility over systems being strictly preferred to more verbose explanation, and conse-
monitored by diagnoser, presenting said operator quently that there is merit to isolating the “smallest” amount
with a concise set of observations capturing the of supporting evidence, or what we call the critical obser-
essence of a failure improves the operator’s un- vations. In cognitive psychology, the seminal paper on the
derstanding of the diagnosis. topic of working memory in humans supports this view, giv-
We take this in the context of Discrete Event Sys- ing the average working memory capacity as 7 ± 2 distinct
tems and demonstrate how the idea can be ap- pieces of information [1]. Providing only the observations
plied to systems utilising event-based observa- critical to the diagnosis also has the additional benefit of
tions, which can contain implicit information. We ameliorating privacy concerns in systems where privacy is
introduce the notion of an abstracted event stream, considered important.
called a sub-observation, that makes the implicit We extend the results of Christopher et al. [2] to event-
information explicit for the operator and allows a based observations. We first present preliminary theory and
diagnoser to arrive at the same diagnosis. We call notation, before going on to show that event-based observa-
the most abstract of these the critical observation. tions contain implicit information. We then introduce what
We provide relevant definitions, properties, and a we call sub-observations that can capture this implicit in-
procedure for computing the critical observation formation and make it available for use in diagnosis pro-
in a diagnosis problem. cedures. We then provide formal definitions of sufficiency
and criticality in addition to several important properties
that allow for a terminating algorithm. We present an algo-
1 Introduction rithm for computing the critical observation and discuss its
complexity. A discussion of alternate ways of defining sub-
Diagnosis problems are concerned with the detection and
observations precedes a brief discussion of related work and
identification of occurrences of specific events in a system,
a conclusion.
generally called faults or failures. These occurrences are
difficult to detect as the fault events are typically not di-
rectly observable, however, they can be inferred from the 2 Preliminaries and Notations
system model (a description of the system behaviour) and The present work takes place in the context and standard
the observations produced by the system. framework of discrete event systems (DES) [3]. We denote
Diagnosis is the first step in the fault recovery process. as Σ the set of events that can take place on the system. A
Once a fault has been detected and identified, the appropri- system run is a finite sequence of events, w = e1 e2 . . . ek ,
ate actions can be taken to mitigate its effects. The issue, and the system is modeled as the prefix-closed language
however, is that this procedure acts as a black box; given a LM ⊆ Σ⋆ that represents all possible runs.
model and a sequence of observations, a diagnoser asserts a The set of events is partitioned into observable events
fault by claiming that there is no possible nominal execution Σo —events that are recorded—and unobservable events
of the system that would produce the observation sequence. Σu —those that are not. The observation o generated by run
The present work is written under the assumption that a w = e1 e2 . . . ek , hereafter called the trace of w, is the pro-
diagnosis procedure is fundamentally built for a human op- jection of w on the set of observable events (i.e., all unob-
erator in charge of taking actions after a fault is identified. servable events of the run are deleted):
In this scenario, a black box approach does not allow for (
the presentation of the information relevant to the diagno- ε if k = 0
sis. We assume that providing the operators with explana- o = PΣo (w) = e1 PΣo (e2 . . . ek ) if k > 0 and e1 ∈ Σo
tory evidence is useful in convincing them of the validity of PΣo (e2 . . . ek ) otherwise.
∗
NICTA is funded by the Australian Government through The observed language of a trace o, denoted Lo , is the set
the Department of Communications and the Australian Research of finite sequences of events that could produce the observed
Council through the ICT Centre of Excellence Program. sequence: Lo = PΣ−1 o
(o) = {w ∈ Σ⋆ | PΣo (w) = o}.

119
Proceedings of the 26th International Workshop on Principles of Diagnosis

The set of unobservable events includes a subset of fault {b, c, d} start {b, e}
events, Σf ⊆ Σu . With slight abuse of notation we write
f ∈ w as short for w ∈ Σ⋆ f Σ⋆ (or “f appears in w”) and {c, d} a b
F ∩ w as short for {f ∈ F | f ∈ w} (or “the subset of 6 1 2 3
events from F that appear in w”). a
A set δ ⊆ Σf of faults is consistent with the model LM f2 f2
and the trace o if there exists a run w ∈ LM that would f1
produce this trace (PΣo (w) = o) and that exhibits exactly a
7 8 4 5
these faults (w ∩ Σf = δ). The diagnosis of trace o, denoted
∆(o), is the collection of all consistent sets of faults: {b, c, d, e}
{a, d} c
∃w ∈ LM .
∆(o) = δ ⊆ Σf (1)
PΣo (w) = o ∧ δ = w ∩ Σf
Figure 1: Example DES
Hereafter we use the hat notation (ˆ) to indicate that the
given symbol represents what actually occurred. Given a that a was followed by c is enough to convince an operator
run ŵ, δ̂ = ŵ ∩ Σf is the set of faults that occurred during of the correctness of the diagnosis.
the run; then the following result is trivial: ŵ ∈ LM ⇒ δ̂ ∈ o2 = ababaa. The model specifies that f1 must have oc-
∆(PΣo (ŵ)). (The premise, completeness of the model, is curred for there to be two a events that are not separated by
another observable event. More specifically, the lack of an
assumed.)
We find it more convenient to define the diagnosis in intervening event is the crucial piece of information that de-
terms of emptiness of languages. Let Lδ be the language termines the fault. In this case, reporting in some abstract
sense that multiple a occurred consecutively is enough to
that represents all sequences that contain exactly δ:
\ \ indicate the fault convincingly.
Lδ = {w ∈ Σ⋆ | w∩Σf = δ} = Σ⋆ f Σ⋆ ∩ (Σ\{f })⋆
3.2 Framework
f ∈δ f ∈Σf \δ
We first present a general framework for sub-observation,
That is, Lδ represents the set of all runs containing all which is then further specified for our particular choice of
of the faults of δ, intersected with all possible runs where implementation.
the faults not in δ never occur—the result is a set of all
runs where the only faults that occur are those in δ. With General Definition
Lδ defined, we can equivalently express the diagnosis as an Definition 1 We define a framework for sub-observations
emptiness of languages problem: as a tuple: hO, , subi:
1. A sub-observation, θ, is an abstraction over a trace that
δ ∈ ∆(o) ⇐⇒ LM ∩ Lo ∩ Lδ 6= ∅. (2) represented an intentional relaxation (or weakening) of
the concrete knowledge contained in the trace.
3 Sub-Observations 2. O is the space of possible sub-observations.
We first discuss event-based observations, and in particular
3. The symbol is a binary relation and partial order
that event-based observations contain implicit information
over O and relates two sub-observations θ, θ′ such that
that must be taken into consideration when performing di-
θ′ θ iff θ′ is a more abstracted form of θ.
agnosis. We then introduce the notion of sub-observations,
providing formal definitions and an explanatory example. 4. sub is an injective function, mapping traces to maximal
Once this has been established, a procedure is given for di- (w.r.t. ) sub-observations θ ∈ O:
agnosing with sub-observations. sub : Σ∗o → O
3.1 Event-Based Diagnosis and Implicit A sub-observation θ implicitly represents the set of traces
Information for which it is a more abstract form of:
Event-based diagnosis, contrasted with state-based diagno- ψ(θ) = {o ∈ Σ∗o | θ sub(o)}
sis, comes with a subtlety; specifically, there is a type of
implicit information encoded in the trace. Take for example Therefore, θ′ θ ⇒ ψ(θ′ ) ⊇ ψ(θ).
the repeated observation of a window being closed without
there ever being an observation of the window opening; in The language of a sub-observation, denoted Lθ , repre-
this case, the fact that we never observed an open event is sents the set of all possible runs θ could represent. How-
distinctly relevant to a diagnosis procedure. ever, these runs are already captured by Lo , and so Lθ can
To further illustrate this, we provide a simple abstract ex- be expressed as the union of the languages of the traces it is
ample in the form of a DES: Take Σ = {a, b, c, d, e, f1, f2 }, a more abstract form of:
[
with Σo = {a, b, c, d, e}, Σu = Σf = {f1 , f2 }. We provide Lθ = Lo (3)
the system model in the form of a NFA in Figure 1 and con- o∈ψ(θ)
sider some example traces over it:
o1 = abababc. The model specifies that f2 must have oc- Specific Definition
curred in strings containing a followed by c. In this case, For the purposes of our specific definition of sub-
the intervening sequence is long (babab), and could be much observations, it is necessary to distinguish between what we
longer. The important information, however, is that a was at call hard and soft events. A hard event is a singleton observ-
some point followed by c. Reporting in some abstract sense able event, x ∈ Σo , and represents the firm occurrence of an

120
Proceedings of the 26th International Workshop on Principles of Diagnosis

event in the system. A soft event is a subset of observable {ac} b {cd} a {c} d {c} a ∅
events, y ⊆ Σo , that any number (including zero) of which
may have occurred along with any number of unobservable
events.
We now explicitly characterize our construction of sub- {abcd} a {bcd} a ∅
observations based on the general framework presented in
Definition 1: Figure 2: An example map satisfying
Definition 2 A sub-observation, θ, is a strict time-ordered
alternating sequence of soft and hard events, commencing
and ending with a soft event: θ = y0 x1 y1 . . . xn yn . We de-
3.3 Diagnosis of Sub-Observations
note O(o) the space of sub-observations for a given trace o. We now formalize the usage of sub-observations in a diag-
θ ∈ O has length |θ| = n. For readability, sub-observations nosis procedure by extending the procedure introduced for
may occasionally be written as a comma separated list. The event-based diagnosis presented in §2. This involves check-
language of θ can then also be expressed: ing the consistency of a set of possible faults.
Lθ = (y0 ∪ Σu )∗ x1 (y1 ∪ Σu )∗ . . . xn (yn ∪ Σu )∗ We therefore provide the construction of the diagnosis
of θ, ∆(θ), the set of faults consistent with a given sub-
By way of example, take the sub-observation θ = observation:
({b, d} , a, ∅, c, {a}) – in this case, we say the singleton Definition 5 The diagnoses of a sub-observation θ is the
events x1 = a and x2 = c are hard and occurred in the spec- union of the diagnoses of the traces for which θ is the more
ified order. The first soft event, y0 = {b, d}, represents the abstract form of, represented by ψ(θ) as given in Defini-
possibility of any number of b or d events in any order hav- tion 1: [
ing occurred before the first hard event – similarly, y1 = ∅ ∆(θ) = ∆(o)
indicates that no events occurred between the hard events o∈ψ(θ)
x1 and x2 , and y2 = {a} that any number of a events could
have occurred after the final hard event. There are multiple From Definition 5 we note that, given δ̂ ∈ ∆(ô), that if
traces ô that this could represent, ac being the simplest, but θ sub(ô) then δ̂ ∈ ∆(θ). That is, the actual diagnosis δ̂
traces such as ddacaa or bac, or indeed up to infinite (or of the actual trace ô, will by definition be in ∆(θ) if θ is an
bounded length depending) other possibilities. abstraction of ô.
Definition 3 The function sub generates a sub-observation First, we observe the following lemma:
in O from a given trace by inserting empty soft events at the Lemma 3.1 The possible traces permitted by the language
head of the trace, and after every hard event: of a more abstracted sub-observation strictly contains all
For o = e1 . . . en the permitted traces of all its ascendants:
sub(o) = ∅x1 ∅ . . . xn ∅ ∈ O θ′ θ =⇒ Lθ ⊆ L′θ
Where ∀i : xi = ei
Proof This is a direct consequence of Equation 3
Definition 4 The relation over O is defined such that
θ′ θ if and only if there exists a mapping function f : Equation 2 provided a formulation of the diagnosis as a
question of emptiness in the intersection of languages – that
Given |θ′ | = n, |θ| = m is, is there some run that is simultaneously possible accord-
f : {0, . . . , n + 1} → {0, . . . , m + 1} such that ing to the system model, the observations, and the faults that
f (i) < f (i + 1), f (0) = 0, f (n + 1) = m + 1 occurred during the run. This can similarly be extended to
a similar question for sub-observations. As Lθ is defined in
x′i = xf (i) Definition 2, then ∆(θ) can be equivalently extended:
[ [
yi′ ⊇ yj ∪ xj ∆(θ) ≡ {δ | LM ∩ Lδ ∩ Lθ 6= ∅} (4)
f (i)≤j≤f (i+1)−1 f (i) f (i)+1 (such an index exists if the two
38. sizes differ). If yi+1 \yi 6= ∅, then let θ′′ = es(θ, i, e) (where
[17] G. Lamperti, F. Vivenzi, and M. Zanella, “On sub- e ∈ yi+1 \ yi ) be the sub-observation obtained by softening
sumption, coverage, and relaxation of temporal ob- yi with e; then, θ′ ≺ θ′′ ≺ θ. Similarly if yi ⊇ yi+1 with
servations in reuse-based diagnosis of discrete-event θ′′ = es(θ, i + 1, e) (where e ∈ yi \ yi+1 ). Lastly the same
systems: a unifying perspective,” in 20th International applies if yi = yi+1 with θ′′ = coll(θ, i).
Workshop on Principles of Diagnosis (DX-09), 2009, If θ and θ′ have same size, then all x′i s equal the cor-
pp. 353–360. responding xi s, and all the yi′ s are supersets of the corre-
[18] J. Kurien and P. Nayak, “Back to the future for sponding yi s. Let i be an index such that yi′ 6= yi (if no
consistency-based trajectory tracking,” in Conference such index exists, then θ′ = θ). Let θ′′ = es(θ, i, e) where
on Artificial Intelligence, 2000, pp. 370–377. e ∈ yi′ \ yi . Then θ′ ≺ θ′′ ≺ θ.
[19] F. Cassez, J. Dubreil, and H. Marchand, “Synthesis of Complexity of FIND C RITICAL O BSERVATION
opaque systems with static and dynamic masks,” For- We show that the number of ∆(·) calls in FIND C RIT-
mal Methods in System Design, vol. 40, no. 1, pp. 88– 2 2
ICAL O BSERVATION could be in the order of n 4m where
115, 2012.
n is the length of the trace and m the number of observable
events.
8 Appendix
We provide proof sketches that will not be included in the start
A
final version of the paper. A
f b b
Proof of Lemma 4.3 4 1 2 3
The proof is three-part: c c
c A
a) proving that the event-softening operation produces
only children;
Figure 4: Example of a system: a fault is diagnosed if there
b) proving that the collapse operation produces only chil- are more cs than bs after the occurrence of the last ai (A
dren; stands for {a1 , . . . , am−2 }).
c) proving that there is no other child.
def We use the example of Figure 4 which involves faulty
Event-Softenings It is easy to see that θ2 = event f and observable events {a1 , . . . , am−2 , b, c}. Con-
es(θ1 , i, e) ≺ θ1 . sider the trace of (odd) length n: ô = a1 . . . a1 bc . . . bc c.
Assume now that θ2 θ3 θ1 and let f23 and f31 be the | {z } | {z }
n/2 n/2
two mapping functions—as presented in Definition 4—used
Clearly the trace reveals a faulty system since the number
to verify the two ordering relations.
of cs exceeds the number of bs in this instance. The critical
By definition of , |θ2 | ≤ |θ3 | ≤ |θ1 |. However since
observation here is:
|θ2 | = |θ1 | (by definition of event-softening), the size of
all three sub-observations are equal and f23 = f31 are the Σo a1 {c}b{c}c{c} . . . {c}b{c}c{c}cΣo,
identity function.
As a consequence, x3j = x2j = x1j for all j. Furthermore i.e., all the second half of the trace needs to be kept.
yj ⊇ yj3 ⊇ yj1 for all j. In particular, if j 6= i, since yj2 = yj1 ,
2 We assume that FIND C RITICAL O BSERVATION always
tries to perform event-softening from the end of the sub-
then yj3 = yj2 = yj1 . For i, yi2 = yi1 ∪ {e}, meaning that observation first, and only tries to collapse when no soft-
either yi3 = yi2 or yi3 = yi1 . ening is possible. Neglecting the first steps where the c
Therefore either θ3 = θ2 or θ3 = θ1 . softenings are successful, the algorithm will need to make
Collapse Similarly, it is easy to see that θ2 =
def U = n2 × (m − 1) calls to ∆(·), unsuccessfully trying to
coll(θ1 , i) ≺ θ1 . softening the second half of the sub-observation. The num-
Again assume that θ2 θ3 θ1 and let f23 and f31 be ber of successful softenings however is S = n2 × m (all the
the functions defined as before. first half of the sub-observation), meaning that the number
2
The size of θ3 now either equals that of θ2 or θ1 ; let ℓ ∈ of ∆(·) calls will be at least U × S = n m(m−1)4 calls.
{1, 2} denote the index such that |θ3 | = |θℓ |. Notice that
either f23 or f31 is the identity function.

125
Proceedings of the 26th International Workshop on Principles of Diagnosis

126
Proceedings of the 26th International Workshop on Principles of Diagnosis

A Framework For Assessing Diagnostics Model Fidelity

Gregory Provan1 and Alex Feldman2
1
Computer Science Department, University College Cork, Cork, Ireland
e-mail: g.provan@cs.ucc.ie
2
PARC Inc., Palo Alto, CA 94304, USA
e-mail: afeldman@parc.com

Abstract models can actually perform worse than lower-fidelity mod-
els on real-world data, as can be explained using over-fitting
“All models are wrong but some are useful" [1]. arguments within a machine learning framework.
We address the problem of identifying which di- To our knowledge, there is no theory within Model-Based
agnosis models are more useful than others. Mod- Diagnostics that relates notions of model complexity, model
els are critical to diagnostics inference, yet little accuracy, and inference complexity. To address these issues,
work exists to be able to compare models. We de- we explore several of the factors that contribute to model
fine the role of models in diagnostics inference, complexity, as well as a theoretically sound approach for
propose metrics for models, and apply these met- selecting models based on their complexity and diagnostics
rics to a tank benchmark system. Given the many performance, i.e., their accuracy in diagnosing faults.
approaches possible for model metrics, we argue Our contributions are as follows:
that only information-theoretic methods address
• We characterise the task of selecting a diagnosis model
how well a model mimics real-world data. We
of appropriate fidelity as an information-theoretic
focus on some well-known information-theoretic
model selection task.
modelling metrics, demonstrating the trade-offs
that can be made on different models for a tank • We propose several metrics for assessing the quality of
benchmark system. a diagnosis model, and derive approximation versions
of a subset of these metrics.
• We use a dynamical systems benchmark model to
1 Introduction demonstrate our compare how the metrics assess mod-
A core goal of Model-Based Diagnostics (MBD) is to ac- els relative to the accuracy of diagnostics output based
curately diagnose a range of systems in real-world appli- on using the models.
cations. There has been significant progress in developing
algorithms for systems of increasing complexity. A key 2 Related Work
area where further work is needed is scaling-up to real- This section reviews work related to our proposed approach.
world models, as multiple-fault diagnostics algorithms are Model-Based Diagnostics: There is some seminal work
currently limited by the size and complexity of the models on modelling principles within the Model-Based Diagnosis
to which they can be applied. In addition, there is still a great (MBD) community, e.g., [2; 3]; this early work adopts an
need for defining metrics to measure diagnostics accuracy, approach based on logic or qualitative physics for model
and to measure the computational complexity of inference specification. However, this work provides no means for
and of the models’ contribution to inference complexity. comparing models in terms of diagnostics accuracy. More
This article addresses the modeling side of MBD: we fo- recent work ([4]) provides a logic-based specification of
cus on methods for measuring the size and complexity of model fidelity. There is also work specifying metrics for
MBD models. We explore the role that diagnostics model diagnostics accuracy, e.g., [5].
fidelity can play in being able to generate accurate diagnos- However, none of this work defines precise metrics for
tics. We characterise model fidelity and examine the trade- computing both diagnostics accuracy and model complex-
offs of fidelity and inference complexity within the overall ity, and their trade-offs. This article adopts a theoretically
MBD inference task. well-founded approach for integrating multiple MBD met-
Model fidelity is a crucial issue in diagnostics [2]: mod- rics.
els that are too simple can be inaccurate, yet highly detailed Multiple Fidelity Modeling There is limited work de-
and complex models are expensive to create, have many pa- scribing the use of models of multiple levels of fidelity. Ex-
rameters that require significant amounts of data to estimate, amples of such work includes [6; 7; 8]. In this article we
and are computationally intensive to perform inference on. focus on methods for evaluating multi-fidelity models and
There is an urgent need to incorporate inference complexity their impact on diagnostics accuracy, as opposed to devel-
within modelling, since even relatively simple models, such oping methodoligies for modelling at multiple levels of fi-
as some of the combinational ISCAS-85 benchmark models, delity.
pose computational challenges to even the most advanced Multiple-Mode Modeling One approach to MBD is to
solvers for multiple-fault tasks. In addition, higher-fidelity use a separate model for every failure mode, rather than to

127
Proceedings of the 26th International Workshop on Principles of Diagnosis

define a model containing all failure modes. Examples of We generalise that notion to incorporate inference effi-
this approach include [9; 10; 11; 12]. Note that this work ciency as well as accuracy. We can define an inference com-
does not specify metrics for computing both diagnostics ac- plexity measure as C(Ỹ , φ). We can then define our diagno-
curacy and model complexity, or their trade-offs. sis task as jointly minimising a function g that incorporates
Model- Selection The metrics that we adopt and extend the accuracy (based on the residual function) and the infer-
have been used extensively to compare different models, ence complexity:
e.g., [13]. The metrics are used to compare simulation per-
formance of models only. In contrast, we extend this frame- ξ ∗ = argmin g R(Ỹ , Yφ ), C(Ỹ , φ) . (2)
work to examine diagnostics performance. In the process, ξ∈Ξ
we explore the use of multiple loss functions for penalising Here g specifies a loss or penalty function that induces a
models, in addition to the standard penalty functions based non-negative real-valued penalty based on the lack of accu-
on number of model parameters. racy and computational cost.
Model-Order Reduction Model-Order reduction [14] In forward simulation, a model φ, with parameters θ, can
aims to reduce the complexity of a model with an aim to
generate multiple observations Ỹ = {ỹ1 , ..., ỹn }. The di-
limit the performance losses of the reduced model. The re-
agnostics task involves performing the inverse operation on
duction methods are theoretically well-founded, although
these observations. Our objective thus involves optimising
they are highly domain-specific. In contrast to this ap-
the state estimation task over a future set of observations,
proach, we assume a model-composition approach from a
Ỹ = {Ỹ1 , ..., Ỹn }. Our model φ and inference algorithm
component library containing hand-constructed models of
multiple levels of fidelity. A have different performance based on Ỹi , i = 1, ..., n: for
example, [15] shows that both inference-accuracy and -time
vary based on the fault cardinality . As a consequence, to
3 Diagnostics Modeling and Inference compute ξ ∗ we want to optimise the mean performance over
This section formalises the notion of diagnostics model future observations. This notion of mean performance op-
within the process of diagnostics inference. We first intro- timisation has been characterised using the Bayesian model
duce the task, and then define it more precisely. selection approach, which we examine in the following sec-
tion.
3.1 Diagnosis Task
Assume that we have a system S that can operate in a nom- 3.2 Diagnosis Model
inal state, ξN , or a faulty state, ξF , where Ξ is the set of We specify a diagnosis model as follows:
possible states of S. We further assume that we have a dis-
Definition 1 (Diagnosis Model). We characterise a Diag-
crete vector of measurements, Ỹ = {ỹ1 , ..., ỹn } observed nosis Model φ using the tuple hV , θ, Ξ, Ei, where
at times t = {1, ..., n} that summarizes the response of
the system S to control variables U = {u1 , ..., un }. Let • V is a set of variables, consisting of variables denoting
Yφ = {y1 , ..., yn } denote the corresponding predictions the system state (X), control (U ), and observations
from a dynamic (nonlinear) model, φ, with parameter values (Y ).
θ: this can be represented by Yφ = φ(x0 , θ, ξ, Ũ ), where x0 • θ is a set of parameters.
signifies the initial states of the system at t0 . • Ξ is a set of system modes.
We assume that we have a prior probability distribution
P (Ξ) over the states Ξ of the system. This distribution de- • E is a set of equations, with a subset Eξ ⊆ E for each
notes the likelihood of the failure states of the system. mode ξ ∈ Ξ.
We define a residual vector R(Ỹ , Yφ ) to capture the dif- We will assume that we can use a physics-based approach
ference between the actual and model-simulated system be- to hand-generate a set E of equations to specify a model.
haviour. An example of a residual vector is the mean- Obtaining good diagnostics accuracy, given a fixed E, en-
squared-error (MSE). We assume a fixed diagnosis task T tails estimating the parameters θ to optimise that accuracy.
throughout this article, e.g., computing the most likely diag-
nosis, or a deterministic multiple-fault diagnosis. 3.3 Running Example: Three-Tank Benchmark
The classical definition of diagnosis is as a state estima- In this paper, we use the three-tank system shown in Fig. 1
tion task, whose objective is to identify the system state that to illustrate our approach. The three tanks are denoted as T1 ,
minimises the residual vector: T2 , and T3 . Each tank has the same area A1 = A2 = A3 .
For i = 1, 2, 3, tank Ti has height hi , a pressure sensor pi ,
ξ ∗ = argmin R(Ỹ , Yφ ) (1)
ξ∈Ξ and a valve Vi , i = 1, 2, 3 that controls the flow of liquid
out of Ti . We assume that gravity g = 10 and the liquid has
Since this is a minimisation task, we typically need to density ρ = 1.
run multiple simulations over the space of parameters and Tank T1 gets filled from a pipe, with measured flow q0 .
modes to compute ξ ∗ . We can abstract this process as Using Torricelli’s law, the model can be described by the
performing model-inversion, i.e., computing some ξ ∗ = following non-linear equations:
φ−1 (x0 , θ, ξ, Ũ ) that minimises R(Ỹ , Yφ ).
During this diagnostics inference task, a model φ can play dh1 1 h p i
= −κ1 h1 − h2 + q0 , (3)
two roles: (a) simulating a behaviour to estimate R(Ỹ , Yφ ); dt A1
(b) enabling the computation of ξ ∗ = φ−1 (x0 , θ, ξ, Ũ ). It dh2 1 h p p i
= κ1 h1 − h2 − κ2 h2 − h3 , (4)
is clear that diagnostics inference requires a model that has dt A2
good fidelity and is computationally efficient for performing dh3 1 h p p i
these two roles. = κ2 h2 − h3 − κ3 h3 . (5)
dt A3

128
Proceedings of the 26th International Workshop on Principles of Diagnosis

q0 mode, and ξ· is the mode where · denotes the combination
of valves (taken from a combination of {1, 2, 3}) which are
faulty. This fault model has 9 parameters.

V1 V2 V3 4 Modelling Metrics
This section describes the metrics that can be applied to esti-
mate properties of a diagnosis model. We describe two types
p1* p2* p3* of metrics, dealing with accuracy (fidelity) and complexity.

Figure 1: Diagram of the three-tank system.
4.1 Model Accuracy
Model accuracy concerns the ability of a model to mimic a
real system. From a diagnostics perspective, this translates
In eq. 3, the coefficient κ1 denotes a parameter that cap- to the use of a model to simulate behaviours that distinguish
tures the product of the cross-sectional area of the tank nominal and faulty behaviours sufficiently well that appro-
A√1 , the area of the drainage hole, a gravity-based constant priate fault isolation algorithms can identify the correct type
( 2g), and the friction/contraction factor of the hole. κ2 of fault when it occurs. As such, a diagnostics model needs
and κ3 can be defined analogously. to be able to simulate behaviours for multiple modes with
Finally, the pressure at the bottom of each tank is obtained “appropriate" fidelity.
from the height: pi = g hi , where i is the tank index (i ∈ Note that we distinguish model accuracy from diagnosis
{1, 2, 3}). inference accuracy. As noted above, model accuracy con-
We emphasize the use of the κi , i = 1, 2, 3 because we cerns the ability of a model to mimic a real system through
will use these parameter-values as a means for “diagnos- simulation, and to assist in diagnostics isolation. Diagnosis
ing” our system in term of changes in κi , i = 1, 2, 3. Con- inference accuracy concerns being able to isolate the true
sider a physical valve R1 between T1 and T2 that constraints fault given an observation and the simulation output of a
the flow between the two tanks. We can say that the valve model.
changes proportionally the cross-sectional drainage area of A significant challenge for a diagnosis model is the need
q1 and hence κ1 . The diagnostic task will be to compute the to simulate behaviours for multiple modes. Two approaches
true value of κ1 , given p1 , and from κ1 we can compute the that have been taken are to use a single model with multiple
actual position of the valve R1 . modes explicitly defined (a multi-mode approach), or to use
We now characterise our nominal model in terms of Def- multiple models [9; 16; 17], each of which is optimised for
inition 1: a single or small set of modes (a multi-model approach).
• variables V consist of variables denoting The AI-based MBD approach typically uses a single
the system state (X = {h1 , h2 , h3 }), con- model φ with multiple modes explicitly defined [18], or a
trol (U = {q0 , V1 , V2 , V3 }), and observations single model with just nominal behaviour [19]. From a di-
(Y = {p1 , p2 , p3 }). agnostics perspective, accuracy must be defined with respect
• θ = {{A1 , A2 , A3 }, {κ1 , κ2 , κ3 }} is the set of pa- to the task T . We adopt here the task of computing the most-
rameters. likely diagnosis.
Given evidence suggesting that model fidelity for a multi-
• Ξ consists of a single nominal mode. mode approach varies depending on the mode, it is impor-
• E is a set of equations, given by equations 3 through 5. tant to explicitly consider the mean performance of φ over
Note that this model has a total of 6 parameters. the entire observation space Y (the space of possible obser-
vations of the system).
Fault Model In this article we focus on valve faults,
where a valve can have a blockage or a leak. We model In this article we adopt the expected residual approach,
this class of faults by including in equations 3 to 5 an addi- i.e., given a space Y = {Ỹ1 , ..., Ỹn } of observations, the ex-
tive parameter β, which is applied to the parameter κ, i.e., as pected residual is the average over the n observations, e.g.,
Pn
κi (1+βi ), i = 1, 2, 3, where −1 ≤ βi ≤ κ1i −1, i = 1, 2, 3. as given by: R̄ = n1 i=1 R(Ỹi , Yφ ).
β > 0 corresponds to a leak, such that β ∈ (0, 1/κ − 1];
β < 0 corresponds to a blockage, such that β ∈ [−1, 0). 4.2 Model Complexity
The fault equations can be written as: At present, there is no commonly-accepted definition of
model complexity, whether the model is used purely for
dh1 1 h p i
simulation or if it is used for diagnostics or control. Defin-
= −κ1 (1 + β1 ) h1 − h2 + q0 , (6)
dt A1 ing the complexity of a model is inherently tricky, due to the
dh2 1 h p number of factors involved.
= κ1 (1 + β1 ) h1 − h2
dt A2 Less complex models are often preferred either due to
p i their low computational simulation costs [20], or to min-
− κ2 (1 + β2 ) h2 − h3 , imise model over-fitting given observed data [21; 22]. Given
dh3 1 h p p i the task of simulating a variable of interest conditioned by
= κ2 (1 + β2 ) h2 − h3 − κ3 (1 + β3 ) h3 . certain future values of input (control) variables, overfitting
dt A3
can lead to high uncertainty in creating accurate simulations.
The fault equations allow faults for any combination of Overfitting is especially severe when we have limited ob-
the valves {V1 , V2 , V3 }, resulting in system modes Ξ = servation variables for generating a model representing the
{ξN , ξ1 , ξ2 , ξ3 , ξ12 , ξ13 , ξ23 , ξ123 }, where ξN is the nominal underlying process dynamics. In contrast, models with low

129
Proceedings of the 26th International Workshop on Principles of Diagnosis

parameter dimensionality (i.e. fewer parameters) are con- Statistical model selection is commonly based on Oc-
sidered less complex and hence are associated with low pre- cam’s parsimony principle (ca.1320), namely that hypothe-
diction uncertainty [23]. ses should be kept as simple as possible. In statistical terms,
Several approaches have been used, based on issues like this is a trade-off between bias (distance between the aver-
(a) number of variables [24], (b) model structure [25], (c) age estimate and truth) and variance (spread of the estimates
number of free parameters [23], (d) number of parameters around the truth).
that the data can constrain [26], (e) a notion of model weight The idea is that by adding parameters to a model we ob-
[27], or (f) type and order of equations for a non-linear dy- tain improvement in fit, but at the expense of making pa-
namical model [14], where type corresponds to non-linear, rameter estimates “worse"’ because we have less data (i.e.,
linear, etc.; e.g., order for a non-linear model is such that a information) per parameter. In addition, the computations
k-th order system has k-th derivates in E. typically require more time. So the key question is how to
Factors that contribute to the true cost of a model include: identify how complex a model works best for a given prob-
(a) model-generation; (b) parameter estimation; and (c) sim- lem.
ulation complexity, i.e., the computational expense (in terms If the goal is to compute the likelihood of a given model
of CPU-time and memory) needed to simulate the model φ(x0 , θ, ξ, U ), then θ and U are nuisance parameters.
given a set of initial conditions Rather than try to formu- These parameters affect the likelihood calculation but are
late this notion in terms of the number of model variables or not what we want to infer. Consequently, these parameters
parameters, or a notion of model structural complexity, we should be eliminated from the inference. We can remove
specify model complexity in terms of a measure based on nuisance parameters by assigning them prior probabilities
parameter estimation, and inference complexity, assuming a and integrating them out to obtain the marginal probability
construction cost of zero. of the data given only the model, that is, the model likeli-
A thorough analysis of model complexity will need to hood (also called integrative, marginal, or predictive like-
take into consideration the model equation class, since lihood). In equational form, this looks like: P (Y |φ) =
R R
model complexity is class-specific. For example, for non- P (φ|Y , θ, U )P (θ, U |φ)dθdU . However, this multi-
θ U
linear dynamical models, complexity is governed by the dimensional integral can be very difficult to compute, and it
type and order of equations [14]. In contrast, for linear dy- is typically approximated using computationally intensive
namical models, which have only matrices and variables in techniques like Markov chain Monte Carlo (MCMC).
equations (no derivatives), it is the order of the matrices that Rather than try to solve such a computationally challeng-
determines complexity. In this article, we assume that mod- ing task, we adopt an approximation to the multidimen-
els are of appropriate complexity, and hence do not address sional integral. In the statistics literature several decompos-
Model order reduction techniques [14], which aim to gen- able approximations have been proposed.
erate lower-dimensional systems that trade off fidelity for Spiegelhalter et al. [26] have proposed a well-known
reduced model complexity. such decomposable framework, termed the Deviance In-
4.3 Diagnostics Model Selection Task formation Criterion (DIC), which measures the number of
model parameters that the data can constrain.: DIC =
The model in this model selection problem corresponds to D + pD , where D is a measure of fit (expected deviance),
a system with a single mode. Given a space Φ of possible and pD is a complexity measure, the effective number of
models, we can define this model selection task as follows: parameters. The Akaike Information Criterion (AIC) [29;

φ∗ = argmin g1 R(Ỹ , Yφ ) + g2 C(Ỹ , φ) , (7) 30] is another well-known measure: AIC = −2L(θ̂) + 2k,
φ∈Φ where θ̂ is the Maximum Likelihood Estimate (MLE) of θ
adopting the simplifying assumption that our loss function and k is the number of parameters.
g is additively decomposable. To compensate for small sample size n, a variant of AIC,
termed AICc , is typically used:
4.4 Information-Theoretic Model Complexity
The Information-Theoretic (or Bayesian) model complex- 2k(k + 1)
AICc = −2L(θ̂) + 2k + (8)
ity approach, which is based on the model likelihood, mea- (n − k − 1)
sures whether the increased “complexity" of a model with
more parameters is justified by the data. The Information- Another computationally more tractable approach is the
Theoretic approach chooses a model (and a model structure) Bayesian Information Criterion (BIC) [31]: BIC =
from a set of competing models (from the set of correspond- −2L(θ̂) + klogn, where k is the number of estimable pa-
ing model structures, respectively) such that the value of a rameters, and n is the sample size (number of observations).
Bayesian criterion is maximized (or prediction uncertainty BIC was developed as an approximation to the log marginal
in choosing a model structure is minimized). likelihood of a model, and therefore, the difference between
The Information-Theoretic approach addresses prediction two BIC estimates may be a good approximation to the nat-
uncertainty by specifying an appropriate likelihood func- ural log of the Bayes factor. Given equal priors for all com-
tion. In other words, it specifies the probability with which peting models, choosing the model with the smallest BIC is
the observed values of a variable of interest are generated equivalent to selecting the model with the maximum poste-
by a model. The marginal likelihood of a model structure, rior probability. BIC assumes that the (parameters’) prior is
which represents a class of models capturing the same pro- the unit information prior (i.e., a multivariate normal prior
cesses (and hence have the same parameter dimensional- with mean at the maximum likelihood estimate and variance
ity), is obtained by integrating over the prior distribution of equal to the expected information matrix for one observa-
model parameters; this measures the prediction uncertainty tion).
of the model structure [28]. Wagenmakers [32] shows that one can convert the BIC

130
Proceedings of the 26th International Workshop on Principles of Diagnosis

metric to Fault Model The fault model introduces a parameter βi
SSE associated with κi , i.e., we replace κi with κi (1 + βi ), i =
BIC = n log + k logn, 1, 2, 3, where −1 ≤ βi ≤ κ1i − 1, i = 1, 2, 3. This model
SStotal
has 7 parameters, adding parameters β1 , β2 , β3 .
where SSE is the sum of squares for the error term. In our
experiments, we assume that the non-linear model is the Qualitative Model
“correct" model (or the null hypothesis H0 ), and either the Nominal Model p For the model we replace the non-linear
linear or qualitative models are the competing model (or al- sub-function hi − hj with the qualitative sub-function
ternative hypothesis H1 ). Hence what we do is use BIC to M + (hi − hj ), where M + is the set of reasonable functions
compare the non-linear to each of the competing models. f such that f 0 > 0 on the interior of its domain [34].
Suppose that we obtain the BIC values for the alternative The tank-heights are constrained to be non-negative, as
and the correct models, using the relevant SS terms. When are the parameters κi . As a consequence, we can discretize
computing ∆BIC = BIC(H1 ) − BIC(H0 ), note that both the hi to take on values {+, 0}, which means that M + (hi −
the null (H0 ) and the alternative hypothesis (H1 ) models hj ) can take on values {+, 0, −}. The domain for dh dt must
1

share the same SStotal term (both models attempt to explain be {+, 0, −}, since the qualitative version of q0 , Q is non-
the same collection of scores), although they differ with re- negative (domain of {+, 0}) and each M + (hi − hj ) can
spect to SSE. The SStotal term common to both BIC values take on values {+, 0, −}. We see that this model has no
cancels out in computing ∆BIC , producing parameters to estimate.
SSE1 Fault Model
∆BIC = n log + (k1 − k0 )logn, (9) The qualitative fault model has different M + functions
SSE0 for the modes where the valve is passing and blocked. We
where SSE1 and SSE0 are the sum of squares for the er- derive these functions as follows. From a qualitative per-
ror terms in the alternative and the null hypothesis models, spective, the domain of βi is {0,+} for a passing valve, and
respectively. {-,0} for a blocked valve. To create a new M + function for
the cases of passing and blocked valve, we qualitatively ap-
5 Experimental Design ply these corresponding domains to the standard M + func-
tion with domain {-,0,+} to obtain fault-based M + func-
This section compares three tank benchmark models accord- tions : MP+ (hi − hj ) denotes the M + function when the
ing to various model-selection measures. We adopt as our valve is passing, and MB+ (hi − hj ) denotes the M + func-
“correct" model the non-linear model. We will examine the tion when the valve is blocked.
fidelity and complexity tradeoffs of two simpler models over
a selection of failure scenarios. 5.2 Simulation Results
The diagnostic task will be to compute the fault state
of the system, given an injected fault, which is one of We have compared the simulation performance of the mod-
(ξN , ξB , ξP ), denoting nominal blocked and passing valves, els under nominal and faulty conditions, considering faults
respectively. This translates to different tasks given the dif- to individual valves V1 , V2 and V3 , as well as double-fault
ferent models. combinations of the valves. In the following we present
some plots for simulations of faults and fault-isolation for
non-linear model estimate the true value of κ1 given p1 , different model types.
which corresponds to a most-likely failure mode as- Figure 2 shows the results from a single-fault scenario,
signment of one of (ξN , ξB , ξP ). where valve V1 is stuck at 50%) at t = 250, based on the
linear model estimate the true value of κ1 given p1 , which non-linear model. The plot from this simulation show that
corresponds to a most-likely failure mode assignment at the time of the fault injection, the water level in tank T1
of one of (ξN , ξB , ξP ). starts increasing while the water level at tanks T2 and T3
start decreasing due to the lower inflow.
qualitative model estimate the failure mode assignment of
one of (ξN , ξB , ξP ). 200 p_1
p_2

5.1 Alternative Models
p_3

150

This section describes the two alternative models that we 100
compare to the non-linear model, a linear and a qualitative
model. 50

Linear Model 0

We compare the non-linear model with a linearised version. 0 100 200
time [s]
300 400

We can perform this linearised process in a variety of ways
[33]. In this simple tank example, we can perform the lin- Figure 2: Simulation with non-linear model for the scenario
earisation directly through replacement of non-linear and of a fault in valve 1 at t = 250 s
linear operators, as shown below.
Nominal Model We can linearise the the non-linear Table 1 shows the simulation error-difference between the
3-tank
p model by replacing the non-linear sub-function non-linear and linear models, for the nominal case and the
hi − hj with the linear sub-function γij (hi − hj ), where faulty case (where valve 1 is faulted). Given that we mea-
γij is a parameter (to be estimated) governing the flow be- sure the pressure levels for p1 , p2 and p3 every second, we
tween tanks i and j. The linear model has 4 parameters, use the difference in these outputs to identify the sum-of-
γ12 , γ12 , γ23 , γ3 . squared-error (SSE) values for the simulations.

131
Proceedings of the 26th International Workshop on Principles of Diagnosis

Total
1
p1 p2 p3 R_1

R_2

Nominal 2600.3 316.2 118.1 3034.6 0.8 R_3

V1 -fault 2583.1 347.5 137.2 3067.8 0.6

0.4

Table 1: Data for SSE values for simulations using Non-
linear and Linear representations, given two scenarios: 0.2

nominal and faulty (valve V1 at 50% after 250 s) 0

100 200 300 400 500
time [s]

Figure 3 shows the results for diagnosing the V1 -fault us- Figure 5: Simulation of fault isolation of fault in valve 1
ing the non-linear model. We can see that the diagnostic with mixed non-linear/linear model (T1 non-linear and both
accuracy is high, as P (V1 ) converges to almost 1 with little T2 and T3 linear). The figure depicts the probability of
time lag. valves 1, 2 and 3 being faulty.
1

6.1 Model Comparisons
0.8
We have empirically compared the diagnostics performance
0.6
of several multi-tank models. In our first set of experiments,
R_1

0.4
we ran a simulation over 500 seconds, and induced a fault
(valve V1 at 50%) after 250 s. The model combinations in-
0.2
volved a non-linear (NL) model, a model (denoted M) with
0 tank T1 being linear (and other tanks non-linear), a fully
100 200 300 400 500
time [s] linear model (denoted L), and a Qualitative model (denoted
Q).
Figure 3: Simulation of fault isolation of fault in valve 1 To compare the relative performance of the models, we
with non-linear model. The figure depicts the probability of compute a measure of diagnostics error (or loss), using the
valve 1 being faulty. difference between the true fault (which is known for each
simulation) and the computed fault. We denote the true fault
In contrast, Figure 4 shows the diagnostic accuracy and existing at time t using the pair (ω, t); the computed fault at
isolation time with a linear model. First, note that there is time t is denoted using the pair (ω̂, t̂). The inference system
a false-positive identified early in the simulation, and the that we use, LNG [35], computes an uncertainty measure
model incorrectly identifies both valves 2 and 3 as being associated with each computed fault, denoted P (ω̂). Hence,
faulty. This linear model thus delivers both poor diagnos- we define a measure of diagnostics error over a time window
tic accuracy (classification errors) and poor isolation time [0, T ] using
(there is a lag between when the fault occurs and when T X
the model identifies the fault). After the fault injection at X
γ1D = |P (ω̂t ) − ωt |, (10)
t = 250 [s], the predictive accuracy improves and the cor-
t=0 ξ∈Ξ
rect fault becomes the most likely fault.
where Ξ is the set of failure modes for the model, and ωt
1 R_1
R_2
denotes ω at time t.
0.8
R_3
Our second metric covers the fault latency, i.e., how
quickly the model identifies the true fault (ω, t): γ2 = t − t̂.
0.6
Table 2 summarises our results. The first columns com-
0.4
pare the number of parameters for the different models, fol-
0.2
lowed by comparisons of the error (γ1 ) and the CPU-time
(γ2 ). The data show that the error (γ1 ) does not grow very
0
100 200 300 400 500 much as we increase model size, but it increases as we de-
time [s]
crease model fidelity from non-linear through to qualitative
models. In contrast, the CPU-time (a) increases as we in-
Figure 4: Simulation of fault isolation of fault in valve
crease model size, and (b) is proportional to model fidelity,
1 with linear model.The figure depicts the probability of
i.e., it decreases as we decrease model fidelity from non-
valves 1, 2 and 3 being faulty.
linear through to qualitative models.
In a second set of experiments, we focused on multiple
Figure 5 depicts the diagnostic performance with a mixed model types for a 3-tank system, with simulations running
linear/non-linear model (T1 is non-linear, while T2 and T3 over 50s, and we induced a fault (valve V1 at 50%) after 25 s.
are linear). The diagnostic accuracy is almost the same as The model combinations involved a non-linear (NL) model,
that of the non-linear model (cf. Figure 3), except for a a model with tank 3 linear (and other tanks non-linear), a
false-positive detection at the beginning of the scenario. model with tanks 2 and 3 linear and tank 1 non-linear, a fully
linear model, and a qualitative model. Table 3 summarises
6 Experimental Results our results.
The data show that, as model fidelity decreases, the er-
This section describes our experimental results, summaris- ror γ1 increases significantly and the inference times γ2 de-
ing the data first and then discussing the implications of the crease modestly. If we examine the outputs from AICc , we
results. see that the best model is the mixed model (T3 -linear). BIC

132
Proceedings of the 26th International Workshop on Principles of Diagnosis

Tanks 2 3 4 7 Conclusions
# Parameters NL 7 9 11 This article has presented a framework for evaluating the
M 6 8 10 competing properties of models, namely fidelity and com-
L 5 7 9 putational complexity. We have argued that model perfor-
Q 2 3 4 mance needs to be evaluated over a range of future observa-
γ1 NL 242 242 242 tions, and hence we need a framework that considers the ex-
M 997 1076 1192 pected performance. As such, information-theoretic meth-
L 1236 1288 1342 ods are well suited.
Q 3859 3994 4261 We have proposed some information-theoretic metrics for
γ2 NL 10.59 23.7 39.5 MBD model evaluation, and conducted some preliminary
M 8.52 17.96 34.6 experiments to show how these metrics may be applied.
L 6.11 10.57 32.0 This work thus constitutes a start to a full analysis of model
Q 4.64 7.31 26.4 performance. Our intention is to initiate a more formal anal-
ysis of modeling and model evaluation, since there is no
framework in existence for this task. Further, the experi-
Table 2: Data for 2-, 3-, and 4-tank models using Non-linear
ments are only preliminary, and are meant to demonstrate
(NL), Mixed (M), Linear (L) and Qualitative (Q) represen-
how a framework can be applied to model comparison and
tations
evaluation.
Significant work remains to be done, on a range of fronts.
In particular, a thorough empirical investigation is needs on
indicates the qualitative model as the best; it is worth noting
diagnostics modeling. Second, the real-world utility of our
that BIC typically will choose the simplest model.
proposed framework needs to be determined. Third, a theo-
retical study of the issues of mode-based parameter estima-
γ1 γ2 AICc BIC tion and its use for MBD is necessary.
Non-Linear 0.97 23.7 29.45 43.7
T3 -linear 3.12 17.96 26.77 42.9 References
T2 , T3 -linear 21.96 13.21 31.12 39.56
Linear 77.43 10.57 35.76 37.55 [1] George EP Box. Statistics and science. J Am Stat
Qualitative 304.41 9.74 43.01 29.13 Assoc, 71:791–799, 1976.
[2] Peter Struss. What’s in SD? Towards a theory of mod-
Table 3: Data for 3-tank model, using Non-linear, Mixed, eling for diagnosis. Readings in model-based diagno-
Linear and Qualitative representations, given a fault (valve sis, pages 419–449, 1992.
V1 at 50%) after 25 s [3] Peter Struss. Qualitative modeling of physical sys-
tems in AI research. In Artificial Intelligence and
Symbolic Mathematical Computing, pages 20–49.
6.2 Discussion Springer, 1993.
Our results show that MBD is a complex task with several [4] Nuno Belard, Yannick Pencolé, and Michel Comba-
conflicting factors. cau. Defining and exploring properties in diagnostic
systems. System, 1:R2, 2010.
• The diagnosis error γ1 is inversely proportional to [5] Alexander Feldman, Tolga Kurtoglu, Sriram
model fidelity, given a fixed diagnosis task.
Narasimhan, Scott Poll, and David Garcia. Em-
• The error γ1 increases with fault cardinality. pirical evaluation of diagnostic algorithm performance
using a generic framework. International Journal of
• The CPU-time γ2 increases with model size (i.e., num- Prognostics and Health Management, 1:24, 2010.
ber of tanks).
[6] Steven D Eppinger, Nitin R Joglekar, Alison Ole-
This article has introduced a framework that can be used chowski, and Terence Teo. Improving the systems en-
to trade off the different factors governing MBD “accuracy". gineering process with multilevel analysis of interac-
We have shown how one can extend a set of information- tions. Artificial Intelligence for Engineering Design,
theoretic metrics to combine these competing factors in Analysis and Manufacturing, 28(04):323–337, 2014.
diagnostics model selection. Further work is necessary
to identify how best to extend the existing information- [7] Sanjay S Joshi and Gregory W Neat. Lessons learned
theoretic metrics to suit the needs of different diagnostics from multiple fidelity modeling of ground interferom-
applications, as it is likely that the “best" model may be eter testbeds. In Astronomical Telescopes & Instru-
domain- and task-specific. mentation, pages 128–138. International Society for
It is important to note that we conducted experiments with Optics and Photonics, 1998.
un-calibrated models, and we have ignored the cost of cal- [8] Roxanne A Moore, David A Romero, and Christi-
ibration in this article. The literature suggests that linear aan JJ Paredis. A rational design approach to gaus-
models can be calibrated to achieve good performance, al- sian process modeling for variable fidelity models. In
though performance inferior to that of calibrated non-linear ASME 2011 International Design Engineering Tech-
models. This class of qualitative models does not possess nical Conferences and Computers and Information in
calibration factors, so calibration will not improve their per- Engineering Conference, pages 727–740. American
formance. Society of Mechanical Engineers, 2011.

133
Proceedings of the 26th International Workshop on Principles of Diagnosis

[9] Peter D Hanlon and Peter S Maybeck. Multiple- [23] S Pande, L Arkesteijn, HHG Savenije, and LA Basti-
model adaptive estimation using a residual correlation das. Hydrological model parameter dimensionality is a
Kalman filter bank. Aerospace and Electronic Systems, weak measure of prediction uncertainty. Natural Haz-
IEEE Transactions on, 36(2):393–406, 2000. ards and Earth System Sciences Discusions, 11, 2014,
[10] Redouane Hallouzi, Michel Verhaegen, Robert 2014.
Babuška, and Stoyan Kanev. Model weight and state [24] Martin Kunz, Roberto Trotta, and David R Parkinson.
estimation for multiple model systems applied to fault Measuring the effective complexity of cosmological
detection and identification. In IFAC Symposium on models. Physical Review D, 74(2):023503, 2006.
System Identification (SYSID), Newcastle, Australia, [25] Gregory M Provan and Jun Wang. Automated bench-
2006. mark model generators for model-based diagnostic in-
[11] Amardeep Singh, Afshin Izadian, and Sohel Anwar. ference. In IJCAI, pages 513–518, 2007.
Fault diagnosis of Li-Ion batteries using multiple- [26] David J Spiegelhalter, Nicola G Best, Bradley P Car-
model adaptive estimation. In Industrial Electronics lin, and Angelika Van Der Linde. Bayesian measures
Society, IECON 2013-39th Annual Conference of the of model complexity and fit. Journal of the Royal
IEEE, pages 3524–3529. IEEE, 2013. Statistical Society: Series B (Statistical Methodology),
[12] Amardeep Singh Sidhu, Afshin Izadian, and Sohel 64(4):583–639, 2002.
Anwar. Nonlinear Model Based Fault Detection of [27] Jing Du. The “weight" of models and complexity.
Lithium Ion Battery Using Multiple Model Adaptive Complexity, 2014.
Estimation. In World Congress, volume 19, pages [28] Jasper A Vrugt and Bruce A Robinson. Treatment of
8546–8551, 2014. uncertainty using ensemble methods: Comparison of
[13] Aki Vehtari, Janne Ojanen, et al. A survey of bayesian sequential data assimilation and bayesian model aver-
predictive methods for model assessment, selection aging. Water Resources Research, 43(1), 2007.
and comparison. Statistics Surveys, 6:142–228, 2012. [29] Hirotugu Akaike. A new look at the statistical model
[14] Athanasios C Antoulas, Danny C Sorensen, and identification. Automatic Control, IEEE Transactions
Serkan Gugercin. A survey of model reduction meth- on, 19(6):716–723, 1974.
ods for large-scale systems. Contemporary mathemat- [30] Hirotugu Akaike. Likelihood of a model and infor-
ics, 280:193–220, 2001. mation criteria. Journal of econometrics, 16(1):3–14,
[15] Alexander Feldman, Gregory M Provan, and Arjan JC 1981.
van Gemund. Computing observation vectors for max- [31] G. Schwarz. Estimating the dimension of a model.
fault min-cardinality diagnoses. In AAAI, pages 919– Ann. Statist., 6:461–466, 1978.
924, 2008. [32] Eric-Jan Wagenmakers. A practical solution to the per-
[16] Amardeep Singh, Afshin Izadian, and Sohel Anwar. vasive problems ofp values. Psychonomic bulletin &
Nonlinear model based fault detection of lithium ion review, 14(5):779–804, 2007.
battery using multiple model adaptive estimation. In [33] Pol D Spanos. Linearization techniques for non-linear
19th IFAC World Congress, Cape Town, South Africa, dynamical systems. PhD thesis, California Institute of
2014. Technology, 1977.
[17] Youmin Zhan and Jin Jiang. An interacting multiple- [34] Benjamin Kuipers and Karl Åström. The composition
model based fault detection, diagnosis and fault- and validation of heterogeneous control laws. Auto-
tolerant control approach. In Decision and Control, matica, 30(2):233–249, 1994.
1999. Proceedings of the 38th IEEE Conference on,
volume 4, pages 3593–3598. IEEE, 1999. [35] Alexander Feldman, Helena Vicente de Castro, Arjan
van Gemund, and Gregory Provan. Model-based diag-
[18] Peter Struss and Oskar Dressler. " physical negation" nostic decision-support system for satellites. In Pro-
integrating fault models into the general diagnostic en- ceedings of the IEEE Aerospace Conference, Big Sky,
gine. In IJCAI, volume 89, pages 1318–1323, 1989. Montana, USA, pages 1–14, March 2013.
[19] Johan De Kleer, Alan K Mackworth, and Raymond
Reiter. Characterizing diagnoses and systems. Arti-
ficial Intelligence, 56(2):197–222, 1992.
[20] Elizabeth H Keating, John Doherty, Jasper A Vrugt,
and Qinjun Kang. Optimization and uncertainty as-
sessment of strongly nonlinear groundwater models
with high parameter dimensionality. Water Resources
Research, 46(10), 2010.
[21] Saket Pande, Mac McKee, and Luis A Bastidas.
Complexity-based robust hydrologic prediction. Wa-
ter resources research, 45(10), 2009.
[22] G Schoups, NC Van de Giesen, and HHG Savenije.
Model complexity control for hydrologic prediction.
Water Resources Research, 44(12), 2008.

134
Proceedings of the 26th International Workshop on Principles of Diagnosis

Posters

135
Proceedings of the 26th International Workshop on Principles of Diagnosis

136
Proceedings of the 26th International Workshop on Principles of Diagnosis

A General Process Model：Application to Unanticipated Fault Diagnosis

Jiongqi WANG1, Zhangming HE2, Haiyin ZHOU3 and Shuxing LI1
1
College of Science, National University of Defense Technology, Changsha, Hunan, P. R. China
email: wjq_gfkd@163.com
2
Institute for Automatic Control and Complex Systems, University of Duisburg-Essen, Duisburg, Germany
email: hezhangming2008@sina.com
2
Beijing Institute of Control Engineering, Beijing, P. R. China
email: gfkd_zhy@sina.com
1
College of Science, National University of Defense Technology, Changsha, Hunan, P. R. China
email: lishuxingok@163.com
Abstract (2006) [16] proposed that the UF diagnosis was carried out
by utilizing particle filter for incomplete patterns. As a
The improvement of the detection and diagnosis transmission mechanism of the UF could not be obtained in
capability for the unanticipated fault is a tendency
advance, the UF diagnosis could not be realized based on
in the research and application of fault diagnosis. model inference. George Vachtsevanos etc. (2008) [17]
In this paper, some notions and the basic principles proposed an UF robust detection method, however, the
for the unanticipated fault detection and diagnosis
isolation on the UF could not be realized. Furthermore, Z.
are given. A general process model applied to the M He (2012) [18] proposed a one-class principal compo-
diagnosis for the unanticipated fault is designed, nent analysis (OC-PCA) method, which could only be used
by adopting a three-layer progressive structure,
for processing the system with stable data in a normal pat-
which is comprised of an inherent detection layer, tern, and did not relate to the UF diagnosis at all. The ma-
an unanticipated isolation layer and an unantici- jority of currently published articles involve only UF de-
pated recognition layer. Several key problems in
tection. However, the fault isolation between the UF and the
the general process model are analyzed. The model AF as well as the recognition (i.e. identification) of the UF
and methods proposed in this paper are driven by has not yet been performed.
pure data and they can detect and diagnose the
For actual system, some impacts such as nonlinearity,
unanticipated fault. The approach is evaluated by uncertainty and external interference are inevitable in its
using an example of a satellite’s attitude control actual operation, which will result difficulties in setting up a
system, and excellent results have been obtained.
precise model for the system. Consequently, the application
of the methods for fault detection and diagnosis based on
1 Introduction model inference will be very limited [19-20]. With the
At present, in the research field of fault diagnosis, a great development of sensor technology, the input and output
majority of methods proposed are based on the premise of a data or the system’s status under real-time monitor is easier
perfect fault pattern database. The treatment on the fault to obtain. The data are redundant, real-time and reliable. As
detection and diagnosis are carried out for anticipated fault a result, the fault diagnosis ideology of extracting data
(AF) [1-3]. However, due to the high complexity and un- instead of establishing a system’s model will play a positive
certainty of the technical structure, the process environment role.
and the working state of the system etc, the occurrence of This paper proposes a data-driven fault diagnosis method
some faults which cannot be anticipated in advance (Un- for UF. Combined with the fault diagnosis process, a gen-
anticipated Fault, UF) is inevitable in actual work [4]. The eral process model (GPM) is advanced, which is comprised
UF is not included in the anticipated fault database, and the of an inherent detection layer (IDL), an unanticipated iso-
occurrence of the UF affects normal operation of the system lation layer (UIL) and an unanticipated recognition layer
and even possibly leads to thorough failure of the system. (URL). Firstly, according to different characteristics of the
The improvement of unanticipated fault detection and monitoring data, the corresponding residual statistics are
diagnosis (UFDD) capability is a difficult issue, as well as a built and a detection criterion of the IDL is provided for
developing direction in the research and application for the fault detection. Secondly, the statistic of angle similarity is
fault diagnosis [5-8]. constructed on the basis of the fault feature direction, the
In retrospect to the existing researches, rather little at- isolation between the UF and the AF is realized in the UIL.
tention has been paid to research UF detection and diagno- Finally, in the URL, by the adoption of the contribution
sis. Therefore, no mature solve scheme has been shaped for factor, the UF is recognized. The method, as a fault diag-
either the problem itself or the technical realization [9-12]. nosis method driven by pure data, is capable of carrying out
Most research on the UF focus on the recognition and the detection, isolation and recognition for the UF.
match between different patterns based on the known fault The paper is organized as follows. In Section 2, some
pattern database [13-14]. For example, Tom Brotherton and notions and the basic principles for UF and UFDD are
Tom Johnson (2001) [15] proposed a neural network discussed. A three-layer GPM for UFDD is introduced in
anomaly detector, which was essentially a single neural Section 3. Sections 4 analyzes some key problems in the
network classifier and could not identify the UF. Z. H. Duan GPM and advances the corresponding solutions. In Section

137
Proceedings of the 26th International Workshop on Principles of Diagnosis

5, performance evaluation of the proposed GPM and cess model (GPM) for UF diagnosis on the basis of pure
methods for the satellite’s attitude control system is pre- data-driven method. The structure of GPM is shown in
sented. Conclusions are drawn in Section 6. Figure 1. The first layer is the IDL, which establishes a
detection discriminator for fault detection; the second layer
2 Notions and Basic Principles for UFDD is the UIL, which applies the detection residual to establish
a fault feature direction so as to build an isolation discrim-
inator to realize the isolation of the AF and the UF; the third
2.1 Notion of UF layer is the URL, which applies a contribution factor to
The fault can be divided into the anticipated fault (AF) and analyze the variant which is most relevant to the current UF
the unanticipated fault (UF). and to realize the fault recognition based on superficial data
Explanation 1: Anticipated fault (AF) is the fault which characteristics.
has been recognized by people, existing in the fault pattern
database with the relevant monitoring data and the pro-
cessing strategy.
Explanation 2: Unanticipated fault (UF) is the fault
which lacks prior knowledge without any fault samples or
with few fault data. UF does not exist in the fault pattern
database, and the corresponding elimination strategy for it
has not been detected.
A perfect fault pattern database should be a set including
all AF patterns and UF patterns. However, due to some
objective reasons, the acquisition of the perfect fault pattern
database is extremely difficult. The AF rarely occurs, and
most of faults occurs in the actual working process are UF
[21]. At present, to detect the UF and moreover to diagnose
the UF is one of the most difficult issues in fault diagnosis
region, and it is also a great challenge for fault diagnosis
technology.

2.2 Notion for UF Detection
Explanation 3: UF detection is a process for judging
whether UF occurs.
The tasks of UF detection and AF detection are different.
The two methods apply previous normal monitoring data to
train a discriminator, and then the current monitoring data is
used as the testing data to be input into the discriminator to
judge whether the current status is a fault. However, the UF
detection is carried out after the completion of fault detec-
tion, and the fault is further judged whether to be UF. Ob-
viously, for AF detection, all faults are always assumed to
be anticipated. Consequently, if the UF occurs, it will be Figure 1 The GPM for UFDD
misjudged as a certain anticipated fault.
3.1 Inherent Detection Layer (IDL)
2.3 Notion for UF Diagnosis The first issue that a diagnosis system faces is to carry out
Explanation 4: UF diagnosis is a process of determining normal/abnormal recognition for a feature vector of the
whether the UF occur (i.e. UF detection). In addition, the monitoring data. The task of the IDL is to determine
UF diagnosis further includes the isolation and the recog- whether the monitoring data is normal or abnormal. The
nition of the UF after the UF detection is completed. detection discriminator can be used for reflecting the
Compared with the AF diagnosis, due to lack of prior characteristics of the normal system. In a given threshold,
knowledge of the UF, the mapping relationship from fault the testing data is inputted to the detection discriminator for
data to fault part (essentially, the fault pattern is a function judging whether the fault exists. If a value of the discrimi-
between fault data and fault part) cannot be found. There- nator is smaller than the given threshold, the system is
fore, the key for UF diagnosis is to quickly establish a thought to be normal; otherwise, a fault is thought to occur.
cognition process. The cognition comprises the recognition Meanwhile the occurrence time (Fault time) and the feature
of superficial data characteristics or the mapping recogni- direction of the fault (Current fault direction) should be
tion from data to a physical layer. Based on a fault diagnosis determined, and the testing data is presented to the UIL.
method driven by pure data, this paper focuses on the Essentially, the IDL is a single discriminator, which can
recognition of superficial data characteristics. be applied to catch the characteristics of the system in a
normal pattern as well as to complete the detection and
3 General Process Model (GPM) for UFDD discrimination of the testing data. Two key problems are
involved, the first is the residual generation and the second
By combining the notion and basic principles of the UF and is the residual evaluation. The specific techniques can be
the UFDD, this paper proposes a multi-layer general pro- seen in Section 4.1.

138
Proceedings of the 26th International Workshop on Principles of Diagnosis

3.2 Unanticipated Isolation Layer (UIL) is suitable for the system capable of obtaining the baseline
The task of the UIL is to finish the isolation between the UF data, its calculation amount is small, the detection speed is
fast, and the detection effect is the best [23]. The time-series
and AF. After detected, the current fault shall be judged
whether to be the AF or the UF. If it is, the current fault will modeling prediction is suitable for the system with con-
be classified as some sort of AF. All AF patterns are saved tinuous output and without input; it is also suitable for it-
eration update of the pattern, while the defect is that the
in the pattern database of AF. The isolation discriminator
matches the feature of the current fault pattern with all those prediction time is short [25].
of the AF patterns successively, so as to realize the isolation In practical application, the characteristics of the moni-
toring system and the monitoring data can be applied to
between the UF and AF. If the feature of the current fault
cannot be matched with any AF pattern, it indicates that the select the corresponding detection method.
UF occurs. The testing data is presented to the URL. The Besides, for the three methods analyzed above, only the
characteristics of data output are considered. However, for
key problem of the UIL lies in the establishment of an iso-
lator and the design of an isolation criterion. The specific some systems (such as the satellite’s attitude control sys-
techniques can be seen in Section 4.2. tem), the object of the fault detection always comprises
control input as well as measuring output, and the control
3.3 Unanticipated Recognition Layer (URL) input has a certain responding relationship with the meas-
uring output. In the situation where there is no baseline
The task of the URL is to perform online learning and training data, an input-output system identification method
analysis for the UF data, so as to generate the fault pattern. is needed to search a model structure for the system, and
The function of the URL is to learn and summarize the thus the fault detection both on control input and measuring
pattern found in unknown pattern. As it is different from the output will be performed in the IDL.
AF, it is difficult to find the mapping relationship from the If we assume that (U n −1 ,Yn −1 ) ∈ ( R ( n −1)× p , R ( n −1)× m ) are re-
fault data to the fault part for the UF. Therefore, the key spectively as system input and system output before the nth
point of recognition lies in establishing the corresponding time period, take them as the training data and make
relationship between the data and the unknown fault. Due to
insufficient recognition on the UF and lack of historical
( un , yn ) ∈ ( R1× p , R1×m ) as the current testing data. The train
purpose is to find the model structure of the system, usually
information and prior knowledge, it is usually more difficult with the rule as follows
to establish the mapping relationship on the physical layer.
The key point of this paper is to analyze the UF recognition min Yn −1 − f (U n −1 ) (1)
f
based on the superficial data layer. According to contribu-
tion factor, the variant which is mostly relevant to the cur- Let Yˆn −1 = f (U n −1 ) is the tendency term,
rent UF can be found, so that the UF recognition is finished.
Yn −1 = Yn −1 − Yˆn −1 = Yn −1 − f (U n −1 ) is the residual term;
The specific techniques can be seen in Section 4.3.
yˆ n = f ( un , U n −1 , Yn −1 )
T
is one-step prediction, and
4 Some Key Problems in GPM
rn = yn − yˆ n is the prediction residual, then the key point
In the above section, a basic framework of the UF diagnosis
for the minimum problem in (1) is to construct the function
is provided. The task of the UF diagnosis is to detect, isolate
f between the system input and system output.
and recognize the UF. The detection is a starting point of
fault diagnosis, and the target of the fault detection is to If a mathematical model can be obtained for the system
equation by the physical mechanism, the estimation of f can
judge whether the UF occurs; the isolation is the core of
be converted into the parameter estimation (Gray-Box
fault diagnosis; and the recognition is a terminal point of
fault diagnosis. Additionally, the recognition is also the Model); and if there is no physical background, f can be
estimated only according to the experiment and the system
starting point of fault-tolerant control (fault processing).
identification (Black-Box Model). Common linear black
The specific techniques on detecting, isolating and recog-
nizing the UF can be seen below. box models comprise an autoregression model (AR Model)
with external input, an autoregressive moving average
4.1 Detection Statistic Construction model (ARMA Model) with external input, an output error
model (OE Model), a Box-Jenkins model (BJ Model) and a
Just as Section 3 shows, the basic task of the IDL is to judge prediction error minimized model (PEM Model); and
whether the testing data is normal. If it is a fault, simulta- common nonlinear black box models comprise a nonlinear
neously the occurrence time and the feature direction of the autoregression moving average model (NLARMA Model)
fault shall be determined. The key point of the IDL lies in and a nonlinear Hammerstein-Wiener model (NLHW
the detection residual generation as well as the residual Model) [26-29] with external input.
evaluation. The detection statistic is established according After obtaining the prediction residual, the detection sta-
to the residual, and the fault detection is performed ac- tistics are as below:
cording to the given criterion. For different monitoring data,
( ) r
-1
different residual generation approaches exist, including T 2 ( yn ) = rnT cov Y n (2)
simple T2 detection [18, 22], baseline data smoothing de-
tection [23], and time-series modeling and predicting de- where cov (Y ) is the covariance of the residual term Y , and
tection [24-25]. a judging threshold is set to be
The characteristics of the monitoring system and moni- m ( n )( n − 2 )
toring data can be applied to select the corresponding de- Tα2 = F ( m, n − 1 − m ) (3)
( n − 1) ( n − 1 - m ) (1−α )
tection method. The simple T2 statistic detection is applied
to a stable data [22]. The baseline data smoothing detection

139
Proceedings of the 26th International Workshop on Principles of Diagnosis

where F(1−α ) ( m, n − 1 − m ) indicates a quantile of F distri- current directions from the same pattern. ξ 2 is another true
bution function when a significance level is α , the degree direction, corresponding to another fault pattern. The origin
of freedom is ( m, n − 1 − m ) . of the coordinates can be regarded as the true direction for
If T 2 ( yn ) > Tα2 , yn−1 is considered as the fault point. the normal pattern.
However, a false alarm is inevitable because of noise, thus
we need a more reliable criterion for detection as follows.
Criterion 1: If T 2 ( yn ) > Tα2 holds continuously for W
times, then the fault has really happened, where W is ξ2
ξ1
called time threshold. The W-th alarm time is considered as
the fault time (tf) (i.e. the occurrence time of the fault) and
the residual r of the fault time is called the current fault
direction or current direction (i.e. the feature direction of
the fault).
The detection statistic threshold is decided by Equation ξ1
(3). The time threshold should not be too large (usually 2 to ξ2
4) to avoid any false alarms. A larger time threshold makes
a more reliable decision, but it will cause some detection
delay which will cause harm to the system. Current fault Figure 2 True detections and current directions
direction is the key information of each fault, and it is the
base for the isolation fault. According to Criterion 1, the Denote θ ( r , ξ ) is the angle between the current direction
current fault is detectable if and only if and the true direction, Ddisc ( r , ξ ) = 1 − cos (θ ( r , ξ ) ) is
called the directional discrepancy between them. We can
( )
−1
| rn ||> Tα2 rn T cov(Y ) −1 rn (4) find that if they are from the same pattern, Ddisc ( r , ξ ) will
be small, otherwise, it will be large.
In the IDL, the fault detection is realized by the adoption Suppose that ε ∼ N ( 0, Ω ) , the current direction is
of the input-output system identification method. Moreover,
r = ε + r ξ , and {ξ i }i =1 is all anticipated true directions, and
q
the occurrence time and feature direction of the fault can
also be obtained. {
ξ i0 = arg min 1 − cos ( r , ξi ) }
q
, then the isolation statistic is
i =1
Obviously, the input-output system identification method ξ

is provided with all the advantages of the time-series mod- given as follows
eling prediction method. It is particularly suitable for the
Iso( r ) =
( (
r 1 − cos r , ξ i0 )) (7)
system with discontinuous input and discontinuous output T
ξ Ωξ i0
at the same time, its defect is that the calculation amount is i0

large, and the iteration process is relatively difficult. Theorem 1: If Iso(r ) is defined in Equation (7), then
Iso(r ) ∼ N ( 0,1) (8)
4.2 Directional Similarity and Isolation Criterion
The basic task of the UIL is to utilize the feature direction of Proof: Suppose that the current direction is r = ε + r ξ ,
the fault obtained in the IDL to establish the isolation dis- where ξ is the true direction and ε is the observation
criminator, and then to realize the isolation between the AF noise, and ε ∼ N ( 0, Ω ) . According to Explanation 5 we
and the UF. The key point lies in the isolator establishment. have ξ = 1 . If cos(r , ξ ) ≥ 0 , we can approximately obtain
Here the concept of direction similarity is induced, and a that
fault isolation criterion is given. In Criterion 1, the defini-
tion of current fault direction or current direction (i.e. the
cos(r , ξ ) =
ξ Tr
ξ r
=
ξ Tε
r
−2
+ 1 ∼ N 1, r ξ T Ωξ ( ) (9)
feature direction of a fault) is given. We adopt the true fault i.e. cos(r , ξ ) satisfies truncated normal distribution.
feature direction as defined below to be the fault’s pattern
characteristics on superficial data layer. Thus
Explanation 5: True (fault) direction of a fault pattern is ( ) (
r 1 − cos(ξ i0 , r ) ∼ N 0 ,ξ iT0 Ωξ i0  ) (10)
defined as the unified mean of all possible current fault
directions from the same pattern. Similarly, if cos(r , ξ ) < 0 , we can prove that
The relationship between the current directions and the r (1 + cos(ξ , r ) ) ∼ N ( 0 ,ξ T Ωξ )  (11)
true direction is just like that between discrete random
variable and its expectation. It is easy to understand that According to Equation (10) and Equation (11), we obtain
1 1
r (1 − cos(ξ , r ) ) ∼ N ( 0,ξ T Ωξ ) 
n n
ξ = lim
n →∞
∑ ri / n ∑
n i =1 i =1
ri  (5) (12)
2

r = r ξ +ε (6) Then
where {ri }i =1 are all possible current directions from the
n

Iso( r ) =
(
r 1 − cos r , ξ i0 ( ))
∼ N ( 0,1) (13)
same pattern, and ε is the noise and r is the magnitude T
ξ Ωξ i0
i0
of the current direction.
It is shown in Figure 2 that there are two opposite true and thus the theorem is proved. Therefore, the threshold for
directions for each fault pattern, e.g. the true direction , ξ1 , Iso(r ) is Φ (1−α ) , where α is the significance level, and Φ
is in the center of a symmetric cone, around which are the is the inverse of the normal cumulative distribution function.
We provide the isolation criterion as follows.

140
Proceedings of the 26th International Workshop on Principles of Diagnosis

Criterion 2: If Iso(r ) > Φ1−α holds true, the current fault The monitoring data comprises of not only the output
is unanticipated; otherwise, it is anticipated. data of the measuring mechanism, but also the control input
Criterion 2 indicates that UF with too small a magnitude of the execution mechanism. The dimension of the data
cannot be isolated. If the current fault is unanticipated, a output by the measuring mechanism is m = 7 , The dimen-
new fault pattern is found and the unified current direction sion of the data input by the execution mechanism is p = 4 ,
is regarded as its true direction. If the current fault is an- which can be seen in Table 1. There are altogether 10
ticipated, then the current direction should be added to the batches of monitoring data, which can be seen in Table 2.
corresponding AF direction database in UIL of the GPM, The first batch is the normal data, and the normal pattern
and the true direction shall be updated. data is discontinuous and unstable (Figure 3). The subse-
quent 9 batches are used for testing, and different fault
4.3 Calculation for Contribution Factor patterns (a sudden-change fault, a gradual-change fault and
The basic task of the URL is to carry out online learning and so on) are given. In Figure 3, the comparison of the moni-
analysis for UF data. The key point of recognition or iden- toring data in the fault with drift-increasing of gyro at roll
tification is to establish the corresponding relationship from axis and the normal pattern is given. The time of each batch
the monitoring data to the unknown fault or the character- of data is 45000s-48000s; each piece data is collected per
istics of the unknown fault. The UF diagnosis discussed in second, and the data length n = 3000 .
this paper is an approach driven by pure data, thus the Additionally, the public parameters used in the simulation
characteristic recognition on the data layer is more focused. are assigned as follows: The significance level α = 0.01
According to the contribution factor, the variant which is and the time threshold defined in Criterion 1 is W=3.
most relevant to the current UF can be found, and then the Table 1 Data explain of attitude control system
UF recognition is completed.
Known from Criterion 1 that after the residual detection Variable
Code Sensor
statistic is established, if T 2 ( yn ) > Tα2 , it is thought that a subscript
1 Wheel1 Output of the first momentum wheel
fault occurs at time period n-1. For the system with the
2 Wheel2 Output of the second momentum wheel
control input and measure output, firstly a residual covari- Input
3 Wheel3 Output of the third momentum wheel
ance matrix R (i.e. cov(Y ) in Equation (2)) is subjected to 4 Wheel4 Output of the fourth momentum wheel
the singular value decomposition, which is 1 Output EarthPhi Output of earth sensor at roll axis
R = P T diag ( λ) P (14) 2 EarthTheta Output of earth sensor at pitch axis
where λ = ( λ1 , … , λm ) , P = ( p1 , … , pm ) , pi indicates the 3 SunPhi Output of sun sensor at roll axis
4 SunTheta Output of sun sensor at pitch axis
ith column of P , and p ji indicates the jth component of
pi . Let ti = r T pi , and rj indicates the jth component of 5 GeoPhi Output of gyro at roll axis
the current fault feature direction r, where 1 ≤ j ≤ m . 6 GeoTheta Output of gyro at pitch axis
Explanation 6: The contribution factor of the jth variant 7 GeoPsi Output of gyro at yaw axis
to the current fault feature direction r is Table 2 Batch number of monitoring data
Cont ( j ) = ∑ ( ti rj p ji / λi )
m
(15) Batch Fault
i =1 Data description
number time
From the aspect of characteristic recognition in the data 1 Normal data Null
layer, the variant with the largest contribution factor is the 2 Sudden-change fault data of earth sensor at roll axis 46000s
fault variant. If it is a sensor fault, the sensor corresponding 3 Gradual-change fault data of earth sensor at roll axis 46000s
to the variant with the largest contribution factor is the 4 Sudden-change fault data of earth sensor at pitch axis 46000s
sensor hardware with the fault. 5 Gradual-change fault data of earth sensor at pitch axis 46000s
6 Loss fault data of sun sensor at roll axis 46000s
5 Simulation and Performance Evaluation 7 Loss fault data of sun sensor at pitch axis 46000s
8 Drift-increasing fault data of gyro at roll axis 46000s
The effectiveness of the proposed GPM and the corre- 9 Drift-increasing fault data of gyro at pitch axis 46000s
sponding UF fault detection, isolation and recognition 10 Drift-increasing fault data of gyro at yaw axis 46000s
method are demonstrated in this section through a satellite’s
attitude control system model. 5.2 Performance Evaluation
5.1 Input and Output of Satellite Control System The monitoring data are relatively more complex, com-
prising of the output data of the measuring mechanism and
The satellite’s attitude control system is a main part of a the control input of the execution mechanism (seen in Table
satellite, which consists of four main parts: a satellite body, 1). The normal pattern data is discontinuous and unstable
a controller, an execution mechanism and a measuring (seen in Figure 3), and the fault pattern is diversified (with
mechanism [30]. sudden-change fault, gradual-change fault and so on).
As the complexity of the satellite’s attitude control sys- Therefore, the normal pattern data is difficult to be dis-
tem, faults particularly for the measuring mechanism and criminated from the fault pattern data (seen from Figure 3).
the execution mechanism occur rather frequently. With the input-output system identification method, the
Here on consideration of the monitoring data for the sat- Hammerstein-Wiener model (NLHW) is adopted. Equation
ellite’s attitude control system. The monitoring data are (1) is optimized, and the responding function f between the
provided by China Aerospace Science and Technology input and output is estimated. Similarly, for the same data
Corporation (CASA). (Drift-increasing fault data of gyro at roll axis (the batch
number is 8) in Table 2), the detection result of the IDL is

141
Proceedings of the 26th International Workshop on Principles of Diagnosis

given in Figure 4, which can be seen that the fault detection tion is delayed caused by the time threshold, W = 3 .
is timely, the detection effect is remarkable, and 4s detec-
-3
x 10
-0.6 1.4 100 100 10

-0.8 1.2 50 50
5
-1 1 0 0
0
-1.2 0.8 -50 -50

-1.4 -100 -100 -5
4.5 4.6 4.7 4.8 4.5 4.6 4.7 4.8 4.5 4.6 4.7 4.8 4.5 4.6 4.7 4.8 4.5 4.6 4.7 4.8
es-x 4 es-y 4 ss-x 4 ss-y 4 w-x 4
x 10 x 10 x 10 x 10 x 10

-0.054 0.01 0.15 0.15 0.1

0.005 0.1 0.1 0.05
-0.056
0 0.05 0.05 0
-0.058
-0.005 0 0 -0.05

-0.06 -0.01 -0.05 -0.05 -0.1
4.5 4.6 4.7 4.8 4.5 4.6 4.7 4.8 4.5 4.6 4.7 4.8 4.5 4.6 4.7 4.8 4.5 4.6 4.7 4.8
w-y 4 w-z 4 T-wheel-1 4 T-wheel-2 4 T-wheel-3 4
x 10 x 10 x 10 x 10 x 10

0.04 0.3 0.05 0.4

0.2 0 0.2
0.02
0.1 -0.05 0
0
0 -0.1 -0.2

-0.02 -0.1 -0.15 -0.4
4.5 4.6 4.7 4.8 4.5 4.6 4.7 4.8 4.5 4.6 4.7 4.8 4.5 4.6 4.7 4.8
T-wheel-4 4 x-esti-attitude 4 y-esti-attitude 4 z-esti-attitude 4
x 10 x 10 x 10 x 10
Figure 3 Drift-increasing fault of gyro at roll axis (Blue line shows the output in the normal pattern while green line shows the output in the
fault patter

By adopting the input-output system identification method, ture direction and the direction similarity is valid, and the
the detection results in the IDL for the data in Table 2 are isolation between the UF and the AF can be truly realized.
shown in Table 3. The fault detection is timely, and the
detection effect is more obvious (both of the FAP (false
alarm probability) and the MAP (missing alarm probability)
are much lower).
In the IDL, the fault detection can be realized, and the t f2 : 1004
ln(T2 ): 5.483
fault time and the current fault direction are also determined.
In the UIL, Criterion 2 is adopted to realize the isolation
between the UF and the AF. In the initial stage, the AF
pattern is assumed to be empty, therefore, when the second
batch of data in Table 2 is filled into the UIL, the detected
fault must be the UF, and then the isolation result is trans-
ferred into the URL. When the third batch of data in Table 2
is filled into the IDL, the fault time is that t = 1001s , the
statistic of the directional similarity is
r (1− cos(r,ξ1 ) ) / ξ1T Rξ1 = 7.3179 , and the isolation threshold
of the UF is also Φ 0.99 = 2.3263 . Obviously
r (1− cos(r,ξ1 ) ) / ξ1T Rξ1 > Φ0.99 , the current fault pattern is
different from the first fault pattern, and an UF occurs. Then Figure 4 The detection result (with input-output system
the UF is transferred into the URL. The fault isolation result identification method) for drift-increasing fault data of gyro at roll
for all the tested data in Table 2 can be seen in Table 4. axis
From Table 4, we know that the isolator with the fault fea-

Table 3 Unanticipated fault diagnosis—IDL

Inherent Detection Layer (IDL)

Batch Normal FAP MAP Fault
Current fault direction
number or Fault (%) (%) time (s)
1 N 5 0 0 0 0 0 0 0
2 F 3 2 1000+2 0.9876 -0.0042 0.041 -0.053 0.0453 -0.1342 0.0678
3 F 4 1 1000+1 -0.9997 0.0005 -0.034 0.049 0.0049 -0.0036 0.0222
4 F 5 1 1000+2 -0.1510 -0.9747 -0.0097 0.0105 0.0442 -0.1550 0.0345
5 F 4 1 1000+2 -0.0018 1.0000 0.0007 0.0006 -0.0009 -0.0022 -0.0077
6 F 5 1 1000+2 0.0086 -0.0093 -0.9752 0.0046 -0.0007 0.0003 0.0008
7 F 3 2 1000+3 -0.0067 0.0052 0.0016 -0.9925 -0.1553 0.0028 -0.0016
8 F 5 1 1000+4 -0.0769 0.0051 0.0037 0.0018 0.9682 -0.0139 -0.0549
9 F 3 1 1000+2 -0.0742 0.0215 -0.0029 0.0016 0.0454 -0.9968 0.0447
10 F 3 1 1000+2 0.0627 -0.0201 -0.0079 0.0086 -0.0476 -0.0441 -0.9849

142
Proceedings of the 26th International Workshop on Principles of Diagnosis

Table 4 Unanticipated fault diagnosis—UIL
Unanticipated Isolation Layer (UIL)
Batch Anticipated or Fault Pattern
Updated true fault direction
number Unanticipated code
1 Null 0 0 0 0 0 0 0 0
2 U 1 1 -0.0043 0.0415 -0.0537 0.0459 -0.1359 0.0687
3 U 2 -1 0 -0.0340 0.049 0 0 0.0223
4 U 3 -0.1549 -1 -0.01 0.0108 0.0453 -0.1590 0.0354
5 U 4 -0.0018 1 0 0 0 -0.0022 -0.0077
6 U 5 0.0088 -0.0095 -1 0.0047 0 0 0
7 U 6 -0.0068 0.0052 0.0016 -1 -0.1565 0.0028 -0.0016
8 U 7 -0.0794 0.0053 0.0038 0.0019 1 -0.0144 -0.0567
9 U 8 -0.0744 0.0216 0.0029 0.0016 0.0455 -1 0.0447
10 U 9 0.0637 -0.0204 -0.0080 0..0087 -0.0483 -0.0448 -1

In the IDL, the fault detection can be realized, and the
fault time and the current fault direction are also determined. 6 Conclusion
In the UIL, Criterion 2 is adopted to realize the isolation The paper firstly takes the UF as a main diagnosis object.
between the UF and the AF. In the initial stage, the AF The detection and diagnosis method based on data driven
pattern is assumed to be empty, therefore, when the second for the UFs has been researched. The GPM for the UF di-
batch of data in Table 2 is filled into the UIL, the detected agnosis has been designed. The GPM is comprised of the
fault must be the UF, and then the isolation result is trans- IDL, the UIL and the URL. This GPM has provided a
ferred into the URL. When the third batch of data in Table 2 framework support for the UF diagnosis. According to the
is filled into the IDL, the fault time is that t = 1001s , the system both with the control input and the measure output,
statistic of the directional similarity is the system identification detection method corresponding to
r (1− cos(r,ξ1 ) ) / ξ1T Rξ1 = 7.3179 , and the isolation threshold the IDL has been provided. The current fault feature direc-
of the UF is also Φ 0.99 = 2.3263 . Obviously tion and the feature direction of the AF pattern have been
r (1− cos(r,ξ1 ) ) / ξ1T Rξ1 > Φ0.99 , the current fault pattern is used to establish the statistic of directional similarity. The
different from the first fault pattern, and an UF occurs. Then isolation between the AF and the UF has been realized in
the UF is transferred into the URL. The fault isolation result the UIL. According to the singular value decomposition, the
for all the tested data in Table 2 can be seen in Table 4. fault contribution factor of each variance has been obtained,
From Table 4, we know that the isolator with the fault fea- and the fault recognition in data layer has been completed.
ture direction and the direction similarity is valid, and the The application to fault diagnosis of the satellite’s control
isolation between the UF and the AF can be truly realized. system has demonstrated its validity.
After isolating the UF, the recognition of the UF should Our research shall be furthered in two directions. Firstly,
be carried out on the data layer. For the data in Table 2, the based on the framework of the GPM, the fault detection,
recognition result is that: the fault feature direction isolation and recognition method on the foundation of
is ( 0.9876,-0.0042,0.041,-0.053,0.0453, -0.1342, 0.0678 ) . The
T
model inference shall be researched. Secondly, the GPM
variance with the largest contribution factor is the first and methods shall be applied to the diagnosis of other
dimension. According to Explanation 6, the contribution complex system for both military and civil use.
factor reaches 97 percent, and it indicates that the fault
occurs for the earth sensor at the roll axis. Similarly, the Acknowledgments
result of the UF recognition in the URL for other batches of
data is shown in Table 5. From Table 5, the recognition of This work was supported in part by National Natural Sci-
the UF corresponding to the fault variance is correct, and ence Foundation of China (NSFC) under Grant No.
the UF recognition of the data layer is reached. 61304119. Besides, we would like to especially thank
China Aerospace Science and Technology Corporation
Table 5 Unanticipated fault diagnosis—URL
(CASA) for providing the satellite control system data.
Unanticipated Recognition Layer

Batch
Anticipated Fault
Variable subscript
References
or pattern
number
Unanticipated code
in Table 3 [1] P. Nomikos, J. F. Macgregor. Monitoring of batch processes
using multiday principal component analysis. AIChE J, 1994,
1 Null 0 0
40(8): 1361-1375.
2 U 1 1
3 U 2 1 [2] R. Isermann. Supervision, Fault detection and fault diagnosis
4 U 3 2 methods-an introduction. Journal of Control Engineering
5 U 4 2 Practice, 1997, 5(5): 639-652.
6 U 5 3 [3] D. M. J. Tax. One-class classification. Ph.D., Delft University
7 U 6 4 of Technology, Holland, 2001.
8 U 7 5 [4] S. Gayaka1, B. Yao. Accommodation of unknown actuator
9 U 8 6 faults using output feedback-based adaptive robust control.
10 U 9 7 International Journal of Adaptive Control and Signal Pro-
cessing, 2008, 25(11): 965-982.

143
Proceedings of the 26th International Workshop on Principles of Diagnosis

[5] P. Smyth. Markov monitoring with unknown States. IEEE detection of unanticipated faults. International Conference on
Journal on Selected Areas in ConununiCationS, 1994, Prognostics and Health Management, 6-9, Oct, Denver, CO,
12(9):1600-1610. 2008: 1-8.
[6] Hofbaur, B.C. Williams. Hybrid diagnosis with unknown [18] Z. M He, H. Y. Zhou, J. Q. Wang. Model for Unanticipated
behavioral modes. Proceedings of the 13th International Fault Detection by OCPCA. Advanced Materials Research,
Workshop on Principles of Diagnosis (DX02), May, 2002. Vols. 591-593, 2012: 2108-2113.
[7] V.J. Hodge, J. Austin. A survey of outlier detection method- [19] J. Chen, R.J. Patton. Robust model-based fault diagnosis for
ologies. Artificial intelligence review. Kluwer Academic Pub- dynamic systems. Boston: Kluwer Academic Publishers, 1999.
lishers, Vol. 22, 2004, 85-124. [20] B. Zhang, S. Chris, B. Carl. A probabilistic fault detection
[8] Patcha, J. M Park. An overview of anomaly detection tech- approach: application to bearing fault detection. IEEE Trans-
niques: existing solutions and latest technology trends. Com- actions on Industrial Electronics, 2010, 58(5): 2011-2018.
puter Networks, 2007, 51: 3448-3470. [21] Pierre Sens. An unreliable failure detector for unknown and
[9] K. Kojima, K. Ito. Autonomous learning of novel patterns by mobile networks. OPODIS 2008, LNCS 5401, 2008, 555–559.
utilizing chaotic dynamics. IEEE International Conference on [22] Anna M. Bartkowiak. Anomaly, novelty, one-class classifi-
Systems, Man, and Cybernetics, IEEE SMC '99, 1999, cation: a short introduction. Computer Information Systems
1:284-289. and Industrial Management Applications (CISIM), 2010 In-
[10] Petra Perner. Concepts for Novelty Detection and Handling ternational Conference, Wrocław, Poland, 8-10, Oct, 2010,
Based on a Case-Based Reasoning Process Scheme. Spring- 1-6.
er-Verlag Berlin Heidelberg, 2007. [23] F. N. Zhou. Extended DCA method for unknown multiple
[11] Satnam Singh, Haiying Tu, William Donat. Anomaly detec- faults diagnosis. Huazhong Univ. of Sci. & Tech. (Natural
tion via feature-aided tracking and Hidden Markov Models. Science Edition), 2009, 37(4): 84-94 [in Chinese].
IEEE Transactions on Systems, Man, and Cybernetics, Part A: [24] N. Gebraeel, J. Pan. Prognostic degradation models for
Systems and Humans, 2009, 39(1): 144-159. computing and updating residual life distributions in a
[12] Ching-Fang Lin. Predictive fault diagnosis system for intel- time-varying environment. IEEE Trans. Rel., 2008, 57(4):
ligent and robust health monitoring. AIAA In- 539–550.
fotech@Aerospace, 20-22, April, 2010, Atlanta, Georgia. [25] Wang Z M, Yi D Y, Duan X J. Measurement data modeling
[13] E. Sobhani-Tehrani, H. A. Talebi, K. Khorasani1. Neural and parameter estimation. CRC Press, 2011.
parameter estimators for hybrid fault diagnosis and estimation [26] Adrian Wills, Brett Ninness. On gradient-based search for
in nonlinear systems. IEEE International Conference on Sys- multivariable system estimates. IEEE Trans. Automat. Control,
tems, Man and Cybernetics, Montreal, 7-10, Oct, 2007, 2008, 53(1): 298–306.
3171-3176.
[27] E. Wernholt, S. Moberg. Nonlinear gray-box identification
[14] Amitabh Barua. Hierarchical fault diagnosis and health using local models applied to industrial robots. Automatica,
monitoring in satellites formation flight. IEEE Transactions on 2011, 4(47): 650-660.
Systems, Man and Cybernetics-Part C: Applications and Re-
views, 2011, 41(2): 223-239. [28] Lennart Ljung. System identification: Theory for the User.
Linkoping University, Sweden Published, 1998.
[15] B. Tom and J. Tom: Anomaly detection for advanced military
aircraft using neural networks. Aerospace Conference, IEEE [29] Goethals, K. Pelckmans, J. A. K. Suykens, B. De Moor.
Proceedings, 2001, 6: 3113-3134. Sup-space identification of Hammerstein systems using least
squares support vector machines. IEEE Transactions on Au-
[16] Z. H. Duan: Theoretic and methodological research on fault tomatic Control, 2005, 50(10): 1509-1519.
diagnosis of mobile robots based on adaptive particle filters.
Ph.D., Central South University, 2007, 63-89. [in Chinese]. [30] Tu S C. Satellite attitude dynamics and control. Beijing:
Chinese Astronautic Publishing House, 2003; 125-168 [in
[17] B. Zhang, Chris Sconyers, Carl Byington, Romano Patrick, Chinese].
Marcos Orchard. Anomaly detection: A robust approach to

144
Proceedings of the 26th International Workshop on Principles of Diagnosis

A SCADA Expansion for Leak Detection in a Pipeline∗

Rolando Carrera∗ Cristina Verde∗ and Raúl Cayetano∗∗
∗
Universidad Nacional Autónoma de México, Instituto de Ingeniería
e-mail: rcarrera@unam.mx, verde@unam.mx
∗∗
Universidad Nacional Autónoma de México, Posgrado de Ingeniería
email: rcayetanos@ii.unam.mx

Abstract pattern recognition and analytical models for failure diagno-
sis.
A solution for expanding an already existing
pipeline SCADA for real time leak detection is But all the later is pure academical, our aim here is
presented. The work consisted in attaching a FDI to share some of our practical experiences acquired dur-
scheme to an industrial SCADA that regulates liq- ing a re-engineering project that consisted on adding a real
uid distribution from its source to end user. For time leak detection and location layer to an already exist-
isolation of the leak a lateral extraction is pro- ing SCADA. The original objectives of that SCADA were
posed instead of the traditional pressure profile of the administration and delivery of some products, through
the pipeline. Friction value is a function of pipe pipelines, from the source to the end user. As it was our first
physical parameters, but on line friction estima- approach to integrating a FDI to an existing SCADA and
tion achieved better results. Aspects that were im- that we didn’t have experience on this subject, we proposed
portant in the integration of the FDI scheme into a solution that involves simple algorithms for detecting and
the SCADA were the non synchrony of pipeline locating a leak. In future work we’ll use more elaborate
variables (flow, pressure) and their accessibility, algorithms as dedicated observers or detecting two simulta-
that leaded to data extrapolation and the use of neous leaks.
data base techniques. Vulnerability of the loca- In order to show how we solved the targets of the project
tion algorithm due to sensors bandwidth and sen- we divided the solution in five major parts (each one in-
sitivity is showed, so the importance of selecting cluded in sections 2 to 6 down here). Some of them are
them. The FDI scheme was programmed in Lab- extracted from available theory, as the dynamical model for
VIEW and executed in a personal computer. a flow in a pipe and the expression for leak location, and
others are consequence of the experience achieved in our lab
facilities, as the calculus of pipe friction and and the choice
1 Introduction of sensors, and finally the data acquisition imposed by the
nature of the available SCADA.
Leak detection and isolation in pipelines is an old problem
that has attracted the attention of the scientific community Delivering a fluid to clients means steady operation, then
since decades. A paradigmatic example is the oil leakage in our solution required a suitable model for that condition,
the Siberian region [1], where the effects on the surrounding section two describes how to achieve a simple steady state
nature have been disastrous. In Mexico, a semi desert coun- model for a pipeline. Once the model is at hand an appro-
try, there is the need to transport water to the population on priate expression for leak location is needed, for that pur-
long distances via aqueducts; this requires complex supervi- pose in section three a simple method for locating a leak is
sion systems that detect leakages in early ways. Also, there presented. From our experience, pipe friction plays a fun-
exist a complex net of pipelines that transport oil and its damental role in the exact location of the leak and that real
by-products; in this net, besides the leakage problem, there time estimated friction is better than a beforehand constant
exist also the illegal extraction of the product transported in one; an on-line expression for calculating the pipeline fric-
the pipeline; this forces that the distribution system should tion is showed in section four. In this project we didn’t have
have a leak detection and location monitoring system. the option to choose sensors, but we consider appropriate to
Since the 1970’s years have been issued several works share here our experience in this matter, a comparative study
that have been fundamental for the detection and location on how different type of sensors affect the leak location is
of leakages as the one of Siebert [2], where on the basis presented in section five. The data acquisition system of
of the steady state pressure profile along the pipeline sim- the SCADA is based on a MODBUS system and a database
ple expressions are derived, based on correlations,that detect with the information of the pipe variables, we didn’t have
and locate a leakage. Later Isermann [3] published a survey the right to get into the MODBUS but in the database, sec-
showing the state of the art on fault detection by using the tion six shows how the indirect measurement of pipe vari-
plant model and parameter identification. Recently, Verde ables issue was solved by using ethernet and data bases,
published a book [4] making emphasis on signal processing, also, the extrapolation of data of non existing data during
sample times is presented. Finally, the concluding remarks
∗
Supported by II-UNAM and IT100414-DGAPA-UNAM. of this work are presented in section seven.

145
Proceedings of the 26th International Workshop on Principles of Diagnosis

2 Pipeline steady state model with
In most applications a dynamical model of the system is re- M i (Qi ) = µi Qi |Qi | + sin(αi ) = mi (Qi ) + sin(αi ) (5)
quired but not here because of the steady operation of the
pipeline, then a steady state model is more suitable. Be- that is independent of the spacial coordinate z i , and µi :=
sides, the pipeline lies buried in the field and has an irregular f i /2Di (Ai )2 g. Then the solution of (4) reduces to
topography, but it is possible to derive a model that handles
it like a horizontal one. This model is simpler as will be H i (z i ) = −M i (Qi )z i + H i (0) for 0 ≤ z i ≤ Li (6)
showed. with H i (0) the pressure head at the beginning of section
In the following we modify the model of a pipeline with i. Defining boundary conditions for section i in terms of
topographical profile as showed in Figure 1 into one with a pressure at the ends:
right profile piezometric head, where the pressure variable
depends on a reference value h, as is the hight over sea level H i (z i = 0) := Hin
i
H i (z i = Li ) := Hout
i
. (7)
along the pipeline. Consider the one dimension simplified
with (7) in (6), we obtain
flow model in a pipeline with n sections [5],
i i
1 ∂Qi (z i , t) ∂H i (z i , t) Hin − Hout = M i (Qi )Li = mi (Qi )Li + ∆Hi , (8)
+g
A i ∂t ∂z i
(1) where ∆Hi = L sin(α ) is the height difference between
i i
i i i i
f Q (z , t)|Q (z , t)| section ends.
+ + g sin αi = 0 It is reported in [7] and [8] that the pressure head
2Di (Ai )2
∂H i (z i , t) b2 ∂Qi (z i , t) P i (z i )
+ =0 (2) H i (z i ) = (9)
∂t gAi ∂z i ρg
which assumes that fluid is slightly compressible, pipe walls
are slightly deformable and negligible convective changes can be written in terms of the piezometric head H̃ i (z i ), wich
in velocity. Q is volumetric flow, H is pressure head, A depends on a heigth h that can be related to sea level, i.e.
is pipe cross-sectional area, g is gravity, f 1 is the D’Arcy- H̃ i (z i ) = H i (z i ) + h(z i ), (10)
Weissbach friction [6], b is the velocity of pressure wave,D
is pipe diameter, z is distance variable and t the time. Super h(z i ) in m over reference datum or sea level, ρ is fluid den-
index i = 1, 2, ..., n indicates pipeline section characterized sity. Then the profile pressure (8) is equivalent to
by its slop with angle αi , n is the total number of sections. i i
H̃in − H̃out = mi (Qi )Li (11)
2340
for section i and sea level h(z i ) along the section. Finally,
2320
Sensors locations considering that boundary conditions are related by
Heigth over sea level [m]

i i+1
2300 H̃out = H̃in , (12)
2280 from this equation and (11) one gets
n
∑
2260 1 n
H̃in − H̃out = Li mi (Qi ) (13)
2240 i=1

2220 which is function of the piezometric head for a pipeline with
n sections without branches.
2200
0 1 2 3 4 5 6 The profile of Figure 1 corresponds to the topography of
Length [m] 4
x 10 the pipeline under study. The pressure head H(z) and the
resulting piezometric head H̃(z) are shown in Figures 2 and
Figure 1: 60 km Pipeline topographical layout 3, respectively. Take into account the uniformity of H̃(z)
similar to the one of a horizontal pipeline. The reference
We start with the following hypothesis: the system works datum was the height of the first sensors location.
in steady state and that the pipeline lay on an horizontal sur- As a consequence, if H̃in1
= H̃in and H̃out
n
= H̃out , be-
face. Therefore we need a steady state model that takes into sides if m (Q ) = m(Q) = M (Q) for all i, then Equation
i i

account these conditions. (13) becomes
In order to describe the behaviour of the pressure head
H i (z i , t) along a section without branches it is assumed H̃in − H̃(z) = LM (Q) (14)
steady state flow, so from (2) one gets ∑n
where L = i=1 Li the total length of the pipeline. Equa-
∂Q (z , t) i i tion (14) is the steady state piezometric model for the
=0 ⇒ Qi constant (3) pipeline viewed as a horizontal one.
∂z i
Combining (1) and (2)
3 Leak location
dH i (z i )
+ M i (Qi ) = 0, (4) We consider a leakage as an outlet pipe at the leak location
dz i as is shown in Figure 4. A branch or lateral pipe in sec-
1
This friction characterizes the shear stress exerted by the con- tion i breaks the continuity of variables Q(z, t) and H(z, t),
duit walls on the flowing fluid. therefore new boundary conditions must be satisfied [9]. In

146
Proceedings of the 26th International Workshop on Principles of Diagnosis

1000 H1 = H2 = H3 . Thereafter in the study was included only
the balance
950 Q1 − Q2 − Q3 = 0, (17)
as consequence,
Pressure head [m]

900
Q1 = Qin , Q3 = Qout (18)

850
with Qin y Qout flows at the ends of the pipeline. So the
differential equation (4) transforms in two equations
800 dH 1 (z)
− M (Q1 ) = 0; for 0 ≤ z ≤ zb
dz
750
(19)
0 1 2 3 4 5 6 dH 3 (z)
Length [m]
x 10
4
− M (Q3 ) = 0; for zb < z ≤ L,
dz
Figure 2: Pipeline pressure head profile H(z) describing the pressure head along the section with a branch
in point zb . As the equations (19) have the same form as (4),
their solutions also have the same as (6). Therefore, with
980
boundary conditions:
960
1. H 1 (z = 0) = Hin ,
940
2. H 3 (z = L) = Hout ,
Piezometric head

920 3. Qin = Qout + Qzb and
900 4. Hzb − ϵ = Hzb + ϵ with ϵ → 0
880 Assuming that all pipes have same diameters, solutions of
(19) evaluated at the ends are reduced to
860
Hin − Hzb
840 − M (Qin ) = 0
zb
820
0 1 2 3 4 5 6 (20)
Length [m] 4
x 10 Hzb − Hout
− M (Qout ) = 0.
L − zb
Figure 3: Profile of the piezometric head H̃(z)
Obtaining the variable zb associated to the position of the
branch
particular, the union of three pipes is associated to a geome- M (Qout )Li + Hout − Hin
try shown in Figure 4 and the corresponding conditions that zb =
M (Qout ) − M (Qin )
describe the action of separating flow are reduced to
L sin α + m(Qout )L + Hout − Hin
H2 = H1 + κ12 (H2 , H1 ) (15) = , (21)
m(Qout ) − m(Qin )
H3 = H1 + κ13 (H3 , H1 ) (16)
in terms of the piezometric head
where H2 and H3 are pressures at the beginning of pipes
2 and 3 and the functions κ1η (·, ·) with η = 2, 3 repre- m(Qout )L + H̃out − H̃in
zb = . (22)
sent losses caused by friction and change of flow direc- m(Qout ) − m(Qin )
tion. For adjusting the order of magnitude of these func-
tions flow simulations were held with Pipelinestudio [10] Equation (22) is the key for leak isolation. In order to see
with the topology of the study case shown in Figure 1. Sim- the performance of this leak location method some experi-
ulation reported that terms κ12 and κ13 were negligible, then ments were held in our pipe prototype [11], which is an iron
pipe of 200 m long, 4 inches diameter and six valves at-
tached to it for leak simulations. Table 1 shows the percent
deviations of locating the leak position. In each experiment
a valve was fully open. Coriolis sensors were used.
ࡽ૛
4 Pipeline friction
ͳ ࡴ૛ ͵
The D’Arcy-Weissbach friction is a function of the pipe
parameters, [6] and [12], and operation conditions, as the
ࡽ૚ ࡴ૚ Reynolds number. For practical purposes the friction f is
ࡴ૜ ࡽ૜
obtained from tables provided by the pipe manufacturers.
But we observed that that value differs from the real one of a
ࢠ࢈ ࡸ െ ࢠ࢈ working pipeline where, no matter that is working in steady
ࢠൌ૙ ࢠൌࡸ
state, the value is influenced by noise -caused by pipe in-
ner surface imperfections and attachments (nipples, elbows,
Figure 4: Union of three branches in point zb of pipeline etc.)-, therefore using a previous fixed value of f is of no
with transversal section areas A1 , A2 and A3 use in Equation (1).

147
Proceedings of the 26th International Workshop on Principles of Diagnosis

ods are based on processing a residual that is a flow differ-
Table 1: Location error in percentage of total pipe length ence. Due to our lack of experience, and by suggestion of
Experiment ∆zb [%] Valve position [m] a supplier, we start our flow measurements with a paddle
1 1.66 11.54 wheel flow sensor [13]. Later on, as ultrasonic sensors are
2 2.93 49.83 widely used in the field, we decide to change to them [14],
3 0.135 80.36 thinking that our measurements would be better. Finally,
4 0.54 118.37 we reached the conclusion that success on leak detection
5 0.375 148.93 and location depends strongly on the sensors quality (make
6 3.42 186.95 and sensing principle), so we acquired sensors based on the
Mean 1.0 Coriolis effect [15].
An experiment that we made in our pipe prototype was
to cause a leakage (outflow in a extraction point) and esti-
To overcome the problem of not having the friction right mate the location with the measurements of the three sen-
value, we proposed a solution that was an on line friction sors. Figure 6 shows the deviation of the calculated location
estimation. In the following we show how to calculate this depending on the type of sensor. Oscillations are observed
friction. For that, we part from the steady state momentum around the operating point, which leads to the necessity of
equation, Equation (4). Turning back the original parame- signal filtering in the diagnosis process. Table 2 shows the
ters we get error leak location, Paddle Wheel and Coriolis sensors have
similar error, but standard deviation is bigger with the Pad-
dH f dle Wheel. In order to compare performance in the fourth
+ g Q |Q| + gsinα = 0 (23)
dz 2DA2 column the accuracy of the instruments are presented; re-
solving the integral, considering that H0 and HL are pres- mark that Coriolis error standard deviation is about seventy
sures at he beginning and at the end of the pipeline and L times bigger than sensor accuracy. The observation here is
the length, results that the quality of the results depends more on the behavior
of the flow than on the accuracy of the instrument used.
f
g(HL − H0 ) = −( Q2 + gsinα)L (24) Leak location (Real position= 49.8 m)
2DA2 ∞ 60
where Q∞ is volumetric flow in steady state, the abso-
lute term disappears when flow goes in one direction only. 50

Friction has the following expression 40
Coriolis
Location [m]

2 Paddle wheel
2DA g (H0 − HL − Lsinα) 30
f= (25) Ultrasonic
L Q2∞
20
Equation (25) is used to calculate on line the friction
value, as is shown in Figure 5, experiment realized in our 10
pipeline prototype. The calculated friction has a consider-
0
able amount of noise, but this noise can be attenuated via
weighted mean value with forgetting factor (MVFF, contin- −10
0 1000 2000 3000 4000 5000 6000 7000
uous line in figure). Actually, we are working on the use of Samples
recursive identification procedures for a better friction esti-
mated. Figure 6: Leak location with the three sensors
0.026
Unfiltered
MVFF
Table 2: Leak location errors
0.025
Sensor Error Error STD Accuracy
[%] [%] [% FS]
Paddle wheel -0.28 3.36 0.50
Friction

0.024
Ultrasonic 2.12 1.39 2.00
Coriolis 0.28 0.84 0.05
0.023

One of our goals in the SCADA expansion project was to
0.022
deliver results in real time. For this, sensors experiments
0 100 200 300 400 were performed to determine which one would have the
Samples faster response. An index to take into account is the time
response, it can be appreciated in Figure 6 but is practically
Figure 5: Friction estimated, raw and filtered the same, therefore we measured the settling time from the
moment when the leakage valve is opened. In Figures 7,
8 and 9 the flow development is observed, dotted line indi-
cates the time when the leakage valve is opened to 100%. In
5 Influence of sensors on location Table 3 are the measured times, being the ultrasonic sensor
Flow measurement in a pipeline is fundamental for leak lo- which requires more time (this by the number of points used
cation, in view that most of the pipeline leak detection meth- to calculate a mean value).

148
Proceedings of the 26th International Workshop on Principles of Diagnosis

Considering the settling time and noise in measurements
Table 3: Sensors settling time (taking the STD as the measure for that), Coriolis sensor has
Sensor ts [s] the best performance. Experiments showed in this section
Paddle wheel 3 were made with 1 s sampling period.
Ultrasonic 35
Coriolis 4 6 Asynchronous data and data bases
In the academy, we are used to work with benchmark sys-
tems or laboratory facilities with ad hoc data acquisition
systems, sufficient sensors, controlled environments, etc.
21
Qe But these conditions are not necessarily in the practice, as
Qs was the case of the SCADA expansion, where the access
20
to flow and pressure sensors of the pipeline were not avail-
able, but through a database. So the solution adopted was as
19
follows:
Flow [L/s]

18 1. The leak locator is on a dedicated computer, indepen-
dent of the system that regulates de distribution of the
17 fluid, it connects to the database server, see Figure 10,
via intranet or VPN (Virtual Private Network) connec-
16 tion in a LAN (Local Area Network) system.
2. With proper permission a program, task performed
leak start

15
20 40 60 80 100 120 140 160 180 with Visual Studio 2010 tool that runs every minute
Time [min] (it is a program without GUI -Graphic User Interfacer-
that runs silently), brings system data and creates a
Figure 7: Flow measurement at the pipe ends, paddle wheel database with pipeline flow and pressure information,
sensors data required by the locator for proper operation.
3. The locator program (made in the LabVIEW plat-
form, [16]) periodically takes data (through SQL data
19
Qe server of Microsoft), applies the detection algorithm
Qs and when detects a leak proceeds to locate it, displays
18.5 on the screen the location of the leak (Figures 12 and
13), generates a visual warning and creates a file with
18 data leakage.
Flow [l/s]

17.5

leak start

16.5
20 40 60 80 100 120 140 160 180
Time [min]

Figure 8: Flow measurement at the pipe ends, ultrasonic
sensors
Figure 10: Communication scheme between leak locator
and database
19.5
Qe But the data acquisition system of SCADA do not meet
19
Qs the condition of sampling the system variables with con-
stant sampling period. The nominal sampling period was
18.5 3 min, but in reality this varies from one to several tens of
minutes. On the other hand, the locator was assigned a sam-
Flow [l/s]

18
pling period of 3 min, determined by the condition that nom-
17.5 inally SCADA performs a polling of all measuring stations
in that time span. To solve the problem of having a value
17 of flow and pressure of each station at all sampling time, it
was added to the localizer an algorithm that extrapolates the
16.5
leak start
missing data when it is not available. Two algorithms were
16 tested, one that retains the last data in the following sam-
20 40 60 80 100 120 140 160 180
Time [min] pling periods and one that generates straight line with the
last two values available, that when the value of the variable
Figure 9: Flow measurement at the pipe ends, Coriolis sen- that is brought from the database is not a new one, then the
sors one determined by straight line is used. In order to compare
results with both proposals a simulation with real data with

149
Proceedings of the 26th International Workshop on Principles of Diagnosis

three leaks was carried on, in Figure 11 the real and extrapo- It connects to the database in the SCADA through TCP
lated input flow data are shown. It can be seen that at certain sockets and VPN.
intervals the extrapolation by a straight line delivers values
that may be beyond the normal range of measurements, this
situation is exacerbated in large intervals with empty data as
the line grows monotonically delivering data outside the re-
gion of validity. In Figure 12 the location of a leak is shown
when extrapolated data are used and in Figure 13 when re-
tained data are used. The pipe length is about 20 km, so
that retention has outperformed extrapolation, since the lat-
ter yields higher values than the length of the pipe. Original
leak location was about 10 km.

17.5 Extrapolation
Retained
Real
17
Flow [m3/min]

16.5

16 Figure 14: Communications between client and database

15.5
For data handling JSON format is used, which is broadly
0 50 100
Time [min]
150 200 used for information interchange trough internet. JSON
(Java Script Object Notation) is a data interchange text for-
mat, easy for humans to read and write [17]. JSON is a
Figure 11: Graphics with original, extrapolated and retained collection of pairs {variable name : value}, realized as an
data of input flow with three leaks object, record, structure, dictionary, hash table, keyed list,
or associated array, see in Figure 15 an object example.

Figure 12: Leaks location with extrapolated data Figure 15: JSON data format for an object

An example of a JSON string for reporting a leak is the
following:
{"service":"event",
"options": {
"action":"new",
"vector": {
"Module":XXX,
Figure 13: Leaks location with retained data "EventID":XXX,
"Quantity":XXX,
6.1 Alternate database communication "PipeID":XXX,
"Location":XXX,
As part of the project requirements, an alternate way of com- "TimeEvent":"yyyymmddhhmmss"}
munication with the SCADA database was experimented. In }}
previous section the communication between leak locator
and database was direct trough a LAN system, the alternate Communications broker attends clients requests (leaks lo-
way was through a third party via internet and VPN connec- cator is not the only one) and also SCADA requests. The
tion. Figure 14 shows the principal elements of this scheme. database attached to the broker contains not only pipeline
The client is the computer with the locator program build data but also data generated by the other clients. At the end,
in LabVIEW platform that performs basically two activities: the SCADA has an interface in which information of leak-
leak detection and location, and request and sending data age events is displayed.
to communications broker using JSON strings. The remote Figure 16 shows a test ran with real data but off line. That
client interface is a Java process that runs locally and han- experience showed that locator not always received answers
dles communication, authentication, data formatting, en- from the broker. But this communications scheme is still in
cryption and security of the communication with data server. development.

150
Proceedings of the 26th International Workshop on Principles of Diagnosis

References
[1] I. Bazilescu and B. Lyhus. Russia oil spill.
http://www1.american.edu/ted/KOMI.HTM, 1994.
[2] H. Siebert and R. Isermann. Leckerkennung und -
lokalisierung bei pipelines durch on-line-korrelation
mit einem prozessrechner. Regelugnstechnik, 25:69–
74, 1977.
[3] R. Isermann. Process fault detection based on mod-
eling and estimation methods - a survey. Automatica,
Figure 16: Off line experiment with real data. Detail of the 20(4):387–404, July 1984.
graph, y axis is leak location in km [4] C. Verde, S. Gentil, and R. Morales. Monitoreo y di-
agnóstico automático de fallas en sistemas diámicos.
Trillas, 2013.
7 Conclusions
[5] M. Hanif Chaudhry. Applied Hydraulic Transients.
An interesting result is that a pipeline with certain topog- Springer, third edition, 2014.
raphy may be analyzed as an horizontal pipe in which the [6] C. F. Colebrook and C. M. White. Experiments with
piezometric head is a sum of measurements and terrain fluid friction in roughened pipes. Proceedings of the
heights, Equation (10), as seen in section 2. Royal Society of London, 1937.
Compared with traditional methods for locating a leak in
[7] J. Saldarriaga. Hidráulica de acueductos. Mc Graw-
a pipe, the method shown here, Equation (22), requires less
computational effort and has a simple expression for calcu- Hill, 2003.
lating it. [8] R. Bansal. Fluid mechanics and hydraulic machines.
Another relevant result is the expression for on line calcu- Laxmi Publications (P) LTD, 2005.
lation of the pipeline friction, Equation (25), as it is enough [9] Oke A. Mahgerefteh, H. and O. Atti. Modelling out-
to measure pressure at the ends and steady state flow. The flow following rupture in pipeline networks. Chemical
value of friction was found to be a key parameter for the Engineering Science, (61):1811–1818, 2006.
exact location of the leak. It is to remark that when a leak
[10] PipelineStudio. Software in energy solutions interna-
occurs the pressures change modifying the friction value; in
order to avoid wrong location of the leak we keep a delayed tional. http://www.energy-solutions.com/, 2010.
value of friction that is frozen when leak alarm occurs. [11] R. Carrera and C. Verde. Prototype for leak detection
On the other hand, is to highlight the importance of in pipelines: Users Manual. Instituto de Ingeniería,
choosing the appropriate sensor. It is not enough to choose UNAM, Ciudad Universitaria, D.F., Noviembre 2010.
a sensor capable of measuring a certain physical variable, In Spanish.
also must be included in the selection process the purpose [12] C. Tzimopoulos G. Papaevangelou, C. Evangelides.
for which the measurements are needed. A new explicit relation for the friction coefficient in
The world of measurements for control targets is not lim- the darcy-weisbach equation. In PRE10: Protection
ited to direct measurement of the physical variable, it is and Restoration of the Environment, Corfu, 05-09 July,
possible to achieve the control objectives with indirect mea- 2010, 2010.
surements, as was the case of reading the variables from the [13] G. Fisher. Signet 2540 high performance flow sensor.
plant via the network to a database. Also, with the partial Georg Fisher Signet LLC, El Monte, CA, 2004.
absence of data we cannot use the plant model to predict
data, then the use of extrapolation methods proves to be a [14] Panametrics. Two-Channel TransPort Mod. 2PT868
powerful tool that helped to achieve the goal of this project; Portable Flowmeter. User’s Manual. PANAMET-
in this paper we use two simple methods, but this is an area RICS, Inc., Waltham, MA, USA, 1997.
that we continue to explore. [15] E+H. Proline Promass 83 Operating Instructions. En-
The experience with JSON format strings showed that it dress + Hauser Flowtec, Greenwood, IN, USA, 2010.
is easier to work with text characters than with specialized [16] National Instruments. LabVIEW user manual. Na-
database commands and, no matter the VPN connection and tional Instruments Corporation, Austin, 2013.
data encryption, the scheme depends strongly on internet
conditions. If internet fails leak detection scheme fails, sit- [17] JSON Organization. Ecma-404 the json data inter-
uation that scarcely appears when the locator connects with change standard. http://www.json.org/.
database through a LAN system.
To the moment this paper was written our FDI system
is in the proof stage at the SCADA facilities and we are
waiting for in the field results.

8 Acknowledgments
Authors are very thankful to Jonathán Velázquez who
helped us by solving the database issues emerged in this
project.

151
Proceedings of the 26th International Workshop on Principles of Diagnosis

152
Proceedings of the 26th International Workshop on Principles of Diagnosis

Automatic Model Generation to Diagnose Autonomous Systems

Jorge Santos Simón1 and Clemens Mühlbacher1 and Gerald Steinbauer1
1
Institute for Software Technology
e-mail: {jsantos, cmuehlba, gstein}@ist.tugraz.at

Abstract RoboEarth [7]) make semi-automated derivation of mod-
els possible. Despite recent advances on this area [8; 9;
Autonomous systems’ dependability can be im- 10], most techniques focus on very specific applications of
proved by performing diagnosis during run-time. the generated formal models. Thus, we pose the problem
This can be achieved through model-based diag- of generating a common knowledge base as an interme-
nosis (MBD) techniques. The required models of diate representation with a well defined semantics out of
the system are for the most part handcrafted. This documents used during the system design process. From
task is time consuming and error prone. To over- this central repository, different algorithms can extract dif-
come this issue, we propose a framework to gen- ferent formal models for particular needs. We believe that
erate formal models out of natural language doc- this work can increase the acceptance of model-based tech-
uments, such as technical requirements or FMEA, niques and broaden their use.
using natural language processing (NLP) tools
and techniques from the knowledge representa- The motivation for this work came during the develop-
tion and reasoning (KRR) domain. Therefore, we ment of a model-based diagnosis and repair (MBDR) sys-
aim to enable the usage of MBD in autonomous tem for an industrial application. The aim is to improve the
systems with few extra burden. So doing, we ex- dependability of a fleet of robots that automatically deliver
pect a significant increase in the usage of MBD goods in a warehouse. As stated in [11], even minor fail-
techniques on real-world systems. ures often prevent a robot from accomplishing its task, de-
creasing the overall performance of the system. Moreover,
the frequent need of human intervention increases costs and
1 Introduction customer dissatisfaction. Using MBDR techniques, many
Dependability is a key feature of modern autonomous sys- of these failures can be automatically handled, allowing the
tems. It can be achieved by sound design and implemen- robot to remain on service, perhaps with its capabilities
tation, thorough testing and runtime diagnosing. To date, gracefully degraded [12; 13]. In extreme cases, diagnos-
all these processes are still not completely automated and ing a failure on time can prevent robot behaviors harmful
need substantial manual work. However, all these fields can for humans, itself or other elements in the environment.
greatly benefit from the use of model-based techniques. De- Confronted with the lack of any formal model of the
sign and implementation can be greatly improved through system, we were forced to manually code the models we
model-driven engineering, as stated in [1]. Model-based need. However, this is both a time-consuming and error
testing (MBT) has been demonstrated [2] to outperform tra- prone task, and also impose a maintaining burden as the
ditional testing techniques in both invested time and number system evolves. Accordingly, we believe that a mostly au-
of errors found. Model-based diagnosis (MBD) is the main tomated approach is not only convenient for the intended
target of this work. It has been successfully used in indus- project but can also help extending the use of MBDR
trial settings [3], reducing the need for human intervention. techniques to other projects and domains. Following this
Although it has being increasingly adopted in recent years, idea, we propose a framework that, in a first step, gath-
we believe that its full potential is still to be developed. ers the information from the project together with domain
All model-based techniques require appropriate models and common-sense knowledge in a machine-understandable
of the system. As stated in [4; 5], creating these models knowledge base. Then, a suit of algorithms can extract for-
is the most prevalent limiting factor for their adoption. To mal models from this knowledge base for particular pur-
overcome this barrier, we propose a method that automates poses. Though our aim is to automate the process as much
models creation from the documents used during the sys- as possible, human assistance will be requested whenever
tem design. These comprise requirements documents, ar- some pieces of information are missing or contradictory [14;
chitectural designs, FMEA and FTA, among others. The 15].
content of these documents is often given in natural lan- The novelty of our proposal is two-fold: first, we empha-
guage and in semi-structured form and lacks a common sizes the usability of the resulting models for MBD. Second,
semantics. Thus, the contained information is not acces- we aim to integrate all the sources of information typically
sible for a computer. However, advances in natural lan- available in an industrial development process, such as re-
guage processing (NLP) and the availability of common quirements, architecture, and failure modes. As a result, we
sense and domain-specific knowledge bases (e.g. Cyc [6], expect to boost the range and applicability of the automat-

153
Proceedings of the 26th International Workshop on Principles of Diagnosis

ically generated models. To better illustrate the proposed 3 Framework overview
framework, we will use a small running example extracted We propose the framework depicted in Figure 1 to transform
from a real-world application. It comes to the robot’s box informal documents and knowledge into models suitable for
loading operation, performed by the robot’s load handling MBD. The informal inputs (white squares with solid lines)
device (LHD). are processed into intermediate representations (light gray
The remainder of the paper is organized as follows: Re- squares with dashed lines) using techniques from NLP and
lated research on model generation is discussed in Section KRR, as well as ontologies (e.g. Cyc). We condense them
2. Section 3 provides an overview of the proposed process. into a knowledge base together with all our knowledge about
Section 4 describes the inputs used, while Section 5 de- the system and its domain. Finally, a variety of algorithms
scribes the proposed NLP and KRR tool-chain to interpret can produce formal models suitable for MBD (gray squares
them. Section 6 provides an example of an output model with dot-dash lines).
and its use for MBD. Finally, Section 7 summarizes the pre-
sented framework and discusses future work. 4 Sources of information
The proposed framework takes artifacts from the design
phase as inputs. We propose the use of the following four in-
2 Related research puts, though additional sources can be incorporated if avail-
able:
We start the brief discussion of related research with the 1. Requirements document: The technical requirements
work using NLP methods to derive models. The work of document describes the expected system behavior.
[9] uses NLP methods to derive a formal model out of re- Therefore, it is a mandatory input. The models’ quality
quirements. This formal model can afterwards be trans- and so the resulting MBD will heavily depend on the
formed into different representations to test or synthesize quality of the requirements. Thus, iterative improve-
the system. The method proposed in [10] uses NLP meth- ment of the requirements and models is used, as pro-
ods to derive design documents (class diagrams, etc.) out posed in [15]. For our running example, we have taken
of requirements. These design documents can afterwards be four requirements that describe the box loading process
used to implement the system. The authors of [8] proposes of a robot:
a method to extract action receipts from websites. These (a) When the robot is docked, it lowers the barrier.
action receipts comprises the desired behavior in order to (b) When the robot is ready to load, the load handling
achieve a given goal. The method use how-to instructions device starts rotating backward.
and NLP tools to derive an action receipt which can be ex-
(c) The load handling device stops rotating back-
ecuted by a robot. Missing parts are inferred with the help
wards when the laser beam is triggered.
of common sense knowledge about actions. In contrast to
all these approaches, we propose a framework which incor- (d) After stopping the load handling device the barrier
porates different information sources to get a better under- is raised.
standing of the system. Furthermore, our framework gen- 2. Domain knowledge: This is the most fuzzy input, as
erates different models out of an internal formal description it is available not as an artifact but as the knowledge
depending on the needs of the intended diagnosis and testing and experience of the engineers involved. We dis-
tasks. tinguish three kinds of knowledge. Common sense
Beside NLP methods, machine learning can also be used knowledge can be provided by existing ontologies as
to generate a model of the system. The work in [4] pre- Cyc [16]. Generic knowledge about the autonomous
sented a method to statistically learn the model of the sys- systems domain can be provided by dedicated ontolo-
tem under nominal conditions. The model describes the gies as KnowRob [17]. Particular knowledge about the
static interaction of the system components. In contrast, the targeted system itself can be partially inferred from the
method proposed in [5] learns the behavior of a system. The system architecture, though other parts must be pro-
method infers from observed events similar/different states vided by the project engineers. The use of ontologies
and merges similar ones. Furthermore, the variables in the range from providing meaning to natural language con-
system for each state are estimated. Both methods are only cepts to inferring missing pieces of information.
applicable if the system is already built. Instead, we create 3. Architecture: The architecture of the system defines its
a model during the design phase, and so the model can be composing elements plus the relations between them.
used right at the first stages of the life-cycle. It is typically described as a set of diagrams generated
Missing or contradicting information must be detected during the design phase of the system. For our run-
and handled when generating models. The method in [15] ning example, we use the architecture excerpt depicted
tries to avoid faults in the requirements document. This is in Figure 2. It states that a robot consists of a LHD
done through the transformation of the requirements into so and other unspecified elements. Furthermore, the LHD
called boilerplates. Through this semi-structured text, am- consists of a laser beam, rollers and a barrier.
biguities are removed and a consistent naming is enforced. 4. Failure Modes and Effects Analysis: FMEA looks at
A different approach was proposed in [14] to diagnose a all potential failure modes, their effects and causes and
knowledge base for consistency. If the knowledge base is determines a risk priority factor. FMEA can be used to
inconsistent, the user is asked as an oracle to pinpoint the determine which potential errors are critical, how they
problem. Afterwards, the user needs to fix this issue. In our can be pinpointed, and how the effects thereof can be
framework, we will use ideas from both methods to derive a avoided [18]. We incorporate the failure modes into
consistent knowledge base of the system. the resulting behavior models to diagnose these known

154
Proceedings of the 26th International Workshop on Principles of Diagnosis

Figure 1: Abstract work-flow for the proposed framework. Starting from left with inputs in natural language, we generate
models that can be applied for diagnosis (right).

Figure 2: Robot architecture excerpt. The figure shows re-
lations of the type part of for components of the Robot.

failures. For our running example, we include the two
failure modes that can occur during the load operation,
depicted in Table 1. Figure 3: Sample syntax tree of the first sentence (a) of the
The biggest challenge for handling all these inputs is to running example.
understand semi-structured information. So, we will depict
a NLP/KRR tool-chain using state-of-the-art techniques in
the following section. Note for example that the 3rd person “s” has been removed
from the verbs. Furthermore complex terms such as “load
5 NLP/KRR tool chain handling device” have been replaced by lhd. Finally, the
propositions order is rearranged in a consistent structure.
The process generates three intermediate artifacts: semi-
formal text (boilerplates), syntax trees and semantic cate- 5.2 Syntax trees
gories. As a showcase, we will concentrate on the require- A syntax tree comprises the information of the type of each
ments of our running example, though these techniques can word in the sentence, e.g. ”lower“ is a verb. Furthermore,
be extended to other textual inputs, as we will see at the end the tree specifies how the sentence is constructed with these
of this section. words. For example, the syntax tree of the first require-
ment in our running example is depicted in Figure 3. In this
5.1 Boilerplates syntax tree we can identify that “robot” is a noun and “the
This is a semi-formal representation where most of the robot” is a so called noun phrase. An example of a tool to
spelling errors, poor grammar and ambiguities have been extract syntax trees is the probabilistic context free grammar
removed. Boilerplates also enforce the use of a consistent parser, described in [20].
naming scheme. There exist tools such as [19] to perform
this task semi-automatically. In our example, the four re- 5.3 Semantic categories
quirements become the four equivalent boilerplates: The semantic categories conceptually describe our system,
(a) when the robot is docked, it lower the barrier. e.g. a transition describing the motion of an actuator. These
semantic categories are hierarchical in nature, as more com-
(b) when the robot is ready to load, the lhd start rotating plex and abstract concepts are composed of simpler ones,
backward. e.g. a transition is composed by an action, pre and post
(c) when the lb is triggered, the lhd stop backward rotation. conditions, etc. We obtain the semantic categories by pars-
(d) after stopping the lhd, the barrier is raised. ing the syntax trees and applying transformation rules in a

155
Proceedings of the 26th International Workshop on Principles of Diagnosis

Component Failure Observations
Failure 1 Barrier Barrier stuck up Barrier stuck up regardless commands
Failure 2 Load Handling Device (LHD) Rotation fail Laser beam not triggered

Table 1: FMEA from the running example.

1. Relations representing a direct transition, as depicted
in Figure 4. Such a transition can be directly mapped
into a transition on the automaton, as can be seen in
Figure 5 through the transitions from state 1 to 2.
2. Relations representing an action with a duration. Such
a relation must be translated into several transitions:
the start of the action, the termination event and a tran-
sition to a final state. Such transformed relation is de-
picted in Figure 5 through the transition from state 2 to
5.
3. Relations representing a failure of the system. The
failure event is represented as a divergent path from
Figure 4: Concepts created from the syntax tree in Figure 3.
a normal transition. Thus, the start state is the same
The word in quotes is the word as it appears in the sentence.
as the one of the normal transition. Afterwards, we
The word in parenthesis is the Cyc concept it belongs to.
need a state representing the failure. Finally, we need
an observation transition that leads to a final state rep-
resenting a general failure of the system. The observ-
bottom up fashion, following [8]. We start at the leafs of able transition is cased due to the fact that use a fault
the syntax tree, containing single words. Each word has model which is derived from the FMEA. Thus every
assigned a part-of-speech (POS) label describing its gram- fault has an observable discrepancy to the real system.
matical role in the sentence. Furthermore, each word has an Additionally it is important to notice that the state rep-
additional label with its WordNet [21] synset, used to de- resenting the general failure is state where the system
rive its semantics from the common sense knowledge base. can exhibit arbitrary behavior. Thus we can model the
From the leafs, higher level transformations can be applied lack of knowledge which impact the fault has on the
to create more complex semantic categories. For example, system. The transformed failure is is depicted in Fig-
on our running example we create a semantic category for ure 5 through the transitions from state 2 to 9.
each word in the sentence “lower the barrier”. Then, we
can derive that “lower” is an action acting on something. 4. Relations representing a failure of a system compo-
We can after that use the semantic category of the word to- nent. The failure event is represented as a divergent
gether with its position in the syntax tree to apply further path from a normal transition. To determine all the
transformation rules. This process is repeated till the root possible affected transitions, we must perform an infer-
node is reached. Then, a new semantic category is assigned ence of the effects each transition has. This inference is
to the sentence capturing its semantics. For the running ex- based on common sense and domain knowledge. In our
ample, the semantic category for “lower the barrier” is a running example, we can infer that lowering the barrier
transition. A transition must contain a precondition, a post causes the barrier to be finally down. A failure such
condition, an action and optionally an object of the action. as barrier_stuck_up can prevent this transition, and so
The semantic category specifies that the action “lower” is they can share a common source state. Then, as before
performed on the object “barrier”. With the help of com- we need an observation transition that leads to a final
mon sense (Cyc ontology [16]) we can reason that this ac- state representing a general failure of the system. Such
tion causes the “barrier” from state “up” to state “down”. a sequence is depicted in Figure 5 though the transi-
Thus, we can infer the pre and post conditions of “lower”. tions from state 1 to 9 through the states 7 and 8.
Finally, the semantic category together with the reasoning
results are packed into statements on our knowledge base, 7 Conclusion and future work
as it is depicted in Figure 4.
We can incorporate other documents into the knowledge In this paper we propose a framework to automatically gen-
base by using a similar NLP tool chain. However, how the erate formal models out of documents represented in semi-
information is treated depends heavily on the context inher- structured form and natural language (requirements, domain
ent to each document type. knowledge, architecture, failure modes, etc.). The parsed in-
formation is gathered together with domain knowledge in a
knowledge base. Accessing this common repository, a va-
6 Model generation for behavior diagnosis riety of algorithms can generate different kinds of models
To illustrate how the framework can be used to diagnose for different purposes. Our main target is to derive models
the behavior of the robot, we create an automaton as output suitable for state-of-the-art MBD techniques applied to au-
model. To use techniques such as [22], the automaton must tonomous systems. We plan to implement this framework
describe both nominal and faulty behaviors of the system. to assist us on creating the models required for MBD. Do-
To generate this automaton from the knowledge base, we ing so, we expect to improve the dependability in the indus-
use four different relations stated on it as transitions: trial application of a fleet of transport robots in a warehouse.

156
Proceedings of the 26th International Workshop on Principles of Diagnosis

tory Automation (ETFA), 2011 IEEE 16th Conference
on, pages 1–9. IEEE, 2011.
[6] Cynthia Matuszek, John Cabral, Michael Witbrock,
and John Deoliveira. An introduction to the syn-
tax and content of Cyc. In Proceedings of the 2006
AAAI Spring Symposium on Formalizing and Com-
piling Background Knowledge and Its Applications to
Knowledge Representation and Question Answering,
pages 44–49, 2006.
[7] Markus Waibel, Michael Beetz, Raffaello D’Andrea,
Rob Janssen, Moritz Tenorth, Javier Civera, Jos
Elfring, Dorian Gálvez-López, Kai Häussermann,
J.M.M. Montiel, Alexander Perzylo, Björn Schießle,
Oliver Zweigle, and René van de Molengraft.
RoboEarth - A World Wide Web for Robots. Robotics
& Automation Magazine, 18(2):69–82, 2011.
[8] Moritz Tenorth, Daniel Nyga, and Michael Beetz. Un-
derstanding and executing instructions for everyday
Figure 5: Automaton generated from the running example. manipulation tasks from the world wide web. In
Shaded states are reached through some fault. Double cir- Robotics and Automation (ICRA), 2010 IEEE Interna-
cled states represent final states. State number 9 is the gen- tional Conference on, pages 1486–1491. IEEE, 2010.
eral failure state for readability the self loops with all possi- [9] Shalini Ghosh, Daniel Elenius, Wenchao Li, Patrick
ble labels are omitted. Lincoln, Natarajan Shankar, and Wilfried Steiner.
Automatically extracting requirements specifi-
cations from natural language. arXiv preprint
Besides this immediate result, we expect that the proposed
arXiv:1403.3142, 2014.
framework will ease the creation of formal models for other
applications. Thus, we hope to contribute to the widespread [10] Sven J Körner and Mathias Landhäußer. Semantic en-
use of MBD techniques, with the consequent improve of au- riching of natural language texts with automatic the-
tonomous systems dependability. matic role annotation. In Natural Language Process-
ing and Information Systems, pages 92–99. Springer,
Acknowledgments 2010.
[11] Gerald Steinbauer. A survey about faults of robots
The research presented in this paper has received funding
used in robocup. In Xiaoping Chen, Peter Stone,
from the Austrian Research Promotion Agency (FFG) under
LuisEnrique Sucar, and Tijn van der Zant, editors,
grant 843468 (Guaranteeing Service Robot Dependability
RoboCup 2012: Robot Soccer World Cup XVI, volume
During the Entire Life Cycle (GUARD)).
7500 of Lecture Notes in Computer Science, pages
344–355. Springer Berlin Heidelberg, 2013.
References [12] Gerald Steinbauer, Franz Wotawa, et al. Detecting and
[1] Stuart Kent. Model driven engineering. In Michael locating faults in the control software of autonomous
Butler, Luigia Petre, and Kaisa Sere, editors, Inte- mobile robots. In IJCAI, pages 1742–1743, 2005.
grated Formal Methods, volume 2335 of Lecture Notes [13] Mathias Brandstötter, Michael Hofbaur, Gerald Stein-
in Computer Science, pages 286–298. Springer Berlin
bauer, and Franz Wotawa. Model-based fault diagnosis
Heidelberg, 2002.
and reconfiguration of robot drives. In 2007 IEEE/RSJ
[2] Mark Utting and Bruno Legeard. Practical model- International Conference on Intelligent Robots and
based testing: a tools approach. Morgan Kaufmann, Systems (IROS), San Diego, CA, USA, 2007.
2010. [14] Kostyantyn Shchekotykhin, Gerhard Friedrich, Patrick
[3] Peter Struss, Raymond Sterling, Jesús Febres, Um- Rodler, and Philipp Fleiss. A direct approach to
breen Sabir, and Marcus M. Keane. Combining engi- sequential diagnosis of high cardinality faults in
neering and qualitative models to fault diagnosis in air knowledge-bases. In International Workshop on Prin-
handling units. In European Conference on Artificial ciples of Diagnosis (DX), Graz, Austria, 2014.
Intelligence (ECAI) - Prestigious Applications of Intel- [15] Bernhard K Aichernig, Klaus Hormaier, Florian Lor-
ligent Systems (PAIS 2014), pages 1185–1190, 2014. ber, Dejan Nickovic, Rupert Schlick, Didier Si-
[4] Safdar Zaman and Gerald Steinbauer. Automated Gen- moneau, and Stefan Tiran. Integration of Require-
eration of Diagnosis Models for ROS-based Robot ments Engineering and Test-Case Generation via
Systems. In International Workshop on Principles of OSLC. In Quality Software (QSIC), 2014 14th Inter-
Diagnosis (DX), Jerusalem, Israel, 2013. national Conference on, pages 117–126. IEEE, 2014.
[5] Dennis Klar, Michaela Huhn, and J Gruhser. Symp- [16] Stephen L Reed, Douglas B Lenat, et al. Mapping
tom propagation and transformation analysis: A prag- ontologies into Cyc. In AAAI 2002 Conference Work-
matic model for system-level diagnosis of large au- shop on Ontologies For The Semantic Web, pages 1–6,
tomation systems. In Emerging Technologies & Fac- 2002.

157
Proceedings of the 26th International Workshop on Principles of Diagnosis

[17] Moritz Tenorth, Alexander Clifford Perzylo, Reinhard
Lafrenz, and Michael Beetz. The roboearth language:
Representing and exchanging knowledge about ac-
tions, objects, and environments. In Robotics and Au-
tomation (ICRA), 2012 IEEE International Conference
on, pages 1284–1289. IEEE, 2012.
[18] Hongkun Zhang, Wenjun Li, and Jun Qin. Model-
based functional safety analysis method for automo-
tive embedded system application. In International
Conference on Intelligent Control and Information
Processing, 2010.
[19] Stefan Farfeleder, Thomas Moser, Andreas Krall, Tor
Stålhane, Herbert Zojer, and Christian Panis. Dodt:
Increasing requirements formalism using domain on-
tologies for improved embedded systems develop-
ment. In Design and Diagnostics of Electronic Cir-
cuits & Systems (DDECS), 2011 IEEE 14th Interna-
tional Symposium on, pages 271–274. IEEE, 2011.
[20] Dan Klein and Christopher D. Manning. Accurate
unlexicalized parsing. In Proceedings of the 41st
Annual Meeting on Association for Computational
Linguistics-Volume 1, pages 423–430. Association for
Computational Linguistics, 2003.
[21] George Miller and Christiane Fellbaum. Wordnet: An
electronic lexical database, 1998.
[22] Meera Sampath, Raja Sengupta, Stéphane Lafortune,
Kasim Sinnamohideen, and Demosthenis Teneket-
zis. Diagnosability of discrete-event systems. Au-
tomatic Control, IEEE Transactions on, 40(9):1555–
1575, 1995.

158
Proceedings of the 26th International Workshop on Principles of Diagnosis

Methodology and Application of Meta-Diagnosis on Avionics Test Benches

R. Cossé1,2 , D. Berdjag2 , S. Piechowiak2 , D. Duvivier2 , C. Gaurel1
1
AIRBUS HELICOPTERS, Marseille International Airport, 13725 Marignane France
{ronan.cosse, christian.gaurel}@airbus.com
2
LAMIH UMR CNRS 8201, University of Valenciennes, 59313 Valenciennes France
{denis.berdjag, sylvain.piechowiak, david.duvivier}@univ-valenciennes.fr

Abstract we call a meta-diagnosis.
Many diagnosis approaches have been proposed to deal with
This paper addresses Model Based Diagnosis for specific avionics problems. Two different classes of repre-
the test of avionics systems that combines aero- sentation are applied: data-based diagnosis or model-based
nautic computers with simulation software. Just diagnosis. The first one, as studied by Berdjag et al. [3] is
like the aircraft, those systems are complex since used to recognize faulty behaviors of an Inertial Reference
additional tools, equipments and simulation soft- System (IRS) thanks to normal or faulty categories of in-
ware are needed to be consistent with the test re- put/output data. In this work, data fusion of outputs sensors
quirements. We propose a structural diagnostic is computed to eliminate faulty sources. In [2], the time
framework based on the lattice concept to reduce dependency is introduced in data of failure messages to im-
the time of unscheduled maintenance when the prove problems detection.
tests cannot be performed. Here, we also describe In Model Based Diagnosis (MBD), Kuntz et al. [4] have
a diagnosis algorithm that is based on the formal studied an avionics system using minimal cuts notions. Be-
lattice description and designed for test systems. lard et al. have defined a new approach based on the MBD
The benefits is to capture the system structure and hypotheses called Meta-Diagnosis in [5] dealing with mod-
communication specificities to diagnose the con- els issues. Berdjag et al. [6] present an algebraic decompo-
figuration, the equipments, the connections, and sition of the model to reduce the complexity of the required
the simulation software. model-based diagnosers. Giap [7] has proposed a formalism
of an iterative process to give a solution when models are not
1 Introduction complete but it lacks of applications on more complex in-
dustrial systems. Nevertheless, it gives clues for an iterative
Avionics systems are complex since tens of subsystems and
diagnosis. Another diagnostic software has been developed
components interact to achieve required functions. Exist-
by Pulido et al. in [8] to perform consistency-based diagno-
ing devices for aircraft fault monitoring are based on ded-
sis of dynamic system simulating diagnosis scenarios. The
icated avionics functions but the existing solutions are in-
architecture is quite novel and is applied to the three-tank
sufficiently flexible for test systems and can be improved.
system.
In [1], the framework of an health management algorithms
Structural approaches as graph theory are also popular
for maintenance is described and implemented on an air-
for MBD to describe the structure of the system as with
craft. In [2], the diagnostic of avionics equipments is per-
Bayesian Networks in [9]. They enable us to incorporate
formed through dynamic fault trees. To prevent important
the system complexity as with the lattice concept to inte-
failures on the aircraft, avionics systems are checked on rigs
grate the sub-models dependencies. For example, in [10],
called Avionics Test Bench (ATB) composed of the avionics
the lattice model represents fault modes to compute testable
equipments and flight simulation software.
subsystems from redundancy equations. We want to get the
The environment of the ATB needs to be compliant with the
main ideas that will serve our proposal. To our knowledge,
configuration of the avionics equipments. Faults of the ATB
there is no method for the diagnostic of test systems based
can concern the avionics equipments, their configurations,
on embedded softwares behaviour. Moreover, our proposi-
or the ATB itself i.e the movable connections and the simu-
tion has been adapted from embedded systems to the ATB
lation software. Since it does not exist monitoring functions
behaviour. Its complexity is relevant to the objectives of
of the ATB itself, a new method needs to be applied to pre-
the avionics embedded systems certification, as for exam-
vent long periods of unavailability. In fact, during the devel-
ple high levels of safety requirements, or the simulation of
opment of embedded softwares, its architecture and the test
specific test conditions. In our model, we must consider the
environment surrounding the ATB are redesigned by adapt-
fact that our representation must put forward the ATB be-
ing the test means to the specification’s requirements. Since
haviour in case of failures concerning embedded systems,
the ATB is a test system, and the main knowledge are based
connections, communications, simulation softwares and all
on its embedded systems, we need a new approach to deal
settings to configure the test. Considering those features, the
with the ATB issues. As the embedded systems are already
high number of needed ATB reconfigurations, it is proposed
tested on the ATB, and the test results are used to focus on
a structural representation associated with hierarchical ver-
the ATB issues thanks to a new representation based on the
ifications that reduce the faulty candidates. The motiva-
model of the test system, the diagnosis of the ATB is what

159
Proceedings of the 26th International Workshop on Principles of Diagnosis

tion of the proposed meta-diagnosis approach was presented 2.2 Diagnostic function
in [11]. Here, we propose an extended diagnosis methodol- A basic diagnostic function is defined to help the diagno-
ogy originally defined by De Kleer, Williams [12], [13] and sis: the check function. Depending on the granularity, the
Davis [14] and we present a software implementation run- check function is applied on a component, a subsystem or
ning on a real ATB. It differs from the Belard et al.’s meta- a partition. First, the checkC function is used to deter-
diagnosis definition because the ATB is still defined as the mine if a component is faulty or not. However, we do not
main system under study. Here, we extend the diagnostic- know precisely how a unique component behaves regarding
world tools for a specific system and due to the lack of a fault. So we need to define the checkS function of a sub-
knowledge and data in case of issues, our proposal is based system. The behaviour of a faulty subsystem may also not
on a MBD representation with a structural and functional be sufficient to explain a fault. In fact, subsystems are inter-
decomposition without fault models. connected making the system structure and the partitioning
First, we describe the diagnostic framework, the lattice- concept allows us to focus on different levels of abstrac-
based representation used to model the ATB system and the tion that we call granularities. In our study, we only focus
diagnostic algorithm. In the third section, we provide a de- on faults with observable and measurable symptoms. These
scription of the ATB and the application of the lattice con- faults can only be localized by testing a functionality on a
cept. In the fourth section, we illustrate the approach with a specific architecture. That is why, functional and structural
case study of the ATB. In the final section, we describe the partitions are used to decompose the system into testable
development of a software application to perform automati- partitions.
cally the ATB diagnosis.
Definition 3. The checkC function of a component ci is
defined by:
2 Diagnostic framework checkC : COM P S → {0, 1, −1} s.a checkC(c) = 0 if
the component c is faulty, checkC(c) = 1 if the component
2.1 System representation c is unfaulty and checkC(c) = −1 if the component state is
unknown.
The system is composed of several subsystems that inter-
Definition 4. The checkP function of a partition P is de-
act together to achieve a global function. The decomposi-
fined by:
tions into subsystems is guided by the communication be-
checkP : P → {0, 1, −1} s.a checkP (P ) = 1 ⇔
tween components to fulfill this goal. Partitions are used
∀σi ∈ P, checkS(σi ) = 1, checkP (P ) = 0 ⇔ ∃σi ∈
to decompose the system into functional and communica-
P, checkS(σi ) = 0, and checkP (P ) = −1 ⇔ the checked
tions categories. So, there are two classes of partitions: the
value is unknown.
partitions that represent the structure and the connections of
Some partitions cannot be checked. The set of pos-
the system; and the partitions that represent the functions of
sible checked partitions is Cons. It defined a con-
the system. As an example, P1 is associated with a func-
straint. A constraint Cons is a subset of P s.a: ∀P ∈
tionality of the system P1 = {σ1 ; σ2 }, σ1 = {C1 } and
Cons, checkP (P ) 6= −1.
σ2 = {C2 , C3 }. If a problem appears, i.e the functionality
is not performed, then a fault is detected for this partition P Once the checkP value of a partition is known, we have
and symptoms are seen and linked to subsystems σ. to define the checkS function of subsystems that are not sin-
In the following paragraphs, we use the following notation: gletons σi 6= {ci }. If the partition is faulty, either it exists
P for a partition, σ for a subsystem and ci for a compo- a component ci ∈ σi such as checkC(ci ) = 0, or the com-
nent. S = {ci , i ∈ [1, n]} is the set of all the n components munication between the components in σi is faulty. This
of a system. We note Σ the set of all subsystems, i.e the is modeled by checkCom(σi ) = 0. If the partition is un-
power set of components. A partition P is a set of np sub- faulty, then all communications between the components in
systems σi ∈ Σ: P = {σi , i ∈ [1, np ]|∀i 6= j; σi ∩ σj = σi 6= {ci } are unfaulty and all singletons σi = {ci } are
n
Sp unfaulty.
∅, and σi = S}. We note P the set of all partitions.
i=1 Definition 5. The checkCom function of a subsystem σi ⊆
We recall the definition 1 of inclusion relation between par- COM P S is defined by:
titions and the definition 2 of multiplication. checkCom : Σ → {0, 1, −1} s.a checkCom(σi ) = 1 ⇔
the communication between components in σi is unfaulty;
Definition 1. Two partitions P1 and P1 are said to be in checkCom(σi ) = 0 ⇔
inclusion relation P1 ⊆ P2 if and only if every subsystems the communications between components in σi is faulty.
of P1 is contained in a subsystem of P2 . The relation ⊆
means that P1 is a sub-partition of P2 . To help the diagnosis of the system, we decompose it
into subsystems and we introduce the checkS function of a
Definition 2. The subsystems σk of the multiplication of two subsystem σi ⊆ COM P S defined by:
partitions P = {σi , i ∈ [1, np ]} and Q = {σj , i ∈ [1, nq ]}
are defined by: ∀σk ∈ P × Q, ∃σi ∈ P, ∃σj ∈ Q, σk = Definition 6. checkS : Σ → {0, 1, −1} s.a checkS(σi ) =
σi ∩ σj . 1 ⇔ ∀ci ∈ σi , checkC(ci ) = 1 ∧ checkCom(σi ) =
This operation is used to order subsystems with respect to 1 ; checkS(σi ) = 0 ⇔ ∃ci ∈ σi , checkC(ci ) = 0 ∨
the proposed diagnostic algorithm. The inclusion relation ⊆ checkCom(σi ) = 0 and checkS(σi ) = −1 ⇔ ∃ci ∈
is used to organize the components with the lattice concept σi , checkC(ci ) = −1 ∧ checkCom(σi ) = −1.
L (Σ, ⊆) with a partial ordering relation. It is different from With the above definitions, it is now time to define the
the concept of partially ordered set (poset) because the ar- diagnosis problem. Given a system representation with the
rangement of elements is not based on sets but on partitions. lattice concept L (Σ, ⊆) and the set of constraints Cons =

160
Proceedings of the 26th International Workshop on Principles of Diagnosis

{P ∈ P, checkP (P ) 6= −1}, the problem is defined by Algorithm 1: DIAG(L (Σ, ⊆))
the consistency between L (Σ, ⊆) that contains the system
representation, and Cons that describes system issues. Input: d = {pi , i ∈ [1, n]}, Cons = {consi }
Output: ∆(Diagnosis)
Definition 7. The problem formulation is to find the faulty Global variables: End
components whose current state may explain the con- Fc (f aulty components), Uc (unf aulty components),
straints. It is defined as a function DIAG(L (Σ, ⊆)) under Σ− (f aulty subsystems), Σ+ (unf aulty subsystems),
the constraints Cons. P − (f aulty partitions), P + (unf aulty partitions)
There are two kinds of faults: the fault of a component ∆, Fc , Uc , P + , P − , Σ− , Σ+ ← {}; End ← f alse;
Ci modeled with checkC(Ci ) = 0, and the communica- N Cons ← {};
tion fault of a subsystem σi = {Ci , Cj , ...} modeled with while ¬End do
checkCom(σi ) = 0. With the P1 partition, suppose that C2 F indF aultySubsystems(d, Cons);
and C3 are linked with an ARINC 429 link that is not work- V erif ication(Fc , Σ− );
ing. The constraint is checkP (P1 ) = 0 because the global if ¬End then
function is broken. The reason is that checkCom(σ2 ) = 0. foreach pi ∈ N Cons do
Knowing that checkCom(σ2 ) = 0 for the P1 functionality GET checkP (pi )
is giving the information to fix the system. Cons ← Cons ∪ {pi }
2.3 Diagnostic algorithm
It is now necessary to introduce a diagnostic method whose
aim is to solve the above problem. The algorithm is based on
the following proposition that extends the verification from Algorithm 2: F indF aultyElements
the multiplication of partitions to partitions, see Proposi- Input: d = {pi }, Cons = {consi }
tion 1. Then, a functional verification is propagated from Outputs: Fc , P − , Σ− , Σ+
partitions to subsystems, and from subsystems to compo- foreach (pj , pk ) ∈ P 2 : pi 6= pj do
nents. pmult ← pj × pk
if pmult ∈ Cons then
Proposition 1. ∀P, Q ∈ P 2 , checkP (P × Q) = 0 ⇒ if checkP (pmult ) = 0 then
checkP (P ) = 0 ∧ checkP (Q) = 0.
P − ← P − ∪ {pi }
In order to increase the readability of the algorithm, it has foreach σi ∈ pi do
been split into three: DIAG(L (Σ, ⊆)) is the main algo- foreach ck ∈ Uc do
rithm, it initializes the framework with the partitions of the σi ← σi \ {ck }
system {pi , i ∈ [1, n]} and the constraints Cons = {P ∈ if σi = {ci } then
P, checkP (P ) 6= x}. Fc ← Fc ∪ σi
F indF aultyElements checks the partitions that are de-
fined as a constraint. If the checked value of a partition else if σi ∈
/ Σ+ then
pmult is faulty (resp. unfaulty), we add it to the faulty (resp. Σ ← Σ− ∪ {σi }
−

unfaulty) partitions set P − (resp. P + ), and every subsystem
σi of the partition is possibly faulty (resp. unfaulty), we add if checkP (pmult ) = 1 then
it in Σ+ , (resp. Σ− ). If another partition pmult can help to P + ← P + ∪ {pi }
get more faulty or unfaulty components, a new constraint is foreach σi ∈ pi do
proposed and added to N Cons. if σi = {ci } then
Uc ← Uc ∪ σi
V erif ication is used to check the possible components that
may be faulty, i.e include in Fc with the checkC function, else
and the communication of the subsystems in Σ− with the Σ+ ← Σ+ ∪ {σi }
checkCom function.
Two functions have been introduced: the checkP (pi ) if pmult ∈/ Cons then
value of a partition pi and the CheckCom(σi ) of a subsys- if ∃{ci } ∈ pmult then
tem. Their values can be automatically computed thanks to a if ¬(ci ∈ Uc ∪ Fc ) then
program developed on the system to automate the diagnosis. N Cons ← N Cons ∪ {pmult }
This is performed by the GET function whose purpose is to
model the computation of checkP (pi ) or CheckCom(σi ).
2.4 Formal example
In order to illustrate the problem formulation and the diag- function is introduced to choose the next topology and the
nostic algorithm, a formal example is provided. It is com- next functionality to be tested. It is guided by the minimum
posed of eight components {Ci , i ∈ [1, 8]} organized into of tests to perform in order to fix the system. For a set of
three partitions: partitions P, we define Choose : {P} → P × P.
P1 = { {C1 ,C2 , C3 ,C4 }, {C5 ,C6 , C7 ,C8 }}, As the two functionalities are modeled by P1 and P2 , and
P2 = { {C1 ,C2 }, {C3 ,C4 ,C5 ,C6 ,C7 ,C8 }}, the the topology is modeled by P3 , we have two possi-
P3 ={{C1 }, {C2 ,C4 ,C6 ,C8 }, {C3 ,C5 ,C7 }}. bilities. We assume that P2 is prior to P1 , the first itera-
P3 describes the topology of the system. P1 and P2 describe tion is defined with Choose(P)=(P1 , P3 ). We begin with
functionalities. We set the C2 component as faulty. The idea checkP (P1 ×P3 ) = 0, s.a P1 × P3 = { { C1 }, {C2 ,C4 },
is to combine the topology of the system with its function- {C3 }, {C6 ,C8 }, {C5 ,C7 }}. The possible faulty component
alities to find the faulty component or subsystem. A choice are C1 and C3 . We check the C1 and C3 components and

161
Proceedings of the 26th International Workshop on Principles of Diagnosis

Algorithm 3: V erif ication Components CheckC
C1 1
Inputs: Fc C2 0
Outputs: ∆ Fc , Uc , End C3 1
Initialization: σ+ , σ− ← I;
C4 −1
foreach ci ∈ Fc do
C5 −1
if checkC(ci ) = 0 then
∆ ← ∆ ∪ {ci } C6 −1
End ← true C7 −1
else C8 −1
Fc ← Fc \ {ci }
Uc ← Uc ∪ {ci } Table 2: Diagnostic results for components in P2 × P3

foreach Σi ∈ Σ− do
GET checkCom(Σi ) more than twelve national customers in over twenty dif-
if checkCom(Σi ) = 0 then ferent basic helicopter configurations. The NH90 Avionics
∆ ← ∆ ∪ {Σi } System consists of two major subsystems: the CORE Sys-
End ← true tem and the MISSION System. A computer is the bus con-
else troller and manages each subsystem communications: the
Σ− ← Σ− \ {Σi } Core Management Computer (CMC) for the CORE Sys-
Σ+ ← Σ+ ∪ {Σi } tem and the Mission Tactical Computer (MTC) for the MIS-
SION System. Each computer is connected to one or both
subsystems via a multiplex data bus (MIL-STD-1553), point
to point connections (ARINC429) and serial RS-485 lines.
find them as unfaulty, see Tables 1. The possible faulty sub- Additional redundant computers are used as backup. One
systems are {C2 , C4 }, {C6 , C8 } and {C5 , C7 } and they are of the two CMC is the Bus Controller (BC) of the CORE
unfaulty. The diagnosis is not sufficient, we must relax the multiplex data bus. The avionics system of the ATB is
constraint P2 × P3 . composed of fourteen computers and the above connec-
The second iteration is defined with Choose(P)=(P2 , P3 ), tions: two CMC: c1 = CM C1 and c2 = CM C2; two
s.a P2 × P3 = {{C1 }, {C2 }, {C4 ,C6 ,C8 }, {C3 ,C5 ,C7 }}. Plant Management Computer (PMC): c3 = P M C1 and
We get checkP (P2 × P3 ) = 0, the possible faulty compo- c4 = P M C2; five Multifunction Display (MFD): c5 =
nents are C1 and C2 but C1 has already been checked in the M F D1, c6 = M F D2, c7 = M F D3, c8 = M F D4,
previous iteration. So, the possible faulty subsystems are c9 = M F D5; two Display and Keyboard Unit (DKU):
{C3 ,C5 ,C7 } and {C4 ,C6 ,C8 }. We check the C2 component c10 = DKU 1, c11 = DKU 2; two IRS: c12 = IRS1,
and find it as faulty. For this example, the computed faulty c13 = IRS2; one Radio Altimeter (RA): c14 = RA. For-
or unfaulty components is, see Table 2, C2 in P2 × P3 . mally, COM P SAT B = {ci , i ∈ [1, 14]}.
If no components has been found faulty, the upper topo- The avionics system under test COM P SSU T is a sub-
logical level is treated i.e subsystems: {C2 ,C4 }, {C6 ,C8 }, system of COM P SAT B . It is described Figure 1.
{C5 ,C7 }, {C4 ,C6 ,C8 } and {C3 ,C5 ,C7 }}. Here, they are COM P SSU T = {c1 , c2 , c3 , c4 , c5 , c10 , c12 , c14 }. For the
unfaulty. rest of the article, COM P SSU T will be the primary system
under study.
Components CheckC
C1 1
C2 −1
C3 1
C4 −1
C5 −1
C6 −1
C7 −1
C8 −1

Table 1: Diagnostic results for components in P1 × P3 Figure 1: Architecture of the avionics subsystem

The method has permitted to detect quickly the faulty
component using functional partition and a structural par- From To Messages Subsystems
titioning. Thanks to this result, possible faults regarding ei- DKU 1 CM C1 Mode on σSerial1
ther the topology or the functionality are checked. CM C1 IRS1 Mode on σM IL
IRS1 RA Mode on σN AV ; σARIN C
RA IRS1 Alert σN AV ; σARIN C
3 The Automatic Test Benchmark
IRS1 CM C1 Alert σM IL ; σN AV
3.1 Avionics system CM C1 DKU 1 Alert σSerial1 ; σN AV
The avionics system of the NH90 helicopter is designed Table 3: Messages
to support multiple hardware and software platforms from

162
Proceedings of the 26th International Workshop on Principles of Diagnosis

The PMC is used to monitor the status of all the avion-
ics computers. It displays the alert informations on the
MFD. We define the performances partition pP ERF =
{σP ERF ,σ¬P ERF } with:
σP ERF = {P M C1,P M C2,RA,IRS1,M F D1}
σ¬P ERF = {CM C1,CM C2,DKU 1} and the navigation Figure 2: Navigation func- Figure 3: Performance
partition pN AV = {σN AV ,σ¬N AV } with: tion decomposition with function decomposition
σN AV = { RA,IRS1,M F D1} dprotocol with dprotocol
σ¬N AV = {CM C1,CM C2,DKU 1,P M C1,P M C2}.
The test consists in the simulation of a high roll. Normally
the RA should be deactivated above the value of forty de- DKU 1}; {P M C1, P M C2}; {M F D1, IRS1, RA}};
grees. The procedure contains the following actions: en- pN AV.ARIN C = pN AV × pARIN C = {{M F D1, IRS1,
gage the RA with the DKU 1; simulating a roll of 50 de- RA}; {CM C1, CM C2, P M C1, P M C2}; {DKU 1}}.
grees; check that the RA functionality is deactivated on the The performance function can give insights about the
DKU 1. Several messages are sent to achieve this func- fault. We compute the partitions with this functionality:
tionality, see Table 3, defining a data-flow for two mes- pP ERF.M IL = pP ERF ×pM IL = { {M F D1,RA};
sages : "Mode on" and "Alert" messages: from DKU 1 {DKU 1}; {CM C1,CM C2}; {P M C1,P M C2,IRS1} }
to CM C1 via serial communication to activate the radioal- pP ERF.Serial =pP ERF ×pSerial = { {CM C1,CM C2,
timeter’s specific mode ("Mode on" message); from CM C1 DKU 1}; {P M C1,P M C2}; {M F D1,IRS1,RA} }
to IRS1 via MIL-STD-1553 communication to relay the pP ERF.ARIN C = pP ERF ×pARIN C = { { P M C1, P M C2,
activation information; from IRS1 to RA via ARINC com- M F D1, IRS1, RA};{CM C1, CM C2}; {DKU 1} }.
munication to send a request to the RA to get the roll angle; Those partitions will serve to improve the diagnosis.
from RA to IRS1 via ARINC communication to send the 3.3 Outlooks about the decompositions
response to the IRS that compute the angle; from IRS1 to
CM C1 via ARINC communication, from CM C to DKU We describe an iterative method to update the diagnostic re-
via serial communication to display the alert and disable the sult by providing new topologies of the system. We need to
functionality ("Alert" message). get precise observations to find the faulty components. The
subsystems are computed with the framework of the previ-
3.2 System Under Test (SUT) decomposition ous section.
Given the components, the messages sent between them,
The ATB is used to perform the realization of the avionics and the protocol of these messages, we can obtain an
functions with the necessary equipments and a simulated en- overview of the system decomposition: pSU T can be
vironment needed to check the system specification. decomposed into dprotocol = {pSU T × pM IL ; pSU T ×
The ATB is described as a structural decomposition with pSerial ; pSU T × pARIN C }. This hierarchical structure is
components subsets. These sets provide partitions of the provided with a dependency graph, see Figures 2 and 3.
whole system. We define subsystems σi and the partitions The following partitions are used:
pi with regards to the connections of the avionics system of σcom1 = {{DKU 1, CM C1, IRS1, RA}};
Figure 1, the serial communication: σ¬com1 = {{M F D1, CM C2, P M C1, P M C2}};
σSerial1 = {CM C1, CM C2, DKU 1} pcom1 = {σcom1 , σ¬com1 }.
σSerial2 = {P M C1, P M C2} The path of the informations "RA mode on" and "RA
σ¬Serial = {M F D1, IRS1, RA} alert" on copilot side defines another decomposition: σcom2
pSerial = {σSerial1 ; σSerial2 ; σ¬Serial } = {{CM C2, IRS1, RA, DKU 1}}; σ¬com2 = {{M F D1,
the ARINC communications: CM C1, P M C1, P M C2}}; pcom2 = {σcom2 , σ¬com2 }.
σARIN C = {CM C1,CM C2,P M C1,P M C2,
M F D1,IRS1,RA} We describe the decomposition dcom = {pcom1 , pcom2 }
σ¬ARIN C = {DKU 1} on Figures 4 and 5. We compute partitions with the
pARIN C = {σARIN C ; σ¬ARIN C } navigability functionality and this structural decomposition:
the MIL-STD-1553 communications: pN AV.com1 = pN AV × pcom1 = {{RA, IRS1}; {M F D1};
σM IL = {CM C1, CM C2, P M C1, P M C2, IRS1} {CM C1, DKU 1}; {CM C2, P M C1, P M C2}};
σ¬M IL = {M F D1, DKU 1, RA} pN AV.com2 = pN AV × pcom2 = {{RA, IRS1}; {DKU 1,
pM IL = {σM IL ; σ¬M IL } CM C2}; {M F D1}; {CM C1, P M C1, P M C2}};
The above partitions describe the topology of the problem. pP ERF.com1 = pP ERF × pcom1 = {{RA, IRS1};
We classify the partitions into two categories: functional {CM C2}; {CM C1, DKU 1}; {M F D1, P M C1,
partitions and communication partitions. The functional P M C2}};
partitions contain the subsystems that compute and send pP ERF.com2 = pP ERF × pcom2 = {{RA, IRS1}; {DKU 1,
the informations. The communication partitions contain the CM C2}; {CM C1}; {M F D1, P M C1, P M C2}}.
subsystems that relay these informations. In our example,
the navigation functionality is tested. Functional partition 4 Illustration of the Meta-Diagnostic
are: {pN AV ,pP ERF }, connection partitions are: {pM IL , Approach
pSerial , pARIN C }. We need to define additional partitions
that can be checked with the check function on the system 4.1 Application of the meta-diagnosis approach
thanks to this representation: An iterative approach is very helpful in this case of dis-
pN AV.M IL = pN AV × pM IL = {{M F D1,RA};{IRS1}; tributed systems since diagnosis can use new subsys-
{CM C1,CM C2,P M C1,P M C2};{DKU 1}}; tems and partitions. The results of the diagnosis are
pN AV.Serial = pN AV × pSerial = {{CM C1, CM C2, re-injected in the upper system to refine the results.

163
Proceedings of the 26th International Workshop on Principles of Diagnosis

pi checkP (pi ) Uc Fc
pN AV.com1 0 {DKU 1, {RA,
IRS1} M F D1}
pN AV.com2 1 {DKU 1, {RA}
IRS1,
M F D1}
Figure 4: Navigation func- Figure 5: Performance
tion decomposition with function decomposition
dcom with dcom Table 7: Iterations of CheckM ultiplicationP artition
with dcom
The first symptom is the misbehavior of the navigation
functionality. We describe the iterations of the algo- Subsystems checkCom Partition
rithms with two topologies. We have launched the meta- {RA, IRS1} 1 pN AV.com1
diagnostic algorithm with the topology: dN AV.protocol = {CM C1, DKU 1} 1 pN AV.com1
{pN AV.M IL ,pN AV.ARIN C ,pN AV.SERIAL } and dN AV.com {CM C2, P M C1, P M C2} 1 pN AV.com1
= {pN AV.com1 , pN AV.com2 }. The constraint is CON S =
{checkP (pi ), ∀pi ∈ dN AV.protocol ∪ dN AV.com }. The iter- Table 8: Diagnostic results of subsystems with pN AV.com1
ations of the algorithms are described in Tables 4, and 5.

pi checkP (pi ) Uc Fc faults. Thanks to the impacted functionality, we know that
pN AV.ARIN C 0 ∅ {DKU 1} only messages concerning the IRS roll are concerned. At
pN AV.SERIAL 1 ∅ {DKU 1} this stage, the simulation of the message or the bad connec-
pN AV.M IL 0 ∅ {IRS1, tion of the IRS are the two main solutions.
DKU 1}
4.2 Application with updated constraints
We describe a new problem: the navigation func-
Table 4: Iterations of CheckM ultiplicationP artition tionality and the performance function do not be-
with dprotocol have normally. The new constraint is CON S =
{checkP (pi ), ∀ pi ∈ dN AV.protocol ∪ dN AV.com ∪
The third step gives a state of the components in Fc set dP ERF.protocol ∪ dP ERF.com }. The algorithm is loaded
that can be faulty: DKU 1 and IRS1 in Table 5. If the com- from CheckM ultiplicationP artition with the decompo-
ponents are faulty, this may explain the system behavior and sition dcom . The algorithm iterations are described in Ta-
the algorithm ends. At the same time, the communications ble 9. Once checkP (pP ERF.com2 ) = 1, we deduce that
of subsystems in Σ− can be faulty. They are checked in CM C1 is not faulty.We continue with dprotocol knowing
Table 6. the CM C1 is not faulty in Table 10. We deduce that we
ci checkC(ci ) Fc Uc have to check DKU 1 and CM C2.
DKU 1 1 {IRS1} {DKU 1}
pi checkP (pi ) Uc Fc
IRS1 0 {IRS1} {DKU 1}
pP ERF.com1 0 ∅ {CM C2}
pP ERF.com2 1 {CM C1} {CM C2}
Table 5: Iterations of the CheckComponents with
dprotocol Table 9: Algorithm 2’s iterations with dcom

Subsystems checkCom Partition
{M F D1, RA} 1 pN AV.ARIN C pi checkP (pi ) Uc Fc
{CM C1, CM C2, 1 pN AV.ARIN C pP ERF.ARIN C 0 {CM C1} {DKU 1,
P M C1, P M C2} CM C2}
pP ERF.SERIAL 1 {CM C1} {DKU 1
Table 6: Diagnostic results for subsystems CM C2}
pP ERF.M IL 0 {CM C1} {DKU 1,
CM C2}
The IRS1 is not faulty, the algorithm is relaunched
with Uc = {DKU 1, IRS1} and the other decomposition
dcom = {pN AV.com1 , pN AV.com2 }. The algorithm itera- Table 10: Iterations of CheckM ultiplicationP artition
tions are described in Tables 7 and 8. with dprotocol
Once checkP (pN AV.com2 ) = 1, we deduce that M F D1
is not faulty, see Table 7. At this step, the unfaulty com- At this state, we check the components on the system.
ponents are {DKU 1, IRS1, M F D1}, and the diagnosis is Since the reparation of CM C2 has fixed the problem, we
{RA}. conclude that CM C2 has been faulty. We also check the
Here the RA is faulty with pN AV.com1 , and the algorithm DKU 1 configuration, and find nothing. The diagnosis is
ends. The solution is RA for pN AV.com1 . The data flow ∆ = {CM C2}.
of the messages are checked as the impacted connections, The evolution of the number of faulty and unfaulty com-
wiring and, routing. The system specificities of the com- ponents is reviewed on figure 6. As expected, the number of
munication modeled with com1 five clues of the possible unfaulty components is increasing with new tests, i.e tests

164
Proceedings of the 26th International Workshop on Principles of Diagnosis

Figure 6: Evolution of the number of faulty and unfaulty
components

of partitions. It reveals that the algorithm is converging to a
solution because the number of components is limited.
Figure 10: State of the con-
5 Software implementation straints
5.1 Diagnostic software architecture Figure 9: Initial state of the
The algorithms are implemented in a spy software of AR- diagnosis
INC and MIL-STD-1553 buses, see Figure 7. They are de-
veloped using C++ for effective diagnosis, and to be im-
plemented in the AIRBUS software. The user interfaces are initialStateP anel panel, Figure 9 defines the status of
developed with Java 1.7 and the Swing Graphical User Inter- equipments before launching the diagnosis and a button the
face (GUI) widget toolkit. The architecture of the diagnostic run the algorithm. The check values computed by the al-
gorithm defined in the Controller are provided to the oper-
ator in Figure 11. The constraintsPanel panel lets to edit
and update constraints, see Figure 10. The result of the di-
agnostic algorithm is provided on Figures 11. It gives the
faulty components (observation equal to zero) and the im-
pacted functionality. If a component is suspected, the data

Figure 7: Data flow of the diagnosis software

framework has been adapted to the ATB specificities as de-
scribed with the Model-View-Controller (MVC) paradigm
on Figure 8. Three main objects are defined for the Model:
the Component, the Set, and the Partition objects. Four main
objects are defined in the View to define specific panels: the
diagnosisPanel, the constraintsPanel, the initialStatePanel
and the resultsPanel objects. The model is implemented
with the ArrayList class. It is used to define the list of com-
ponents, the subsystems and the list of partitions. eXtensible
Markup Language (XML) files have been used to describe
the system structure. The Controller dispatches the user re-
quests and selects the panels for presentation. The diagnosis
algorithm is implemented in it. A GUI is provided for han-
dling user inputs such as partitions check values and com-
ponents observations values.
Figure 11: Diagnosis results

flow of the functional chain described by the partition must
be checked. As described in the case study, it gives insights
about the possible connections, wiring and, routing that can
be wrong.
We compute the results ∆ = { IRS1, DKU 1, CM C2,
RA } and display them on Figure 11. If some components
are unfaulty, we can update their status in Figure 9. The al-
gorithm is relaunched using the "GO" button in Figure 9.
The good diagnosis rate is evaluated on Figure 12. It is de-
Figure 8: Architecture of the diagnosis software fined by the number of faulty components that the operator
has to fix over the number of proposed faulty components.
5.2 User interfaces 5.3 Discussion
The panels are displayed one after the others for each We have proposed a solution for the diagnosis of a complex
step of the algorithm defined in the Controller. The system in aeronautics based on the MBD paradigm and the

165
Proceedings of the 26th International Workshop on Principles of Diagnosis

equipment based on dynamic fault tree. In Proceed-
ings of the IFAC-CEA conference, October 2007.
[3] Denis Berdjag, Jérôme Cieslak, and Ali Zolghadri.
Fault detection and isolation of aircraft air data/inertial
system. pages 317–332. EDP Sciences, 2013.
[4] Fabien Kuntz, Stéphanie Gaudan, Christian San-
Figure 12: Good diagnosis rate nino, Éric Laurent, Alain Griffault, and Gérald Point.
Model-based diagnosis for avionics systems using
minimal cuts. DX 2011 22nd International Workshop
lattice concept. It is an other solution for the meta-diagnosis on Principles of Diagnosis, 2011.
problem as described in [5] since we consider the test sys- [5] Nuno Belard, Yannick Pencole, and Michel Comba-
tem environment as the main system. Belard has extended
cau. A theory of meta-diagnosis: reasoning about
the framework, here we use the original one with the lat-
diagnostic systems. In Proceedings of the Twenty-
tice concept to represent the system description. It is also
Second international joint conference on Artificial In-
provided a diagnostic algorithm implemented on the system
telligence, IJCAI’11, pages 731–737, Barcelona, Cat-
to evaluate our method. Since hundreds of diagnosis are
alonia, Spain, 2011.
possible on the ATB, since it is not possible to check all
those possibilities, we have introduced a methodology for [6] Denis Berdjag, Vincent Cocquempot, Cyrille
the ATB diagnosis that reduce the number of iterations to get Christophe, Alexey Shumsky, and Alexey Zhirabok.
the diagnosis. We have upgraded the applications of MBD Algebraic approach for model decomposition:
for avionics systems evaluated in [4] and [2]. It is proposed Application for fault detection and isolation in
the integration and evaluation of a diagnostic algorithm for discrete-event systems. International Journal of
an ATB, taking the test systems environment into account. Applied Mathematics and Computer Science (AMCS),
It differs from other applications of MBD like [8] because 21(1):109–125, March 2011.
the model decomposition is driven by the test systems speci- [7] Quang-Huy Giap, Stephane Ploix, and Jean-Marie
ficities that are represented with the lattice concept. Flaus. Managing Diagnosis Processes with Interac-
tive Decompositions. In Artificial Intelligence Appli-
6 Conclusion cations and Innovations III, IFIP International Federa-
This paper extends the MBD approach to propose a diagnos- tion for Information Processing, pages 407–415. 2009.
tic software that is developed for the diagnosis of test sys- [8] Belarmino Pulido, Carlos Alonso-González, Anibal
tems. The current framework is based on the lattice decom- Bregon, Alberto Hernández Cerezo, and David Ru-
position and is used to model a test system. First, the lat- bio. DXPCS: A software tool for consistency-based di-
tice decomposition has been used to decompose the system agnosis of dynamic systems using Possible Conflicts.
into its functionalities and connections. The second contri- 25st Annual Workshop Proceedings, DX-14, 2014.
bution consists in the proposal of an algorithm that reduce [9] Veronique Delcroix, Mohamed-Amine Maalej, and
the diagnostic ambiguity. The lattice description has been
Sylvain Piechowiak. Bayesian Networks versus Other
implemented with JAVA native packages. The software ar-
Probabilistic Models for the Multiple Diagnosis of
chitecture and diagnostic iterations are provided for a formal
Large Devices. International Journal on Artificial In-
example and an industrial case study. The diagnostic algo-
telligence Tools, 16(3):417–433, 2007.
rithm has shown to reduce the number of faulty candidates.
The results is either faulty equipment or a group of equip- [10] Mattias Krysander, Jan Aslund, and Erik Frisk. A
ments with the associated system functionality that is unable Structural Algorithm for Finding Testable Sub-models
to meet its goal. Together, they are sufficient to point out the and Multiple Fault Isolability Analysis. 21st Annual
reparations that will fix the system. The tests on the Avion- Workshop Proceedings, DX-10, 2010.
ics Test Systems in AIRBUS HELICOPTERS have shown [11] Ronan Cossé, Denis Berdjag, David Duvivier, Sylvain
good results. The development of models may confront our Piechowiak, and Christian Gaurel. Meta-Diagnosis for
solution to many others real problems. In future works, al- a Special Class of Cyber-Physical Systems: the Avion-
gorithms will be improved with adaptable decompositions ics Test Benches. In The 28th International Confer-
and automatic tests. Furthermore, as the method is generic, ence on Industrial, Engineering & Other Applications
we want to demonstrate the validity of our method for others of Applied Intelligent Systems, [Accepted], IEA/AIE
test systems used in AIRBUS HELICOPTERS. 2015, Seoul, Corea, 2015.
[12] Johan de Kleer and B.C. Williams. Diagnosing multi-
References ple faults. Artificial Intelligence, 32(1):97–130, 1987.
[1] Canh Ly, Kwok Tom, Carl S. Byington, Romano
[13] Johan de Kleer, Alan K. Mackworth, and Raymond
Patrick, and George J. Vachtsevanos. Fault Diagno-
Reiter. Characterizing diagnoses and systems. Artifi-
sis and Failure Prognosis for Engineering Systems: A
cial Intelligence, 56(2-3):197–222, 1992.
Global Perspective. In Proceedings of the Fifth An-
nual IEEE International Conference on Automation [14] Randall Davis and Walter C. Hamscher. Model-Based
Science and Engineering, CASE’09, pages 108–115, Reasoning: Troubleshooting. pages 297–346, July
Piscataway, NJ, USA, 2009. IEEE Press. 1988. San Francisco, CA, USA.
[2] Arnaud Lefebvre, Zineb Simeu-Abazi, Jean-Pierre
Derain, and Mathieu Glade. Diagnostic of the avionic

166
Proceedings of the 26th International Workshop on Principles of Diagnosis

SAT-Based Abductive Diagnosis

Roxane Koitz1∗ and Franz Wotawa1
1
Graz University of Technology, Graz, Austria
e-mail: {rkoitz, wotawa}@ist.tugraz.at

Abstract constitute the diagnoses. At the same time [2] presents
the General Diagnosis Engine (GDE) for multiple fault
Increasing complexity and magnitude of tech- identification, drawing on the connection between in-
nical systems demand an accurate fault local- consistencies and causes as well. Their approach em-
ization in order to reduce maintenance costs ploys an assumption-based truth maintenance system
and system down times. Resting on solid (ATMS) to detect conflicts and thereon compute diag-
theoretical foundations, model-based diagno- noses. Over the years much work has concentrated on
sis provides techniques for root cause identi- model-based diagnosis applications in various domains,
fication by reasoning on a description of the such as space probes [3] or the automotive industry [4].
system to be diagnosed. Practical implemen- Besides the consistency-based approach, a second
tations in industries, however, are sparse due method emerged within the field of model-based diag-
to the initial modeling effort and the compu- nosis, which exploits the concept of entailment to infer
tational complexity. In this paper, we utilize explanations for given observables. While related to
a mapping function automating the modeling the more traditional technique based on consistency,
process by converting fault information avail- abductive model-based diagnosis requires a system for-
able in practice into propositional Horn logic malization representing faults and their manifestations
sentences to be used in abductive model-based [5].
diagnosis. Furthermore, the continuing per-
formance improvements of SAT solvers moti- Even though based on a well-defined theory, a
vated us to investigate a SAT-based approach widespread acceptance of model-based diagnosis among
to abductive diagnosis. While an empirical industries has not been accounted for yet. Two main
evaluation did not indicate a computational contributing factors can be identified: the initial model
benefit over an ATMS-based algorithm, the development and the computational complexity of di-
potential to diagnose more expressive models agnosis [6]. In order to diminish the modeling effort,
than Horn theories encourages future research [7] formulates a conversion of failure assessments avail-
in this area. able in practice into a propositional logic representation
suitable for abductive diagnosis. Failure mode and ef-
fect analysis (FMEA) is an established reliability eval-
1 Introduction uation method utilized in various industrial fields. It
considers possible component faults as well as their im-
Fault identification of technical systems is becoming in- plications on the system’s behavior [8]. Whereas there
creasingly difficult due to their rising complexity and has been extensive research on the automatic genera-
scale. Economic and safety considerations have put ac- tion of FMEAs from system models [9], we argue in
curate diagnosis not only into research focus but has favor of the inverse process. As these assessments re-
led to a growing interest in practice as well. port on failures and how they reveal themselves in the
Model-based diagnosis has been presented as a artifact’s behavior, they provide knowledge requisite for
method to derive root causes for observable anoma- abductive reasoning. In this paper, we present a com-
lies utilizing a description of the system to be diag- pilation of FMEAs to models which can be used in ab-
nosed [1, 2]. Reiter [1] proposed a component-oriented ductive diagnosis.
model encompassing the correct system behavior and
Apart from discovering inconsistencies, an ATMS is
structure. Discrepancies, i.e. conflicts, arise when
capable of inferring abductive diagnoses. However, it
the observed and expected system performance diverge.
may face computational challenges and is restricted to
Based on the minimal conflict sets, root causes for the
operate on propositional Horn clauses. In the case of
inconsistencies are obtained by hitting set computation.
the models we are extracting from the FMEAs, this is
Hence, fault diagnosis is a two step process, where first
not a limitation so far. Nevertheless, as we anticipate
contradicting assumptions on component health, given
to exploit more expressive representations, a different
a set of symptoms and the model, are identified. Then
approach is required.
the sets intersecting all conflict sets are computed which
The performance of Boolean satisfiability (SAT)
∗
Authors are listed in alphabetical order. solvers has improved immensely over the last years and

167
Proceedings of the 26th International Workshop on Principles of Diagnosis

several applications of SAT solvers in practice have calls for existing methods as well as a novel algorithm
proven successful. Furthermore, we are able to encode for MCSes computation.
a greater variety of models in SAT. Thus, we propose a As stated by [20] the complexity of abduction sus-
SAT-based approach to abductive diagnosis and empir- pends of a polynomial-time transformation to SAT.
ically compare its performance to a procedure depen- Thus, in their work the authors present a fixed-
dent on an ATMS. parameter tractable transformation from propositional
The remainder of this paper is structured as follows. abduction to SAT exploiting backdoors and describe
After formally providing the theoretical background on how to use their transformations to enumerate all solu-
abductive diagnosis as well as relevant definitions in tions for a given abduction instance.
the context of SAT, we formulate the modeling process
based on FMEAs and give information on the prop- 3 Preliminaries
erties of the obtained system descriptions. In Section
This section provides a brief introduction to abduc-
5 we describe our SAT-based approach to abductive
tive model-based diagnosis. In particular, we describe
diagnosis and present an algorithm computing expla-
the propositional Horn clause abduction problem (PH-
nations for a given abduction problem. An empirical
CAP) which provides the basis for our research. Note
evaluation comparing our method to an ATMS-based
that throughout the paper we consider the closed-world
diagnosis engine follows in Section 6. Subsequently, we
assumption. In addition to the background on abduc-
provide some concluding remarks and give an outlook
tive model-based diagnosis, we formally define MUSes
on future research possibilities.
and MCSes.
2 Related Work
3.1 Abductive Diagnosis
Mechanizing logic-based abduction has been an active
In contrast to the traditional consistency-based ap-
research field for several decades with different ap-
proach, abductive model-based diagnosis depends on a
proaches for generating explanations emerging, such as
stronger relation between faults and observable symp-
proof tree completion [10] and consequence finding [11].
toms, namely entailment. Hence, whereas consistency-
While the former exploits a refutation proof involving
based diagnosis reasons on the description of the cor-
hypotheses, the latter computes causes as logical conse-
rect system operation, abductive reasoning requires the
quences of the theory. As resolution is not consequence
model to capture the behavior in presence of a fault.
finding complete, [12] devised a procedure based on lin-
By exploiting the notion of entailment and the causal
ear resolution which is sound and complete for conse-
links between defects and their corresponding effects,
quence finding for propositional as well as first order
we can reason about explanations for observed anoma-
logic.
lies. In general, abductive diagnosis is an NP-hard
While the number of practical applications in the
problem. However, there are certain subsets of logic,
context of abductive model-based diagnosis is rather
such as propositional definite Horn theory, which are
small, in [13] the authors describe abductive reasoning
tractable [21]. On these grounds we consider the PH-
in environmental decision support systems.
CAP as defined in [22], which represents the connec-
Most recently [14] present a SAT encoding for
tions between causes and effects as propositional Horn
consistency-based diagnosis. The system description
sentences. Similar to [22], we define a knowledge base
is compiled into a Boolean formula, such that the for-
as a set of Horn clauses over a finite set of propositional
mula’s satisfying assignments correspond to the solu-
variables.
tions of the diagnosis problem. Based on the encoding,
a SAT solver directly computes the diagnoses. In or- Definition 1 (Knowledge base (KB)). A knowledge
der to improve the solver’s performance, the authors base (KB) is a tuple (A,Hyp,Th) where A denotes the
utilize several preprocessing techniques. An empirical set of propositional variables, Hyp ⊆ A the set of hy-
comparison of their approach to other model-based di- potheses, and Th the set of Horn clause sentences over
agnosis algorithms indicates that their SAT encoding A.
yields performance benefits. Contrasting these results, The set of hypotheses contains the propositions,
[15] propose a translation to Max-SAT which could not which can be assumed to either be true or false and
outperform the stochastic model-based diagnosis algo- refer to possible causes. In order to form an abduction
rithm SAFARI [16]. problem, a set of observations has to be considered for
In [17] the authors present an algorithm which ties which explanations are to be computed.
constraint solving to diagnosis, thus renders the detec- Definition 2. (Propositional Horn Clause Abdu-
tion of inconsistencies and subsequent hitting set com- ction Problem (PHCAP)) Given a knowledge base
putation unnecessary. Another direct approach by [18] (A,Hyp,Th) and a set of observations Obs ⊆ A then
computes minimal diagnoses for over-constrained prob- the tuple (A,Hyp,Th,Obs) forms a Propositional Horn
lems by finding the sets of constraints to be relaxed Clause Abduction Problem (PHCAP).
in order to restore consistency. For Boolean formu-
las, those relaxations correspond to Minimal Correc- Definition 3 (Diagnosis; Solution of a PHCAP).
tion Subsets (MCSes). Their hitting set dual, mini- Given a PHCAP (A,Hyp,Th,Obs). A set ∆ ⊆ Hyp is
mal unsatisfiable subsets (MUSes), constitute the set a solution if and only if ∆ ∪ Th |= Obs and ∆ ∪ Th
of subformulas explaining the unsatisfiability, i.e. refer 6|= ⊥. A solution ∆ is parsimonious or minimal if and
to conflicts. While there are several algorithms for ef- only if no set ∆0 ⊂ ∆ is a solution.
ficiently computing MCSes, most recently [19] develop A solution to a PHCAP is equivalent to an abduc-
three techniques for reducing the number of SAT solver tive diagnosis, as it comprises the set of hypotheses

168
Proceedings of the 26th International Workshop on Principles of Diagnosis

explaining the observations. Even though Definition 3 Definition 5. (Minimal Correction Subset
does not impose the constraint of minimality on a solu- (MCS)) A subset M ⊆ φ is an MCS if φ \ M is satis-
tion, in practice only parsimonious explanations are of fiable and ∀Ci ∈ M, φ \ (M \ {Ci }) is unsatisfiable.
interest. Hence, we refer to minimal diagnoses simply Since an MCS is a set of clauses correcting the un-
as diagnoses. Notice that finding solutions for a given satisfiable formula when removed, a single clause of an
PHCAP is NP-complete [22]. MUS is an MCS for this MUS. Note that the hitting
As aforementioned an ATMS derives abductive ex- set duality of MUSes and MCSes has been established
planations for propositional Horn theories, thus it can [26].
be utilized to find solutions to a PHCAP. Based on Example. Consider the unsatisfiable formula φ in
a graph structure where hypotheses, observations, and CNF.
contradiction are represented as nodes, the Horn clause
C1 C2 C3 C4
sentences defined in T h determine the directed edges in z }| { z }| { z}|{ z}|{
the graph. Each node is assigned a label containing the φ = (¬a ∨ ¬b ∨ c) ∧ (¬c ∨ d) ∧ (c) ∧ (¬d)
set of hypotheses said node can be inferred from. By
updating the labels, the ATMS maintains consistency. It is apparent that the combination of clauses C2 , C3
Algorithm abductiveExplanations exploits an and C4 results in φ being unsatisfiable, hence
ATMS and returns consistent abductive explanations MUSes(φ) = {{C2 , C3 , C4 }}.
for a set of observations [23]. In case the observa-
tion consists of a single effect, the label of the corre- By hitting set computation we arrive at the following
sponding proposition already contains the abductive set of MCSes:
diagnoses. To account for multiple observables, i.e.
Obs = {o1 , o2 , . . . , on }, an individual implication is MCSes(φ) = {{C2 }, {C3 }, {C4 }}.
added, such that o1 ∧ o2 . . . ∧ on → obs, where obs is Removing any MCS of φ results in the formula being
a new proposition not yet considered in A. Every set satisfiable.
contained in the label of obs constitutes a solution to It is worth noticing that utilizing subsets of un-
the particular PHCAP. satisfiable formulas has been proposed in regard to
consistency-based diagnosis. In this context, a diagno-
Algorithm 1 abductiveExplanations [23] sis is defined as the set of components which assumed
procedure abductiveExplanations faulty retains the consistency of the system. Thus, a
(A, Hyp, T h, Obs) consistency-based diagnosis corresponds to an MCS.
Add TV h to AT M S For instance, [18] presents a direct diagnosis method
Add o∈Obs o → obs to AT M S . obs ∈
/A computing MCSes for over-constrained systems. In
return the label of obs conflict-directed algorithms, as proposed by Reiter [1],
end procedure the minimal conflicts, arising from the deviations of
the modeled to the experienced behavior, equate to the
MUSes. In Section 5 we discuss our abductive diagnosis
3.2 Minimal Unsatisfiable Subset and approach based on MUSes and MCSes.
Minimal Correction Subset
We assume standard definitions for propositional logic 4 Modeling Methodology
[24]. A propositional formula φ in CNF, defined over As mentioned before model-based diagnosis depends on
a set of Boolean variables X = {x1 , x2 , . . . xn }, is a a formal description of the system to be examined. The
conjunction of m clauses (C1 , C2 , . . . , Cm ). A clause generation of appropriate models, however, is still an
Ci = (l1 , l2 , . . . , lk ) is a disjunction of literals, where issue preventing a wide industrial adoption, since the
each literal l is either a Boolean variable or its comple- modeling process is time-consuming and typically de-
ment. A truth assignment is a mapping µ : X ⇒ {0, 1} manding for system engineers.
and a satisfying assignment for φ is a truth assignment Therefore, we present a modeling methodology rely-
µ such that φ evaluates to 1 under µ. Given a formula φ, ing on FMEAs available in practice. An FMEA com-
the decision problem SAT consists of deciding whether prises a systematic component-oriented analysis of pos-
there is a satisfying assignment for the formula. sible faults and the way they manifest themselves in
In case φ is unsatisfiable there are subsets of φ, which the artifact’s behavior and functionality [8]. This type
are of special interest in the diagnosis context, namely of assessment is gaining importance and has become a
the MUSes and MCSes. A Minimal Unsatisfiable Sub- mandatory task in certain industries, especially for sys-
set (MUS) comprises a subset of clauses which cannot tems that require a detailed safety analysis. Due to the
be satisfied simultaneously. Notice that every proper knowledge capturing the causal dependencies between
subset of MUS is satisfiable. A Minimal Correction specific fault modes and symptoms, an FMEA provides
Subset (MCS) is the set of clauses which corrects the information suitable for abductive reasoning [7].
unsatisfiable formula, i.e. by removing any MCS the Definition 6 (FMEA). An FMEA is a set of tuples
formula becomes satisfiable. (C, M, E) where C ∈ COM P is a component, M ∈
Given an unsatisfiable formula φ, an MUS and MCS M ODES is a fault mode, and E ⊆ P ROP S is a set of
are defined as follows [25]: effects.
Definition 4. (Minimal Unsatisfiable Subset Running Example. In order to illustrate our mod-
(MUS)) A subset U ⊆ φ is an MUS if U is unsatisfi- eling process, we use the converter of an industrial
able and ∀Ci ∈ U, U \ ({Ci }) is satisfiable. wind turbine as our running example [27]. Table 1

169
Proceedings of the 26th International Workshop on Principles of Diagnosis

illustrates a simplified FMEA neglecting all parts af- The set of propositional variables A is defined as the
filiated with reliability analysis, such as severity rat- union of all effects stored in the FMEA as well as all
ings. Each row specifies a particular failure mode, (i.e. hypotheses, that is the set of component-fault mode
Corrosion, Thermo-mechanical fatigue (TMF) or High- pairs, i.e.:
cycle fatigue (HCF)) of a subsystem and determines its [
corresponding symptoms, such as P turbine referring A =def E ∪ {mode(C, M )}
to a deviation between expected and measured turbine (C,M,E)∈F M EA
power output. Continuing our converter example:
 
Component Fault Mode Effect 
 T cabinet, P turbine, 

Fan Corrosion T cabinet, P turbine

 T inverter cabinet, T nacelle, 

Fan TMF T cabinet, P turbine A= mode(F an, Corrosion),
IGBT HCF T inverter cabinet, 
 mode(F an, T M F ), 


 

T nacelle, P turbine mode(IGBT, HCF )

Table 1: Excerpt of the FMEA of the converter Applying M results in the following set of proposi-
tional Horn clauses representing T h and thus complet-
Consider the FMEA of the converter in Table 1. We ing KBConverter :
can map the columns to their corresponding represen-  
tations from Definition 6. The entries in the column  mode(F an, Corrosion) → T cabinet, 

 mode(F an, Corrosion) → P turbine, 

Component constitute the elements of COM P , the en- 
 


 mode(F an, T M F ) → T cabinet, 

tries in Fault Mode of M ODES and P ROP S subsumes
the entries of Effect. Th = mode(F an, T M F ) → P turbine,

 mode(IGBT, HCF ) → T inverter cabinet,  

 

COM P = { F an, IGBT } 
 mode(IGBT, HCF ) → T nacelle, 

 
mode(IGBT, HCF ) → P turbine
M ODES = { Corrosion, T M F, HCF }
On account of the mapping function M and the un-
T cabinet, P turbine,
P ROP S = derlying structure of the FMEAs, the compiled models
T inverter cabinet, T nacelle
feature a certain topology. First, the set of hypotheses
Through Definition 6 we obtain F M EAConverter = and symptoms are disjoint sets. Second, since there is
  a causal link from faults to effects but not vice versa,
 (F an, Corrosion, {T cabinet, P turbine}),  the descriptions exhibit a forward and acyclic structure.
 
(F an, T M F, {T cabinet, P turbine}), Specifically, each implication connects one hypothesis
 (IGBT, HCF, {T inverter cabinet, T nacelle, 
  to one effect, thus are bijunctive clauses. In order to
P turbine}) account for impossible observations, we append addi-
Since the FMEA already represents the relation be- tional implications to KB stating that an effect and its
tween defects and their manifestations the conversion to negation cannot occur simultaneously, i.e. e ∧ ¬e |= ⊥.
a suitable abductive KB is straightforward. It is worth The question remains whether the generated models
noting that FMEAs usually consider single faults; thus, are suitable for the diagnostic task. Abductive expla-
the resulting diagnostic system holds the single fault as- nations are consistent by definition and complete given
sumption. Let HC be the set of horn clauses. We define an exhaustive search. Thus, the appropriateness of the
a mapping function M : 2F M EA 7→ HC generating a system description is determined by whether a single
corresponding propositional Horn clause for each entry fault diagnosis can be obtained given all necessary in-
of the FMEA [7]. formation is available.
Definition 7 (Mapping function M). Given an Definition 8. (One Single Fault Diagnosis
FMEA, the function M is defined as follows: Property (OSFDP)) Given a KB (A, Hyp, T h). KB
[ fulfills the OSFDP if the following hold:
M(F M EA) =def M(t) ∀m ∈ Hyp : ∃Obs ⊆ A : {m} is a diagnosis of (A, Hyp,
t∈F M EA
T h, Obs) and ¬∃m0 ∈ Hyp : m0 6= m such that {m’} is
where M(C, M, E) =def {mode(C, M ) → e |e ∈ E } . a diagnosis for the same PHCAP.
We utilize the proposition mode(C, M ) to denote that The property ensures that under the assumption
component C experiences fault mode M . Thus, the enough knowledge is available all single fault diagnoses
set of component-fault mode couples forms the set of can be distinguished and subsequently unnecessary re-
hypotheses. placement activities are avoided. To verify whether the
[ OSFDP holds or not, we compute the set of proposi-
Hyp =def {mode(C, M )}. tions δ(h) implied by each hypothesis h and the theory.
(C,M,E)∈F M EA It is not fulfilled if we can record for two or more hy-
potheses the same δ(h). [7] describes a polynomial al-
In regard to the running example the following elements gorithm testing for the property. Note that the OSFDP
compose the set Hyp: check can be done on side of the FMEA before compil-
( ) ing the model. This is advantageous as the absence of
mode(F an, Corrosion),
Hyp = mode(F an, T M F ), the property indicates that internal variables or obser-
mode(IGBT, HCF ) vations have not been considered in the FMEA.

170
Proceedings of the 26th International Workshop on Principles of Diagnosis

Assume the set of hypotheses {h1 , h2 , . . . , hn } share of SAT solvers and their application to a vast number
the same δ(h). We cannot distinguish h1 , h2 , . . . , hn of different AI problems and industrial domains have
from one another and thus all corresponding compo- motivated us to consider a SAT-based approach for ab-
nents have to be repaired or replaced in case they are ductive diagnosis.
part of the diagnosis. Therefore, we can treat them Recall Definition 3 of a diagnosis: ∆ is an abduc-
as a unit by replacing h1 , h2 , . . . , hn with a new hy- tive explanation if ∆ ∪ T h |= Obs and ∆ ∪ T h 6|= ⊥.
pothesis h0 . Once all indistinguishable hypotheses have Through logical equivalence we recast the first condi-
been removed, the KB satisfies the OSFDP. Regarding tion to ∆ ∪ T h ∪ {¬Obs} |= ⊥, where {¬Obs} denotes
the hypotheses, which cannot be differentiated, as one the set containing the complement of each observation
cause during diagnosis has an effect on the computa- in Obs, i.e. ∀o ∈ Obs : ¬o ∈ {¬Obs} [10]. In general, we
tional effort as fewer hypotheses are to be considered. can state the relation as follows: given the theory and
Algorithm distinguishHypotheses replaces all in- assuming the hypotheses to be true whereas stating the
distinguishable causes and ensures that after termi- absence of a set of observations, results in an inconsis-
nation the given KB satisfies the OSFDP. Evidently, tency due to the fact that the causes entail the effects,
the algorithm’s complexity is determined by the three i.e. Hyp ∪ T h ∪ {¬Obs} |= ⊥. Thus, we draw on this
nested loops, hence O(|Hyp|2 |A − Hyp|). Since there relationship and reformulate the problem of generating
is a finite number of hypotheses and effects possibly minimal abductive explanations for a set of observa-
included in δ(h) the algorithm must terminate. tions to computing minimal unsatisfiable subformulas.
Since MUSes contain several unsatisfiable subsets
Algorithm 2 distinguishHypotheses irrelevant for the diagnostic task, we define the set
procedure distinguishHypotheses (A, Hyp, T h)
M U SesHyp , which only contains subset minimal MUS
Ψ[|Hyp|] ← Hyp comprising clauses referring to hypotheses:
for all h1 ∈ Ψ do Definition 9. (M U SesHyp ) Let M U Ses be the set of
for all h2 ∈ Ψ do MUSes of Hyp∪T h∪{¬Obs}, then ∀M ∈ M U SesHyp :
if h1 6= h2 then ∃U ∈ M U Ses : M = U ∩ Hyp and ¬∃M 0 ∈
if δ(h1 ) = δ(h2 ) and δ(h1 ) 6= ∅ then
Create new hypothesis h0 . h0 ∈ / Hyp
M U SesHyp : M 0 ⊂ M.
Add h0 to Ψ Corollary 1. Given a P HCAP (A, Hyp, T h, Obs), let
Add h0 to A M U SesHyp be the set of interesting MUSes. A set
for all e ∈ δ(h1 ) do ∆ ⊆ Hyp is a minimal abductive diagnosis if ∃M ∈
Add (h0 → e) to T h M U SesHyp : ∆ = M and ∆ ∪ T h 6|= ⊥.
Remove (h1 → e) from T h
Remove (h2 → e) from T h
end for Proof. We can restate the problem of computing in-
Remove h1 ∧ h2 from Ψ consistencies to finding the set of prime implicates of
Remove h1 ∧ h2 from A T h∧Hyp∧{¬Obs}. By definition, the prime implicates
end if are equivalent to the MUSes of said formula.
end if
end for Deriving a minimal abductive explanation corre-
end for sponds to computing a minimal subset of the hypothe-
return KB(A, Ψ, T h) ses, which cannot be simultaneously satisfied with the
end procedure theory and the negation of observations.
We devised the algorithm satAB, which computes the
Our running example of the converter does not set of abductive diagnoses for a given PHCAP based on
fulfill the OSFDP, since mode(F an, Corrosion) and MUS enumeration. First, in order to take advantage of
mode(F an, T M F ) are not distinguishable. By re- the MUSes, which correspond to the solutions of the
moving both hypotheses and introducing h0 = PHCAP, we create an unsatisfiable CNF encoding of
mode((F an, Corrosion), (F an, T M F )) the property is the problem. Since the T h consists of Horn clauses
fulfilled. a conversion into CNF is straightforward. Note that
Notice that abductive diagnosis is premised on the we are, however, not limited to Horn clause models, as
assumption that the model is complete; thus, we pre- we can create a CNF representation based on Tseitin
sume that all significant fault modes for each con- transformation [28]. We refer to the set of clauses as-
tributing part of the system have been contemplated sociated with the theory as T . For each h ∈ Hyp we
in the FMEA. Furthermore, we expect on the one hand create a single clause assuming h to be true. Addition-
that the symptoms described within the FMEA are de- ally, we generate a disjunction containing the negated
tectable in order to constitute observations. On the observations. The resulting unsatisfiable formula is re-
other hand, the automated mapping demands a consis- ferred to as φ. ∆ − Set is the set of diagnoses obtained
tent effect denotation throughout the analysis. from the PHCAP.
The diagnostic task consists in computing the sets of
5 Abductive Diagnosis via SAT hypotheses which are responsible for the unsatisfiabil-
Although an ATMS derives abductive diagnoses, it is ity of φ, i.e. M U SesHyp (φ). Since finding satisfiable
limited to propositional Horn theories and subject to subsets is an NP-hard problem whereas UNSAT resides
performance issues. Both problems have been accom- in Co-NP, we employ an MCSes enumeration algorithm
modated through ATMS extensions and focus strate- on the unsatisfiable formula and then derive the diag-
gies. Nevertheless, the advances in the development noses via hitting set computation [25]. As we are only

171
Proceedings of the 26th International Workshop on Principles of Diagnosis

Figure 1: SAT encoding of the running example

Algorithm 3 satAB Hence the abductive diagnoses are
procedure satAB (A, Hyp, T h, Obs) ∆1 = {mode(F an, Corrosion)} and ∆2 =
M CSes ← ∅ {mode(F an, T M F )}.
M CSesHyp ← ∅
T ← CNF(T h) W . CNF representation of T h 6 Empirical Evaluation
φ ← T ∪ Hyp ∪ o∈Obs ¬o To determine whether computing abductive diagnoses
M CSes ← MCSes(φ) . MCS enumeration algorithm via SAT yields any computational advantages in the
for all m ∈ M CSes do
case of our models, we conducted an empirical eval-
if m ⊆ Hyp and m ∪ T h is consistent then
M CSesHyp ← m ∪ M CSesHyp uation, comparing abductiveExplanations to satAB
end if on several instances of FMEAs. In case of the former
end for we employed a Java implementation of an unfocused
∆ − Set ← MHS(M CSesHyp ) . Minimal hitting set ATMS. The algorithm satAB exploits on the one hand
algorithm an MCS enumeration procedure and on the other hand
return ∆ − Set an implementation of a hitting set algorithm. We uti-
end procedure lized the MCSLS tool by [19] to compute the MCSes.
MCSLS is written in C++, employs Minsat 2.2 as the
SAT solver, and provides the possibility to apply sev-
interested in the conflicts stemming from the assump- eral MCS enumeration algorithms. We decided for the
tions that all hypotheses are true, we select each MCS CLD approach of MCSLS , which takes advantage of
only containing clauses referring to explanations. For disjoint unsatisfiable cores and showed the best over-
this reason, we create the set M CSesHyp such that all performance in a preliminary experimental set-up.
∀m ∈ M CSesHyp : m ⊆ Hyp. This has one prac- Regarding the hitting set computation, we engaged a
tical rational: it diminishes the number of sets to be Java implementation of the Binary Hitting Set Tree al-
considered by the hitting set algorithm. The corre- gorithm [29] which performed well in a comparison of
sponding MUSes derived via hitting set computation minimal hitting set algorithms [30]. All the numbers
of M CSesHyp already constitute the abductive diag- presented in this section were obtained from a Lenovo
noses. ThinkPad T540p Intel Core i7-4700MQ processor (2.60
Consider again our running example of the converter. GHz) with 8 GB RAM running Ubunutu 14.04 (64-bit).
We already obtained the KB via the mapping function Several publicly available as well as project internal
M. Let us assume that the condition monitoring sys- FMEAs provide the basis for our evaluation. They
tem of the wind turbine encountered that the turbine’s cover various technical systems and subsystems with
power output is lower than expected (P turbine) and different underlying structures. In particular they de-
that the cabinet temperature exceeds a certain thresh- scribe faults in electrical circuits, a connector system by
old (T cabinet), i.e. Obs = {P turbine, T cabinet}. In Ford (FCS), the Focal Plane Unit (FPU) of the Hetero-
Figure 1 we depict the CNF representation φ of the ab- dyne Instrument for the Far Infrared (HIFI) built for
duction problem. Clauses C1 to C7 refer to T , C8 to the Herschel Space Observatory, printed circuit boards
C10 to the set Hyp and clause C11 contains the negation (PCB), the Anticoincidence Detector (ACD) mounted
of the set of observations. on the Large Area Telescope of the Fermi Gamma-ray
Computing the M CSes of φ we obtain: M CSes = Space Telescope, the Maritim ITStandard (MiTS), and
( ) rectifier, inverter, transformer, backup components, as
{C11 } , {C1 , C3 } , {C1 , C9 } , {C3 , C8 } , {C9 , C8 } ,
{C4 , C7 , C2 } , {C4 , C10 , C2 } , {C4 , C7 , C8 } , . well as main bearing of an industrial wind turbine. By
{C4 , C10 , C8 } , {C2 , C9 , C7 } , {C2 , C9 , C10 } applying the mapping function M, we generated the
corresponding abductive knowledge bases KB for each
Extracting the MCSes, which only contain clauses FMEA. Table 2 provides an overview of the FMEAs’
from Hyp and are consistent with regard to the theory, structure and the evaluation results. It is worth noting
results in that the FMEAs vary in the number of hypotheses, i.e.
component-fault mode couples, the number of effects,
M CSesHyp = {{C9 , C8 }}. and the number of rules, i.e. the links between faults
By computing the hitting set of M CSesHyp , we obtain and symptoms. Due to T h of an abductive KB com-
the set of MUSes solely referring to explanations, which prising Horn clauses, a conversion into a CNF represen-
is in fact the set of diagnoses: tation, suitable for the MCSLS tool, is straightforward.
We do not address the model compilation times, since
∆ − Set = {{C9 } , {C8 }}. the system description would be compiled offline and

172
Proceedings of the 26th International Workshop on Principles of Diagnosis

the hitting set computation accounted for a negligible
fraction of the total runtime.
Figure 2 illustrates the cumulative log runtimes for
satAB and abductiveExplanations on the FMEA
models generated. Although abductiveExplanations
performs on average better, the first model requires a
longer computation time for both algorithms. More-
over, the illustration reveals the high computational ef-
fort necessary for satAB to compute the diagnoses for
the model of the inverter. As expected we observe par-
ticularly high runtimes when the set of observations
contains effects corresponding to different hypotheses.
This has a greater impact on satAB than on the ATMS
implementation. For the section from the models FCS
to PCB in Figure 2, however, we can see that the cu-
Figure 2: Cumulative runtimes of abductiveExplana- mulative runtime for abductiveExplanations rises at
tions and satAB for the FMEA instances a steeper angle. Generally, the data gathered in the
experiment do not suggest a performance benefit of the
the mapping execution consumed less than one second SAT-based approach over an ATMS implementation.
for the examples we utilized so far.
Table 2 shows that none, except of the model re- 7 Conclusion and Future Work
sulting from the transformer’s FMEA, of the original In the course of the paper, we presented a mapping
models satisfy the OSFDP. Therefore, we compiled a from failure assessments available to propositional Horn
second set of models fulfilling the property by exchang- clause models. The modeling methodology relies on
ing each set of indistinguishable hypotheses with a new FMEAs as they comprise information on faults and
single hypothesis representing said set. For example, their symptoms. Hence, they provide a suitable source
Algorithm distinguishHyp ensures that the resulting for model compilation. Although in our case an ATMS
KB satisfies the OSFDP. In Table 2 the original models can be used to compute abductive diagnoses, it is lim-
are identified accordingly, and the adapted models are ited to propositional Horn theories. We proposed a
provided with the label OSFDP. Note that the num- SAT-based approach to abductive model-based diagno-
ber of hypotheses and rules diminishes for the adapted sis which allows us to reason on more expressive repre-
models. sentations. Our method is based on computing conflict
In the experiments, we computed the abductive ex- sets, i.e. MUSes, resulting from a rewritten, unsatisfi-
planations for |Obs| from one to the maximum number able system description. Subsets of these unsatisfiable
of effects possible. The observations were generated cores constitute the minimal abductive explanations.
randomly; however, the same set was used for satAB Since the computation of MUSes is computationally de-
and abductiveExplanations on the original as well as manding our proposed algorithm exploits its hitting set
adapted model. The results reported in Table 2 have dual, MCSes, in order to derive minimal diagnoses.
been obtained from ten trials and both algorithms faced We empirically compared an implementation of a di-
a 200 seconds runtime limit. Whereas some of the small agnosis engine employing an ATMS to our SAT-based
runtimes are arguable due to the measurement in the algorithm. The results indicate that while for some of
milliseconds range, Table 2 reveals that satAB (Mean the models, the algorithm performs well, in general we
= 703.73 ms, SD = 8432.07 ms, Median = 0.59 ms, could not observe a performance advantage. Particular
Skewness = 18.61) does not outperform abductiveEx- examples led to even longer computation times than the
planations (Mean = 3.08 ms, SD = 16.38 ms, Median ATMS-based implementation. Despite the fact that the
= 1 ms, Skewness = 12.68) in general. From the statis- data provided no evidence of a computational benefit in
tical data we can infer that the underlying distribution employing a SAT-based approach, we believe that the
of both algorithms is highly right skewed, thus the bulk possibility to utilize more expressive models provides
of values is located towards the lower runtimes. We can an interesting incentive for future research in this area.
even observe that for certain instances, the SAT-based Since the evaluation results, did not indicate a supe-
approach performs rather poorly. Amongst these are riority of the SAT-based approach on grounds of MC-
the model of an inverter and a rectifier of an industrial Ses enumeration, we currently investigate direct conflict
wind turbine. satAB exceeded the given timeout four generation methods. Additionally, due to the model
times for the former. Notice that in all these cases the structure and the experiment data we are planning on
MCSes generation already reached the time threshold. employing compilation methods [31, 32], in order to
According to [19] CLD requires |φ| − p + 1 SAT solver divert some of the computational inefficiency to the
calls, where p refers to the size of the smallest MCS model generation process.
of φ. In our case p = 1, as the clause representing the
set of negated observations always constitutes an MCS.
Acknowledgments
Thus, |φ| SAT solver calls are necessary, where |φ| is de- The work presented in this paper has been supported
termined by |T h| + |Hyp| + 1, with 1 referring to the by the FFG project Applied Model Based Reasoning
clause containing the observations. Unsurprisingly, the (AMOR) under grant 842407. We would further like to
larger FMEAs are more computationally demanding. express our gratitude to our industrial partner, Uptime
It is worth mentioning that in the majority of cases Engineering GmbH.

173
Proceedings of the 26th International Workshop on Principles of Diagnosis

Model Structure #Diagnoses Runtime [in ms]
Component #Hyp #Effects #Rules MAX AVG SF DF TF Algorithm MIN MAX AVG
Electrical circuit
32 17 52 792 197.15 11 11 66 abductive <1 425 27.87
Original Explanations
satAB <1 181.33 76.05
15 17 35 1 1 1 1 1 abductive <1 8 0.33
OSFDP Explanations
satAB <1 1.91 0.16
FCS
17 17 51 18 2.93 3 6 18 abductive <1 1 0.42
Original Explanations
satAB <1 6.41 1.28
15 17 49 18 2.75 3 6 18 abductive <1 61 2.04
OSFDP Explanations
satAB <1 4.73 0.56
ACD
13 16 41 15 2.89 5 15 15 abductive <1 84 1.38
Original Explanations
satAB <1 2.89 0.35
12 16 39 10 2.04 5 10 10 abductive <1 1 0.29
OSFDP Explanations
satAB <1 2.435 0.28
Main bearing
3 5 20 3 2.54 3 0 0 abductive <1 1 0.16
Original Explanations
satAB <1 1 0.09
2 5 15 2 1.54 2 0 0 abductive <1 1 0.12
OSFDP Explanations
satAB <1 0.61 0.03
HIFI - FPU
17 11 36 63 8.64 3 7 21 abductive <1 86 2.54
Original Explanations
satAB <1 8.33 3
9 11 27 6 1.55 2 2 3 abductive <1 1 0.15
OSFDP Explanations
satAB <1 1 0.09
MiTS 1
18 21 48 24 8.40 3 2 6 abductive <1 94 3.40
Original Explanations
satAB <1 3.02 0.39
13 21 43 1 1 1 1 1 abductive <1 100 1.54
OSFDP Explanations
satAB <1 2.15 0.16
MiTS 2
22 15 48 288 39.98 4 8 18 abductive <1 109 4.49
Original Explanations
satAB <1 15.16 3.43
14 15 37 5 2.02 1 5 2 abductive <1 1 0.33
OSFDP Explanations
satAB <1 1.68 0.20
PCB
10 11 24 2 1.49 2 2 2 abductive <1 1 0.21
Original Explanations
satAB <1 1.49 0.1
9 11 23 1 1 1 1 1 abductive <1 1 0.11
OSFDP Explanations
satAB <1 1 0.1
Inverter
30 38 144 450 23.73 19 5 50 abductive <1 107 6.15
Original Explanations
satAB <1 166593 5007.37
23 38 124 66 5.89 14 3 6 abductive <1 94 1.67
OSFDP Explanations
satAB <1 1110.82 38.23
Rectifier
20 17 93 88 10.83 8 24 32 abductive <1 6 1.07
Original Explanations
satAB <1 24236.9 1070.88
14 17 66 22 3.06 5 18 8 abductive <1 1 0.63
OSFDP Explanations
satAB <1 44.74 4.88
Transformer
5 8 22 2 1.06 2 2 1 abductive <1 1 0.16
Original Explanations
satAB <1 1.69 0.06
5 8 22 2 1.06 2 2 1 abductive <1 1 0.13
OSFDP Explanations
satAB <1 1.91 0.08
Backup
components
25 30 114 252 23.06 8 12 21 abductive <1 138 5.24
Original Explanations
satAB <1 41.98 12.89
19 30 95 48 3.29 7 7 10 abductive <1 4 0.79
OSFDP Explanations
satAB <1 10.06 3.09

Table 2: Features of the FMEAs and experimental results. For each component we conducted the experiment
using an implementation of abductiveExplanations and satAB. The columns SF, DF, TF display the maximum
number of single faults, double faults, and triple faults, respectively.

174
Proceedings of the 26th International Workshop on Principles of Diagnosis

References [17] Iulia Nica and Franz Wotawa. ConDiag-computing
[1] Raymond Reiter. A theory of diagnosis from first prin- minimal diagnoses using a constraint solver. In Inter-
ciples. Artificial Intelligence, 32(1):57–95, 1987. national Workshop on Principles of Diagnosis, pages
185–191, 2012.
[2] Johan de Kleer and Brian C Williams. Diagnosing Mul-
tiple Faults. Artificial Intelligence, 32(1):97–130, 1987. [18] Alexander Felfernig and Monika Schubert. Fastdiag:
A diagnosis algorithm for inconsistent constraint sets.
[3] Brian C Williams and P Pandurang Nayak. A model-
In Proceedings of the 21st International Workshop on
based approach to reactive self-configuring systems. In the Principles of Diagnosis (DX 2010), Portland, OR,
Proceedings of the National Conference on Artificial In- USA, pages 31–38, 2010.
telligence, pages 971–978, 1996.
[19] Joao Marques-Silva, Federico Heras, Mikolás Janota,
[4] Peter Struss, Andreas Malik, and Martin Sachen-
Alessandro Previti, and Anton Belov. On comput-
bacher. Case studies in model-based diagnosis and fault ing minimal correction subsets. In Proceedings of the
analysis of car-subsystems. In Proc. 1st Int’l Workshop
Twenty-Third international joint conference on Artifi-
Model-Based Systems and Qualitative Reasoning, pages cial Intelligence, pages 615–622. AAAI Press, 2013.
17–25, 1996.
[20] Andreas Pfandler, Stefan Rümmele, and Stefan Szei-
[5] Luca Console, Daniele Theseider Dupre, and Pietro
der. Backdoors to abduction. In Proceedings of the
Torasso. On the Relationship Between Abduction
Twenty-Third international joint conference on Artifi-
and Deduction. Journal of Logic and Computation,
cial Intelligence, pages 1046–1052. AAAI Press, 2013.
1(5):661–690, 1991.
[6] Peter Zoeteweij, Jurryt Pietersma, Rui Abreu, Alexan- [21] Gustav Nordh and Bruno Zanuttini. What makes
der Feldman, and Arjan JC Van Gemund. Auto- propositional abduction tractable. Artificial Intelli-
mated fault diagnosis in embedded systems. In Secure gence, 172:1245–1284, 2008.
System Integration and Reliability Improvement, 2008. [22] Gerhard Friedrich, Georg Gottlob, and Wolfgang Ne-
SSIRI’08. Second International Conference on, pages jdl. Hypothesis classification, abductive diagnosis and
103–110. IEEE, 2008. therapy. In Expert Systems in Engineering Principles
[7] Franz Wotawa. Failure mode and effect analysis for ab- and Applications, pages 69–78. Springer, 1990.
ductive diagnosis. In Proceedings of the International [23] Franz Wotawa, Ignasi Rodriguez-Roda, and Joaquim
Workshop on Defeasible and Ampliative Reasoning Comas. Abductive Reasoning in Environmental De-
(DARe-14), volume 1212. CEUR Workshop Proceed- cision Support Systems. In AIAI Workshops, pages
ings, ISSN 1613-0073, 2014. http://ceur-ws.org/Vol- 270–279, 2009.
1212/. [24] Chin-Liang Chang and Richard Char-Tung Lee. Sym-
[8] Peter G. Hawkins and Davis J. Woollons. Failure bolic logic and mechanical theorem proving. Academic
modes and effects analysis of complex engineering sys- press, 1973.
tems using functional models. Artificial Intelligence in [25] Mark H Liffiton and Karem A Sakallah. Algorithms for
Engineering, 12:375–397, 1998.
computing minimal unsatisfiable subsets of constraints.
[9] Chris Price and Neil Taylor. Automated multiple fail- Journal of Automated Reasoning, 40(1):1–33, 2008.
ure fmea. Reliability Engineering & System Safety,
[26] Elazar Birnbaum and Eliezer L Lozinskii. Consis-
76:1–10, 2002.
tent subsets of inconsistent systems: structure and be-
[10] Sheila A McIlraith. Logic-based abductive inference. haviour. Journal of Experimental & Theoretical Artifi-
Knowledge Systems Laboratory, Technical Report KSL- cial Intelligence, 15(1):25–46, 2003.
98-19, 1998.
[27] Christopher S Gray, Roxane Koitz, Siegfried Psutka,
[11] Pierre Marquis. Consequence finding algorithms. In and Franz Wotawa. An abductive diagnosis and mod-
Handbook of Defeasible Reasoning and Uncertainty eling concept for wind power plants. In International
Management Systems, pages 41–145. Springer, 2000. Workshop on Principles of Diagnosis, 2014.
[12] Katsumi Inoue. Linear resolution for consequence find- [28] Gregory Tseitin. On the complexity of proofs in propo-
ing. Artificial Intelligence, 56(2):301–353, 1992. sitional logics. In Seminars in Mathematics, volume 8,
[13] Franz Wotawa, Ignasi Rodriguez-Roda, and Joaquim pages 466–483, 1970.
Comas. Environmental decision support systems based [29] Li Lin and Yunfei Jiang. The computation of hitting
on models and model-based reasoning. Environmental sets: review and new algorithms. Information Process-
Engineering and Management Journal, 9(2):189–195, ing Letters, 86(4):177–184, 2003.
2010.
[30] Ingo Pill, Thomas Quaritsch, and Franz Wotawa. From
[14] Amit Metodi, Roni Stern, Meir Kalech, and Michael
conflicts to diagnoses: An empirical evaluation of min-
Codish. A novel SAT-based approach to model based
imal hitting set algorithms. In 22nd Int. Workshop on
diagnosis. Journal of Artificial Intelligence Research,
the Principles of Diagnosis, pages 203–210, 2011.
pages 377–411, 2014.
[31] Adnan Darwiche. Decomposable negation normal
[15] Alexander Feldman, Gregory Provan, Johan de Kleer,
form. Journal of the ACM (JACM), 48(4):608–647,
Stephan Robert, and Arjan van Gemund. Solving
2001.
model-based diagnosis problems with Max-SAT solvers
and vice versa. In DX-10, International Workshop on [32] Pietro Torasso and Gianluca Torta. Computing
the Principles of Diagnosis, 2010. minimum-cardinality diagnoses using OBDDs. In KI
[16] Alexander Feldman, Gregory M Provan, and Arjan JC 2003: Advances in Artificial Intelligence, pages 224–
van Gemund. Computing minimal diagnoses by greedy 238. Springer, 2003.
stochastic search. In AAAI, pages 911–918, 2008.

175
Proceedings of the 26th International Workshop on Principles of Diagnosis

176
Proceedings of the 26th International Workshop on Principles of Diagnosis

Fault Tolerant Control for a 4-Wheel Skid Steering Mobile Robot

George K. Fourlas1, George C. Karras2 and Kostas J. Kyriakopoulos 2
1
Department of Computer Engineering, Technological Educational Institute (T. E. I.) of Central
Greece, Lamia, Greece
email: gfourlas@teiste.gr
2
Control Systems Laboratory, School of Mechanical Eng. National Technical University of Athens
(NTUA) Athens, Greece
email: karrasg@mail.ntua.gr, kkyria@mail.ntua.gr
Abstract
This paper studies a fault tolerant control strategy
for a four wheel skid steering mobile robot
(SSMR). Through this work the fault diagnosis
procedure is accomplished using structural analy-
sis technique while fault accommodation is based
on a Recursive Least Squares (RLS) approxima-
tion. The goal is to detect faults as early as possi-
ble and recalculate command inputs in order to
achieve fault tolerance, which means that despites
the faults occurrences the system is able to recov-
er its original task with the same or degraded per- Figure 1. 4-Wheel Skid Steering Mobile Robot.
formance. Fault tolerance can be considered that
it is constituted by two basic tasks, fault diagnosis Fault diagnosis and accommodation for wheeled mobile
and control redesign. In our research using the di- robots is a complex problem due to the large number of
agnosis approach presented in our previous work faults that can be present such as faults of sensors and ac-
we addressed mainly to the second task proposing tuators [10] - [20].
a framework for fault tolerant control, which al- Model based fault detection and isolation is a method to
lows retaining acceptable performance under sys- perform fault diagnosis using a certain model of the sys-
tems faults. In order to prove the efficacy of the tem. The goal is to detect faults as early as possible in or-
proposed method, an experimental procedure was der to provide a timely warning [8]. The aim of timely
carried out using a Pioneer 3-AT mobile robot. handling the fault occurrence is to accommodate their con-
sequences so that the system remains functional. This can
1 Introduction be achieved with fault tolerance.
In cases where fault could not be tolerated, it is necessary
The higher demands to achieve more reliable performance
to use redundant hardware. In practice there exist two dif-
in modern robotic systems have necessitated the develop-
ferent approaches for fault tolerance control, static redun-
ment of appropriate fault diagnosis methods. The appear-
dancy and dynamic redundancy [8].
ance of faults is inevitable in all systems, such as wheeled
In [10] and [16], the research is focused only on the prob-
robots, either because their elements are worn out or be-
lem of fault detection and identification in a mobile robot
cause the environment in which they operate, presents un-
and different approaches related to state estimation were
anticipated situations [4].
introduced. In [9] and [15], the research interest is focused
In a large number of applications, as for example search
only on the problem of fault detection which is a sepa-
and rescue, planetary exploration, nuclear waste cleanup or
rate problem in the fault diagnosis domain. The research
mine decommissioning, the wheeled robots operate in en-
efforts in [7] and [12] – [14] are primarily intended to de-
vironments where human intervention can be very costly,
tect faults in the sensors of a wheeled robot. Concerning
slow or even impossible. They can move freely in such
the research area of detection and accommodation on
dynamic environments. It is therefore essential for the ro-
wheeled robots there is also a small number of efforts [18]
bots to monitor their behavior so that faults may be ad-
with different approaches and methodologies.
dressed before they result in catastrophic failures.
As a fault, it can be considered any unpermitted deviation
A wheeled mobile robot is usually an embedded control
from the normal behavior of a system. Fault diagnosis is
platform, which consists of an on-board computer, power,
the procedure of determination of the component which is
motor control system, communications, sonars, cameras,
faulty. Consequently, the aim of fault diagnosis is to pro-
laser radar system and sensors such as gyroscope, encod-
duce the suitable fault statement regarding the malfunction
ers, accelerometers etc, Fig. 1.
of a wheeled robot.

177
Proceedings of the 26th International Workshop on Principles of Diagnosis

Fault diagnosis includes fault detection, which is the indi- The kinematic model describes the motion constrains of
cation that something is going wrong in the system and the system, as well as the relationship of the sensors meas-
fault isolation, which is the determination of the magnitude urements with the system states and it is crucial for the
of the fault, by evaluating symptoms. Follows fault detec- fault diagnosis procedure.
tion. Fault detection and isolation tasks together are re-
ferred to as fault diagnosis (FDI - Fault Detection and Iso- 2.1 Kinematic Model
lation). The geometry of the robot is presented in Fig.2. To con-
Among the various methods in the design of a residual sider the model of the four wheel skid steering mobile ro-
generator, only few deal with nonlinear systems. Structural bot (SSMR) it is assumed that the robot is placed on a
analysis is a technique that provides feasible solutions to
plane surface where ( Χ Ι ,Υ Ι ) is the inertial reference
the residual generation of nonlinear systems
Structural analysis methods are used in research publica- frame and ( Χ ,Υ ) is a local coordinate frame fixed on the
tions [2] and [6]. Paper [3] presents a structural analysis robot at its center of mass (COM). The position of the
for complex systems such as a ship propulsion benchmark.
In [13] and [14] the authors discusses how structural anal- COM is ( x, y ) with respect to the inertial frame and ϑ is
ysis technique is applied to an unmanned ground vehicle the orientation of the local coordinate frame with respect to
for residual generation. the inertial frame.
In this research, a model based fault diagnosis for a four
wheel skid steering mobile robot (SSMR) is presented. YΙ
The basic idea is to use structural analysis based technique v1
X
in order to generate residuals. For this purpose we use the Y v1x
kinematic model of the mobile robot that serves to the de- b v1y ϑ
yICR
sign of the structural model of the system. This technique a v
ΙCR
provides the parity equations which can be used as residual v4
vx v4 x
generators. The advantage of the proposed method is that
y 2 RL v2 x vy v4 y
offers feasible solution to the residual generation of non- v2
COM

linear systems. Additionally, we a propose a fault accom- xICR
v2 y 2 RR
modation technique based on RLS approximation in order v3x
to provide recalculated control inputs in the case that the v3
left or right set of the robot tires becomes flat. 2c
The mobile robot is supposed to be equipped with two v3 y
high resolution optical quadrature shaft encoders mounted XΙ
on reversible-DC motors which provide rotational speeds
x
of the left and right wheels ωL and ωR respectively and
Figure 2. Mobile Robot Geometry.
an inertial measurement unit (IMU) which provides the
forward linear acceleration and the angular velocity well as As depicted in Fig. 2, a is the distance between the center
the angle θ between the mobile robot axle and the x axis of mass and the front wheels axle along X, b is the distance
of the mobile robot. The absolute pose (horizontal position between the center of mass and the rear wheels axle along
and orientation) of the robot is available via a camera sys- X, c is half distance between wheels along Y and RL , RR
tem mounted on the workspace of the robot. A distinctive
are the radii of left and right wheels respectively. The co-
marker is place at the top side of the robot.
ordinates of the instantaneous center of rotation (ICR) are
The paper is organized as follows. We start by presenting
the mathematical model of a Pioneer 3-AT mobile robot in ( xICR ,yICR ) .
section 2. Section 3 describes the fault diagnosis proce- Assuming that the robot moves on a horizontal plane the
dure. Section 4 describes the methodology of fault ac- linear velocity with respect to the local frame is given by
commodation. In section 5 we present the application re-
sults of the proposed method to the robotic platform. Con-  υx 
clusions and directions for future work are presented in υ = υy  (1)
Section 6.
 0 
2 Mathematical Model of Pioneer 3-AT Mo- and its angular velocity is given by
bile Robot
0 
ω =  0 
In this work, the mobile robot Pioneer 3-AT was used as a
robotic platform. This robot is a four wheel skid – steering (2)
vehicle actuated by two motors, one for the left sided ω 
wheels and the other for the right sided wheels. The wheels
on the same side are mechanically coupled and thus have The state vector with respect to the inertial frame is
the same velocity. Also, they are equipped with encoders
and the angular readings are available through routine x
calls. q =  y  (3)
ϑ 

178
Proceedings of the 26th International Workshop on Principles of Diagnosis

The time derivatives of (3) denotes the robot’s velocity We suppose that the mobile robot localization is calculated
vector and is given by via the following measurement devices:
• two high resolution optical quadrature shaft encod-
 x  cos ϑ − sin ϑ 0  υ x  er mounted on reversible-DC motors which provide
 y  =  sin ϑ cos ϑ 0  υ 
    y (4) rotational speeds of the left and right wheels ωL
ϑ   0 0 1   ω  and ωR respectively,
Assuming that longitudinal slip between the wheels and • an Inertial Measurement Unit (IMU) which pro-
the surface can be neglected we have the following equa- vides the forward linear acceleration and the angu-
tion, lar velocity as well as the angle ϑ between the
mobile robot axle and the x axis of the mobile ro-
υix = Riωi (5) bot.
• A camera system, which calculates the pose of the
where υix is the longitudinal component of the total veloci- robot, by tracking a marker placed at the top side of
ty vector υi of the i-th wheel expressed with respect to the it.
In this work we are only interested in abrupt faults which
local frame and Ri is the rolling radius of that wheel. occur in the actuators of the mobile robot and as conse-
If we take into account all wheels (Fig. 2), the following quence, we make the following assumptions.
relationships between the wheels can be obtained [11], • Assumption 1: When the mobile robot starts func-
tioning all its components are in normal mode.
υ=
L υ=1x υ2 x • Assumption 2: The magnitude of the noise is as-
υ= υ= υ4 x sumed to be significantly smaller than the magni-
(6)
R 3x

υ= υ= υ tude of the faults.
F 1y 4y
• Assumption 3: Regarding the wheel radius the fol-
υ=B υ=2y υ 3y lowing inequalities are satisfied:
where υ L refers to the longitudinal coordinates of the left RR + δ RR > 0 & RL + δ RL > 0
wheels velocities, υ R refers to the longitudinal coordinates According to this assumption, faults that result in the com-
of the right wheels velocities, υ F refers to the lateral coor- plete loss of the wheel are not considered.
dinates of the front wheels velocities and υ B refers to the
lateral coordinates of the rear wheels velocities. 3 Fault Detection and Isolation
Unlike other mobile robots, lateral velocities of the four Between several techniques for generating residuals, lim-
wheel skid steering mobile robot are generally nonzero ited number of them concerns nonlinear systems. Such one
since from its mechanical structure the lateral skidding is is structural analysis. Using this method we can extract
necessary if the robot changes its orientation. Therefore, in information about system components that we are not able
order to complete the kinematic model, the following non- to measure. Also we can take the parity equations that al-
holonomic constrain in Pfaffian form is introduced low generating residuals.
The structure of the mobile robot is described using the
 x  following sets of constrains C and variables V
[ − sin ϑ cos ϑ − xICR ]  y  =A ( q ) q =0 (7) C = {c1 , c2 ,..., c9 } (12)
ϑ 
V= X ∪ K (13)
Then we have
X is a subset of the unknown ones and K is a subset of
q = S ( q )η (8) known that are measurements and inputs.
The above subsets are
where

cos ϑ xICR sin ϑ 
{
X = x , y , ϑ,υ x ,υ y } (14)

S ( q )  sin ϑ
= − xICR cos ϑ  (9) K = { x, y, ϑ ,  y, ω , ωL , ωR }
x,  (15)
 0 1 
The constrain set of the mobile robot is
υ  c1 : x cos ϑυ x − sin ϑυ y
= (16)
η =  x (10)
ω
c=  sin ϑυ x + cos ϑυ y
2 : y (17)
S ( q ) is a full rank matrix, whose columns are in the null
space of A ( q ) , c3 : ϑ = ω (18)
S T ( q ) AT ( q ) = 0 (11) dx
c4 : 
x= (19)
It is noted that since dim ( η ) =2 < dim ( q ) =
3 , equation dt
(8) describes the kinematic of a sub-actuated robot with the
nonholonomic constraint given by (7).

179
Proceedings of the 26th International Workshop on Principles of Diagnosis

dy t
c5 : 
y= (20) r2= ϑ − ∫ ω dt (32)
dt 0
t
c6 : ϑ = ∫ ω dt (21) dx
0
=r3 ∫ xdt − dt (33)

r dy
υx
c7 := (ωR + ωL ) (22) =r4 ∫ ydt − dt (34)
2
dx
c8 : x = (23) 4 Fault Accommodation
dt
Fault accommodation is the phase that follows the fault
dy diagnosis. One of the most important issues to consider for
c9 : y = (24)
dt the design of fault tolerant control is relative to the per-
formance and functionality of the system under considera-
Through the above technique we create the following inci- tion. More specific it should take into consideration, the
dence matrix that describes the robot structure, Table 1. degree of performance degradation that is acceptable.
Table 1. Incidence Matrix There are two aspects of system performance, dynamic and
steady state. In our approach we take into account the sec-
KNOWN UNKNOWN ond one. We also use the aforementioned fault diagnosis
υy
method to monitor the system. The goal is to have the nec-
x y θ x y ω ωL ωR x y ϑ υx
essary information about the fault occurrence for timely
c1 1 1 1 1 counteraction. Figure 3 shows the overall structure of the
c2 1 proposed fault tolerant mechanism. It consists of two parts:
1 1 1
i) the fault detection module which accepts as inputs the
c3 1 1 measurement of the linear and angular velocity of the
c4 1 1 SSMR and decides about the type of fault according to the
method described in Section 3, and ii) the fault accommo-
c5 1 1 dation module which accepts as inputs the type of fault as
c6 1 1 well as the measurement of the linear and angular velocity
c7 1 1 1 and recalculates accordingly the command inputs in order
to compensate for the fault.
c8 1 1
c9 1 1

Applying matching algorithm [1] to the incidence matrix,
we take out the following matched M and unmatched U
constrains
M = {c1 , c2 , c3 , c4 , c7 } (25)

U = {c5 , c6 , c8 , c9 } (26)

In order to have residual generators we use the following
parity equations
c5 ( y , 
y) = 0 (27) Figure 3. Fault Tolerance System Architecture.

(
c6 ϑ , ϑ = 0 ) (28)
When a fault occurs the appropriate action is undertaken
c8 ( x, x ) = 0 (29) (e.g. maintenance, repair, reconfiguration, stop operation)
in such a way to prevent system failures. In that level the
c9 ( y, y ) = 0 (30) performance degradation that is acceptable is relative to
the minimum requirements that ensure the system func-
By starting from the unknown variables through backtrack- tionality. There is always the case that the malfunction
ing to known variables, the residuals are: may cause hazard for the process or the environment, and a
decision for stopping the operation is unavoidable.
d r In this work, we propose a fault accommodation technique
r1 =
y−  sin ϑ (ωR + ωL ) + which is employed when either the left or the right set of
dt  2
tires becomes flat during the operation of a SSMR. It is
 r  (31)
 cos ϑ (ωR + ωL ) − ∫ 
xdt   obvious that when a flat tire fault occurs, the total nominal
+ cos ϑ  
2
radius RNOM (rim and tire) of the fault wheel changes to
sin ϑ 
 RF , where RF < RNOM . The proposed fault accommoda-


180
Proceedings of the 26th International Workshop on Principles of Diagnosis

tion strategy relays on the online estimation of the new 3. Update the estimate Rˆ Fk and the covariance Pk of the
radius RF , in order to correct the commanded rotational
estimation error sequentially according to:
speeds of the faulty wheel and compensate for the fault
K k Pk −1 H kT ( H k Pk −1 H kT + Rk )
which otherwise will inevitable lead the vehicle to diverge −1
=
from its nominal course.
As explained in [11], the kinematic model of the SSMR
can be consider equivalent with the unicycle differential
Rˆ Fk = (
Rˆ Fk −1 + K k ωk − H k Rˆ Fk −1 ) (39)

drive one, mainly due to the existence of a single motor P=
k ( I − K k H k ) Pk −1
drive and a transmission belt for each set of wheels (left
and right), which impose the same rotational speed for where ωk is the actual measurement of the body an-
each set of wheels. According to this assumption we can gular velocity as delivered by the IMU sensor.
safely assume that: 4. Using the estimated wheel radius Rˆ Fk we correct the
u x  1  c c  u L  commanded wheel angular velocity as follows:
 ω  = 2c  −1 1  u  (35)
    R −2ωk c + ωR RR
ωL _ cor = (40)
=
where υ L ω= L RL , υ R ωR RR are the equivalent linear Rˆ Fk

velocities of the left and right wheels respectively in rela-
2ωk c + ωL RL
tion to the rotational speeds and radii. If we consider that ωR _ cor = (41)
the fault will occur only at the one set of the wheels (left or Rˆ Fk
right), we may consider only the angular velocity equation
for the accommodation. Thus, only the angular velocity in case there is a left or a right wheel fault respectively.
measurement is needed. The fault accommodation is based
on the online estimation of the new radius RF employing a 5 Application Results
Recursive Least Squares algorithm. More specific, we may The proposed method has been implemented and tested
consider the following linear equation for the measurement experimentally on Pioneer 3-AT mobile robot. All experi-
of the mobile’s robot body angular velocity, in case a left ments have been performed indoors. We consider a faulty
side fault occurs: situation where the right wheel set is flat (forward and
backward wheels). We apply a command of
0.5
ωˆ k − ωR RR = H k Rˆ Fk + vk ω= L ω=R 5 rad / s for both set of wheels. In the nominal
c situation (no faults) the robot should move (almost)
0.5 straight forwards without any deviation. The robot starts
H k = H Lk = − ωL (36)
c from the origin of the inertial frame and moves for 2.5m.
vk − ( 0, Rk ) The time interval dt between successive IMU measure-
ments is 2.5m sec . The nominal radius of the wheels
(proper inflation) is R=
L R=
R 0.115 m .
while in the case of a right side fault:
In the first experiment (Fig. 4), the fault accommodation
0.5 algorithm is not enabled and as we can observe from the
ωˆ k + ωL RL = H k Rˆ Fk + vk trajectory of the vehicle, the SSMR significantly diverges
c
from its nominal course to the right.
0.5
=
H k H= Rk ωR (37)
c
vk  ( 0, Rk )
SSMR Position along Y-axis (m)

0.4
Having defined the measurement model of the robot angu-
lar velocity in the body frame, we proceed to the on line
estimation of the fault wheel radius employing the follow- 0.2
ing Recursive Least Squares approximation algorithm:

1. Initialize the estimator: 0

= ( RF ) RL R
Rˆ F0 E=

(( ))
2 (38) -0.2
=P0 E RF − Rˆ F0 0 0.5 1
SSMR Position along X-axis (m)
1.5 2

where RL| R is the nominal radius of the left or right Figure 4. Robot’s position while the right wheel set is flat.
wheel set.
The fault detection algorithm is enabled, and as we can see
2. Obtain a new measurement ωk , assuming that it is from Fig. 5 the fault was successfully detected by the pro-
posed structural analysis algorithm.
given by the equation (36), or (37).

181
Proceedings of the 26th International Workshop on Principles of Diagnosis

As we can observe from Fig. 8 the trajectory of the SSMR
was successfully detained in an almost straight line form.

Figure 5. Fault signal as the right wheel set is flat.
In the second experiment we impose the same control in-
Figure 8. SSMR Corrected Planar Trajectory.
puts to the SSMR , but this time not
only the fault detection but also the proposed fault accom- 6 Conclusion
modation algorithm is enabled. As we can see in Fig. 6 the
on line estimation algorithm quickly converge to the new The notion of fault tolerant control for a 4-wheel skid
radius of the faulty wheel set and consequently the fault steering mobile robot is an important problem to deal with,
accommodation algorithm provides modified inputs to the since faults appearance is inevitable in such systems. The
right wheel set (Fig. 7). most significant challenge arises from the complexity of
the system. In this paper we have introduced the underly-
ing concepts for our approach to fault tolerant control for
mobile robots focusing our attention mainly to control re-
configuration. As concerning the issue of fault diagnosis
the structural analysis based technique is used in order to
generate residuals. We use the kinematic model of the mo-
bile robot that serves to the development of the structural
model of the system. The above technique provides the
parity equations which can be used as residual generators
since model based fault diagnosis approach is based on
residuals. The advantage of the above method is that it can
offer a feasible solution to the residual generation of non-
linear systems. The fault accommodation procedure targets
in the case where one of the two wheel tire sets becomes
flat. The proposed accommodation method is based on a
RLS approximation of the new faulty wheel radius and via
this information a new control input is calculated in order
Figure 6. On line estimation of faulty radius. to compensate for the fault.
The efficacy of the proposed method is demonstrated
through an extensive experimental procedure using a mo-
bile robot Pioneer 3-AT.

Acknowledgments
This research is implemented through the Operational Pro-
gram "Education and Lifelong Learning" and is co-
financed by the European Union (European Social Fund)
and Greek national funds. The work is part of the research
project entitled «DIAGNOR - Fault Diagnosis and Ac-
commodation for Wheeled Mobile Robot» of the Act "Ar-
chimedes III - Strengthening Research Groups in TEI La-
mia".

Figure 7. Recalculated input from the fault accommodation
algorithm.

182
Proceedings of the 26th International Workshop on Principles of Diagnosis

References [14] A. Monteriù, P. Asthan, K. Valavanis and S. Longhi,
“Model-Based Sensor Fault Detection and Isolation
[1] M. Blanke, M. Kinnaert, J. Luzne, M. Staroswiecki, System for Unmanned Ground Vehicles: Experi-
«Diagnosis and Fault Tolerant Control», ser. Heidel- mental Validation (part II), 2007 IEEE International
berg, Springer-Verlag, 2003. Conference on Robotics and Automation Roma, Italy,
[2] M. Blanke, H. Niemann, and T. Lorentzen, “Structural 10-14 April 2007.
analysis – a case study of the Rømer satellite,” in [15] P. Sundvall and P. Jensfelt, “Fault detection for mo-
Proc. of IFAC Safeprocess 2003, Washington, DC, bile robots using redundant positioning systems” Pro-
USA, 2003. ceedings of the 2006 IEEE International Conference
[3] M. Blanke, V. Cocquempot, R. I. Zamanabadi, and M. on Robotics and Automation, Orlando, Florida - May
Staroswiecki, “Residual generation for the ship 2006.
benchmark using structural approach,” in Proc. of Int. [16] C. Valdivieso, and A. Cipriano, “Fault Detection and
Conference on CONTROL’98, Swansea, UK, Sep Isolation System Design for Omnidirectional Soccer-
1998. Playing Robots”, Proceedings of the 2006 IEEE Con-
[4] Steven X. Ding, “Model-based Fault Diagnosis Tech- ference on Computer Aided Control Systems Design
niques: Design Schemes, Algorithms, and Tools, Munich, Germany, October 4-6, 2006.
Springer-Verlag Berlin, 2008. [17] R. Izadi-Zamanabadi, “Structural Analysis Approach
[5] G.K. Fourlas, K.J. Kyriakopoulos, N.J. Krikelis, to Fault Diagnosis with Application to Fixed-wing
“Fault Diagnosis of Hybrid Systems”, Proceedings of Aircraft Motion” in Proc. of American control Con-
the 2005 IEEE International Symposium on Intelligent ference, USA, 2002.
Control, 13th IEEE Mediterranean Conference on Con- [18] D. Zhuo-hua, CAI Zi-xing, YU Jin-xia “Fault Diagno-
trol and Automation, Limassol, Cyprus, 2005. sis and Fault Tolerant Control for Wheeled Mobile
[6] G.K. Fourlas, “Theoretical Approach of Model Based Robots under Unknown Environments: A Survey”,
Fault Diagnosis for a 4 - Wheel Skid Steering Mobile Proceedings of the 2005 IEEE International Confer-
Robot, 21st IEEE Mediterranean Conference on Con- ence on Robotics and Automation, Barcelona, Spain,
trol and Automation (MED '13), Platanias-Chania, April 2005.
Crete, GREECE, June 25-28, 2013. [19] S.Zaman, G. Steinbauer, J.Maurer, P.Lepej, and
[7] G. Ippoliti, S. Longhi, A. Monteriµu, “Model-based S.Uran, “An integrated model-based diagnosis and re-
sensor fault detection system for a smart wheelchair”, pair architecture for ROS-based robot systems”, IEEE
IFAC 2005. International Conference on Robotics and Automa-
[8] R. Isermann, “Fault Diagnosis Systems – An Introduc- tion (ICRA), Karlsruhe, Germany, 2013.
tion from Fault Detection to Fault Tolerant”, Springer- [20] D. Portugal, and Rui P. Rocha, “Scalable, Fault-
Verlag Berlin Heidelberg, 2006. Tolerant and Distributed Multi-Robot Patrol in Real
[9] B. Halder, and N. Sarkar, “Robust Fault Detection of World Environments” 2013 IEEE/RSJ International
Robotic Systems: New Results and Experiments”, Conference on Intelligent Robots and Systems
Proceedings of the 2006 IEEE International Confer- (IROS), November 3-7, Tokyo, Japan, 2013.
ence on Robotics and Automation, Orlando, Florida -
May 2006.
[10] Z. Kira, “Modeling Cross-Sensory and Sensorimotor
Correlations to Detect and Localize Faults in Mobile
Robots”, Proceedings of the 2007 IEEE/RSJ Interna-
tional Conference on Intelligent Robots and Systems
San Diego, CA, USA, Oct 29 - Nov 2, 2007.
[11] K. Kozlowski and D. Pazderski, “Modeling and Con-
trol of a 4-Wheel Skid-Steering Mobile Robot”, Int. J.
Appl. Math, Comput. Sci., 2004, vol. 14, no. 4, 477-
496.
[12] Y. Morales, E. Takeuchi and T. Tsubouchi, “Vehicle
Localization in Outdoor Woodland Environments with
sensor fault detection”, 2008 IEEE International Con-
ference on Robotics and Automation, Pasadena, CA,
USA, May 19-23, 2008.
[13] A. Monteriù, P. Asthan, K. Valavanis and S. Longhi,
“Model-Based Sensor Fault Detection and Isolation
System for Unmanned Ground Vehicles: Theoretical
Aspects (part I)”, 2007 IEEE International Conference
on Robotics and Automation Roma, Italy, 10-14 April
2007.

183
Proceedings of the 26th International Workshop on Principles of Diagnosis

184
Proceedings of the 26th International Workshop on Principles of Diagnosis

Data-Driven Monitoring of Cyber-Physical Systems
Leveraging on Big Data and the Internet-of-Things for Diagnosis and Control
Oliver Niggemann1,3 , Gautam Biswas2 , John S. Kinnebrew2 , Hamed Khorasgani2 ,
Sören Volgmann1 and Andreas Bunte3
1
Fraunhofer Application Center Industrial Automation, Lemgo, Germany
e-mail: {oliver.niggemann, soeren.volgmann}@iosb-ina.fraunhofer.de
2
Vanderbilt University and Institute for Software Integrated Systems, Nashville, TN, USA
e-mail: {john.s.kinnebrew, hamed.g.khorasgani, gautam.biswas}@vanderbilt.edu
3
Institute Industrial IT, Lemgo, Germany
e-mail: {andreas.bunte}@hs-owl.de
Abstract modeled. However, the last 20 years have clearly shown that
such models are rarely available for complex CPSs; when
The majority of projects dealing with monitoring they do exist, they are often incomplete and sometimes in-
and diagnosis of Cyber Physical Systems (CPSs) accurate, and it is hard to maintain the effectiveness of these
relies on models created by human experts. But models during a system’s life-cycle.
these models are rarely available, are hard to ver- A promising alternative is the use of data-driven ap-
ify and to maintain and are often incomplete. proaches, where monitoring and diagnosis knowledge can
Data-driven approaches are a promising alterna- be learned by observing and analyzing system behavior.
tive: They leverage on the large amount of data Such approaches have only recently become possible: CPSs
which is collected nowadays in CPSs, this data is now collect and communicate large amounts of data (see Big
then used to learn the necessary models automati- Data [9]) via standardized interfaces, giving rise to what is
cally. For this, several challenges have to be tack- now called the Internet of Things [10]. This large amount
led, such as real-time data acquisition and storage of data can be exploited for the purpose of detecting and an-
solutions, data analysis and machine learning al- alyzing anomalous situations and faults in these large sys-
gorithms, task specific human-machine-interfaces tems: The vision is developing CPSs that can observe their
(HMI) and feedback/control mechanisms. In this own behavior, recognize unusual situations during opera-
paper, we propose a cognitive reference architec- tions, inform experts, who can then update operations proce-
ture which addresses these challenges. This ref- dures, and also inform operators, who use this information
erence architecture should both ease the reuse of to modify operations or plan for repair and maintenance.
algorithms and support scientific discussions by In this paper, we take on the challenges of proposing
providing a comparison schema. Use cases from a common data-driven framework to support monitoring,
different industries are outlined and support the anomaly detection, prognosis (degradation modeling), diag-
correctness of the architecture. nosis, and control. We discuss the challenges for developing
such a framework, and then discuss case studies that demon-
1 Motivation strate some initial steps toward data-driven CPSs.
The increasing complexity and the distributed nature of
technical systems (e.g. power generation plants, manufac-
2 Challenges
turing processes, aircraft and automobiles) have provided In order to implement data-driven solutions for the moni-
traction for important research agendas, such as Cyber Phys- toring, diagnosis, and control of CPSs, a variety of chal-
ical Systems (CPSs) [1; 2], the US initiative on the “Indus- lenges must be overcome to enable the learning pathways
trial Internet” [3] and its German counterpart “Industrie 4.0” illustrated in Figure 1:
[4]. In these agendas, a major focus is on self-monitoring, Data Acquisition: All data collected from distributed
self-diagnosis and adaptivity to maintain both operability CPSs, e.g. sensors, actuators, software logs, and business
and safety, while also taking into account humans-in-the- data, must meet real-time requirements, as well as includ-
loop for system operation and decision making. Typical ing time synchronization and spatial labeling when relevant.
goals of such self-diagnosis approaches are the detection Often sensors and actuators operate at different rates, so data
and isolation of faults and anomalies, identifying and an- alignment, especially for high-velocity data, becomes an is-
alyzing the effects of degradation and wear, providing fault- sue. Furthermore, data must be annotated semantically to
adaptive control, and optimizing energy consumption [5; allow for a later data analysis.
6]. Data Storage, Curation, and Preprocessing: Data will be
So far, the majority of projects and papers for analy- stored and preprocessed in a distributed way. Environmen-
sis and diagnosis has relied on manually-created diagno- tal factors and the actual system configuration (e.g., for the
sis models of the system’s physics and operations [6; 7; current product in a production system) must also be stored.
8]: If a drive is used, this drive is modeled, if a reactor is in- Depending on the applications, a relational database format,
stalled, the associated chemical and physical processes are or increasingly distributed noSQL technologies [11], may

185
Proceedings of the 26th International Workshop on Principles of Diagnosis

Cyber Physical System Task-specific
Human-Machine-Interface
Distributed Abstracted
Diagnosis Condition Monitoring
Data Storage
Data Knowledge
Acquisition Machine
Usage and Editing Diagnosis
OK Cancel
Learning
…… of Knowledge

Controller Controller
…... Energy
OK Analysis
Cancel
…....
Network OK Cancel

Feedback mechanisms
and control

Figure 1: Challenges for the analysis of CPSs.

need to be adopted, so that the right subsets of data may be raising the overall efforts, preventing any reuse of hard-
retrieved for different analyses. Real-world data can also be ware/software and impeding a comparison between solu-
noisy, partially corrupted, and have missing values. All of tions.
these need to be accommodated in the curation, storage, and To achieve better standardization, efficiency, and repeata-
pre-processing applications. bility, we suggest a generic cognitive reference architecture
Data Analysis and Machine Learning: Data must be ana- for the analysis of CPSs. Please note that this architecture is
lyzed to derive patterns and abstract the data into condensed a pure reference architecture which does not constraint later
usable knowledge. For example, machine learning algo- implementations and introduction of application-specific
rithms can generate models of normal system behavior in methods.
order to detect anomalous patterns in the data [12]. Other Figure 2 shows its main components:
algorithms can be employed to identify root causes of ob-
served problems or anomalies. The choice and design of
appropriate analyses and algorithms must consider factors User
Task-Specific HMI
like the ability to handle large volumes and sometimes high Task-Specific HMI
velocities of heterogeneous data. At a minimum, this gener- Conceptual Interface
I/F 4 I/F 5
ally requires machine learning, data mining, and other anal-

System Synthesis
Data
System Analysis

System
ysis algorithms that can be executed in parallel, e.g., using Abstraction
and ML
Repair Conceptual Layer

the Spark [13], Hadoop [14], and MapReduce [15] architec-

System Synthesis
I/F 3 I/F 6

System Analysis
Real-time Big Data Platform
tures. In some cases, this may be essential to meet real-time Learning Adaptation
analysis requirements. Cyber Physical System

I/F 2 I/F 7
Task-specific Human-Machine-Interfaces: Tasks such as
condition monitoring, energy management, predictive main- Controller Controller
Big Data Platform

tenance or diagnosis require specific user interfaces [16]. Network
I/F 1
One set of interfaces may be more tailored for offline analy- Cyber Physical System

sis to allow experts to interact with the system. For example,
experts may employ information from data mining and ana-
lytics to derive new knowledge that is beneficial to the future Controller

Network
Controller

operations of the system. Another set of interfaces would be
appropriate for system operators and maintenance person-
nel. For example, appropriate operator interfaces would be Figure 2: A cognitive architecture as a solution for the anal-
tailored to provide analysis results in interpretable and ac- ysis of CPSs.
tionable forms, so that the operators can use them to drive
decisions when managing a current mission or task, as well Big Data Platform (I/F 1 & 2): This layer receives all rel-
as to determine future maintenance and repair. evant system data, e.g., configuration information as well
Feedback Mechanisms and Control: As a reaction to rec- as raw data from sensors and actuators. This is done by
ognized patterns in the data or to identified problems, the means of domain-dependent, often proprietary interfaces,
user may initiate actions such as a reconfiguration of the here called interface 1 (I/F 1). This layer then integrates,
plant or an interruption of the production for the purpose of often in real-time, all of the data, time-synchronizes them
maintenance. In some cases, the system may react without and annotates them with meta-data that will support later
user interactions; in this case, the user is only informed. analysis and interpretation. For example, sensor meta-data
may consist of the sensor type, its position in the system and
3 Solutions its precision. This data is provided via I/F 2, which, there-
fore, must comprise the data itself and also the meta-data
As Section 4 will show, the challenges from Section 2 reap- (i.e., the semantics). A possible implementation approach
pear in the majority of CPS examples. While details, such for I/F 2 may be the mapping into and use of existing of Big
as the machine learning algorithms employed or the nature Data platforms, such as Sparks or Hadoop, for storing the
of data and data storage formats can vary, the primary steps data and the Data Distribution Service (DDS) for acquiring
are about the same. Most CPS solutions re-implement all of the data (and meta-data).
these steps and even employ different solution strategies— Learning Algorithms (I/F 2 & 3): This layer receives all

186
Proceedings of the 26th International Workshop on Principles of Diagnosis

data via I/F 2. Since I/F 2 also comprises meta-data, the ma- 4 Case Studies
chine learning and diagnosis algorithms need not be imple- We present a set of case studies that cover the manufacturing
mented specifically for a domain but may adapt themselves and process industries, as well as complex CPS systems,
to the data provided. In this layer, unusual patterns in the such as aircraft.
data (used for anomaly detection), degradation effects (used
for condition monitoring) and system predictions (used for 4.1 Manufacturing Industry
predictive maintenance) are computed and provided via I/F The modeling and learning of discrete timing behavior for
3. Given the rapid changes in data analysis needs and capa- manufacturing industry (e.g., automative industry) is a new
bilities, this layer may be a toolbox of algorithms where new field of research. Due to the intuitive interpretation, Timed
algorithms can be added by means of plug-and-play mecha- Automata are well-suited to model the timing behavior of
nisms. I/F 3 might again be implemented using DDS. these systems. Several algorithms have been introduced to
Conceptual Layer (I/F 3 & 4): The information provided learn such Timed Automata, e.g. RTI+ [17] and BUTLA
by I/F 3 must be interpreted according to the current task [18]. Please note that the expert still has to provide struc-
at hand, e.g. computing the health state of the system.
tural information about the system (e.g. asynchronous sub-
Therefore, the provided information about unusual patterns, systems) and that only the temporal behavior is learned.
degradation effects and predictions are combined with do-
main knowledge to identify faults, their causes and rate them
according to the urgency of repair. A semantic notation will Aspirator on
[25…2500]
Muscle on
[8…34] Silo empty Muscle off
be added to the information, e.g. the time for next main- 0 1 2
[8…34]
3
[8…34]

tenance or a repair instruction, which will be provided at Muscle off
4
I/F 4 in a human understandable manner. From a computer [7…35] Aspirator off
[2200…2500]
science perspective, this layer provides reasoning capabili-
ties on a symbolic or conceptual level and adds a semantic
context to the results.
Task-Specific HMI (I/F 4 & 5): The user is in the center
of the architecture presented here, and, therefore, requires
task-, context- and role-specific Human-Machine-Interfaces
(HMIs). This HMI uses I/F 4 to get all needed analysis
results and presents them to the user. Adaptive interfaces,
rather than always showing the results of the same set of
Silo empty Conveyor off
[8…3400] [8…25]
1 2 3
analyses, could allow a wider range of information to be Silo full
provided, while maintaining efficiency and preventing in- 0 [1000…34000]
4
formation overload. Beyond obvious dynamic capabilities
like alerts for detected problems or anomalies, the interfaces
could further adapt the information displayed to be more Figure 3: Learned Timed Automata for a manufacturing plant.
relevant to the current user context (e.g. the user’s loca-
tion within a production plant, recognition of tasks the user The data acquisition for this solution (I/F 1 in Figure 2)
may be engaged in, observed patterns of the user’s previous has been implemented using a direct capturing of Profinet
information-seeking behavior, and knowledge of the user’s signals including an IEEE 1588 time-synchronization. The
technical background). If the user decides to influence the data is offered via OPC UA (I/F 2). On the learning layer,
system (e.g. shutdown of a subsystem or adaptation of the timed automata are learned from historical data and com-
system behavior), I/F 5 is used to communicate this deci- pared to the observed behavior. Also, the sequential behav-
sion to the conceptual layer. Again, I/F 4 and I/F 5 might be ior of the observed events as well as the timing behavior
implemented using DDS. is checked, anomalies are signaled via I/F 3. On the con-
Conceptual Layer (I/F 5 & 6): The user decisions will be ceptual layer it is decided whether an anomaly is relevant.
received via I/F 5. The conceptual layer will use the knowl- Finally, a graphical user interface is connected to the con-
edge to identify actions which are needed to carry out the ceptual layer via OPC UA (I/F 4).
users’ decisions. For example, a decision to decrease the Figure 3 shows learned automata for a manufacturing
machine’s cycle time by 10 % could lead to actions such as plant: The models correspond to modules of the plants, tran-
decreasing the robot speed by 10 % and the conveyor speed sitions are triggered by a control signals and are annotated
by 5 % or the decision to shutdown a subsystem. These ac- with a learned timing interval.
tions are communicated via I/F 6 to the adaption layer.
Adaption (I/F 6 & 7): This layer receives system adaption 4.2 Energy Analysis In Process Industry
commands on the conceptual level via I/F 6—which again Analyzing the energy consumption in production plants has
might be based on DDS. Examples are the decrease of robot some special challenges: Unlike the discrete systems de-
speed by 10 % or a shutdown of a subsystem. The adap- scribed in Section 4.1, also continuous signals such as the
tion layer takes these commands on the conceptual level energy consumption must be learned and analyzed. But also
and computes, in real-time, the corresponding changes to the discrete signals must be taken into consideration because
the control system. For example, a subsystem shutdown continuous signals can only be interpreted with respect to
might require a specific network signal or a machine’s tim- the current system’s status, e.g. it is crucial to know whether
ing is changed by adapting parameters of the control algo- a valve is open or whether a robot is turned on. And the
rithms, again by means of network signals. I/F 7 therefore system’s status is usually defined by the history of discrete
uses domain-dependent interfaces. control signals.

187
Proceedings of the 26th International Workshop on Principles of Diagnosis

the production cycles. In Figure 6 the architecture of the big
data platform is depicted.

Cyber-Physical System Hadoop Ecosystem Grafana Webvisualisation

Hadoop Distributed
Filesystem (HDFS)
OpenTSDB

Controller Controller HBase

Network

Figure 6: Data Analysis Plattform in Manufacturing

The CPS is connected through OPC UA (I/F 1 in Figure 2)
with an Hadoop ecosystem. Hadoop itself is an software
framework for scalable distributed computing. The process
llll
2 data is stored in an non-relational database (HBase) which is
based on a distributed file-system (HDFS). On top of HBase,
a time-series database OpenT SDB is used as an interface
to explore and analyze the data (I/F 2 in Figure 2). Through
Figure 4: A learned hybrid automaton modeling a pump. this database it is possible to do simple statistics such as
mean-values, sums or differences, which is usually not pos-
sible within the non relational data stores.
In [19], an energy anomaly detection system is de-
Using the interfaces of OpenTSDB or Hadoop, it be-
scribed which analyzes three production plants. Ethercat
comes possible to analyze the data directly on the storage
and Profinet is used for I/F 1 and OPC UA for I/F 2. The col-
system. Hence, the volume of a historical dataset need not
lected data is then condensed on the learning layer into hy-
be loaded into a single computer system, instead the algo-
brid timed automata. Also on this layer, the current energy
rithms can work distributively on the data. A web interface
consumption is compared to the energy prediction. Anoma-
can be used to visualize the data as well as the computed re-
lies in the continuous variables are signaled to the user via
sults. In Figure 6, grafana is used for data visualization. In
mobile platforms using web services (I/F 3 and 4).
the SmartFactoryOWL this big data platform is currently be-
In Figure 4, a pump is modeled by means of such au-
ing connected to the application scenarios from Sections 4.1
tomata using the flow rate and switching signals. The three
and 4.2.
states S0 to S2 are separating the continuous function into
three linear pieces which can then be learned automatically. 4.4 Anomaly Detection in Aircraft Flight Data
Figure 5 shows a typical learned energy consumption
(here for bulk good production). Fault detection and isolation schemes are designed to detect
the onset of adverse events during operations of complex
systems, such as aircraft and industrial processes. In other
work, we have discussed approaches using machine learn-
ing classifier techniques to improve the diagnostic accuracy
of the online reasoner on board of the aircraft [20]. In this
paper, we discuss an anomaly detection method to find pre-
viously undetected faults in aircraft system [21].
The flight data used for improving detection of existing
faults and discovering new faults was provided by Honey-
well Aerospace and recorded from a former regional airline
that operated a fleet of 4-engine aircraft, primarily in the
Midwest region of the United States. Each plane in the fleet
flew approximately 5 flights a day and data from about 37
aircraft was collected over a five year period. This produced
Figure 5: A measured (black line) and a learned power consump- over 60,000 flights. Since the airline was a regional carrier,
tion (red line). most flight durations were between 30 and 90 minutes. For
each flight, 182 features were recorded at sample rates that
varied from 1Hz to 16Hz. Overall this produced about 0.7
4.3 Big Data Analysis in Manufacturing Systems TB of data.
Analyzing historical process data during the whole produc- Situations may occur during flight operations, where the
tion cycle requires new architectures and platforms for han- aircraft operates in previously unknown modes that could be
dling the enormous volume, variety and velocity of the data. attributed to the equipment, the human operators, or envi-
Data analysis pushes the classical data acquisition and stor- ronmental conditions (e.g., the weather). In such situations,
age up to its limits, i.e. big data platforms are need. data-driven anomaly detection methods [12], i.e., finding
In the assembling line of the SmartFactoryOWL, a small patterns in the operations data of the system that were not
factory used for production and research, a big data platform expected before can be applied. Sometimes, anomalies
is established to acquire, store and visualize the data from may represent truly aberrant, undesirable and faulty behav-
ior; however, in other situations they may represent behav-

188
Proceedings of the 26th International Workshop on Principles of Diagnosis

iors that are just unexpected. We have developed unsuper- 4.5 Reliability and Fault Tolerant Control
vised learning or clustering methods for off-line detection Most complex CPSs are safety-critical systems that operate
of anomalous situations. Once detected and analyzed, rele- with humans-in-the-loop. In addition to equipment degrada-
vant information is presented to human experts and mission tion and faults, humans can also introduce erroneous deci-
controllers to interpret and classify the anomalies. sions, which becomes a new source of failure in the system.
Figure 7 illustrates our approach. We started with cu- Figure 8 represents possible faults and cyber-attacks that can
rated raw flight data (layer ”Big Data Platform” in Figure occur in a CPS.
2), transforming the time series data associated with the dif- There are several model-based fault tolerant control
ferent flight parameters to a compressed vector form using strategies for dynamic systems in the literature (see for ex-
wavelet transforms. The next step included building a dis- ample [23] and [24]). Research has also been conducted to
similarity matrix of pairwise flight segments using the Eu- address network security and robust network control prob-
clidean distance measure, followed by a subsequent step lems (see for example [25] and [26]). However, these meth-
where the pairwise between flight distances was used to ods need mathematical models of the system, which may
run a ‘complete link’ hierarchical clustering algorithm [22] not exist for large scale complex systems. Therefore, data
(layer ”Learning” in Figure 2). Run on the flight data, the driven control [27] and data driven fault tolerant control [28]
algorithm produced a number of large clusters that we con- have become an important research topic in recent years.
sidered to represent nominal flights, and a number of smaller For CPSs, there are more aspects of the problem that need
clusters and outlier flights that we initially labeled as anoma- to be considered. As it is shown in Figure 8, there are many
lous. By studying the feature value differences between the sources of failure in these systems.
larger nominal and smaller anomalous clusters with the help We propose a hybrid approach that uses an abstract model
of domain experts, we were able to interpret and explain the of the complex system and utilizes the data to ensure the
anomalous nature (”Conceptual Layer” in Figure 2). compatibility between model and the complex system. Data
These anomalies or faults represented situations that the abstraction and machine learning techniques are employed
experts had not considered before; therefore, this unsuper- to extract patterns between different control configurations
vised or semi-supervised data driven approach provided a and system outputs unit by computing the correlation be-
mechanism for learning new knowledge about unanticipated tween control signals and the physical subsystems outputs.
system behaviors. For example, when analyzing the aircraft The highly correlated subsystems (layer ”Learning” in Fig-
data, we found a number of anomalous clusters. One of ure 2) become candidates for further study of the effects of
them turned out to be situations where one of the four en- failure and degradation at the boundary of these interacting
gines of the aircraft was inoperative. On further study of ad- subsystems. For complex systems, all possible inteeractions
ditional features, the experts concluded that these were test and their consequences are hard to pre-determine, and data-
flights conducted to test aspects of the aircraft, and, there- driven approaches help fill this gap in knowledge to support
fore, they repesented known situations, and, therefore, not more informed decision-making and control. A case-based
an interesting anomaly. A second group of flights were in- reasoning module can be designed to provide input on past
terpreted to be take offs, where the engine power was set successes and failed opportunities, which can then be trans-
much higher than most flights in the same take off condition. lated by human experts into operational monitoring, fault di-
Further analysis of environmental features related to these agnosis, and control situations (’Conceptual Layer” in Fig-
set of take-off’s revealed that these were take-offs from a ure 2). Some of the control paradigms that govern appro-
high altitude airport at 7900 feet above sea level. priate control configurations, such as modifying sequence
A third cluster provided a more interesting situation. The of mission tasks and switching between different objectives
experts when checking on the features that had significantly or changing the controller parameters (layer Adaptation in
different values from the nominal flights realized that the Figure 2) are being studied in a number of labs including
auto throttle disengaged in the middle of the aircraft’s climb ours [29].
trajectory. The automatic throttle is designed to maintain Example Fault Tolerant Control of Fuel Transfer Sys-
either constant speed during takeoff or constant thrust for tem The fuel system supplies fuel to the aircraft engines.
other modes of flight. This was an unusual situation where Each individual mission will have its own set of require-
the auto thruster switched from maintaining speed for a ments. However, common requirements such as saving the
takeoff to a setting that applied constant thrust, implying aircraft Center of Gravity (CG), safety, and system relia-
that the aircraft was on the verge of a stall. This situation bility are always critical. A set of sensors included in the
was verified by the flight path acceleration sensor shown in system to measure different system variables such as the
Figure 7. By further analysis, the experts determined that in fuel quantity contained in each tank, engines fuel flow rates,
such situations the automatic throttle would switch to a pos- boost pump pressures, position of the valves and etc.
sibly lower thrust setting to level the aircraft and compensate There are several failure modes such as the total loss or
for the loss in velocity. By examining the engine parame- degradation in the electrical pumps or a leakage in the tanks
ters, the expert verified that all the engines responded in an or connecting pipes in the system. Using the data and the ab-
appropriate fashion to this throttle command. Whereas this stract model we can detect and isolate the fault and estimate
analysis did not lead to a definitive conclusion other than the its parameters. Then based on the type fault and its severity
fact the auto throttle, and therefore, the aircraft equipment, the system reconfiguration unit chooses the proper control
responded correctly, the expert determined that further anal- scenario form the control library. For example in normal sit-
ysis was required to answer the question “why did the air- uation the transfer pumps and valves are controlled to main-
craft accelerate in such a fashion and come so close to a tain a transfer sequence to keep the aircraft center of gravity
stall condition?”. One initial hypothesis to explain these within limits. This control includes maintaining a balance
situations was pilot error. between the left and right sides of the aircraft. When there

189
Proceedings of the 26th International Workshop on Principles of Diagnosis

Raw Flight Data

Hierarchical

Wavelet Transform
Clustering
Dii Dij … Din

Flight
Dissimilarity
Matrix

Anomalous
max(dAN)

Current
Flight

Figure 7: Data Driven Anomaly Detection Approach for Aircraft Flights

System Reconfiguration
adapted frequently.
In Sections 4.1 and 4.2, structural information about the
plant is imported from the engineering chain and the tempo-
Cyber-attack
System and
ral behavior is learned in form of timed automata. In Section
Human error
actuator faults 4.5, an abstract system model describing the input/output
Actuator faults
Operator
Controller
Parameters
Sensor fault structure and the main failure types is provided and again the
Physical
behavior is learned. These approaches are typical because in
Controller Actuators
System
Sensors
most applications structural information can be gained from
Controller earlier engineer phases while behavior models hardly exist
Library
and are almost never validated with the real system.
Communication network Looking at the learning phase, the first thing to notice
is that all described approaches work and deliver good re-
Communication error and noise sults: For CPSs, data-driven approaches have moved into
the focus of research and industry. And they are well suited
Cyber Physical System for CPSs: They adjust automatically to new system config-
urations, they do not need manual engineering efforts and
Figure 8: Possible faults in a CPS. they make usage of the now available large number of data
signals—connectivity being a typical feature of CPSs.
Another common denominator of the described appli-
is a small leak, normally the system can tolerate it depend- cation examples is that the focus is on anomaly detec-
ing on where the leak is, but the leak usually grows over tion rather than on root cause analysis: for data-driven ap-
time. Therefore we need to estimate the leakage rate and re- proaches it is easier to learn a model of the normal behav-
configure the system to move the fuel from the tank or close ior than learning erroneous behavior. And it is also typi-
the pipe before critical situation. cal that the only root cause analysis uses a case-based ap-
proach (Section 4.5), case-based approaches being suitable
5 Conclusions for data-driven solutions to diagnosis.
Data-driven approaches to the analysis and diagnosis of Finally, the examples show that the proposed cognitive
Cyber-Physical Systems (CPSs) are always inferior to clas- architecture (Figure 2) matches the given examples:
sical model-based approaches, where models are created Big Data Platform: Only a few examples (e.g. Section 4.3)
manually by experts: Experts have background knowledge make usage of explicit big data platforms, so-far solutions
which can not be learned from models and experts automat- often use proprietary solutions. But with the growing size of
ically think about a larger set of system scenarios than can the data involved, new platforms for storing and processing
be observed during a system’s normal lifetime. the data are needed.
So the question is not whether data-driven or expert- Learning: All examples employ machine learning
driven approaches are superior. The question is rather technologies—with a clear focus on unsupervised learning
which kind of models can we realistically expect to ex- techniques which require no a-priori knowledge such as
ist in real-world applications—and which kind of models clustering (Section 4.4) or automata identification (Sections
must therefore be learned automatically. This becomes es- 4.1, 4.2).
pecially important in the context of CPSs since these sys- Conceptual Layer: In all examples, the learned models are
tems adapt themselves to their environment and show there- evaluated on a conceptual or symbolic level: In Section 4.4,
fore a changing behavior, i.e. models would also have be clusters are compared to new observations and data-cluster

190
Proceedings of the 26th International Workshop on Principles of Diagnosis

distances are used for decision making. In Sections 4.1 and The 24th International Workshop on Principles of Di-
4.2, model predictions are compared to observations. And agnosis, pages 71–78, 2013.
again, derivations are decided on by a conceptual layer. [8] D. Klar, M. Huhn, and J. Gruhser. Symptom propaga-
Task-Specific HMI: None of the given examples works com- tion and transformation analysis: A pragmatic model
pletely automatically, in all cases the user is involved in the for system-level diagnosis of large automation sys-
decision making. tems. In Emerging Technologies Factory Automation
Adaption: In most cases, reactions to detected problems (ETFA), 2011 IEEE 16th Conference on, pages 1–9,
are up to the expert. The use case from Section 4.5 is an Sept 2011.
example for an automatic reaction and the usage of analysis
results for the control mechanism. [9] GE. The rise of big data - leveraging large time-
series data sets to drive innovation, competitiveness
Using such a cognitive architecture would bring several and growth - capitalizing on the big data oppurtu-
benefits to the community: First of all, algorithms and nity. Technical report, General Electric Intelligent
technologies in the different layers can be changed quickly Platforms, 2012.
and can be re-used. E.g. learning algorithms from one
application field can be put on top of different big data [10] A. Katasonov, O. Kaykova, O. Khriyenko, S. Nikitin,
platforms. Furthermore, currently most existing approaches and V. Terziyan. Smart semantic middleware for the
mix the different layers, making the comparison of ap- internet of things. In 5th International Conference
proaches to the analysis of CPSs difficult. Finally, such an on Informatics in Control, Automation and Robotics
architecture helps to clearly identify open issues for the (ICINCO), 2008.
development of smart monitoring systems. [11] Michael Stonebraker. Sql databases v. nosql databases.
Communications of the ACM, 53(4):10–11, 2010.
Acknowledgments The work was partly supported [12] Varun Chandola, Arindam Banerjee, and Vipin Kumar.
by the German Federal Ministry of Education and Re- Anomaly detection: A survey. ACM Computing Sur-
search (BMBF) under the project "Semantics4Automation" veys (CSUR), 41.3:1–72, Sept 2009.
(funding code: 03FH020I3), under the project "Analyse [13] M. Zaharia, M. Chowdhury, M. J. Franklin,
großer Datenmengen in Verarbeitungsprozessen (AGATA)" S. Shenker, and I Stoica. Spark: cluster comput-
(funding code: 01IS14008 A-F) and by NASA NRA ing with working sets. In Proceedings of the 2nd
NNL09AA08B from the Aviation Safety program. We also USENIX conference on Hot topics in cloud computing,
acknowledges the contributions of Daniel Mack, Dinkar page 10, June 2010.
Mylaraswamy, and Raj Bharadwaj on the aircraft fault di-
agnosis work. [14] K. Shvachko, H. Kuang, S. Radia, and R. Chansler.
The hadoop distributed file system. In Proceedings
References 26th IEEE Symposium on Mass Storage Systems and
Technologies (MSST), pages 1–10, May 2010.
[1] E.A. Lee. Cyber physical systems: Design challenges.
In Object Oriented Real-Time Distributed Computing [15] M JAYASREE. Data mining: Exploring big data us-
(ISORC), 2008 11th IEEE International Symposium ing hadoop and mapreduce. International Journal of
on, pages 363–369, 2008. Engineering Sciences Research-IJESR, 4(1), 2013.
[2] Ragunathan (Raj) Rajkumar, Insup Lee, Lui Sha, and [16] Friedhelm Nachreiner, Peter Nickel, and Inga Meyer.
John Stankovic. Cyber-physical systems: The next Human factors in process control systems: The de-
computing revolution. In Proceedings of the 47th De- sign of human–machine interfaces. Safety Science,
sign Automation Conference, DAC ’10, pages 731– 44(1):5–26, 2006.
736, New York, NY, USA, 2010. ACM. [17] Sicco Verwer. Efficient Identification of Timed Au-
[3] Peter C. Evans and Marco Annunziata. Industrial in- tomata: Theory and Practice. PhD thesis, Delft Uni-
ternet: Pushing the boundaries of minds and machines. versity of Technology, 2010.
Technical report, GE, 2012. [18] Oliver Niggemann, Benno Stein, Asmir Vodenčarević,
[4] Promotorengruppe Kommunikation. Im fokus: Das Alexander Maier, and Hans Kleine Büning. Learning
industrieprojekt industrie 4.0, handlungsempfehlun- behavior models for hybrid timed systems. In Twenty-
gen zur umsetzung. Forschungsunion Wirtschaft- Sixth Conference on Artificial Intelligence (AAAI-12),
Wissenschaft, March 2013. pages 1083–1090, Toronto, Ontario, Canada, 2012.
[5] L. Christiansen, A. Fay, B. Opgenoorth, and J. Neidig. [19] Bjoern Kroll, David Schaffranek, Sebastian Schriegel,
Improved diagnosis by combining structural and pro- and Oliver Niggemann. System modeling based on
cess knowledge. In Emerging Technologies Factory machine learning for anomaly detection and predic-
Automation (ETFA), 2011 IEEE 16th Conference on, tive maintenance in industrial plants. In 19th IEEE In-
Sept 2011. ternational Conference on Emerging Technologies and
[6] Rolf Isermann. Model-based fault detection and diag- Factory Automation (ETFA), Sep 2014.
nosis - status and applications. In 16th IFAC Sympo- [20] D.L.C. Mack, G. Biswas, X. Koutsoukos, and D. My-
sium on Automatic Control in Aerospace, St. Peters- laraswamy. Learning bayesian network structures to
bug, Russia, 2004. augment aircraft diagnostic reference model, “to ap-
[7] Johan de Kleer, Bill Janssen, Daniel G. Bobrow, Tolga pear”. IEEE Transactions on Automation Science and
Kurtoglu Bhaskar Saha, Nicholas R. Moore, and Sar- Engineering, 17:447–474, 2015.
avan Sutharshana. Fault augmented modelica models.

191
Proceedings of the 26th International Workshop on Principles of Diagnosis

[21] Daniel LC Mack. Anomaly Detection from Complex
Temporal Spaces in Large Data. PhD thesis, Vander-
bilt University, Nashville, TN. USA, 2013.
[22] Stephen C Johnson. Hierarchical clustering schemes.
Psychometrika, 32(3):241–254, 1967.
[23] Jiang Jin. Fault tolerant control systems - an intro-
ductory overview. Acta Automatica Sinica, 31(1):161–
174, 2005.
[24] M. Blanke, M. Kinnaert, J. Lunze, and
M. Staroswiecki. Diagnosis and fault-tolerant
control. Springer-Verlag, Sep 2003.
[25] L. Schenato, B. Sinopoli, M. Franceschetti, K. Poolla,
and S. S. Sastry. Foundations of control and estimation
over lossy networks. In In Proceedings of the IEEE,
volume 95, pages 163 – 187, Jan 2007.
[26] B. Schneier. Security monitoring: Network security
for the 21st century. In Computers Security, 2001.
[27] Zhong-Sheng Hou and Zhuo Wang. From model-
based control to data-driven control: Survey, classi-
fication and perspective. Information Sciences, 235:3–
35, 2013.
[28] Hongm Wang, Tian-You Chai, Jin-Liang Ding, and
Martin Brown. Data driven fault diagnosis and fault
tolerant control: some advances and possible new
directions. Acta Automatica Sinica, 25(6):739–747,
2009.
[29] Z. S. Hou and J. X. Xu. On data-driven control theory:
the state of the art and perspective. Acta Automatica
Sinica, 35:650–667, 2009.

192
Proceedings of the 26th International Workshop on Principles of Diagnosis

Diagnosing Advanced Persistent Threats: A Position Paper
Rui Abreu and Danny Bobrow and Hoda Eldardiry and Alexander Feldman and
John Hanley and Tomonori Honda and Johan de Kleer and Alexandre Perez
Palo Alto Research Center
3333 Coyote Hill Rd
Palo Alto, CA 94304, USA
{rui,bobrow,hoda.eldardiry,afeldman,john.hanley,tomo.honda,dekleer,aperez}@parc.com
Dave Archer and David Burke
Galois, Inc.
421 SW 6th Avenue, Suite 300
Portland, OR 97204, USA
{dwa,davidb}@galois.com
Abstract individual applications. Thus current techniques give
operators little system-wide situational awareness, nor
When a computer system is hacked, analyz- any viewpoint informed by a long-term perspective.
ing the root-cause (for example entry-point Adversaries have taken advantage of this opacity by
of penetration) is a diagnostic process. An adopting a strategy of persistent, low-observability
audit trail, as defined in the National Infor- operation from inside the system, hiding effectively
mation Assurance Glossary, is a security- through the use of long causal chains of system and
relevant chronological (set of) record(s), application code. We call such adversaries advanced
and/or destination and source of records that persistent threats, or APTs.
provide evidence of the sequence of activi-
ties that have affected, at any time, a specific To address current limitations, this position pa-
operation, procedure, or event. After detect- per discusses a technique that aims to track causal-
ing an intrusion, system administrators man- ity across the enterprise and over extended periods of
ually analyze audit trails to both isolate the time, identify subtle causal chains that represent ma-
root-cause and perform damage impact as- licious behavior, localize the code at the roots of such
sessment of the attack. Due to the sheer vol- behavior, trace the effects of other malicious actions
ume of information and low-level activities descended from those roots, and make recommenda-
in the audit trails, this task is rather cum- tions on how to mitigate those effects. By doing so,
bersome and time intensive. In this posi- the proposed approach aims to enable stakeholders to
tion paper, we discuss our ideas to automate understand and manage the activities going on in their
the analysis of audit trails using machine networks. The technique exploits both current and
learning and model-based reasoning tech- novel forms of local causality to construct higher-level
niques. Our approach classifies audit trails observations, long-term causality in system informa-
into the high-level activities they represent, tion flow. We propose to use a machine learning ap-
and then reasons about those activities and proach to classify segments of low-level events by the
their threat potential in real-time and foren- activities they represent, and reasons over these ac-
sically. We argue that, by using the outcome tivities, prioritizing candidate activities for investiga-
of this reasoning to explain complex evi- tion. The diagnostic engine investigates these candi-
dence of malicious behavior, we are equip- dates looking for patterns that may represent the pres-
ping system administrators with the proper ence of APTs. Using pre-defined security policies and
tools to promptly react to, stop, and mitigate related mitigations, the approach explains discovered
attacks. APTs and recommends appropriate mitigations to op-
erators. We plan to leverage models of APT and nor-
mal business logic behavior to diagnose such threats.
1 Introduction Note that the technique is not constrained by availabil-
Today, enterprise system and network behaviors are ity of human analysts, but can benefit by human-on-
typically “opaque”: stakeholders lack the ability to as- the-loop assistance.
sert causal linkages in running code, except in very The approach discussed in the paper will offer un-
simple cases. At best, event logs and audit trails can precedented capability for observation of long-term,
offer some partial information on temporally and spa- subtle system-wide activity by automatically con-
tially localized events as seen from the viewpoint of structing such global, long-term causality observa-

193
Proceedings of the 26th International Workshop on Principles of Diagnosis

tions. The ability to automatically classify causal execute processes on the front-end as the non-
chains of events in terms of abstractions such as ac- privileged user www-data.
tivities, will provide operators with a unique capabil-
2. The attacker notices that the front-end is run-
ity to orient to long-term, system-wide evidence of
ning an unpatched U BUNTU L INUX OS version
possible threats. The diagnostic engine will provide
13.1. The attacker uses the nc Linux utility to
a unique capability to identify whether groups of such
copy an exploit for obtaining root privileges. The
activities likely represent active threats, making it eas-
particular exploit that the attacker uses utilizes
ier for operators to decide whether long-term threats
the x32 recvmmsg() kernel vulnerability reg-
are active, and where they originate, even before those
istered in the Common Vulnerabilities and Expo-
threats are identified by other means. Thus, the ap-
sures (CVE) database as CVE 2014-0038. After
proach will pave the way for the first automated, long-
running the copied binary for a few minutes the
horizon, continuously operating system-wide support
attacker gains root access to the front-end host.
for an effective defender Observe, Orient, Decide, and
Act (OODA) loop. 3. The attacker installs a root-kit utility that inter-
cepts all input to ssh;
2 Running Example 4. A system administrator uses the compromised
The methods proposed in this article are illustrated on ssh to connect to the back-end revealing his back-
a realistic running example. The attackers in this ex- end password to the attacker;
ample use sophisticated and recently discovered ex- 5. The attacker uses the compromised front-end to
ploits to gain access to the victim’s resources. The at- bypass firewalls and uses the newly acquired
tack is remote and does not require social engineering back-end administrator’s password to access the
or opening a malicious email attachment. The meth- back-end;
ods that we propose, however, are not limited to this
class of attacks. 6. The attacker uses a file-tree traversing utility on
the back-end that collects sensitive data and con-
router solidates it in an archive file;
7. The attacker sends the archive file to a third-party
Internet
hijacked computer for analysis.
hacker

victim’s local network
3 Auditing and Instrumentation
Almost all computing systems of sufficiently high-
level (with the exception of some embedded systems)
leave detailed logs of all system and application activ-
ities. Many UNIX variants such as L INUX log via the
syslog daemon, while W INDOWSTM uses the event
web server data storage system log service. In addition to the usual logging mecha-
front-end back end administrator nisms, there is a multitude of projects related to se-
cure and detailed auditing. An audit log is more de-
Figure 1: Network topology for the attack tailed trail of any security or computation-related ac-
tivity such as file or RAM access, system calls, etc.
Depending on the level of security we would like
The network topology used for our running example to provide, there are several methods for collecting in-
is shown in figure 1. The attack is executed over sev- put security-related information. On one extreme, it is
eral days. It starts by (1) compromising the web server possible to use the existing log files. On the other ex-
front-end, followed by (2) a reconnaissance phase and treme there are applications for collecting detailed in-
(3) compromising the data storage back end and ulti- formation about the application execution. One such
mately extracting and modifying sensitive information approach [1] runs the processes of interest through a
belonging to the victim. debugger and logs every memory read and write ac-
Both the front-end and the back end in this example cess.
run unpatched U BUNTU 13.1 L INUX OS on an I N - It is also possible to automatically inject logging
TEL R S ANDY B RIDGE TM architecture.
calls in the source files before compiling them, allow-
What follows is a detailed chronology of the events: ing us to have static or dynamic logging or a combi-
1. The attacker uses the A PACHE httpd server, a nation of the two. Log and audit information can be
cgi-bin script, and the S HELLSHOCK vulnera- signed, encrypted and sent in real-time to a remote
bility (GNU bash exploit registered in the Com- server to make system tampering and activity-hiding
mon Vulnerabilities and Exposures database as more difficult. All these configuration decisions im-
CVE 2014-6271 (see https://nvd.nist. pose different trade-offs in security versus computa-
gov/) to gain remote shell access to the victim’s tional and RAM load [2] and depend on the organiza-
front-end. It is now possible for the attacker to tional context.

194
Proceedings of the 26th International Workshop on Principles of Diagnosis

.. abstractions may be composed of lower-order abstrac-
.
tions that are in turn abstractions of low-level events.
front_end.secure_access_log:11.239.64.213 - [22/Apr/2014 For example, a sequential set of logged events such as
06:30:24 +0200] "GET /cgi-bin/test.cgi HTTP/1.1" 401 381 ‘browser forking bash’, ‘bash initiating Netcat’, and
.. ‘Netcat listening to new port’, might be abstracted as
. the activity ‘remote shell access’. The set of activities,
front_end.rsyslogd.log:recvmsg(3, msg_name(0) = ‘remote shell access’, and ‘escalation of privilege’ can
NULL, msg_iov(1) = ["29/Apr/2014 22:15:49 ...", 8096], be abstracted as the activity ‘remote root shell access’.
msg_controllen = 0, msg_flags = MSG_CTRUNC,
MSG_DONTWAIT) = 29 We approach activity annotation as a supervised
learning problem that uses classification techniques to
.. generate activity tags for audit trails. Table 1 shows
.
multiple levels of activity classifications for the above
back_end:auditctl:type = SYSCALL msg = au- APT example.
dit(1310392408.506:36): arch = c000003e syscall = 2
success = yes exit = 3 a0 = 7fff2ce9471d a1 = 0 a2 = 61f768 Table 1 represents one possible classification-
a3 = 7fff2ce92a20 items = 1 ppid = 20478 pid = 21013 auid enriched audit trail for such an APT. There can be
= 1000 uid = 0 gid = 0 euid = 0 suid = 0 fsuid = 0 egid = 0 many relatively small variations. For example, ob-
sgid = 0 fsgid = 0 ses = 1 comm = "grep" exe = "/bin/grep" scuring the password file could be done using other
.. programs. A single classifier only allows for a single
. level of abstraction, and a single leap from low-level
events to very abstract activities (for example, from
Figure 2: Part of log files related to the attack from the ‘bash execute perl’ level to ‘extracting modified file’
running example level) will have higher error caused by these additional
variations.
Figure 2 shows part of the logs collected for our run- To obtain several layers of abstraction for reason-
ning example. The first entry is when the attacker ex- ing over, and thus reduce overall error in classifica-
ploits the S HELLSHOCK vulnerability through a CGI tion, we use a multi-level learning strategy that models
script of the web server. The second entry shows sys- information at multiple levels of semantic abstraction
log strace-like message resulting from the kernel using multiple classifiers. Each classifier solves the
escalation. Finally, the attacker uses the grep com- problem at one abstraction level, by mapping from a
mand on the back-end server to search for sensitive lower-level (fine) feature space to the next higher-level
information and the call is recorded by the audit sys- conceptual (coarse) feature space.
tem. The activity classifier rely on both a vocabulary of
It is often the case that the raw system and secu- activities and a library of patterns describing these ac-
rity log files are preprocessed and initial causal links tivities that will be initially defined manually. This vo-
are computed. If we trace the exec, fork, and cabulary and pattern set reside in a Knowledge Base.
join POSIX system calls, for example, it is possi- In our training approach, results from training lower
ble to add graph-like structure to the log files comput- level classifiers are used as training data for higher
ing provenance graphs. Another method for comput- level classifiers. In this way, we coherently train all
ing local causal links is to consider shared resources, classifiers by preventing higher-level classifiers from
e.g., two threads reading and writing the same memory being trained with patterns that will never be gener-
address [1]. ated by their lower-level precursors. We use an ensem-
ble learning approach to achieve accurate classifica-
4 Activity Classification tion. This involves stacking together both bagged and
The Activity Classifier continuously annotates audit boosted models to reduce both variance and bias er-
trails with semantic tags describing the higher-order ror components [3]. The classification algorithm will
activity they represent. For example, ‘remote shell ac- be trained using an online-learning technique and in-
cess’, ‘remote file overwrite’, and ‘intra-network data tegrated within an Active Learning Framework to im-
query’ are possible activity tags. These tags are used prove classification of atypical behaviors.
by the APT Diagnostics Engine to enable higher-order Generating Training Data for Classification To
reasoning about related activities, and to prioritize ac- build the initial classifier, training data is generated
tivities for possible investigation. using two methods. First, an actual deployed sys-
tem is used to collect normal behavior data, and a
4.1 Hierarchical semantic annotation of Subject Matter Expert manually labels it. Second,
audit trails a testing platform is used to generate data in a con-
A key challenge in abstracting low-level events into trolled environment, particularly platform dependent
higher-order activity patterns that can be reasoned vulnerability-related behavior. In addition, to gener-
about efficiently is that such patterns can be described ate new training data of previously unknown behavior,
at multiple levels of semantic abstraction, all of which we use an Active Learning framework as described in
may be useful in threat analysis. Indeed, higher-order Section 5.

195
Proceedings of the 26th International Workshop on Principles of Diagnosis

Table 1: Sample classification problem for running example
Activity 1 Activity 2 Activity 3
Remote Shell Access Remote File Overwrite Modified File Download
Shell Shock Trojan Installation Password Exfiltration
Browser (Port 80) fork bash Netcat listen to Port 8443 Netcat listen to Port 8443
bash fork Netcat Port 8443 receive binary file Port 8443 fork bash
Netcat listen to port 8080 binary file overwrites libns.so bash execute perl
Perl overwrite /tmp/stolen_pw
Port 8443 send /tmp/stolen_pw

5 Prioritizer are ranked according to their threat level by aggregat-
ing a severity measure (determined by classified threat
As the Activity Classifier annotates audit trails with
type) and a confidence measure. We complement the
activity descriptors, the two (parallel) next steps in our
initial set of training data to calibrate our classifiers by
workflow are to 1) prioritize potential threats to be re-
using an Active Learning Framework, which focuses
ferred to the Diagnostic Engine (see Section 6) for in-
on improving the classification algorithm through oc-
vestigation, and 2) prioritize emergent activities that
casional manual labeling of the most critical activities
(after suitable review and labeling) are added to the ac-
in the audit trails.
tivity classifier training data. This module prioritizes
Unsupervised ranking using normalcy charac-
activities by threat severity and confidence level. This
terization to catch unknown threats. The second
prioritization process presents three key challenges.
component of the prioritizer is a set of unsupervised
normalcy rankers, which rank entities based on their
5.1 Threat-based rank-annotation of
statistical “normalcy". Activities identified as un-
activities usual will be fed to the Active Learning framework
One challenge in ranking activities according to their to check if any of them are “unknown” APT activities.
threat potential is the complex (and dynamic) notion of This provides a mechanism for detecting “unknown”
what constitutes a threat. Rankings based on matching threats while also providing feedback to improve the
to known prior threats is necessary, but not sufficient. APT classifier.
An ideal ranking approach should take known threats
into account, while also proactively considering the 5.2 Combining Multiple Rankings
unknown threat potential of new kinds of activities. One of the key issues with combining the outputs of
Another such challenge is that risk may be assessed multiple risk ranking is dealing with two-dimensional
at various levels of activity abstraction, requiring that risk (severity, confidence) scores that may be on very
overall ranking must be computed by aggregating risk different scales. A diverse set of score normalization
assessments at multiple abstraction levels. techniques have been proposed [4; 5; 6] to deal with
We implement two ranking approaches: a super- this issue, but no single technique has been found to
vised ranker based on previously known threats and an be superior over all the others. An alternative to com-
unsupervised ranker that considers unknown potential bining scores is to combine rankings [7]. Although
threats. converting scores to rankings does lose information, it
Supervised ranking using APT classification to remains an open question if the loss in information is
catch known threats. The goal of APT classifica- compensated for by the convenience of working with
tion is to provide the diagnostic engine with critical the common scale of rankings.
APT related information such as APT Phase, severity We will develop combination techniques for
of attack, and confidence level associated with APT weighted risk rankings based on probabilistic rank ag-
tagging for threat prioritization. Since the audit trails gregation methods. This approach builds on our own
are annotated hierarchically into different granularity work [8] that shows the robustness of the weighted
of actions, multiple classifiers will be built to consider ranking approach. We also build on principled meth-
each hierarchical level separately. APT classifiers are ods for combining ranking data found in the statistics
used to identify entities that are likely to be instances and information retrieval literature.
of known threats or phases of an APT attack. Two Traditionally, the goal of rank aggregation [9; 10]
types of classifiers are used. The first classifier is is to combine a set of rankings of the same candi-
hand-coded and the second classifier is learned from dates into a single consensus ranking that is “better”
training data. than the individual rankings. We extend the tradi-
The hand-coded classifier is designed to have high tional approach to accommodate the specific context
precision, using hand-coded rules, mirroring SIEM of weighted risk ranking. First, unreliable rankers will
and IDS systems. Entities tagged by this classifier are be identified and either ignored or down-weighted,
given the highest priority for investigation. The second lest their rankings decrease the quality of the over-
classifier, which is learned from training data, will pro- all consensus [7; 10]. Second, we will discount ex-
vide higher recall at the cost of precision. Activities cessive correlation among rankers, so that a set of

196
Proceedings of the 26th International Workshop on Principles of Diagnosis

highly redundant rankers do not completely outweigh vulnerabilities or the combined use of social engineer-
the contribution of other alternative rankings. To ad- ing and insufficiency of the organizational security
dress these two issues, we will associate a probabilis- policies. We use MBD for computing the set of si-
tic latent variable Zi with the i’th entity of interest, multaneously exploited vulnerabilities that allowed the
which indicates whether the entity is anomalous or deployment of the APT. Computing such explanations
normal. Then, we will build a probabilistic model is possible because MBD reasons in terms of multiple-
that allows us to infer the posterior distribution over faults [14]. In our running example this set would in-
the Zi based on the observed rankings produced by clude both the fact the the web server has been ex-
each of the input weighted risk rankings. This poste- ploited due to the Shellshock vulnerability and that a
rior probability of Zi being normal will then be used the attacker gained privileged access on the front-end
as the weighted risk rank. Our model will make the due to the use of the X64_32 escalation vulnerability.
following assumptions to account for both unreliable The abstract security model is used to gather infor-
and correlated rankers: 1) Anomalies are ranked lower mation about types of attacks the system is vulnerable
than all normal instances and these ranks tend to be to, and to aid deciding the set of actions required to
concentrated near the lower rankings of the provided stop an APT campaign (policy enforcement). Various
weighted risk rankings, and 2) Normal data instances heuristics exist to find the set of meaningful diagnosis
tend to be uniformly distributed near the higher rank- candidates. As an example, one might be interested
ings of the weighted risk rankings. in the minimal set of actions to stop the attack [15;
There are various ways to build a probabilistic 16] or select those candidates that capture significant
model that reflects the above assumptions and al- probability mass [17]. In the rest of this section, for
lows for the inference of the Zi variables through illustration purposes, we use minimality as the heuris-
Expectation-Maximization [11]. In addition to these tic of interest. MBD is the right tool for dealing with
assumptions, we will explore allowing other factors to computation of diagnosis candidates as it offers sev-
influence the latent Zi variables, such as features of eral ways to address the modeling and computational
the entities as well as feedback provided by an expert complexity [18; 19].
analysts.
6.2 Detection and Isolation of Attacks from
Abstract Security Model and Sensor
6 Diagnosis Data
We view the problem of detecting, isolating, and ex- The abstract security model provides an abstraction
plaining complex APT campaigns behavior from rich mechanism that is originally missing in the audit trails.
activity data is a diagnostic problem. We will use More precisely what is not in the audit trails and what
an AI-based diagnostic reasoning to guide the global is in the security model is how to connect (possibly
search for possible vulnerabilities that enabled the disconnected) activities for the purpose of global rea-
breach. Model-based diagnosis (MBD) [12] is a par- soning. The abstract security model and the sensor
ticularly compelling approach as it supports reasoning data collected from the audit trails are provided as in-
over complex causal networks (for example, having puts to an MBD algorithms that performs the high-
multiple conjunctions, disjunctions, and negations) level reasoning about possible vulnerabilities and at-
and identifies often subtle combinations of root causes tacks similar to what a human security officer would
of the symptoms (the breach). do.
The information in the “raw” audit trails is of too
6.1 An MBD approach for APT detection high fidelity [2] and low abstraction to be used by a
and isolation: Motivation “crude” security model. That is the reason the diag-
nostic engine needs the machine learning module to
Attack detection and isolation are two distinct chal- temporally and spatially group nodes in the audit trails
lenges. Often diagnostic approaches use separate and to provide semantically rich variable/value sensor
models for detection and isolation [13]. MBD how- data about actions, suitable for MBD. Notice that in
ever uses a single model, to combine these two rea- this process, the audit trail structure is translated to se-
sonings. The security model contains both part of mantic categories, i.e., the diagnostic engine receives
the security policy (that communicating with certain as observations time-series of sensed actions.
blacklisted hosts may indicate an information leak) The listing that follows next shows an abstract se-
and information about the possible locations and con- curity model for the running example in the LYDIA
sequence of a vulnerability (a privilege escalation may language [20]. This bears some resemblance to P RO -
lead to an information leak). The security model also LOG , except that LYDIA is a language for model-
contains abstract security constraints such as if a pro- based diagnosis of logical circuits while P ROLOG is
cess requires authentication, a password must be read for Horn-style reasoning. The use of LYDIA is for
and compared against. illustration purposes only, in reality computer sys-
The diagnostic approach takes into consideration tems can be much more easily modeled as state ma-
the bootstrapping of an APT which we consider the chines. There is a significant body of literature deal-
root-cause of the attack. What enables a successful ing with diagnosis of discrete-event systems [21; 22;
APT is either a combination of software component 23], to name just a few.

197
Proceedings of the 26th International Workshop on Principles of Diagnosis

1 system front_end (bool know_root_password) know root password

2 { root shell
3 bool httpd_shell_vuln ; // vulnerability httpd shell
>
bool buffer_overflow_vuln ; // vulnerability
2

4
r

bool escalation_vuln ; // vulnerability
leak pw1
5 2
p
6 Legend:
1
assumable variable
7 bool httpd_shell ; buffer overflow vuln 1
2
q 2
internal variable

8 bool root_shell ;
9 bool leak_passwd;
10 Figure 3: Part of the abstract security model for the
11 // weak−fault models
running example
12 if (! httpd_shell_vuln ) { // if healthy
13 ! httpd_shell ; // forbid shells via httpd
14 } rity constraints in it is notoriously difficult, hence we
15 plan to create or use specialized modal logic similar to
16 if (! escalation_vuln ) { // if healthy
17 ! root_shell ; // no root shell is possible
the one proposed in [25].
18 } Notice that the format of the Boolean circuit shown
19 in figure 3 is very close to the one used in Truth Main-
20 if (! buffer_overflow_vuln ) { // if healthy tenance System (TMS) [26]. The only assumable vari-
21 !leak_passwd; // passwords don’t leak able in figure 3 is buffer_overflow_vuln and its
22 } default value is false (i.e., there is no buffer overflow
23 vulnerability in the web server process).
24 bool access_passwd; We next show how a reasoning engine can discover
25 attribute observable (access_passwd) = true; a conflict through forward and backward propagation.
26
Looking at figure 3, it is clear that r must be true be-
27 !access_passwd => !leak_passwd;
28 cause it is an input to an AND-gate whose output is set
29 /∗∗ to true. Therefore either p or q (or both) must be true.
30 ∗ Knowing the root password can be explained This means that either buffer_overflow_vuln or
31 ∗ by a root shell ( for example there is a leak_pw must be false. If we say that leak_pw is
32 ∗ password sniffer ). assumed to be true (measured or otherwise inferred),
33 ∗/ then leak_pw and buffer_overflow_vuln are to-
34 know_root_password => gether part of a conflict. It means that the reasoning
35 (( httpd_shell || leak_passwd) && root_shell ); engine has to change one of them to resolve the con-
36 } tradiction.
37
38 system back_end(bool know_root_password)
Based on the observation from our running exam-
39 { ple and a TMS constructed from the security model
40 bool comm; shown in figure 3, the hitting set algorithm computes
41 attribute observable (comm) = true; two possible diagnostic hypotheses: (1) the attacker
42 gained a shell access through a web-server vulnerabil-
43 /∗∗ ity and the attacker performed privilege escalation or
44 ∗ Normal users can only communicate with a (2) the attacker injected binary code through a buffer
45 ∗ list of permitted hosts . overflow and the attacker performed privilege escala-
46 ∗/ tion.
47 if (! know_root_password) { If we use LYDIA to compute the set of diagnoses for
48 comm == true;
49 }
the running example, we get the following two (am-
50 } biguous) diagnoses for the root-cause of the penetra-
51 tion:
52 system main() $ lydia example.lm example.obs
53 { d1 = { fe.escalation_vuln,
54 bool know_root_password;
fe.httpd_shell_vuln }
55
56 system front_end fe (know_root_password); d2 = { fe.buffer_overflow_vuln,
57 system back_end be(know_root_password); fe.escalation_vuln }
58 } MBD uses probabilities to computes a sequence of
possible diagnoses ordered by likelihood. This proba-
LYDIA translates the model to an internal proposi- bility can be used for many purposes: decide which di-
tional logic formula. Part of this internal representa- agnosis is more likely to be the true fault explanation,
tion is shown in figure 3, which uses the standard VLSI whether there is the need for consider further evidence
[24] notation to denote AND-gates, OR-gates, and from the logs or limit the number of diagnoses that
NOT-gates. Wires are labeled with variable names. need to be identified. Many policies exist to compute
Boolean circuits (matching propositional logic), how- these probabilities [27; 28].
ever, have limited expressiveness and modeling secu- For illustration purposes we consider that the diag-

198
Proceedings of the 26th International Workshop on Principles of Diagnosis

noses for the running example are ambiguous. Before and their diagnostic accuracy in the context of trans-
we discuss methods for dealing with this ambiguity, parent computing.
we address the major research challenge of model gen-
eration. 7 Conclusions
6.3 Model Generation Identifying the root-cause and perform damage im-
The abstract vulnerability model can either be con- pact assessment of advanced persistent threats can be
structed manually or semi-automatically. The chal- framed as a diagnostic problem. In this paper, we dis-
lenge with modeling is that an APT campaign gener- cuss an approach that leverages machine learning and
ally exploits unknown vulnerabilities. Therefore, our model-based diagnosis techniques to reason about po-
approach to address this issue is to construct the model tential attacks.
which captures expected behavior (known goods) of Our approach classifies audit trails into high-level
the system. Starting from generic parameterized vul- activities, and then reasons about those activities and
nerability models and security objectives, the abstract their threat potential in real-time and forensically. By
vulnerability model can be extended with information using the outcome of this reasoning to explain com-
related to known vulnerabilities (known bads). plex evi- dence of malicious behavior, the system
Generating the model can be done either manu- administrators is provided with the proper tools to
ally or semi-automatically. We will explore venues to promptly react to, stop, and mitigate attacks.
generate this model manually, which requires signif-
icant knowledge about potential security vulnerabili- References
ties, while being error prone and not detailed enough.
Amongst company specific requirements, we envisage [1] Kyu Hyung Lee, Xiangyu Zhang, and Dongyan
the abstract vulnerability model to capture the most Xu. High accuracy attack provenance via binary-
common attacks that target software systems, as de- based execution partition. In Proceedings of
scribed in the Common Attack Pattern Enumeration the 20th Annual Network and Distributed System
and Classification (CAPEC1 ). The comprehensive list Security Symposium, San Diego, CA, February
of known attacks has been designed to better under- 2013.
stand the perspective of an attacker exploiting the vul- [2] Kyu Hyung Lee, Xiangyu Zhang, and Dongyan
nerabilities and, from this knowledge, devise appro- Xu. LogGC: Garbage collecting audit log. In
priate defenses. Proceedings of the 2013 ACM SIGSAC Confer-
As modeling is challenging, we propose to explore ence on Computer and Communications Secu-
semi-automatic approaches to construct models. The rity, pages 1005–1016, Berlin, Germany, 2013.
semi-automatic method is suitable to addressing the
modeling because in security, similarly to diagno- [3] M. Galar, A. Fernández, E. Barrenechea,
sis, there is (1) component models and (2) structure. H. Bustince, and F. Herrera. A review on ensem-
While it is difficult to automate the building of com- bles for the class imbalance problem: Bagging-,
ponent models (this may even require natural language boosting-, and hybrid-based approaches. IEEE
parsing of databases such as CAPEC), it is feasible to Transactions onSystems, Man, and Cybernetics,
capture diagnosis-oriented information from structure Part C: Applications and Reviews, 42(4):463–
(physical networking or network communication). 484, July 2012.
Yet another approach to semi-automatically gener- [4] Charu C Aggarwal. Outlier ensembles: Position
ate the model is to learn it from executions of the paper. ACM SIGKDD Explorations Newsletter,
system (e.g., during regression testing, just before 14(2):49–58, 2013.
deployment). This approach to system modeling is
inspired by the work in automatic software debug- [5] Jing Gao and Pang-Ning Tan. Converting out-
ging work [29], where modeling of program behav- put scores from outlier detection algorithms into
ior is done in terms of abstraction of program traces probability estimates. In Proceedings of the Sixth
– known as spectra [30], abstracting from modeling International Conference on Data Mining, pages
specific components and data dependencies 212–221. IEEE, December 2006.
The outlined approaches to construct the abstract [6] Hans-Peter Kriegel, Peer Kröger, Erich Schu-
vulnerability model entail different costs and diagnos- bert, and Arthur Zimek. Interpreting and unify-
tic accuracies. As expected, manually building the ing outlier scores. In Proceedings of the Eleventh
model is the most expensive one. Note that build- SIAM International Conference on Data Mining,
ing the model is a time-consuming and error-prone pages 13–24, April 2011.
task. The two semi-automatic ways also entail differ-
ent costs: one exploits the available, static informa- [7] Erich Schubert, Remigius Wojdanowski, Arthur
tion and the other requires the system to be executed Zimek, and Hans-Peter Kriegel. On evaluation of
to compute a meaningful set of executions. We will in- outlier rankings and outlier scores. In Proceed-
vestigate the trade-offs between modeling approaches ings of the Twelfth SIAM International Confer-
ence on Data Mining, pages 1047–1058, April
1
http://capec.mitre.org/ 2012.

199
Proceedings of the 26th International Workshop on Principles of Diagnosis

[8] Hoda Eldardiry, Kumar Sricharan, Juan Liu, sis using greedy stochastic search. Journal of Ar-
John Hanley, Bob Price, Oliver Brdiczka, and tificial Intelligence Research, 38:371–413, 2010.
Eugene Bart. Multi-source fusion for anomaly [19] Nuno Cardoso and Rui Abreu. A distributed
detection: using across-domain and across-time approach to diagnosis candidate generation. In
peer-group consistency checks. Journal of Progress in Artificial Intelligence, pages 175–
Wireless Mobile Networks, Ubiquitous Com- 186. Springer, 2013.
puting, and Dependable Applications (JoWUA),
[20] Alexander Feldman, Jurryt Pietersma, and Ar-
5(2):39–58, 2014.
jan van Gemund. All roads lead to fault
[9] Yoav Freund, Raj D. Iyer, Robert E. Schapire, diagnosis: Model-based reasoning with LY-
and Yoram Singer. An efficient boosting al- DIA . In Proceedings of the Eighteenth Belgium-
gorithm for combining preferences. Journal of Netherlands Conference on Artificial Intelli-
Machine Learning Research, 4(Nov):933–969, gence (BNAIC’06), Namur, Belgium, October
2003. 2006.
[10] Ke Deng, Simeng Han, Kate J Li, and Jun S Liu. [21] Meera Sampath, Raja Sengupta, Stephane Lafor-
Bayesian aggregation of order-based rank data. tune, Kasim Sinnamohideen, and Demosthenis C
Journal of the American Statistical Association, Teneketzis. Failure diagnosis using discrete-
109(507):1023–1039, 2014. event models. Control Systems Technology, IEEE
[11] Arthur P Dempster, Nan M Laird, and Donald B Transactions on, 4(2):105–124, 1996.
Rubin. Maximum likelihood from incomplete [22] Alban Grastien, Marie-Odile Cordier, and Chris-
data via the EM algorithm. Journal of the royal tine Largouët. Incremental diagnosis of discrete-
statistical society. Series B, 39(1):1–38, 1977. event systems. In DX, 2005.
[12] Johan de Kleer, Olivier Raiman, and Mark [23] Alban Grastien, Patrik Haslum, and Sylvie
Shirley. One step lookahead is pretty good. In Thiébaux. Conflict-based diagnosis of discrete
Readings in Model-Based Diagnosis, pages 138– event systems: theory and practice. 2012.
142. Morgan Kaufmann Publishers, San Fran-
[24] Behrooz Parhami. Computer Arithmetic: Algo-
cisco, CA, 1992.
rithms and Hardware Designs. Oxford Univer-
[13] Alexander Feldman, Tolga Kurtoglu, Sriram sity Press, Inc., New York, NY, USA, 2nd edi-
Narasimhan, Scott Poll, David Garcia, Johan tion, 2009.
de Kleer, Lukas Kuhn, and Arjan van Gemund.
[25] Janice Glasgow, Glenn Macewen, and Prakash
Empirical evaluation of diagnostic algorithm
Panangaden. A logic for reasoning about secu-
performance using a generic framework. In-
rity. ACM Transactions on Computer Systems,
ternational Journal of Prognostics and Health
10(3):226–264, August 1992.
Management, pages 1–28, 2010.
[26] Kenneth Forbus and Johan de Kleer. Building
[14] Johan de Kleer and Brian Williams. Diagnosing
Problem Solvers. MIT Press, 1993.
multiple faults. Artificial Intelligence, 32(1):97–
130, 1987. [27] Johan de Kleer. Diagnosing multiple persistent
[15] Oleg Sheyner, Joshua Haines, Somesh Jha, and intermittent faults. In Proceeding of the 2009
International Joint Conference on Artificial In-
Richard Lippmann, and Jeannette M Wing. Au-
telligence, pages 733–738, July 2009.
tomated generation and analysis of attack graphs.
In Proceeding of the 2002 IEEE Symposium [28] Rui Abreu, Peter Zoeteweij, and Arjan J. C.
on Security and Privacy, pages 273–284. IEEE, Van Gemund. A new bayesian approach to multi-
May 2002. ple intermittent fault diagnosis. In Proceeding of
[16] Seyit Ahmet Camtepe and Bülent Yener. Mod- the 2009 International Joint Conference on Arti-
ficial Intelligence, pages 653–658, July 2009.
eling and detection of complex attacks. In Pro-
ceeding of the Third International Conference on [29] Rui Abreu, Peter Zoeteweij, and Arjan JC
Security and Privacy in Communications Net- Van Gemund. Spectrum-based multiple fault lo-
works, pages 234–243, September 2007. calization. In Proceedings of the 24th IEEE/ACM
[17] Rui Abreu and Arjan JC van Gemund. A International Conference on Automated Soft-
ware Engineering, pages 88–99, November
low-cost approximate minimal hitting set algo-
2009.
rithm and its application to model-based diagno-
sis. In Proceedings of the Eighth Symposium on [30] Mary Jean Harrold, Gregg Rothermel, Kent
Abstraction, Reformulation and Approximation, Sayre, Rui Wu, and Liu Yi. An empirical in-
pages 2–9, July 2009. vestigation of the relationship between spectra
[18] Alexander Feldman, Gregory Provan, and Arjan differences and regression faults. Software Test-
ing Verification and Reliability, 10(3):171–194,
van Gemund. Approximate model-based diagno-
2000.

200
Proceedings of the 26th International Workshop on Principles of Diagnosis

A Structural Model Decomposition Framework for Hybrid Systems Diagnosis
Matthew Daigle1 and Anibal Bregon2 and Indranil Roychoudhury3
1
NASA Ames Research Center, Moffett Field, CA 94035, USA
e-mail: matthew.j.daigle@nasa.gov
2
Department of Computer Science, University of Valladolid, Valladolid, 47011, Spain
e-mail: anibal@infor.uva.es
3
Stinger Ghaffarian Technologies Inc., NASA Ames Research Center, Moffett Field, CA 94035, USA
e-mail: indranil.roychoudhury@nasa.gov

Abstract allows the modeler to focus on the discrete behavior only at
the component level, the pre-enumeration of all the system-
Nowadays, a large number of practical systems in level modes can be avoided [6, 7]. Additionally, building
aerospace and industrial environments are best rep- models in a compositional way facilitates reusability and
resented as hybrid systems that consist of discrete maintenance, and allows the validation of the components
modes of behavior, each defined by a set of contin- individually before they are composed to create the global
uous dynamics. These hybrid dynamics make the hybrid system model.
on-line fault diagnosis task very challenging. In
this work, we present a new modeling and diagno- In a system model, the effects of mode changes in individ-
sis framework for hybrid systems. Models are com- ual components may force other components to reconfigure
posed from sets of user-defined components using their computational structures, or causality, during the sim-
a compositional modeling approach. Submodels ulation process, which requires developing efficient online
for residual generation are then generated for a causality reassignment procedures. As an example of this
given mode, and reconfigured efficiently when the kind of approach, Hybrid Bond Graphs (HBGs) [8] have
mode changes. Efficient reconfiguration is estab- been used by different authors [9, 10], and efficient causality
lished by exploiting causality information within reassignment has been developed previously for such mod-
the hybrid system models. The submodels can then els [11]. However, the main limitation of HBGs is that the
be used for fault diagnosis based on residual gen- set of possible components is restricted (e.g., resistors, ca-
eration and analysis. We demonstrate the efficient pacitors, 0-junctions, etc.), with each component having to
causality reassignment, submodel reconfiguration, conform to a certain set of mathematical constraints, and
and residual generation for fault diagnosis using modelers do not have the liberty to define and use their own
an electrical circuit case study. components. Another example is that of [7], which uses a
more general modeling framework, and tackles the causality
reassignment problem from a graph-theoretic perspective.
1 Introduction In this work, we propose a compositional modeling ap-
Robust and efficient fault diagnosis plays an important role in proach for hybrid systems, where models are made up of
ensuring the safe, correct, and efficient operation of complex sets of user-defined components. Here, a component is con-
engineering systems. Many engineering systems are modeled structed by defining a set of discrete modes, with a different
as hybrid systems that have both continuous and discrete- set of mathematical constraints describing the continuous
event dynamics, and for such systems, the complexity of dynamics in each mode. Then, we borrow ideas for efficient
fault diagnosis methodologies increases significantly. In this causality reassignment in HBGs [11], and propose algorithms
paper, we develop a new modeling framework and structural for efficient causality assignment in our component-based
model decomposition approach that enable efficient online models, extending and generalizing those from HBGs. We
fault diagnosis of hybrid systems. then apply structural model-decomposition [12] to compute
During the last few years, different proposals have been minimal submodels for the initial mode of the system. These
made for hybrid systems diagnosis, focusing on either hy- submodels are used for fault diagnosis based on residual gen-
brid modeling, such as hybrid automata [1–3], hybrid state eration and analysis. Based on efficient causality reassign-
estimation [4], or a combination of on-line state tracking and ment, submodels can be reconfigured upon mode changes
residual evaluation [5]. However, in all these cases, the pro- efficiently. Using an electrical circuit as a case study, we
posed solutions involve modeling and pre-enumeration of the demonstrate efficient causality reassignment and submodel
set of all possible system-level discrete modes, which grows reconfiguration and show that these submodels can correctly
exponentially with the number of switching components. compute system outputs for residual generation in the pres-
Both steps are computationally very expensive or unfeasible ence of known mode changes.
for hybrid systems with complex interacting subsystems. The paper is organized as follows. Section 2 presents the
One solution to the mode pre-enumeration problem is to modeling approach and introduces the case study. Section 3
build hybrid system models in a compositional way, where presents the overall approach for hybrid systems fault diag-
discrete modes are defined at a local level (e.g., at the com- nosis based on structural model decomposition. Section 4
ponent level), in which the system-level mode is defined develops the causality analysis and assignment algorithms.
implicitly by the local component-level modes. Since this Section 5 presents the structural model decomposition ap-

201
Proceedings of the 26th International Workshop on Principles of Diagnosis

Sw1 C1 R1 Sw2 C2 R2
Example 1. Consider the component R1 (δ6 ). It has only
a single mode with a single constraint v5 = i5 ∗ R1 over
i11
variables {v5 , i5 , R1 }.
v(t) i3 L1 v8 L2
Example 2. Consider the component Sw2 (δ10 ). It has two
modes: on and off. In the off mode, it has three constraints
setting each of its currents (i9 , i10 , i11 ) to 0. In the on mode,
it has also three constraints, setting the three currents equal
Figure 1: Electrical circuit running example. to each other and establishing that the voltages sum up (it
acts like a series connection when in the on mode).
proach. Section 6 describes efficient causality reassignment. We can define a system model by composing components:
Section 7 demonstrates the approach for the electrical case Definition 3 (Model). A model M = {δ1 , δ2 , . . . , δd } is a
study. Section 8 reviews the related work and current ap- finite set of d components for d ∈ N.
proaches for hybrid systems fault diagnosis. Finally, Sec-
tion 9 concludes the paper. Example 3. The model of the electrical system is made up
of the components detailed in Table 1, i.e., M = {δ1 , δ2 , . . . ,
δ15 }. For each component, the variables and constraints are
2 Compositional Hybrid Systems Modeling defined for each component mode (third column).
We define hybrid system dynamics in a general composi- Note that the set of variables for a model does not change
tional way, where the system is made up of a set of com- with the mode, hence we need only a variable set in a com-
ponents. Each component is defined by a set of discrete ponent and not a set of variable sets as with constraints.
modes, with a different set of constraints describing the con- The set of variables for a model, VM , is simply the union
tinuous dynamics of the component in each mode. Here, of all the component variable sets, i.e., for d components,
system-level modes are defined implicitly through the com- VM = Vδ1 ∪ Vδ2 ∪ . . . ∪ Vδd . The interconnection struc-
position of the component-level modes. Because the number ture of the model is captured using shared variables between
of system-level modes is exponential in the number of switch- components, i.e., we say that two components are connected
ing components, we want to avoid generating and reasoning if they share a variable, i.e., components δi and δj are con-
over the system-level hybrid model, instead working directly nected if Vδi ∩ Vδj 6= ∅. VM consists of five disjoint sets,
with the component models. namely, the set of state variables, XM ; the set of parame-
To illustrate our proposal, throughout the paper we will ters, ΘM ; the set of inputs (variables not computed by any
use a circuit example, shown in Fig. 1. The components of constraint), UM ; the set of outputs (variables not used to
the circuit are a voltage source, V, two capacitors, C1 and C2 , compute any other variables), YM ; and the set of auxiliary
two inductors, L1 and L2 , two resistors, R1 and R2 , and two variables, AM . Parameters, ΘM , include explicit model pa-
switches, Sw1 and Sw2 , as well as components for series and rameters that are used in the model constraints (e.g., fault
parallel connections. Sensors measure the current or voltage parameters). Auxiliary variables, AM , are additional vari-
in different locations (i3 , v8 , and i11 , as indicated in Fig. 1). ables that are algebraically related to the state, parameter,
Because each switch has two modes (on and off), there are and input variables, and are used to simplify the structure of
four total modes in the system. the equations.
In the following, we present the main details of our hy-
Example 4. In the circuit model, we have XM =
brid system modeling framework, which may be viewed as
{i3 , v6 , i8 , v11 }, ΘM = {L1 , R1 , C1 , L2 , R2 , C2 }, UM =
an extension of our modeling approach described in [12],
{uv }, and YM = {i∗3 , i∗11 , v8∗ }. Remaining variables belong
extended with the notion of components, and with hybrid
to AM . Here, the ∗ superscript is used to denote a measured
system dynamics.
value of a physical variable, e.g., i3 ∈ XM is the current
and i∗3 ∈ YM is the measured current. Since i3 is used to
2.1 System Modeling compute other variables, like i2 , it cannot belong to YM and
At the basic level, the continuous dynamics of a component a separation of the variables is required. Connected com-
in each mode are modeled using a set of variables and a set ponents are known by shared variables, e.g., R1 and Series
of constraints. A constraint is defined as follows: Connection1 are connected because they share i5 and v5 .
Definition 1 (Constraint). A constraint c is a tuple (εc , Vc ), The model constraints, CM , are a union of the component
where εc is an equation involving variables Vc . constraints over all modes, i.e., CM = Cδ1 ∪ Cδ2 ∪ . . . ∪ Cδd ,
where Cδi = Cδ1i ∪ Cδ2i ∪ . . . ∪ Cδni for n modes. Constraints
A component is defined by a set of constraints over a set
are exclusive to components, that is, a constraint c ∈ CM
of variables. The constraints are partitioned into different
belongs to exactly one Cδ for δ ∈ M.
sets, one for each component mode. A component is then
To refer to a particular mode of a model we use the con-
defined as follows:
cept of a mode vector. A mode vector m specifies the current
Definition 2 (Component). A component δ with n discrete mode of each of the components of a model. So, the con-
modes is a tuple δ = (Vδ , Cδ ), where Vδ is a set of variables straints for a mode m are denoted as CM m
.
and Cδ is a set of constraints sets, where Cδ is defined as
Example 5. Consider a model with five components, then
Cδ = {Cδ1 , Cδ2 , . . . , Cδn }, with a constraint set, Cδm , defined
if m = [1, 1, 3, 2, 1], it indicates that components δ1 , δ2 ,
for each mode m = {1, . . . , n}.
and δ5 use constraints of their mode 1, component δ3 use
The components of the circuit are defined in Table 1 (first constraints of its mode 3, and component δ4 use constraints
three columns). of its mode 2.

202
Proceedings of the 26th International Workshop on Principles of Diagnosis

Table 1: Components of the electrical circuit.
[1 2] [1 2] [1 2] [2 1] [2 1] [2 1]
Component Mode Constraints A[1 2] Ai∗ Av∗ Ai∗ A[2 1] A i∗ Av∗ Ai∗
3 8 11 3 8 11
δ1 : V 1 v1 =uv v1 v1 v1 v1 v1
δ2 : Sw1 1 i1 =0 i1
i2 =0 i2 i2 i2 i2
2 i1 =i2 i1
v1 =v2 v2 v2 v2
δ3 : Parallel Connection1 1 v2 =v3 v3 v3 v3 v3
v2 =v4 v2 v2 v4 v4
i2 =i3 + i4 i4 i4 i4 i4 i2
δ4 : L1 1 i̇3 =vR3 /L1 i̇3 i̇3 i̇3 i̇3
t
i3 = t0 i̇3 i3 i3 i3 i3
δ5 : Series Connection1 1 i4 =i5 i5 i5 i5 i5
i4 =i6 i6 i6 i6 i6
i4 =i7 i7 i7 i7 i4 i4
v4 =v5 + v6 + v7 v4 v4 v7 v7
δ6 : R1 1 v5 =i5 ∗ R1 v5 v5 v5 v5
δ7 : C1 1 v̇6 =iR6 /C1 v̇6 v̇6 v̇6 v̇6
t
v6 = t0 v̇6 v6 v6 v6 v6
δ8 : Parallel Connection2 1 v7 =v8 v8 v7 v8 v8 v8
v7 =v9 v7 v7 v9
i7 =i8 + i9 i9 i9 i9 i7 i7
δ9 : L2 1 i̇8 =vR8 /L2 i̇8 i̇8 i̇8 i̇8 i̇8
t
i8 = t0 i̇8 i8 i8 i8 i8 i8
δ10 : Sw2 1 i9 =0 i9 i9
i10 =0 i10
i11 =0 i11 i11
2 i9 =i10 i10 i10
i9 =i11 i11 i11
v9 =v10 + v11 v9 v9
δ11 : R2 1 v10 =i10 ∗ R2 v10 v10 v10
δ12 : C2 1 v̇11 =iR11 /C1 v̇11 v̇11 v̇11
t
v11 = t0 v̇11 v11 v11 v11
∗
δ13 : Current Sensor11 1 i11 =i11 i∗11 i11 i∗11 i∗11 i∗11
δ14 : Voltage Sensor8 1 v8∗ =v8 v8∗ v8 v8∗ v8 v8∗ v8∗
δ15 : Current Sensor3 1 i∗3 =i3 i∗3 i∗3 i3 i3 i∗3 i∗3

For shorthand, we will refer to the modes only of the act as vcout . However, in some cases some causal assign-
components with multiple modes. So, for the circuit, we will ments may not be possible, e.g., if we have noninvertible
refer only to components δ2 and δ10 , and we will have four nonlinear constraints. Also, if we assume integral causality,
possible mode vectors, [1 1], [1 2], [2 1], and [2 2]. then state variables must always be computed via integration,
The switching behavior of each component can be de- and so the derivative causality is not allowed. Further, when
fined using a finite state machine or a similar type of control placed in the context of a model, additional causalities may
specification. The state transitions may be attributed to con- not be applicable, because the causal assignments of other
trolled or autonomous events. However, for the purposes of constraints may limit the potential causal assignments. To de-
this paper, we view the switching behavior as a black box note this concept, we use Ac to refer to the set of permissible
where the mode change event is given, and refer the reader causal assignments of a constraint c.
to many of the approaches already proposed in the literature For a given mode, we have the set of (specific) causal
for modeling the switching behavior [1, 8]. assignments over the entire model in its mode, denoted using
Am . So, some α ∈ Am would refer to the causal assignment
2.2 Causality of some constraint in some component of the model in its
Given a constraint c, which belongs to a specific mode of a correct mode. The consistency of the causal assignments
specific component, the notion of a causal assignment is used Am is defined as follows,
to specify a possible computational direction, or causality,
for the constraint c. This is done by defining which v ∈ Vc Definition 5 (Consistent Causal Assignments). Given a
is the dependent variable in equation εc . mode m, we say that a set of causal assignments Am , for
a model M is consistent if (i) for all v ∈ UM ∪ ΘM , Am
Definition 4 (Causal Assignment). A causal assignment αc does not contain any α such that α = (c, v), i.e., input or
to a constraint c = (εc , Vc ) is a tuple αc = (c, vcout ), where parameter variables cannot be the dependent variables in
vcout ∈ Vc is assigned as the dependent variable in εc . We the causal assignment; (ii) for all v ∈ YM , Am does not
use Vcin to denote the independent variables in the constraint, contain any α = (c, vcout ) where v ∈ Vcin , i.e., an output
where Vcin = Vc − {vcout }. variable can only be used as the dependent variable; and
In general, the set of possible causal assignments for a (iii) for all v ∈ VM − UM − ΘM , Am contains exactly
constraint c is as big as Vc , because each variable in Vc can one α = (c, v), i.e., every variable that is not an input or

203
Proceedings of the 26th International Workshop on Principles of Diagnosis

parameter is computed by only one (causal) constraint. from one constraint to the related constraints (those sharing
With causality information, we can efficiently derive a set a variable with the fixed causality constraint). This will help
of submodels for residual generation [12]. to reduce the number of permissible causal assignments. For
example, if we again assume integral causality, then any
constraint involving a state variable cannot be in a causal
3 Hybrid Systems Diagnosis Approach assignment where the state variable is the dependent/output
We propose a hybrid systems diagnosis approach based on variable, because the integration constraint is the one that
structural model decomposition. In this approach, we gener- must compute it. For such a constraint, 1 < |Ac | < |Vc |.
ate submodels for the purpose of computing residuals. Resid- Given a system model and a set of outputs, Algorithm 1
uals can then be used for diagnosis. searches over the model constraints to reduce the set of per-
For hybrid systems, however, the problem is that these sub- missible causal constraints based on system-level informa-
models may change as the result of a mode change. That is, tion.1 First, it determines which constraints are mode-variant,
we may obtain two different submodels when decomposing i.e, they can appear/disappear from the model depending on
the model in two different modes. There are two approaches the mode (so belong to components with multiple modes),
to this problem. One is to find a set of submodels that work and which are mode-invariant, i.e., they are present in all
for all modes, and can be easily reconfigured by executing system modes (so belong to components with a single mode).
only local mode changes within the submodels [10]. This It is only the mode-invariant constraints for which causal
approach requires the least online effort, with some offline assignments can be removed. We then construct a queue
effort in finding these submodels, which exist only in limited of variables from which to propagate. This queue contains
cases. The other approach is to generate submodels for the the inputs and parameters (which must always be indepen-
current mode, and when a mode change occurs, reconfigure dent/input variables in constraints), and the outputs (which
the submodels to be consistent with the new system mode. must always be dependent/output variables in constraints).
This is the approach we develop in this paper. We create a variable set V that refers to the variables that are
In order to execute this type of approach, however, we resolved, i.e., either they are inputs/parameters or there is a
must be able to efficiently reconfigure submodels online. In constraint with a single causal assignment that will compute
order to do this, we take advantage of causality in two ways. the variable. So, V is initially set to include UM and ΘM .
First, we perform an offline model analysis to determine Further, for any mode-invariant constraints that only has a
which causalities of the hybrid system model are not per- single causal assignment, the output variable is added to V ,
missible, i.e., they will never be used in any mode of the and all variables of the constraint added to the queue.
system (determine AM for a model). Second, we use an effi- The main idea is to analyze the causality restrictions im-
cient causality reassignment algorithm, so that the causality posed by variables in the queue, which will be propagated
of a hybrid systems model is updated incrementally when throughout the model. While the queue is nonempty, we pop
a mode changes (given A for the previous mode, compute a variable v off the queue. We then count the number of con-
it for the new mode). Since causal changes usually only straints involving v that have no set causal assignment yet,
propagate in a local area in the model, causality does not including constraints that are both mode-variant and mode-
need to be reassigned at the global model level. Together, invariant. We then go through all mode-invariant constraints
these algorithms reduce the number of potential causalities to involving v, and remove causal assignments that will never
search within the model decomposition algorithm and allow be possible. There are three conditions in which this holds:
efficient submodel reconfiguration. a causal assignment is not possible in any system mode if (i)
the output variable is already computed by another constraint,
4 Causality Assignment or is an input/parameter (i.e., in V ), (ii) any of the input vari-
In order to compute minimal submodels for residual genera- ables are in the model outputs (i.e., in Y ), or (iii) v is not yet
tion, we need a model M with a valid causal assignment Am . computed by any constraint (i.e., not in V ), there is only one
As described in Section 2, causality assignment can only be noncausal constraint involving v remaining, and v is not the
defined for a given mode. However, there are some causal output in this causality (in this case, v needs to be computed
assignments that are independent of the system mode, i.e., by some constraint and there is only one option left, so this
they are valid for all system modes. We capture this through constraint must only be in the causality computing v). These
the notion of permissible causal assignments, introduced as causal assignments are removed. If only one is left, then
AM in Section 2. we add the output for that causal assignment to V , and add
Given a model with a number of modes, some constraints the constraint’s variables to the queue. The algorithm stops
will always have the same causal assignment in all modes, when causalities can no longer be removed, i.e., there are
and we say these constraints are in fixed causality. not enough restrictions imposed by the current permissible
causalities to reduce AM further.
Definition 6 (Fixed Causality). A constraint cδ is in fixed
causality if (i) component δ has only a single mode, i.e., Example 6. For the circuit, we assume integral causality, so
|Cδ | = 1, and (ii) for cδ in the single C ∈ Cδ , it always has all constraints with the state variables are limited to causal
the same causal assignment in all system modes. assignments in which the states are computed via integration.
Further, the constraint with uV is also fixed so that uV is the
If a constraint is in fixed causality, then |Ac | = 1, i.e.,
independent variable. For any specified outputs, AM is also
there is only one permissible causal assignment. For ex-
ample, if we make the integral causality assumption, then 1
For structural model decomposition, some output variables may
constraints computing state variables will always be in the become input variables and so the causal assignments permitting
integral causality, and thus they are in fixed causality. that must be retained. Therefore, the algorithm only reduces the
Additionally, when the constraint is viewed in the context permissible set of causal assignments for a given set of outputs
of the model, the concept of fixed causality can be propagated Y ⊆ YM .

204
Proceedings of the 26th International Workshop on Principles of Diagnosis

Algorithm 1 AM ← ReduceCausality(M, AM , Y ) Algorithm 2 A ← AssignCausality(M, m, A)
1: Cinvariant ← ∅ 1: A ← ∅
2: Cvariant ← ∅ 2: V ← UM ∪ ΘM
3: for all δ ∈ M do 3: Q ← UM ∪ ΘM ∪ YM
4: if |Cδ | = 1 then 4: for all c ∈ CMm do
5: Cinvariant ← Cinvariant ∪ Cδ1 5: if |Ac | = 1 then
6: else   6: (c, v) ← Ac (1)
[ 7: Q←Q∪v
7: Cvariant ← Cvariant ∪  C 8: while |Q| > 0 do
C∈Cδ 9: v ← pop(Q)
8: Q ← UM ∪ ΘM ∪ Y 10: for all c ∈ CMm (v) do
9: V ← UM ∪ ΘM 11: if c ∈
/ {c : (c, v) ∈ A} then
10: for all c ∈ Cinvariant do 12: α∗ ← ∅
11: if |Ac | = 1 then 13: for all α ∈ Ac do
12: (c, v) ← Ac (1) 14: if Vc − {vα∗ } ∪ V 6= ∅ then
13: Q ← Q ∪ Vc 15: α∗ ← α
14: V ←V ∪v 16: else if αv ∈ Y then
15: while |Q| > 0 do 17: α∗ ← α
16: v ← pop(Q) 18: else if vα∗ = v and |CMm (v)|−|{c0 : (c0 , v 0 ) ∈ A∧v ∈
17: nnoncausal ← 0 vc }| = 1 then
18: for all c ∈ Cinvariant (v) do 19: α∗ ← α
19: if |Ac | > 1 or (|Ac | = 1 and vAc (1) ∈/ V ) then 20: if α∗ 6= ∅ then
21: A ← A ∪ {α∗ }
20: nnoncausal ← nnoncausal + 1 22: Q ← Q ∪ (Vc − V )
21: for all c ∈ Cvariant (v) do 23: V ← V ∪ {vα∗ }
22: nnoncausal ← nnoncausal + 1
23: for all c ∈ Cinvariant (v) do
24: if |Ac | > 1 or (|Ac | = 1 and vAc (1) ∈/ V ) then
25: for all (c0 , v 0 ) ∈ Ac do
26: 0
if v ∈ V then constraints is restricted to U and Θ variables being indepen-
27: Ac ← Ac − (c0 , v 0 ) dent variables and Y variables being dependent variables.
28: if (Vc − {v}) ∩ Y 6= ∅ then We add also to Q any variables involved in constraints that
29: Ac ← Ac − (c0 , v 0 )
30: if nnoncausal = 1 and v 0 ∈
/ V and v 0 6= v then have only one permissible causal assignment, because this
31: Ac ← Ac − (c0 , v 0 ) will also restrict other causal assignments. The set of causal
32: if |Ac | = 1 then
assignments is maintained in A.
33: (c0 , v 0 ) ← Ac (1)
34: Q ← Q ∪ (Vc0 − V ) The algorithm goes through the queue, inspecting vari-
35: V ← V ∪ {v 0 }
ables. For a given variable, we obtain all constraints it is
involved in, and for each one that does not yet have a causal
assignment (in A), we go through all permissible causal as-
reduced so that they can appear only as dependent variables. signments, and determine if the causality is forced into one
With AM defined, we can perform causality assignment particular causal assignment, α∗ . If so, we assign that causal-
for a given mode, m. Because AM was reduced as much as ity and propagate by adding the involved variables to the
possible, causality assignment (and, later, reassignment) will queue. A causal assignment α = (c, v) is forced in one of
be more efficient than otherwise. Algorithm 2 describes the three cases: (i) v is in Y , (ii) all variables other than v of the
causality assignment process for a model given a mode. Here, constraint are already in V , and (iii) v is not yet in V , and
the model is assumed to not have an initial causal assign- all but one of the constraints involving v have an assigned
ment. Causal assignment works by propagating causal re- causality, in which case no constraint is computing v and
strictions throughout the model. The process starts at inputs, there is only one remaining constraint that must compute v.
which must always be independent variables in constraints;
outputs, which must always be the dependent variables in Example 7. Consider the mode m = [1 2]. Here, A[1 2]
constraints; and variables for involved in fixed causality con- is given in column 4 of Table 1, denoted by the vcout in the
straints. From these variables, we should be able to propagate causal assignment. In this mode, the first switch is off, so
throughout the model and compute a valid causal assignment i1 and i2 act as inputs. Given the integral causality assump-
for the model in the given mode. For the purposes of this tion, a unique causal assignment to the model exists and is
paper, we assume integral causality and that the model pos- specified in the column.
sesses no algebraic loops.2 In this case, there is only one Example 8. Consider the mode m = [2 1]. Here, A[2 1]
valid causal assignment (this is a familiar concept within is given in column 8 of Table 1. In this mode, the second
bond graphs) [13]. switch is off, so i9 , i10 , and i11 act as inputs. Given the
Specifically, the algorithm works as follows. Similar to integral causality assumption, a unique causal assignment to
Algorithm 1, we keep a queue of variables to propagate the model exists and is specified in the column. Note that
causality restrictions, Q, and a set of variables that are com- some causal assignments are in the same as in m = [1 2],
puted in the current causality, V . Initially, V is set to U and while others are different. In changing from one mode to
Θ, because these variables are not to be computed by any another, an efficient causality reassignment should be able
constraint. Q is set to U , Θ, and Y , since the causality of to determine which constraints need to change causality, and
2
do the work for only that portion of the model.3 Causal
If algebraic loops exist, the algorithm will terminate before all assignments that do not change from mode to mode are in
constraints have been assigned a causality. Extending the algorithm fixed causality and found by Algorithm 1.
to handle algebraic loops is similar to that for bond graphs; a con-
straint without a causality assignment is assigned one arbitrarily,
3
and then effects of this assignment are propagated until nothing Note that this particular circuit was carefully chosen so that
more is forced. This process repeats until all constraints have been causality does propagate across much of the circuit, in order to
assigned causality. demonstrate the causality reassignment algorithm.

205
Proceedings of the 26th International Workshop on Principles of Diagnosis

5 Structural Model Decomposition Algorithm 3 Am
0
←
For a given causal model in a given mode, we have the ReassignCausality(M, m, Am , A)
equivalent of a continuous systems model for the purpose of 0
1: Am ← ∅
structural model decomposition, and we can compute mini- 2: for all (c, v) ∈ Am do
3: if c ∈ CMm then
mal submodels using the GenerateSubmodel algorithm 4:
0
Am ← Am ∪ Am
0

described in our previous work [12]. The algorithm finds 5: V ← ∅
c

a submodel, which computes a set of local outputs given a 6: Q ← ∅
set of local inputs, by searching over the causal model. It 7: for all δ ∈ M where mδ 6= m0δ do
8: Q ← Q ∪ Vδ
starts at the local inputs, and propagates backwards through 9: while |Q| > 0 do
the causal constraints, finding which constraints and vari- 10: v ← pop(Q)
ables must be included in the submodel. When possible, 11: for all c ∈ CMm (v) do
0
causal constraints are inverted in order to take advantage of 12: / {c : (c, v) ∈ Am } then
if c ∈
13: α∗ ← ∅
local inputs. Additional information and the pseudocode are 14: for all α ∈ Ac do
provided in [12]. 15: if Vc − {vα∗ } ∪ V 6= ∅ then
16: α∗ ← α
In the context of residual generation, we set the local 17: else if αv ∈ Y then
output set to a single measured value, and the local inputs 18: α∗ ← α
to all other measured values and the (known) system inputs. 19: else if vα∗ = v and |CMm (v)|−|{c0 : (c0 , v 0 ) ∈ A∧v ∈
vc }| = 1 then
That is, we exploit the analytical redundancy provided by 20: α∗ ← α
the sensors in order to find minimal submodels to compute 21: if α∗ 6= ∅ then
0
estimated values of sensor outputs. In this framework, we 22: ∗
if ∃α ∈ Am where vα = vα then
0 0
consider one submodel per sensor, each producing estimated 23: Am ← Am − {α∗ }
24: Q ← Q ∪ (V{ α∗
c} − V )
values for that sensor. 0 0
25: Am ← Am ∪ {α∗ }
Assuming that the set of sensors does not change from 26: Q ← Q ∪ (Vc − V )
mode to mode, then for a hybrid system we have one sub- 27: V ← V ∪ {vα∗ }
model for each sensor.4 However, since the set of con- 28: else if (Vc − V = v then
29: V ← V ∪ {v}
straints changes from mode to mode, the result of the 30: Q ← Q ∪ {v}
GenerateSubmodel algorithm will also change. When a
mode changes, we first reassign causality to the model for the
new mode. Then, we generate new updated submodels for a submodel that gets a state added in a new mode can initial-
that mode using GenerateSubmodel. In order to reduce ize using the estimated value from another submodel in the
the work performed by this algorithm when a mode changes, previous mode.
we use an efficient causality reassignment algorithm. That,
coupled with the reduced set AM , significantly reduces the
work of the algorithm compared to a naive approach, where 6 Online Causality Reassignment
the submodels are completely regenerated for a new mode. As we mentioned before, from the initial mode in the system
Additionally, when the system transitions to a new mode, with a valid set of causal assignments, we compute minimal
the causal assignments for the previous mode can be stored, submodels. However, when the system transitions to a differ-
so that when the system changes to a mode that has already ent mode, any submodel containing constraints of a switch-
been visited, it just takes the causal assignments that were ing component will no longer be consistent, and must be
stored previously. Similarly, submodels generated in previ- recomputed. In order to do this, we need to know the causal
ously visited modes can be saved and reused when the mode assignments for the new mode. We can reassign causality in
reappears. an efficient incremental process to avoid having to reassign
Example 9. The causal assignments for the submodels in the causality to the whole model, as causal changes typically
different modes are shown in Table 1. For example, consider propagate only to a small local area in the model [11].
the submodel for i∗11 in m = [2 1]. Here, i11 is zero, since Algorithm 3 presents the causality reassignment procedure.
Sw2 is off, and therefore we have just two constraints needed The main ideas are based on the hybrid sequential causality
to compute i∗11 . In mode m = [1 2], i∗3 can be computed assignment procedure (HSCAP) developed for hybrid bond
using 16 constraints, where v8∗ is used as a local input to the graphs in [11]. In our more general modeling framework,
submodel. we find that similar ideas apply. Essentially, we start with a
causal model in a given mode. We then switch to a different
Note that a submodel for an output may have different
mode, so for the switching components we have a new set of
states in two different modes (e.g., in moving from m = [2 1]
constraints in the model. We need to find causal assignments
to m = [1 2], the i∗3 submodel adds state v6 ). In order to
for these constraints. It is likely that some of the necessary
continue tracking, new states must be initialized. For the pur-
causal assignments will conflict with causal assignments
poses of this paper, we assume that in any one system mode,
from the old mode, therefore, we have to resolve the conflict
all states are included in at least one submodel.5 Therefore,
and propagate the change. The change will propagate only as
4
By assuming that the sensor set does not change, we mean far as it needs to in order to obtain a valid causal assignment
only that sensors are not added/removed to/from the physical for the model in the new mode. Here, propagation stops
system upon a mode change. They are still allowed to be con- along a computational path when a new causal assignment
nected/disconnected, but still appear in the system model even does not conflict with a previous assignment.
when disconnected. For example, if a disconnected sensor outputs
0, then that needs to still be in the model. mode. Estimation techniques to handle that situation are outside
5
If this is not the case, then a state is not observable in some the scope of this paper.

206
Proceedings of the 26th International Workshop on Principles of Diagnosis

The algorithm works similarly to Algorithm 2. It maintains 10
Measured
a queue of variables to inspect and a set V of variables 8 Estimated

that are known to be computed in the causality for the new 6
mode (so includes only variables from new assignments

i∗3 (A)
or confirmed assignments made in the new mode). In this 4

case, we initialize the queue only to variables involved in the 2

constraints of the switching components. If no components 0
are switching, the queue will be empty and no work will be
done. The main idea is that the required causal changes from 0 5 10 15 20 25 30
Time (s)
the variables placed in the queue will, on average, be limited
to a very small area. The causal assignments for the new (a) i∗3 .
mode are initialized to those for the previous mode, for any
constraints that still exist in the new mode. Some of these 2 Measured

may conflict with the new mode and will be removed and
Estimated
0
replaced with different assignments.

v8∗ (V)
As in all the other causality algorithms, we go through -2

the queue and propagate the restrictions we find on causality. -4
We pop a variable off the queue, and look at all involved
-6
constraints. If the constraint is not causal, then we need
to assign causality. We do the same analysis as before to -8
0 5 10 15 20 25 30
find if a causality is forced, but checking things only with Time (s)
respect to V that includes only variables with confirmed (b) v8∗
causal assignments computing it in the new mode. If we find
a constraint that is forced into a particular causal assignment
for the new mode, we make the assignment. If it conflicts 2 Measured
Estimated
with one already in the set of causal assignments (copied 0
from the old mode), then we remove the old assignment and
i∗11 (A)

add the new one, adding the involved variables to the queue -2

so that changes are propagated. -4

7 Demonstration of Approach -6

For the circuit example, we consider two modes: one where 0 5 10 15
Time (s)
20 25 30

Sw1 is on and Sw2 is off (i.e., m = [2 1]), and one where (c) i∗11
Sw1 is off and Sw2 is on (i.e., m = [1 2]). We consider
a scenario in which to demonstrate the approach where the Figure 2: Measured and estimated values with an increase in
system starts in m = [2 1], switches to m = [1 2] at t = 10 s, R1 at t = 15 s.
and switches back to m = [2 1] at t = 20 s. Additionally, at
t = 15 s, a fault is injected, specifically, an increase in R1 .
Fig. 2 shows the measured and submodel-estimated values developed. In [14], parameterized ARRs are used. However,
for the sensors. Up through the first mode change, the outputs the approach is not suitable for systems with high nonlineari-
are correctly tracked by the submodels. At the first mode ties or a large set of modes. A different approach [15], but
change at 10 s, the submodels reconfigure and track correctly uses purely discrete models.
up to 15 s, when the fault is injected, and a discrepancy In the DX community, some approaches have used differ-
is observed in the i∗3 submodel. Specifically, the current ent kind of automata to model the complete set of modes and
increases above what is expected. The other submodels in transitions between them. In those cases, the main research
this mode are independent of the fault, and so continue to topic has been hybrid system state estimation, which has has
track correctly. When the second mode change occurs, i∗11 been done using probabilistic (e.g., some kind of filter [16]
can still be tracked correctly, since its estimation remains or hybrid automata [4]) or set-theoric approaches [5].
independent of the fault. However, we now see a discrepancy Another solution has been to use an automaton to track the
in v8∗ , as the measurement increases above what is expected. system mode, and then use a different technique to diagnose
This transient occurs because we switch from a mode in the continuous behavior (for example, using a set of ARRs
which the submodel is independent of the fault to one where for each mode [3], or parameterized ARRs for the complete
it is dependent on the fault. Fault isolation can be performed set of modes [17]). Nevertheless, one of the main difficulties
by using the information that in m = [1 2], an increase in regarding state estimation using these techniques is the need
R1 would produce an increase in the residual for i∗3 , and to pre-enumerate the set of possible system-level modes and
in m = [2 1], it would produce an increase also in the v8∗ mode transitions, which is difficult for complex systems. We
residual. avoid this problem by using a compositional approach.
Regarding hybrid systems modeling, there are several
8 Related Work proposals. For HBGs [8, 18], there are two main ap-
Modeling and diagnosis for hybrid systems have been an im- proaches: those that use switching elements with fixed causal-
portant focus of study for researchers from both the FDI and ity [18–20], and those who use ideal switching elements that
DX communities during the last 15 years. In the FDI commu- change causality [8]. The advantages of the latter are that the
nity, several hybrid system diagnosis approaches have been modeling of hybrid systems is done through a special kind of

207
Proceedings of the 26th International Workshop on Principles of Diagnosis

hybrid component (which avoid the mode pre-enumeration in [8] P.J. Mosterman and G. Biswas. A comprehensive
the system), and also changes are handled in a very efficient methodology for building hybrid models of physical
way [11]. Finally, in [10] the HBGs are used to compute systems. Artificial Intel., 121(1-2):171 – 209, 2000.
minimal submodels (Hybrid Possible Conflicts, HPCs) simi- [9] S. Narasimhan and G. Biswas. Model-Based Diagnosis
lar to the minimal submodels presented in this paper. HPCs of Hybrid Systems. IEEE Trans. Syst. Man. Cy. Part A,
can track hybrid systems behavior, efficiently changing on- 37(3):348–361, May 2007.
line for each mode the PC simulation model, by using block
[10] A. Bregon, C. Alonso, G. Biswas, B. Pulido, and
diagrams as in [11], and performing diagnosis without pre-
enumerating the set of modes in the system. However, HPCs N. Moya. Hybrid systems fault diagnosis with possi-
rely on HBG modeling and do not provide a generalized ble conflicts. In Proceedings of the 22nd International
framework for hybrid systems. Workshop on Principles of Diagnosis, pages 195–202,
Murnau, Germany, October 2011.
[11] I. Roychoudhury, M. Daigle, G. Biswas, and X. Kout-
9 Conclusions soukos. Efficient simulation of hybrid systems: A
In this work, we have developed a compositional modeling hybrid bond graph approach. SIMULATION: Trans-
framework for hybrid systems. Using computational causal- actions of the Society for Modeling and Simulation
ity, we developed efficient causality assignment algorithms. International, 87(6):467–498, June 2011.
Given this causal information, submodels computed using [12] I. Roychoudhury, M. Daigle, A. Bregon, and B. Pulido.
structural model decomposition can be computed and recon- A structural model decomposition framework for sys-
figured efficiently. The approach was demonstrated with a tems health management. In Proceedings of the 2013
circuit system. In future work, we will further develop the IEEE Aerospace Conference, March 2013.
hybrid systems diagnosis approach for the single and multi- [13] D. C. Karnopp, D. L. Margolis, and R. C. Rosenberg.
ple fault cases, and we will approach the diagnosis task in Systems Dynamics: Modeling and Simulation of Mecha-
a distributed manner. The assumption of one submodel per tronic Systems. John Wiley & Sons, Inc., NY, 2000.
sensor can also be dropped, using the extended framework
developed in [21, 22]. [14] V. Cocquempot, T. El Mezyani, and M. Staroswiecki.
Fault detection and isolation for hybrid systems using
structured parity residuals. In 5th Asian Control Con-
Acknowledgments ference, volume 2, pages 1204–1212, July 2004.
This work has been funded by the Spanish MINECO [15] J. Lunze. Diagnosis of quantised systems by means of
DPI2013-45414-R grant and the NASA SMART-NAS timed discrete-event representations. In Proceedings of
project in the Airspace Operations and Safety Program of the Third International Workshop on Hybrid Systems:
the Aeronautics Mission Directorate. Computation and Control, HSCC ’00, pages 258–271,
London, UK, 2000. Springer-Verlag.
References [16] X. Koutsoukos, J. Kurien, and F. Zhao. Estimation
of distributed hybrid systems using particle filtering
[1] T. A. Henzinger. The theory of hybrid automata. methods. In In Hybrid Systems: Computation and
Springer, 2000. Control (HSCC 2003). Springer Verlag Lecture Notes
[2] T. Rienmüller, M. Bayoudh, M.W. Hofbaur, and on Computer Science, pages 298–313. Springer, 2003.
L. Travé-Massuyès. Hybrid Estimation through Syn- [17] M. Bayoudh, L. Travé-Massuyès, and X. Olive. Di-
ergic Mode-Set Focusing. In 7th IFAC Symposium on agnosis of a Class of Non Linear Hybrid Systems by
Fault Detection, Supervision and Safety of Technical On-line Instantiation of Parameterized Analytical Re-
Processes, pages 1480–1485, Barcelona, Spain, 2009. dundancy Relations. In 20th International Workshop
[3] M. Bayoudh, L. Travé-Massuyès, and X. Olive. Cou- on Principles of Diagnosis, pages 283–289, 2009.
pling continuous and discrete event system techniques [18] W. Borutzky. Representing discontinuities by means of
for hybrid system diagnosability analysis. In 18th Eu- sinks of fixed causality. In F.E. Cellier and J.J. Granda,
ropean Conf. on Artificial Intel., pages 219–223, 2008. editors, Proc. of the Int. Conf. on Bond Graph Modeling,
pages 65–72, 1995.
[4] M.W. Hofbaur and B.C. Williams. Hybrid estimation
of complex systems. IEEE Trans. on Sys., Man, and [19] M. Delgado and H. Sira-Ramirez. Modeling and simu-
Cyber, Part B: Cyber., 34(5):2178–2191, 2004. lation of switch regulated dc-to-dc power converters of
the boost type. In IEEE Int. Conf. on Devices, Circuits
[5] E. Benazera and L. Travé-Massuyès. Set-theoretic es- and Systems, pages 84–88, December 1995.
timation of hybrid system configurations. Trans. Sys. [20] P.J. Gawthrop. Hybrid Bond Graphs Using Switched I
Man Cyber. Part B, 39:1277–1291, October 2009. and C Components. CSC report 97005, Centre for Sys.
[6] S. Narasimhan and L. Brownston. HyDE: A General and Control, Faculty of Eng., Glasgow, U.K., 1997.
Framework for Stochastic and Hybrid Model-based [21] A. Bregon, M. Daigle, and I. Roychoudhury. An inte-
Diagnosis. In Proc. of the 18th Int. WS. on Principles grated framework for distributed diagnosis of process
of Diagnosis, pages 186–193, May 2007. and sensor faults. In 2015 IEEE Aerospace Conf., 2015.
[7] L. Trave-Massuyes and R. Pons. Causal ordering [22] M. Daigle, I. Roychoudhury, and A. Bregon.
for multiple mode systems. In Proceedings of the Diagnosability-based sensor placement through struc-
Eleventh International Workshop on Qualitative Rea- tural model decomposition. In Second Euro. Conf. of
soning, pages 203–214, 1997. the PHM Society 2014, pages 33–46, 2014.

208
Proceedings of the 26th International Workshop on Principles of Diagnosis

Device Health Estimation by Combining
Contextual Control Information with Sensor Data
Tomonori Honda and Linxia Liao and Hoda Eldardiry and Bhaskar Saha and Rui Abreu
Palo Alto Research Center, Palo Alto, California, USA
e-mail: {tomo.honda, linxia.liao, hoda.eldardiry, bhaskar.saha, rui.maranhao}@parc.com

Radu Pavel and Jonathan C. Iverson
TechSolve, Inc., Cincinnati, Ohio, USA
e-mail: {pavel, iverson}@TechSolve.org
Abstract discusses various categories of anomaly detection technolo-
gies and their assumptions as well as their computational
The goal of this work is to bridge the gap between complexity. Several approaches such as statistical meth-
business decision making and real-time factory ods [3], neural network methods [4] and reliability meth-
data. Beyond real-time data collection, we aim to ods [5], have been applied to detect anomalies for various
provide analysis capability to obtain insights from types of equipment. The philosophies and techniques of
the data and converting the learnings into action- monitoring and predicting machine health with the goal of
able recommendations. We focus on analyzing improving reliability and reducing unscheduled downtime
device health conditions and propose a data fusion of rotary machines are presented by Lee et al. [6].
method that combines sensor data with limited di-
agnostic signals with the device’s operating con- Many of these methods focus on analyzing, combining,
text. We propose a segmentation algorithm that and modeling sensor data (e.g. vibration, current, acous-
provides a temporal representation of the device’s tics signal) to detect machine faults. One issue that remains
operation context, which is combined with sensor mostly unaddressed in these methods is that they rarely con-
data to facilitate device health estimation. Sensor sider the varying operating context of the machine. In many
data is decomposed into features by time-domain cases, false alarms are generated due to a change in machine
and frequency-domain analysis. Principal com- operation (e.g. rotational speed) rather than a change in ma-
ponent analysis (PCA) is used to project the high- chine condition. A major challenge in addressing this issue
dimensional feature space into a low-dimensional is that most machine controllers are built with proprietary
space followed by a linear discriminant analysis communication protocols, which leads to a barrier in ob-
(LDA) to search the optimal separation among taining control parameters to understand the context under
different device health conditions. Our industrial which the machine is operating. Recently, the MTConnect
experimental results show that by combining de- open protocol [7] was developed to connect various legacy
vice operating context with sensor data, our pro- machines independent of the controller providers. MTCon-
posed segmentation and PCA-LDA approach can nect provides an unprecedented opportunity to monitor ma-
accurately identify various device imbalance con- chine operating context in real-time. In this paper, we lever-
ditions even for limited sensor data which could age MTConnect to diagnose machine health condition by
not be used to diagnose imbalance on its own. combining sensor data with operating context information.
Additionally, we investigate whether it is possible to diag-
nose machine health condition using less sensor data when
1 Introduction it is combined with context information.
The growing Internet of Things is predicted to connect 30 Prior work [8] has demonstrated that vibration data could
billion devices by 2020 [1]. This will bring in tremendous be used for diagnosing machine imbalance fault conditions.
amounts of data and drive the innovations needed to realize Our study focuses on extending prior work by exploring var-
the vision of Industry 4.0—cyber-physical systems moni- ious types of sensor and control data for diagnosing the im-
toring physical processes, and communicating and cooper- balance of the machine tools.
ating with each other and with humans in real time. One of
the key challenges to be addressed is how to analyze large Our contribution includes the following extensions:
amounts of data to provide useful and actionable informa-
tion for businesses intelligence and decision making. In par- • Combining control and sensor signals to improve ac-
ticular, to prevent unexpected downtime and its significant curacy.
impact on overall equipment effectiveness (OEE) and total
cost of ownership (TCO) in many industries. Continuous • Utilizing a different set of sensor data such as temper-
monitoring of equipment and early detection of incipient ature, power, flow, and lubricant/coolant pH.
faults can support optimal maintenance strategies, prevent
downtime, increase productivity, and reduce costs. Our hypothesis is that these advancements to prior work
A significant number of anomaly detection and diagno- will aid in improving the diagnosis capability as well as re-
sis methods have been proposed for machine fault detection ducing the cost of machine diagnostics by utilizing cheaper
and machine health condition estimation. Chandola et al. [2] sensors.

209
Proceedings of the 26th International Workshop on Principles of Diagnosis

2 Experimental Data sufficient degree. Additionally, if the associated sensors are
The data under study has been collected from experiments too expensive to install, then data fusion may be applied.
utilizing a machine tool monitoring system implemented on There are three data fusion approaches typically used in
a horizontal machining center manufactured by Milltronic machinery diagnostics [10; 11]—data-level fusion, feature-
with Fanuc 0i-MC control. We have two main sources of level fusion, and decision-level fusion. Data-level fusion
data: (i) data from additional sensors installed on the ma- involves combining sensor data before feature extraction,
chine, and (ii) data from the machine tool controller. This such that features contain information gathered from mul-
data has been collected using National Instrument equip- tiple sensors. Feature-level fusion involves generating fea-
ment and software (LabVIEW). tures from each sensor separately, then fusing this set of
The external sensors used for data collection include: features generated from all of the sensors coherently for di-
agnostics. Finally, decision-level fusion creates diagnostics
• power sensor that measures power using Hall effect, from each sensor separately, then aggregates these diagnos-
• accelerometers that capture machine tool motion in 6 tics into a single diagnostic output.
degrees of freedom, The choice of the three types of data fusion methods is
often application specific. In our application, we found that
• thermocouples that measure temperatures at 10 loca-
temperature sensor data cannot resolve imbalance condi-
tions on the machine tool,
tions by itself and control signal data is too coarse-grained
• pH sensor for detecting the pH level of the metalwork- to aid in classifying imbalance conditions using the stan-
ing fluid, and dard data-fusion techniques. Note that we did not focus on
• flow rate sensor to measure metalworking fluid pump spindle acceleration data, which could diagnose imbalance
flow. on its own (see Subsection 4.1) since that would require
retrofitting existing machine tools with new expensive sen-
The second category consists of data collected from the sors and data acquisition hardware. Ideally we would like
controller. This data includes drive loads, absolute and rel- to use the readily accessible control signals and data from
ative positions, servo delays, and feed rate. The complete inexpensive temperature sensors to diagnose imbalance. To
list of the components of the control data is listed in Pavel achieve this goal, we proposed a different type of data fusion
et al.[8]. approach. We used the control signal to provide the contex-
Data has been collected in two sessions, one in 2009 and tual information for temperature sensor data. The control
the other in 2010. Although the basic control signals are signal is used for the segmentation of sensor data, but does
similar, they are offset by constant values (see Figure 1). not directly map into feature vectors (see Subsection 4.2).
Since the positional offset could cause a difference in the
motion dynamics, we have treated them as separate data sets 3.2 Model Selection
for this study.
Since the data sets are statistically small and dimensional-
ity of the data is increased by feature synthesis, the models
3 Technical Approach to be used for imbalance classification need to be carefully
chosen to avoid over-fitting. The high-dimensional data
For each extension to prior work listed in Section1, we needs to be projected to a much smaller sub-space to prevent
have performed two main steps for creating appropriate di- over-fitting1 To accomplish this, the main techniques used
agnostics: in this study are Principal Component Analysis (PCA) [12]
• Feature Extraction & Synthesis and Linear Discriminant Analysis (LDA) [13]. These tech-
niques are based on linear coordinate transformation, which
• Model Selection makes them more likely to under-fit and less likely to over-
fit [14].
3.1 Feature Extraction & Synthesis
There are various approaches for condensing time series in- 4 Results
formation into data mining features. Prior work has utilized
transfer functions to map control signals to vibrational sen- We have explored three types of imbalance diagnostics to
sor data [8]. The diagnosis step is then reduced to compar- investigate the hypothesis posed in Section 1:
ing the features of transfer function-predicted vibration data • Sensor based Diagnostics
and the sensor-derived vibration data. This approach makes
• Control based Temporal Segmentation followed by
sense when the control signal directly impacts the output
Sensor based Diagnostics
variables of the machine. For motion control of machine
tools, the estimated transfer function should be similar to 4.1 Sensor based Diagnostics
the transfer function of the implemented control (like PI or
PID). Typical vibration data features would include average, In this case, each sensor signal was analyzed separately to
standard deviation, and maximum FFT values [9]. determine if any of the sensor signals contains enough diag-
However, we would like to diagnose the state of machine nostic information to detect imbalance on its own. By plot-
using not only accelerometers, but also other sensors, such ting the time series data we find that spindle acceleration
as temperature sensors. Since temperatures at various lo- sensors (which captures vibration) show higher oscillation
cations are not part of active control loops, there may not 1
Note that complexity of model is positively correlated with
exist well defined transfer functions that can map control likelihood of over-fitting. Thus, creating a classifier that takes
signals to temperature sensor data very accurately. In such high-dimensional input will have higher degree of fredoom (i.e.
cases where conventional features extracted from tempera- higher complexity) compare to low-dimensional inputs, which re-
ture signals are not correlated with the fault (imbalance) to a sults in higher likelihood of over-fitting.

210
Proceedings of the 26th International Workshop on Principles of Diagnosis

(a) Absolute X position (b) Absolute Y position

Figure 1: Primary Control Signals

211
Proceedings of the 26th International Workshop on Principles of Diagnosis

amplitudes (see Figure 2) with increasing imbalance. Since computed the standard deviation at the each time step and
imbalance actually impacts moment of inertia of the spindle, identified the periods with standard deviation below a set
this change in acceleration is expected. threshold to find the consistent time intervals (shown as col-
We also considered measuring imbalance through tem- ored segments along the time axis in Figure 5 (b)). Then
perature. From the energy flow perspective, additional ac- we find the intersection of the sets of consistent time inter-
celeration caused by imbalance should result in higher en- vals over all the control signals to determine the aggregate
ergy consumption from the power source and higher energy time intervals over which the control signals are statistically
dissipation to thermal inertias due to friction, which should consistent (shown as black segments along the time axis in
result in temperature increase in parts of the machine tool. Figure 5 (c)).
However, the time series data, from each of the tempera- These temporal segments are then mapped to sensor data
ture sensors, did not show distinguishing features similar to to facilitate diagnostics. For each of 16 temporal segments,
the acceleration sensors. An example of temperature sensor we computed features including (i) average, (ii) standard de-
time series data is shown in Figure 3. viation, (iii) maximum FFT value, and (iv) FFT frequency
at maximum amplitude. This step produces a 64 dimen-
sional feature space to diagnose machine imbalance. As
mentioned before, to avoid the overfitting we focus on linear
transformation based approaches. We implemented Princi-
pal Component Analysis (PCA) to reduce the dimensional-
ity from 64 to 4 (postulating that there should be 4 unique
dimensions given the 4 uncorrelated features that we have
selected). The PCA step is followed by Linear Discrimi-
nant Analysis to find the optimal coordinate transformation
that provides maximum separation between classes. Result
of this PCA-LDA analysis is shown in Figure 6 for Fluid
Temperature sensor data. Another temperature sensor lo-
cated at Spindle Motor also exhibits similar diagnostic ca-
pability after application of control based temporal segmen-
tation. This demonstrates that control data can be used to
provide context to sensor data in a way that helps diagnose
machine imbalance. Thus, temperature sensor which had
Figure 3: Sample Temperature Sensor Data (Fluid Temper- inferior diagnostic performance without context data, could
ature): blue and red traces indicate nominal and faulty con- classify imbalance perfectly when it is combined with addi-
ditions respectively tional context from control signal.

For this sensor data analysis, the features extracted are (i)
5 Conclusion and Discussion
average, (ii) standard deviation, (iii) maximum amplitude This work explores various types of sensor and control data
of FFT, and (iv) frequency for maximum amplitude of FFT. for diagnosing the imbalance of the machine tools. Our
These four features are inspected visually to determine if proposed approaches utilize sensor data that has not been
imbalance could be classified by a simple linear classifier. used before for this purpose. This includes temperature,
The spindle acceleration (X, Y, and Z) feature (maximum power, flow, and lubricant/coolant pH. In addition, our pro-
amplitude of FFT) showed easily visible characteristics that posed techniques combine control and sensor signals to im-
can distinguish between degrees of imbalance. See Figure 4 prove accuracy. Namely, by combining context information
for an example of visual classification based on X-axis ac- gained from the control signal, temperature sensor was able
celeration data. Other sensor signals like power, pH, flow, to classify machine imbalance conditions with much higher
and temperature did not exhibit such classification capabil- accuracy than using itself alone.
ity. For future work, we will explore diagnostics based on
control signal alone. Given that relying on sensor data typ-
4.2 Control-based Segmentation followed by ically requires adding sensors to existing machine tools, it
Sensor-based Diagnostics would be ideal if we could diagnose imbalance of the ma-
chine from control signals that are usually recorded (i.e. no
The second diagnostic approach that we explored combines additional hardware required). The expectation is that if a
both sensor and control data in a coherent manner. The machine tool uses feedback controls, then the control signal
first step in this approach is to utilize the control signal to should be impacted by any change in the operational char-
provide temporal segmentation, i.e., assuming quasi-steady acteristics (in this case the imbalance of the machine tools).
state, the goal is to find the time intervals in which the fol-
lowing conditions are satisfied: (i) all experiments display References
same values for the primary control signal (actual spindle
speed) , and (ii) all the control signals are constant over [1] Carrie MacGillivray, Vernon Turner, and Denise Lund.
the same period. Note that, to investigate the dynamic re- Worldwide internet of things (iot) 2013–2020 fore-
sponse, rather than quasi steady state response, the control cast: Billions of things, trillions of dollars. IDC. Doc,
signals should be consistent across the experiments so that 243661(3), 2013.
responses are compared under the same set of control in- [2] Varun Chandola, Arindam Banerjee, and Vipin Kumar.
puts. Figure 5 (a) shows the result of this temporal seg- Anomaly detection: A survey. ACM Computing Sur-
mentation scheme. For each of the control signals, we have veys (CSUR), 41(3):15, 2009.

212
Proceedings of the 26th International Workshop on Principles of Diagnosis

(a) Spindle X Acceleration: 2009 Data (b) Spindle X Acceleration: 2010 Data

Figure 2: Spindle Acceleration Data for different imbalance level

(a) Spindle X Acceleration for 2009 Data (b) Spindle X Acceleration for 2010 Data

Figure 4: Visual Classification using Spindle X Acceleration Sensor

[3] Markos Markou and Sameer Singh. Novelty detection: incomplete failure data collected from after the date of
a review-part 1: statistical approaches. Signal process- initial installation. Reliability Engineering & System
ing, 83(12):2481–2497, 2003. Safety, 94(6):1057–1063, 2009.
[4] Markou Markos and Sameer Singh. Novelty detection: [6] Jay Lee, Fangji Wu, Wenyu Zhao, Masoud Ghaffari,
a review-part 2: neural network based approaches. Sig- Linxia Liao, and David Siegel. Prognostics and health
nal Processing, 83(12):2499–2521, 2003. management design for rotary machinery systems-
[5] Haitao Guo, Simon Watson, Peter Tavner, and Jiang- reviews, methodology and applications. Mechanical
ping Xiang. Reliability analysis for wind turbines with Systems and Signal Processing, 42(1):314–334, 2014.

213
Proceedings of the 26th International Workshop on Principles of Diagnosis

(a) Raw Spindle Speed Control (b) Spindle Speed Control with Consistent Time Segment

Figure 5: Time Series Segmentation

(a) Group 1 (b) Group 2

Figure 6: PCA-LDA Result using Fluid Temperature

[7] MTConnect Standard. Part 1-overview and protocol, [8] Radu Pavel, John Snyder, Nick Frankle, Gary Key, and
version 1.01. MTConnect Institute, 2009. Loran Miller. Machine tool health monitoring using

214
Proceedings of the 26th International Workshop on Principles of Diagnosis

prognostic health monitoring software. In MFPT 2010
Conference, Huntsville, AL, April 2010.
[9] Houtao Deng, George Runger, Eugene Tuv, and
Martyanov Vladimir. A time series forest for classi-
fication and feature extraction. Information Sciences,
239:142–153, 2013.
[10] Qing Charlie Liu and Hsu-Pin Ben Wang. A case study
on multisensor data fusion for imbalance diagnosis of
rotating machinery. AI EDAM, 15(03):203–210, 2001.
[11] Andrew KS Jardine, Daming Lin, and Dragan Banje-
vic. A review on machinery diagnostics and prognos-
tics implementing condition-based maintenance. Me-
chanical systems and signal processing, 20(7):1483–
1510, 2006.
[12] Svante Wold, Kim Esbensen, and Paul Geladi. Princi-
pal component analysis. Chemometrics and intelligent
laboratory systems, 2(1):37–52, 1987.
[13] Gary J Koehler and S Selcuk Erenguc. Minimizing
misclassifications in linear discriminant analysis*. De-
cision sciences, 21(1):63–85, 1990.
[14] Bo Yang, Songcan Chen, and Xindong Wu. A struc-
turally motivated framework for discriminant analy-
sis. Pattern Analysis and Applications, 14(4):349–367,
2011.

215
Proceedings of the 26th International Workshop on Principles of Diagnosis

216
Proceedings of the 26th International Workshop on Principles of Diagnosis

On the Learning of Timing Behavior for Anomaly Detection in Cyber-Physical
Production Systems
Alexander Maier1 and Oliver Niggemann1,2 and Jens Eickmeyer1
1
Fraunhofer Application Center Industrial Automation IOSB-INA
e-mail: {alexander.maier, jens.eickmeyer}@iosb-ina.fraunhofer.de
2
inIT - Institute Industrial IT
e-mail: oliver.niggemann@hs-owl.de

Abstract specific features. The taxonomy is then used to evaluate
whether the models can be identified automatically and used
Model-based anomaly detection approaches by for anomaly detection.
now have established themselves in the field of
engineering sciences. Algorithms from the field Based on this evaluation, we present a timing learning
of artificial intelligence and machine learning are method, which is used to learn the timing behavior as timed
used to identify a model automatically based on automaton. In contrast to other approaches, we use the
observations. Many algorithms have been devel- underlying timing distribution function to differentiate be-
oped to manage different tasks such as monitoring tween transitions with equal events which belong to differ-
and diagnosis. However, the usage of the factor ent processes.
of time in modeling formalisms has not yet been By calculating the computation runtime, we prove that
duly investigated, though many systems are de- our approach runs faster than other existing methods for
pendent on time. timed automaton learning.
In this paper, we evaluate the requirements of the The presented learning method is used in an exemplary
factor of time on the modeling formalisms and the plant setup to demonstrate the suitability for anomaly detec-
suitability for automatic identification. Based on tion in CPPS.
these features, which classify the timing model- The paper is organized as follows: In Section 2 we eval-
ing formalisms, we classify the formalisms con- uate some timing learning features and give a taxonomy
cerning their suitability for automatic identifica- of how these features are met by three categories of tim-
tion and the use of the identified models for the ing modeling formalisms, namely (i) Dynamic system mod-
diagnosis in Cyber-Physical Production Systems els, (ii) Operational formalisms and (iii) Descriptive for-
(CPPS). We argue the reasons for choosing timed malisms. In Section 3, we argue why we use timed automata
automata for this task and propose a new timing as formalism, point out some challenges in timed automaton
learning method, which differs from existing ap- learning and present our timing learning approach. Further,
proaches and we proof the enhanced calculation we prove formally the enhancement of the calculation run-
runtime. The presentation of a use case in a real time of our approach. Section 4 completes the contribution
plant set up completes this paper. with the presentation of a use case in a real plant. Finally
in Section 5, we conclude this paper with a short discussion
and give an outlook to future work.
1 Introduction
Many learning algorithms have been developed for the iden-
tification of behavior models of CPPS, e.g. [1], [2], [3]. 2 Classification of Timing Learning Features
However, most of the learning algorithms do not include and Algorithms
timing information, not least because the modeling for-
malisms do not consider timing information. The modeling of time for computation purpose is a widely
Indeed, technical systems mostly depend on time, e.g. the researched area (e.g. in [4], [5] and [6]). Many formalisms
filling of a bottle or the moving of a part on a conveyor belt. have been created to model different aspects of timing be-
Therefore, many applications (such as the anomaly detec- havior. In this paper, some aspects are analyzed which have
tion) require a model with timing information. Some faults to be considered when choosing an appropriate timing mod-
only can be detected using timing information (especially eling formalism. Based on this analysis, some modeling
degradation faults, e.g. a worn conveyor belt runs slower). formalisms are evaluated according to their capabilities to
In this paper, we use the term "Cyber-Physical Systems model the timing behavior. One of those formalisms is cho-
(CPS)" for "systems that associate (real) objects and pro- sen that is well suited for the anomaly detection in CPPS.
cesses with information processing (virtual) objects and pro- To keep the application domain in mind, a special fo-
cesses through open, partly global, anytime interconnected cus is on modeling and identification of the timing behavior
information networks". Further, a CPPS is a CPS in the con- of CPPS. Additionally, the suitability of the modeling for-
text of an industrial production environment. malisms according to automatic learnability from observa-
In this paper, we give a taxonomy of modeling for- tions only and the suitability for anomaly detection is eval-
malisms. These formalisms are evaluated according to uated.

217
Proceedings of the 26th International Workshop on Principles of Diagnosis

2.1 Evaluation of Timing Modeling Features (a set of) linear behaviors, where the future behavior from a
Before choosing an appropriate timing modeling formalism given state at a given time is always identical. Branching-
some key issues have to be considered, which are listed time formalisms are interpreted over trees of states. That
below. Some of that and additional features are given in means, in contrast to linear-time models, the future behavior
[5] where the authors provide a comprehensive analysis on of a given state at a given time can follow different behavior
timing modeling features and corresponding modeling for- according to the tree.
malisms is given. A linear behavior can be regarded as a special case of
a tree. Conversely, a tree can be treated as a set of linear
Discrete or dense time domain behaviors that share common prefixes (i.e., that are prefix-
The separation of formalisms concerning the usage of dis- closed); this notion is captured formally by the notion of
crete and dense time domains is a first natural categoriza- fusion closure [8]. Thus, linear and branching models can
tion. Discrete time models comprise a set of isolated points, be put on a common ground and compared.
whereas dense time means that in a dense set, ordered by
"<", for every 2 points t1 and t2 with t1 < t2 there is always 2.2 Taxonomy of Timing Modeling Formalisms
a third point t3 in between, such that t1 < t3 < t2 .
Mainly, the timing modeling formalisms can be subdivided
Explicit or implicit modeling of time into three categories: (i) Dynamic system models, (ii) Oper-
Another major distinctive feature is the possibility of im- ational formalisms and (iii) Descriptive formalisms:
plicit and explicit modeling of time. Model formalisms with
explicit time allow the modeling of concrete time values for Dynamic system models
some specific event, e.g. "if the sensor is activated, start In various engineering disciplines (like mechanical or
the conveyor belt within two seconds". Implicit modeling electrical) and especially in control engineering, the so-
of time only gives information about the time duration as a called state-space representation is a common way to model
whole. the timing behavior of technical systems [9].
One clock or many clocks Three key elements are essential for the state-based rep-
Furthermore, time model formalisms can be differenti- resentation: The vector x with the state variables, the vector
ated according to their number of used clocks. When deal- u with the input variables and the vector y with the output
ing with independent modules within a system, the question variables. All these values explicitly depend on the time at
arises whether to use one or many clocks. The usage of which they are evaluated (usually represented as x(t), u(t),
many clocks leads to the need of clock synchronization in and y(t)), however, the timing information is not explicitly
the simulation step, whereas the usage of one clock only re- described in the form as "the filling of the bottle takes five
quires a transformation from an n-clock model to a 1-clock second" i.e. it uses implicit timing.
model. The main advantage of dynamical system models is that
very detailed physical models can be created using estab-
Concurrency and composition lished mathematical methods. But this also can turn into a
Most real systems are too complex to model them in one disadvantage. For many purposes, the models are too de-
overall model. The behavior has to be divided into several tailed, i.e. they are unsuitable for high-level description,
subsystems, so that the overall model is a composition of its since some expert knowledge is required to read and under-
sub-models. For finite state machines, the number of states stand the models. As proposed in [10], dynamical systems
reduces enormously if the system is decomposed into sub- can be used for the diagnosis of distributed systems.
systems. This is also referred to as modularization. Various methods exist to identify dynamic system mod-
The decomposition is a less mature process. Difficulties els. These methods are grouped under the term model iden-
can arise in the synchronization step. Mostly, the separated tification (sometimes the term "system identification" is also
models of subsystems have equal or identical properties. used), although, the model is not identified completely, but
Furthermore, the time bases can be different between the a structure model is presumed and the identification meth-
modules, discrete or continuous, or the time base is implicit ods only determine the parameters. So, still some expert
for one module and explicit for another. knowledge is necessary and manual work has to be done.
Single-mode and multiple-modes In [6], Isermann describes some methods, e.g by means of
The distinction between models, which can only cope parameter estimation. The states itself are not identified.
with single-modes and models that additionally can deal Dynamic system models also can be used for fault de-
with multiple-modes, goes a step deeper than concurrency tection (e.g. [11]). The model-based fault detection uses
and decomposition. A system may, at some point in time, the inputs u and the outputs y to generate residuals r, the
abruptly change its behavior. In technical systems, this hap- parameter estimates Φ or state estimates x, that are called
pens for reasons such as shifting a gear or stopping a con- features. A comparison of these features with the nominal
veyor belt. All state based models (e.g. statecharts, Petri values (normal behavior) detects changes of features, which
nets or finite state machines) are able to describe multiple- lead to analytical symptoms s. The symptoms are then used
mode systems, where equation based formalisms (e.g. ordi- to determine the faults.
nary differential equation) can only describe the behavior of Despite their suitability for the modeling of timing be-
single-mode systems. havior, dynamic system models can hardly be learned auto-
matically based on observations only, since the structure of
Linear- and branching-time models the model has be given and mostly only the parameters are
A difference can also be made between linear and branch- identified.
ing time models [7]. Linear-time formalisms are interpreted
over linear sequences of states. Each description refers to Operational Formalisms

218
Proceedings of the 26th International Workshop on Principles of Diagnosis

Operational formalisms further can be subdivided into (i) Petri nets are named according to Carl Adam Petri, who
synchronous state machines and (ii) asynchronous abstract initially developed this modeling formalism [20]. A vari-
machines: ety of Petri nets exists [21]. The most common type is
place/transition-nets. It basically consists of states and tran-
Synchronous state machines: sitions. Places store tokens and hand them over to the tran-
A large variety of synchronous state machines exists: fi- sitions. If all incoming places hold at least one token, a
nite state machine, statecharts, timed automaton, hybrid au- transition is enabled. An enabled transition will be fired.
tomaton, Büchi automaton, Muller automaton, and others After firing the transition, tokens from incoming transitions
(see [12]). Here, we confine our self to finite state machines are moved to outgoing transitions.
and timed automata, the timing extension of finite state ma- Petri nets also have been extended to handle timing infor-
chines. mation. Merlin and Farber proposed the first Timed Petri net
The main strength and the reason for the wide usage of in [22]. Each transition is extended with the minimum and
finite state machines is their accessibility for humans and maximum firing time, where the minimum firing time can
their simplicity. Often, processes or timing behavior are be 0 and the maximum can be ∞. A comprehensive sur-
described by a sequence of events. In fact, technical sys- vey on several timed extensions to Petri nets can be found
tems are often programmed in state machines, e.g. using in [23] and [24].
the standardized programming language from IEC 61131. Furthermore, several approaches exist to identify Petri
Therefore, modeling the timing behavior of such technical nets from sampled data. However, some requirements are
systems, in the sense of finite state machines or timed au- put on the language to be identified or some assumptions
tomata, is consequential. are made, e.g. in [25], Petri nets are identified from knowl-
Some algorithms already exist to identify timed automata edge of their language, where it is assumed that the set of
from observations (e.g. in [13] , [14], [15], [16], [1]). Most transitions and the number of places is known. Only the net
automata identification algorithms are based on the state structure and the initial marking are identified.
merging method. The basic procedure is illustrated in Fig- Petri nets in general are suited for fault detection
ure 1. It works as follows: (e.g. in [26] or [2]). The different types of Petri nets
(mainly condition/event-systems, place/transition-nets and
Data high-level Petri nets) have different time and space com-
Data
Acquisition
Measure- plexity.
ments
1
Descriptive Formalisms
2
Prefix As the name suggests, descriptive formalisms describe
Detection
the model using a natural language, mostly based on mathe-
State matical logic [27]. Such formalisms are especially suited if
Merging some conditions have to be described.
3 Example 1. If it is raining or if it was raining in the last
two hours, then the street is wet.
Finite Prefix Tree
Automaton Acceptor Similar rules can also be created for the prediction of
output signals (actuators) based on the inputs (sensors) in
a CPPS.
Figure 1: The principle of offline automaton learning algo-
As already shown in Example 1, the conditions can also
rithms using the state merging approach.
contain time information.
There exist different types of descriptive formalisms, e.g.
First, in step (1), the data is acquired from the system first order logics, temporal logics, explicit-time logics or al-
and stored into a database. In step (2), the observations are gebraic formalisms. Further details can be found in the lit-
used to create a prefix tree acceptor (PTA) in a dense form, erature, e.g. [27].
whereas equal prefixes are stored only once. Then, in step Some algorithms exist to identify descriptive models. For
(3), in an iterative manner all pairs of states are checked for the prediction of the behavior of CPPS, a timed decision
compatibility. If a compatible pair of states is found, the tree can be learned for instance. Examples for such learning
states are merged. In [13], additionally a transition split- algorithms are ID3 [28], the C4.5 algorithm as extension of
ting operation is introduced, which is executed when the the ID3 algorithm [29] or a generic algorithm for building a
resulting subtrees are different enough. The result is a fi- decision tree by Console [3].
nite automaton the generalizes the observed behavior in an Note that the rule can not always be interpreted back-
appropriate way. wards. Using Example 1, a reason for the wet street could
Finite state machines can also be used for fault detec- be that somebody has washed his car on the street. There-
tion and diagnosis (e.g. in [17], [18], [19]). Depending on fore, descriptive formalisms have a limited suitability for
the used formalism, different errors can be detected: wrong anomaly detection. The usage of descriptive formalisms for
event sequence, improper event, timing deviation and error anomaly detection puts additional requirements on the rules,
in continuous signals. they have to be more concrete. Using the given example, it
can be modified as follows:
Asynchronous abstract machines:
Beside the finite state machines, which work syn- Example 2. The street is wet if and only if it is raining or it
chronously, there exist formalisms that work asyn- was raining in the last two hours.
chronously, called the asynchronous abstract machines. The This rule allows a backward interpretation, if the reason
most popular formalism in this group is Petri nets. for the wet street is unknown. However, the meaning of the

219
Proceedings of the 26th International Workshop on Principles of Diagnosis

rule has now changed. Additionally, these kind of rules is some state of the art timing representation methods and
hardly identifiable from observations only. propose our solution.
Comparison of Modeling Formalisms • Relative or absolute time base: The time base is also
Table 1 shows how the mentioned timing modeling fea- a very important issue. The base can be either absolute
tures are met by the corresponding modeling formalisms. e.g. referred to the beginning of a production cycle or
It can be seen that operational and descriptive formalisms relative to the last event.
allow a similar level of timing modeling, while dynamic • Number of clocks: Technical systems may be pro-
system models differ in nearly all features. In contrast to grammed using a certain number of clocks. These have
the other formalisms, dynamic system models use a dense to be identified or the behavior has to be expressed us-
time domain, only allow implicit modeling of, time, use one ing only one clock.
clock only and can model linear time models.
Please note the different possibilities to handle concur- Timed automata allow both, one and many clocks.
rent behavior. Petri nets are the first choice for this task. However, in [13] Verwer showed that 1-clock timed
Using tokens, concurrent behavior can be modeled in one automata and n-clock timed automata are language-
model. Timed automata and hidden Markov models (HMM) equivalent, but in contrast to n-clock timed automata,
are able to decompose the behavior in several subsystems. 1-clock timed automata can be identified efficiently.
• Event splitting: When do events with different timing
3 Automaton Learning belong to the same event, or do they describe different
The decision of which formalism to use is based on several events? As can be seen in Figure 2, the events can be
factors. These can differ based on the individual use case. split based on the timing, which is based on the con-
Here, we consider the models to be used for learning and tainer size: The robot needs more time to move the big
diagnosis of CPPS. container compared to the small one, this is captured
Despite there exist several algorithms for the identifica- in the given probability distribution function over time.
tion of timed behavior, it can be seen in Table 2 that the More formally: The event’s timing distribution func-
usage of timed automata is a good choice. tion can comprise several modes that have to be identi-
fied.
• Understandability: In contrast to many other auto-
matically identified models, the identified finite state d(t)
probability place at
machines can be better understood by third persons. position A
Small
They can be verified by experts. t
time
filling bottleevent a
Containers A

• Wide usage: Finite state machines are widely used, d(t) place at
position A
e.g. for modeling or programming. Robot
filling
t
bottle A
d(t) fillingStart
bottle
• Learnability: Finite state machines are suitable for
B
automatic learning. The goal is to use as few expert probability t
place
at position B
knowledge as possible. d(t)
filling bottle event a
Large
• Diagnosability: Finite state machines are suitable for time
t
place Containers B
at position B
fault detection. This applies for both, manually created
and automatically identified finite state machines.
Figure 2: The timing behavior changes based on the con-
• Suitability for verification: The identified finite state tainer size.
machines can be used for automatic verification.
• Modification: The identified finite state machine can
• Event splitting or timing preprocessing: Continu-
be manually modified and adapted after learning. This
ing from the previous point, additionally the question
can also be done automatically.
arises that whether the modes are identified during the
3.1 Challenges in Automaton Learning learning process itself or whether a preprocessing can
be used to identify multiple modes and use this infor-
Some algorithms have already been introduced for the iden- mation in the learning process, avoiding the additional
tification of timed automata, see Section 2.2. However, there splitting operation.
are still some challenges in learning timed automata. This
applies in particular to the time factor. 3.2 Timed Automaton Learning Algorithm
• Identification of states and events: The timing behav- Several algorithms have been introduced to learn an au-
ior includes not only the time stamps for some obser- tomaton based on observations of the normal behavior only.
vations, but also some states and transitions with timed While most automaton identification algorithms do not con-
events in between. Many learning algorithms (espe- sider time (e.g. MDI [30] and Alergia [31]), recently only
cially for learning of Markov chains) assume the states few algorithms have been introduced that identify a Timed
and transitions as given and only learn the transition Automaton. RTI+ [13] and BUTLA/HyBUTLA [16] learn
probabilities. Here, the structure (states and events) is in an offline manner, i.e. first the data is acquired and stored
not given but has to be identified from observations. and then the automaton is learned. However, for the case
• Timing representation method: Additionally, an ap- that observations cannot be stored, an online learning algo-
propriate timing representation method has to be cho- rithm is desirable, which includes each observed event on-
sen, which is able to correctly describe the technical line, without a preprocessing. OTALA [1] is an extension of
processes. At the beginning of Section 3.2 we review BUTLA and learns a timed automaton in an online manner.

220
Proceedings of the 26th International Workshop on Principles of Diagnosis

Table 1: Taxonomy of the timing modeling features and how they are satisfied by the corresponding modeling formalisms.
descriptive Dynamic
operational Formalisms
Formalisms system models
Timed e.g. Rule- e.g. state space
HMM Petri nets
Automata based system representation
Discrete or dense
discrete discrete discrete discrete dense
time domain
Explicit or implicit
explicit explicit explicit explicit implicit
modeling of time
One clock or
one/many one/many one/many one/many one
many clocks
Concurrency
++ ++ +++ + +
and composition
Single-mode and single/ single/ single/
single single
multiple-modes multiple multiple multiple
Linear- and branching- linear/ linear/ linear/ linear/
linear
time models branching branching branching branching

Table 2: Satisfiability of the mentioned properties by different timing modeling formalisms.
Timed Rule-based State space
HMM Petri nets
Automata system representation
Understandability +++ ++ ++ +++ +
Wide usage +++ +++ ++ ++ ++
Learnability +++ ++ + +++ +
Diagnosability +++ ++ ++ ++ ++
Suitability for verification +++ +++ ++ ++ +
Modification +++ ++ ++ +++ +

A crucial issue for the modeling formalism of timed sys- terns as shown in Figure 2.
tems is the representation of the timing information. Usu-
ally, timed automata use a single clock only and therefore a Timing preprocessing
relative time base is required, where a relative time stamp The timing of events is analyzed in a preprocessing step.
represents the passed time from entering until leaving a The relative time values of each event are collected in a his-
state. The timing information is annotated in the transition togram. It is decided whether the timing behavior is subdi-
next to an event. The usual way is to use intervals record- vided into multiple modes based on this histogram and the
ing the minimum and maximum observed time values for a resulting probability density distribution over time. In case
specific event [13], [14], [15], [1]. of multiple modes, an event is separated according to the
number of modes in the PDF such that each event consists
RTI+, the first algorithm for the identification of timed
of only one mode. For instance, an event ei with 2 modes is
automata [13], included a transition splitting operation in
separated into ei,1 , ei,2 , as can be seen in Figure 3.
addition to the merging operation. The timing in the transi-
tions is represented with histograms using bins and uniform Probability density
distribution [13]. During the state merging procedure, it is function over time
Separated events

also checked, whether a transition can be split. A transi- p(t) p(t)
tion is split when the resulting subtrees are different enough. ei ei,1 ei,2
However, the splitting operation is associated with a high
calculation time, since depending on the bin size, all pos- t t
sible splits have to be calculated. The disadvantage of this
approach is that the bin size has to be set manually by ex- Figure 3: An event with a multi-mode timing behavior is
perts. Further, it does not take the underlying distribution separated into its modes.
into account.
In contrast to other existing algorithms for the identifi- For the detection of multiple modes in events, three meth-
cation of timed automata, our proposed identification algo- ods have been evaluated:
rithm BUTLA [16] uses probability density functions over
time (PDFs) to express the timing behavior. Unlike other • Kernel density estimation: This version is straight for-
approaches, we base our decision on the timing information ward by estimating the density of the distribution func-
itself, not on the subtree resemblance. tion and subdividing at local minimums. It is optimized
The identification algorithm BUTLA follows the method- for efficient computation time. Nevertheless it delivers
ology from Figure 1. Additionally, instead of the splitting useful results.
operation, a preprocessing step is introduced, which identi- • ExpectationŰmaximization (EM) - algorithm: This
fies the timing behavior and captures different behavior pat- method is well-known from the state of the art. It per-

221
Proceedings of the 26th International Workshop on Principles of Diagnosis

forms well, but the number of mixed distribution func- procedure is not necessary, since the transitions are already
tions has to be known or determined subsequently by split according to the identified timing modes.
trying all values and take the best fitting.
• Variational Bayesian inference: This version has the
3.3 Analysis of the Timing Preprocessing
weakest performance but delivers the best results. The Figure 2 illustrates that a state can be a starting point for
number of overlapping distribution function is calcu- different processes: When the robot is started, it depends on
lated in an iterative manner. the size of the containers that which of the sub-trees is taken
for the further process, based on the time that is needed to
Due to the high computation effort of the EM-algorithm
move the container. Different possibilities exist to identify
and Variational Bayesian inference, we chose to use the
the different timing behavior of the sub-trees.
kernel density estimation for the timing preprocessing in
The algorithm RTI+ uses a splitting operation, which cal-
BUTLA. The determination of the timing modes using the
culates a p-value for all possible splits and its sub-trees. If
kernel density estimation works as follows:
the lowest p-value of one split is less than 0.05, the transition
First, for each event e, all timing values t1 , t2 , ..., tk are
is split.
collected and stored in a list {e, {t1 , t2 , ..., tk }}, k ∈ N is
Figure 4 illustrates the problem of the splitting operation.
the number of collected timing values for one event. Then,
The main drawback of using the splitting operation is that
the PDFs are calculated using the kernel density estimation
it requires additional computation time. First, all possible
method for each event. Density estimation methods use a
splits have to be evaluated. Based on the number of ob-
set of observations to find the subjacent density function.
servations, these can be a huge amount. And after finding
Given a vector t with the time values of the observations,
the best splitting point based on the smallest p-value, the
the underlying density distribution for a time value t can be
transition has to be split. Here, for all postfixes of the cor-
estimated as
responding transition, it has to be decided that which path
N to follow. Since all these paths are mixed in the previous
1 X states, the information that which path follows which states,
f (t) = k(ti ; t) (1)
N i=1 based on the original data, has to be stored somehow. This
where N ∈ N is the number of time values in the vector of leads to a huge memory consumption. To avoid this high
observations and k(ti ; t) is a non negative kernel function memory consumption, RTI+ renews the prefix tree acceptor
beginning with the corresponding state after each splitting
Z ∞
operation. However, this is still time and space consuming.
k(t; t)dt = 1. (2)
−∞
As underlying probability distribution, we use the Gaus-
sian distribution, which is defined as: a Split a a

1 (t−µ)2
− 2σ2
G(µ, σ 2 , t) = √ e (3) ?

2πσ 2 n n'
m
where σ 2 is the bandwidth (smoothing factor), µ the mean
value and t is the time value, for which the probability is
calculated.
Figure 4: The problem of the splitting operation.
The choice of the bandwidth is important for the correct-
ness of the results and it is the subject of research in dif-
ferent publications (e.g. [32]). In the case of identifying Proposition 1. The time complexity of calculating and per-
the normal behavior of production plants, it is useful not to forming a splitting operation is O(m2 · n2 ), where m is the
use a fixed value for smoothing factor but to keep it vari- number of input samples and n is the number of states in the
able. Here, the variable smoothing factor is 5% of the cur- PTA.
rent value. This results in the greater variance for greater
Proof. For each transition (in worst case there are n − 1
time values and smaller variance for smaller time values.
transitions in the PTA, if it is a linked list of states with only
Therefore, the density is estimated as:
one input sample or all input samples follow the path), the
N p-value has to be calculated (which has to be done for each
1 X 1 −
(x−t)2
input sample using the certain transition). Therefore, the
f (t) = √ e 2·0.05ti . (4)
N i=1 2π · 0.05ti complexity for calculating the p-values is O(m · n).
In the next step the local minimums in the calculated PDF One splitting operation itself also needs time in O(m · n)
are localized. One mode is assumed to be between the local for the creation of the PTA with m input samples, where
minimums. each can have n states.
Finally, referring to the original data (discrete time val- In the worst case, if each transition has to be split, the
ues) and based on the assumption of normally distributed complexity is in O(m2 · n2 ).
data, the needed statistic parameters (mean µ and standard
BUTLA firstly uses a preprocessing of timing values to
deviation σ) are calculated. This is done for each mode:
avoid this splitting operation. This version is based on the
between the minimum value, all local minimums and the
assumption that events with the same changing signals but
maximum value.
different timing behavior describe different behavior.
Using this preprocessing of the timing information, the In the preprocessing step, events with multiple timing
time-consuming splitting operation during the state merging modes are identified. These modes are used for the creation

222
Proceedings of the 26th International Workshop on Principles of Diagnosis

of the prefix tree. Events with the same symbol but arising original observations’ timings and σ is the standard devia-
from different timing modes are handled as different events tion.
and lead to different states in the prefix tree. In the iden- In a first experiment, the Lemgo Model Factory (see Fig-
tification phase, these events are also handled as different. ure 5) is used. A frequently occurring error for example is
Using this preprocessing step, the splitting process can be the wear of a conveyor belt which leads to a decrease in the
omitted. This leads to a computation speed increase. system’s throughput. 12 production cycles are used to iden-
Proposition 2. The time complexity of calculating the tim- tify a normal behavior model. The PTA comprises 6221
ing modes in a preprocessing step is in O(n), where n is the states. BUTLA reduces this to 13 states—this corresponds
number of observed events. to a compression rate of 99.79%.
To verify the model learning algorithm with a high
Proof. Since this is during the preprocessing step and the amount of data, in a second experiment, data is generated ar-
PTA does not exist so far, the worst case is not dependent tificially using the modified Reber grammar (extended with
on the PTA structure, but only on the number of incoming timing information). 1000 samples are generated to learn
events and the number of symbols. the model, then 2000 test samples are created where 1000
First, the time stamps for each symbol in the alphabet comprise timing errors. From the initial 5377 states in the
a ∈ Σ have to be collected. This takes time O(n). PTA, a model with 6 states is learned.
Then for each a ∈ Σ, the probability density distribution Table 3 shows the error rates for the anomaly detection
over time has to be calculated. For this, Equation 4 is com- applied to both data sets using different factors k in the tim-
puted. Note that all events are not considered for a single ing intervals.
symbol a ∈ Σ, but only those that belong to this symbol
a. All computations together need time O(n). Additionally
Table 3: Experimental results using real and artificial data.
the local minimums have to identified, which is also done in
O(n). k =1 k =2 k =3 k =4
All these steps are performed subsequently and therefore false negative rate (%) - LMF 2 5.3 12.8 30
the overall time complexity for the preprocessing step is false positive rate (%) - LMF 12 4.2 2 0
O(n). false negative rate (%) - Reber 0 1.3 7.5 21
false positive rate (%) - Reber 9 3.1 1.1 0

Using the preprocessing step, the computation time can
be reduced compared to the splitting version. While the The experimental results in Table 3 show that the false
splitting version runs in polynomial time, we could reduce positive rate could be reduced by enlarging the time bounds.
this additional timing computation to linear time using the But at the same time, the false negative rate rose. The ap-
preprocessing step. plication of the enlargement of the time requires a trade off
between false positive and false negative rate. This has to be
done separately for each application.
4 Learning Automata Results
As mentioned before, the goal of the identified automata is 5 Conclusion
the usage for anomaly detection. An exemplary plant at the
In this paper we analyzed the possibilities of learning the
institute has been used for experimental results. Figure 5
timing behavior for anomaly detection in CPPS. First, we
shows a part of the Lemgo Model Factory and the identified
gave a taxonomy of timing modeling formalisms. Based
models of two modules.
on this taxonomy we analyzed whether the models can be
identified automatically and whether they are suitable for
Muscle on
[8…34] anomaly detection.
1 2
Timed automata are often the first choice for the modeling
of timed behavior of CPPS, especially for the modeling of
Muscle off
event [7…35]

timing
sequential timed behavior.
Due to the intuitive interpretation, timed automata are
well-suited to model the timing behavior. In our proposed
learning method, we used probability density distribution
functions over time for the timing representation. In a
preprocessing step multiple modes in single transitions are
identified, this enables the omission of the time consuming
splitting operation.
We proved the runtime enhancement formally and gave
Figure 5: Example plant with identified models for two some experimental results which prove the practicability of
modules. timed automata for automatic identification and for anomaly
detection.
During the anomaly detection phase, the running plant’s
timing behavior is compared to the prognosis of the automa- References
ton. A timing anomaly is signaled whenever a measured [1] A. Maier. Online passive learning of timed automata
timing is outside the timing interval in the learned timed au- for cyber-physical production systems. In The 12th
tomaton. Here, the interval is defined as [µ − k · σ, µ + k · IEEE International Conference on Industrial Infor-
σ], k ∈ R+ where µ is the mean value of the corresponding matics (INDIN 2014). Porto Alegre, Brazil, Jul 2014.

223
Proceedings of the 26th International Workshop on Principles of Diagnosis

[2] M.M. Mansour, M. Wahab, and W.M. Soliman. Petri [17] S. Tripakis. Fault diagnosis for timed automata. In
nets for fault diagnosis of large power generation sta- Werner Damm and Ernst-Rüdiger Olderog, editors,
tion. Ain Shams Engineering Journal, 4(4):831 – 842, FTRTFT, volume 2469 of Lecture Notes in Computer
2013. Science, pages 205–224. Springer, 2002.
[3] L. Console, C. Picardi, and D.T. Dupré. Temporal de- [18] P. Supavatanakul, C. Falkenberg, and J. J. Lunze. Iden-
cision trees: Model-based diagnosis of dynamic sys- tification of timed discrete-event models for diagnosis,
tems on-board. CoRR, abs/1106.5268, 2011. 2003.
[4] A. Maier. Identification of Timed Behavior Models for [19] Z. Simeu-Abazi, M. Di Mascolo, and M. Knotek.
Diagnosis in Production Systems. PhD thesis, Univer- Diagnosis of discrete event systems using timed au-
sity of Paderborn, 2015. tomata. In International Conference on cost effective
[5] C. A. Furia, D. Mandrioli, A. Morzenti, and M. Rossi. automation in Networked Product Development and
Manufacturing, Monterrey, Mexico, 2007.
Modeling time in computing: A taxonomy and a com-
parative survey. ACM Comput. Surv., 42(2):6:1–6:59, [20] C. A. Petri. Fundamentals of a theory of asynchronous
March 2010. information flow. In IFIP Congress, pages 386–390,
[6] R. Isermann and M. Münchhof. Identification of Dy- 1962.
namic Systems: An Introduction with Applications. [21] T. Murata. Petri nets: Properties, analysis and applica-
Advanced textbooks in control and signal processing. tions. Proceedings of the IEEE, 77(4):541–580, April
Springer, 2010. 1989.
[7] M. Y. Vardi. Branching vs. linear time: Final show- [22] P. M. Merlin and D. J. Farber. Recoverability of
down. In Proceedings of the 7th International Con- communication protocols–implications of a theoreti-
ference on Tools and Algorithms for the Construction cal study. Communications, IEEE Transactions on,
and Analysis of Systems, TACAS 2001, pages 1–22, 24(9):1036–1043, Sep 1976.
London, UK, 2001. Springer-Verlag. [23] A. Cerone. A Net-based Approach for Specifying Real-
[8] R. Alur and T. A. Henzinger. Back to the future: time Systems. Serie TD. Ed. ETS, 1993.
Towards a theory of timed regular languages. In In [24] A. Cerone and A. Maggiolo-Schettini. Time-based
Proceedings of the 33rd Annual Symposium on Foun- expressivity of time petri nets for system specifica-
dations of Computer Science, pages 177–186. IEEE tion. Theoretical Computer Science, 216(1 - 2):1 –
Computer Society Press, 1992. 53, 1999.
[9] H. Khalil. Nonlinear Systems. Prentice Hall, January [25] M.P. Cabasino, A. Giua, and C. Seatzu. Identification
2002. of Petri Nets from Knowledge of Their Language. Dis-
[10] S. Indra. Decentralized Diagnosis with Isolation on crete Event Dynamic Systems, 17(4):447–474, 2007.
Request for Spacecraft. In Astorga Zaragoza, editor, [26] P. Nazemzadeh, A. Dideban, and M. Zareiee. Fault
Proceedings of the 8th IFAC Symposium on Fault De- modeling in discrete event systems using petri nets.
tection, Supervision and Safety of Technical Processes, ACM Trans. Embed. Comput. Syst., 12(1):12:1–12:19,
pages 283–288, August 2012. January 2013.
[11] R. Isermann. Model-based fault detection and diag- [27] F. Baader, D. Calvanese, D. L. McGuinness, D. Nardi,
nosis - status and applications. In 16th IFAC Sympo- and P. F. Patel-Schneider, editors. The Description
sium on Automatic Control in Aerospace, St. Peters- Logic Handbook: Theory, Implementation, and Appli-
bug, Russia, 2004. cations. Cambridge University Press, New York, NY,
USA, 2003.
[12] W. Thomas. Automata on infinite objects. In Jan van
Leeuwen, editor, Handbook of Theoretical Computer [28] J. R. Quinlan. Induction of decision trees. In Jude W.
Science (Vol. B), pages 133–191. MIT Press, Cam- Shavlik and Thomas G. Dietterich, editors, Readings
bridge, MA, USA, 1990. in Machine Learning. Morgan Kaufmann, 1990. Orig-
inally published in Machine Learning 1:81–106, 1986.
[13] S. Verwer. Efficient Identification of Timed Automata:
Theory and Practice. PhD thesis, Delft University of [29] J. R. Quinlan. C4.5: Programs for Machine Learn-
Technology, 2010. ing. Morgan Kaufmann Publishers Inc., San Fran-
cisco, CA, USA, 1993.
[14] M. Roth, J. Lesage, and L. Litz. Black-box identifica-
tion of discrete event systems with optimal partitioning [30] Franck Thollard, Pierre Dupont, and Colin de la
of concurrent subsystems. In American Control Con- Higuera. Probabilistic DFA inference using Kullback-
ference (ACC), 2010, pages 2601–2606, June 2010. Leibler divergence and minimality. In Proc. of the 17th
International Conf. on Machine Learning, pages 975–
[15] M. Roth, S. Schneider, J.-J. Lesage, and L. Litz. Fault 982. Morgan Kaufmann, 2000.
detection and isolation in manufacturing systems with
[31] Rafael C. Carrasco and Jose Oncina. Learning stochas-
an identified discrete event model. Int. J. Systems Sci-
ence, 43(10):1826–1841, 2012. tic regular grammars by means of a state merging
method. In GRAMMATICAL INFERENCE AND AP-
[16] O. Niggemann, B. Stein, A. Vodenčarević, A. Maier, PLICATIONS, pages 139–152. Springer-Verlag, 1994.
and H. Kleine Büning. Learning behavior models for
[32] Z. I. Botev, J. F. Grotowski, and D. P. Kroese. Kernel
hybrid timed systems. In Twenty-Sixth Conference on
density estimation via diffusion. Annals of Statistics,
Artificial Intelligence (AAAI-12), pages 1083–1090,
38(5):2916–2957, 2010.
Toronto, Ontario, Canada, 2012.

224
Proceedings of the 26th International Workshop on Principles of Diagnosis

The Case for a Hybrid Approach to Diagnosis: A Railway Switch

Ion Matei and Anurag Ganguli and Tomonori Honda and Johan de Kleer
Palo Alto Research Center, Palo Alto, California, USA
e-mail: {imatei,aganguli,thonda,dekleer}@parc.com

Abstract model ultimately has 56 continuous time state and more than
2000 time-varying variables). We require this model to con-
Behavioral models are at the core of Fault- tain the key mechanisms which comprise a switch mecha-
Detection and Isolation (FDI) and Model-Based nism. Under the limiting conditions, building an accurate
Diagnosis (MBD) methods. In some practical ap- model of the system proved to be impractical and therefore
plications, however, building and validating such we used simplified models for the system’s components. For
models may not always be possible, or only par- example, we model the controller as a PID controller while
tially validated models can be obtained. In this the actually mechanism surely has a more complex one. The
paper we present a diagnosis solution when only Modelica model is fault augmented [Minhas et al., 2014]
a partially validated model is available. The solu- including parameters which represent the fault amounts for
tion uses a fault-augmented physics-based model wear, etc. Second, we develop ML classifiers to detect and
to extract meaningful behavioral features corre- diagnose faults by running the Modelica model repeatedly
sponding to the normal and abnormal behavior. with various fault amounts. We mix noise in the simulation
These features together with experimental train- to avoid over-fitting. For the ML classifier to work requires
ing data are used to build a data-driven statisti- developing a set of features for the signal. Each time series
cal model used for classifying the behavior of the is segmented at defined conditions and a set of features is
system based on observations. We apply this ap- designed (e.g., mean in segment, max in segment). Mul-
proach for a railway switch diagnosis problem. tiple ML techniques can develop a classifier, the best we
found are based on random-forest. Third, we throw away
1 Introduction the model — it was only important to develop the features
and the classifier. We now use the classifiers developed for
Consider the case of developing diagnostic software for a
the synthetic data on the real data. We were able to detect
complex system (for this paper our example is a railway
faults with a high level of accuracy, but were only partially
switch). The task is to determine from operational data
successful in identifying the correct fault mode (or nomi-
whether the switch is operating correctly or in one of a fixed
nal) for the operating system. Independently, we showed
number of fault modes. We are given the following very
that given enough data for the various fault modes, using
limiting (but all too common) conditions: (a) very limited
the same set of features, a ML classifier can be designed that
resources to complete the project (a few man months); (b)
also achieves a high diagnostic accuracy. The latter effort is
limited number of sensors; (c) unavailability of the model
not the subject of the paper. Overall, the customer was very
of the system; (d) unavailability of the system itself (would
satisfied with the results of the project. Throughout the rest
require an instrumented private rail system); (e) unavailabil-
of the paper we describe in detail the procedure described
ity of the parameters of the system components; (f) lim-
above.
ited nominal data; (g) extremely limited fault data (supplied
as time series); (h) highly non-linear multi-physics system
having multiple operating modes. Broadly speaking there
1.1 FDI and MBD
are three approaches to this type of problem: Model-Based In model-based approaches (FDI and MBD), the diagnosis
Diagnosis (MBD), Fault Detection and Isolation (FDI) and engine is provided with a model of the system, values of the
Machine Learning (ML). None of these approaches is ade- parameters of the model and values of some of its inputs
quate of this task. MBD and FDI require models and param- and outputs. Its main goal is to determine from only this
eters which are unavailable. ML approaches will require a information whether the system is malfunctioning, which
large amount of training data, and most approaches would components might be faulty and what additional informa-
require extensive feature engineering. In this paper we will tion need to be gathered (if any) to identify the faulty com-
demonstraint a hybrid approach to this task which was ulti- ponents with relative certainty. The distinguishing features
mately fully satisfactory for the train company. Many real of the MBD [de Kleer et al., 1992] approach are an empha-
world diagnostic tasks have similar limitations and we be- sis on general diagnostic reasoning engines that perform a
lieve our approach is one that yields good diagnostic algo- variety of diagnostic tasks via on-line reasoning, and infer-
rithms for many cases. ence of a system’s global behavior from the automatic com-
At a high level our approach is as follows. First we build bination of physical components. Hence, MBD models are
by hand an approximate model in Modelica (our switch compositional - the model of a combination of two systems

225
Proceedings of the 26th International Workshop on Principles of Diagnosis

is directly constructed from the models of the constituent 2 we motivate and describe the railway switch diagnosis
systems. FDI methods can work with both physics-based problem. Sections 3 and 4 present the physics-based model,
and empirical models. The physics-based models are usu- its fault-augmented version and the partial validation of the
ally flattened, that is, the components and sub-components system. Section 5 describes the diagnosis solution under a
structure is lost into an overall behavioral model. Often, partially validated physics-based model while Section 6 puts
the faults are seen as separate inputs that need to be com- our solution in the context of exiting work on railway switch
puted by the diagnosis engine. The disadvantage of this diagnostics.
approach is that the physical semantics of the faults is ig-
nored. In addition, treating the faults as exogenous inputs 2 Problem Description
ignores the fact that the abnormal behavior may in fact
Railway signaling equipment (including switches) generates
depend on the variables of the systems. However, many
approximately 60% of the failure statistics related to traffic
FDI techniques were shown to be effective in diagnosing
disruptions due to signalling problems. As a consequence
dynamical systems [Gertler, 1998; Isermann, 1997; 2005;
more and more attention is paid to railway safety and op-
Patton et al., 2000].
timal railway maintenance. As a result of the rapid tech-
The above discussion emphasizes the need for a model nological advances in microelectronics and communication
when using either an FDI or MBD approach. As we will see technologies in the past decades, it has become possible
later in the paper, there are cases when such a model is very to add sensing and communication capabilities to railway
difficult to obtain and (more importantly) validate, or only equipment such as switches, to detect equipment failure and
a partial model is available. Naturally, both FDI and MBD therefore to enhance the quality of the railway service. Al-
approaches would not fare well in such a scenario. When though these sensing capabilities allow for easy detection of
no model is available, data-driven methods can be used to faults in the electrical components of the equipment, a sig-
learn the behavior of the system and use this knowledge nificant number of faults related to the mechanical compo-
to predict the system behavior. Such methods require ex- nents affect parameters whose monitoring would be difficult
perimental data corresponding to the normal and abnormal either due to cost or impracticality of sensor placement.
behavior for classification purposes; data that is used to ex- The rail switch assembly considered in this paper is
tract features representative for the system’s behavior. The shown Figure 2. The component responsible for moving the
set of features together with observations of the system (out- switch blades is the point machine. The point machine has
put measurements) are used to learn a data-driven statistical two sub-components: a servo-motor (generates rotational
model that is further used to classify the current observed motion) and a gear-cam mechanism (amplifies the torque
behavior. Namely, when new data is available it is fed into generated by the motor and transforms the rotational motion
the data-driven model, which in turn will provide a “best into a translational motion).
guess” to which class of behavior (normal or abnormal) the The adjuster transfers the motion from the point machine
data corresponds to. It is well recognized that in data-driven to the load (switch blades) through a drive rod. In particular,
approaches, the effectiveness of the classification is highly by adjusting two bolts, the adjuster controls the time when
dependent on the quality of the features used for learning. the switch blades start moving having as reference the time
In this paper, we begin to bridge the gap between pure when the drive rod commence moving. The switch blades
model-based and data-driven methods with a more hybrid are supported by a set of rolling bearings to minimize mo-
approach. We propose the use of a partially validated model tion friction. The manufacturer of the point machine en-
to help us determine a set of features that are representa- dowed the equipment with a series of sensors that can mea-
tive for the normal and abnormal behavior. In this approach sure the motor’s angular velocity and torque, and the cam’s
we build a physics based model of the system, emphasiz- angle and stroke (linear position). These sensors log data
ing its components and sub-components. Due to the lack in real time which is ten sent to a central station for anal-
of sufficient technical specifications and measurement data, ysis. These sensors were installed by design on the point
only partial validation is achieved. By this we mean that machine to monitor its safety. Although the operator of the
only a sub-set of the variables of interest match their coun- railway switch is also interested in the diagnosis of the point
terpart in the experimental data. The rest of the variables, machine, other possible faults are of interest as well. The
although not completely matching the real data, they do ex- faults considered in this paper are as follows: loose lock-pin
hibit similar characteristics compared to the real data, e.g., fault (at the connection between the drive rod and the point
same number of maxima, minima, or common regions of machine), adjuster bolts misalignment (the bolts move away
increasing/decreasing values, etc. In other words they are from their nominal position), missing bearings and the pres-
qualitatively equivalent. The physics-based model is further ence of an obstacle preventing the completion of the switch
extended to include behaviors under different fault operating blades motion. Adding new sensors measuring forces ap-
modes. In particular, physics-based models for the faults plied to the switch blades or the position of the switch blades
are included in the nominal model. The fault-augmented may facilitate immediate detection of such faults. How-
model is then used to generate synthetic simulated normal ever, due to the sheer number and possible configurations
and abnormal (including multiple faults) behavior and ex- of switches in the railway transportation network, this is not
tract representative features that are used in a data-driven a scalable solution. Therefore, the challenge is to diagnose
approach. Note that although ideally we would like to exe- the aforementioned faults using only the available measure-
cute the feature extraction step automatically, in this paper it ments.
is performed manually as the automatic feature extraction is
a challenging problem in its own. The diagnosis procedure 3 System Modeling
described above is pictorially presented in Figure 1. This section presents the fault augmented physics-based
The rest of the paper is organized as follows: in Section model of railway switch assembly, together with some

226
Proceedings of the 26th International Workshop on Principles of Diagnosis

Figure 1: Diagnosis procedure with partially validated model

ates a rotational motion. The gear-cam mechanism scales
down the angular velocity of the motor and amplifies the
torque generated by the motor. In addition, it transforms the
rotational motion into a translational motion.
Servomotor
No technical details were provided on this component, such
as type of motor or type of controller. Values for technical
parameters (e.g., armature resistance, motor shaft inertia)
were not available either. This information was not avail-
able to the switch operator either. Therefore, as a result of
a literature review on the type of motors used in railway
switches, a DC-permanent motor was chosen to be the most
likely candidate. The dynamical model for this component
is given by
di(t)
La = −Ra i(t) − Ke ω(t) + v(t),
dt
Figure 2: Diagnosis procedure with partially validated
dω(t)
model J = Kt i(t) − Bω(t) − τ (t),
dt

model validation results. Such models provide deeper in- where v(t) acts as input signal, ω(t) is the angular veloc-
sight on the behavior of the physical system. Simulated ity at the motor flange that acts as output, τ (t) is the torque
behavior helps with learning of normal and abnormal be- load of the motor and i(t) is the current through the arma-
havior patterns. The abnormal patterns are especially useful ture. Generic motor parameters from the literature were also
when not enough experimental data describing the abnormal chosen [Zattoni, 2006]. One question that may arise is if an
behavior is available. The modeling process consists of de- empirical model can be estimated. Unfortunately since only
composing the system into its main components, build phys- the output ω(t) is available, an empirical model based on
ical models and combining them into an overall model of system identification cannot be estimated, since no voltage
the system. We used the Modelica language to construct the measurements are available. No information on the type of
model, which is a non-proprietary, object-oriented, equation controller was available to us either. As a consequence, we
based language to model complex physical systems [Tiller, used a PID controller for the feedback loop. Based on the
2001]. Models for the three main components of the rail- observed profile of the motor output we determined that the
way switch, the point machine, the adjuster and the switch controlled variable is the angular velocity ω(t). Indeed, Fig-
blades, are presented in what follows. ure 3 shows the motor’s angular velocity1 that is maintained
at a constant value by the controller. To compute the pa-
3.1 Point machine rameters of the PID controller we estimated metrics corre-
sponding to the transient component of the output (angular
The point machine is the component of the railway switch velocity), such as rise time and overshoot; metrics that are
system that is responsible for moving the switch blades and formulated in .
locking them in the final position until a new motion action
is initiated. It is composed of two sub-components: servo- 1
The angular velocity profile shown in the graph is similar but
motor and gear-cam mechanism. The electrical motor trans- not exactly the observed one, due to proprietary information re-
forms electrical energy into mechanical energy and gener- strictions.

227
Proceedings of the 26th International Workshop on Principles of Diagnosis

Figure 5: Adjuster diagram

ing the adjuster was modeling the non-sticking contact be-
tween the drive rod and the adjuster extremes. Stiff contact
Figure 3: Motor angular velocity two bodies is usually modeled using a spring-damper com-
ponent with very large values for the elasticity and damping
constants. However, under this approach once contact takes
The Gear-Cam mechanism place, it is permanent. To solve this challenge, we built a
As mentioned earlier, the gear-cam mechanism amplifies the custom component that models the non-sticking contact.
torque generated by the motor and transforms the rotational
motion into a translational motion. The technical details 3.3 Switch blades
provided to us confirmed only the presence of the cam, but The adjuster is connected to two switch blades that are
not of the gear. We inferred the presence of the latter, by moved from left to right or right to left, depending on
comparing the angular velocity of the motor with the cam’s the traffic needs. We look at a switch blade as a flexi-
angular velocity, estimated from the measured cam’s angle. ble body and used an approximation method to modeling
This allowed us to estimate the ratio between the two veloci- beams, namely the lumped parameter approximation. This
ties, and therefore estimate the gear ratio. The cam diagram method assumes that beam deflection is small and in the lin-
is shown in Figure 4, where a wheel rotates as a result of ear regime. The lumped parameter approach approximates
the torque transmitted through the gear and acts on a lever a flexible body as a set of rigid bodies coupled with springs
that pushes the drive rod. Using the geometry of the cam, and dampers. It can be implemented by a chain of alter-
nating bodies and joints. The springs and dampers act on
the bodies or the joints. The spring stiffness and damping
coefficients are functions of the material properties and the
geometry of the flexible elements. Parameters such a rail
length, mass and mass moment of inertia were provided to
us through technical documentation. To model the effect of
the rail moving on rolling bearings, we included a friction
component that accounts for energy loss due to friction. Al-
though the component can model different friction models,
the default models is Coulomb friction.

3.4 Fault augmentation
Figure 4: Cam schematics In this section we describe the modeling artifacts that were
used to include in the behavior of the system the four fault
the relation between the rotation motion and the linear mo- operating modes: loose lock-pin, misaligned adjuster bolts,
tion (that is, the relation between the angle and the stroke) obstacle and missing bearings.
is given by
stroke = R × sin(angle), Loose lock-pin
where R denotes the radius of the cam. In addition, the map The lock-pin referred in this fault mode connects the point
between the applied torque and the generated force is machine with the drive rod that transfers the motion to the
switch blades. More precisely, it locks the drive rod to the
1 point machine. When this lock-pin becomes loose due to
force = × torque × cos(angle).
R wear, it introduces a slackness in the way the motion is
As both the cam angle and the stroke were included in the transferred to the switch blades. The lock-pin fault affects
available measurements, we used a least square method to stability the connection point between the drive rod and
estimate the radius of the cam. the point machine. In time, if not fixed, this can lead to a
complete failure of the pin, and therefore the point-machine
3.2 Adjuster cannot longer act upon the blades. A custom-built compo-
The adjuster links the drive rod connected to the point ma- nent whose main characteristic is that it implements a non-
chine to the switch blades, and hence it is responsible for sticking pushing and pulling between two rods was built to
transferring the translational motion. There is a delay be- model the effects of this fault. The impact between the two
tween the time instants the drive rod and the switch blades rods is assumed to be elastic, that is, we use a spring-damper
start moving. This delay is controlled by setting the po- assembly with large values for their parameters to model the
sitions of two bolts on the drive rod. Tighter bolt setting contact. There are two types of contact: contact of the rods
means a smaller delay, while looser bolt setting produce a with the boundaries of the locking mechanism and contact
larger delay. The high level diagram of the adjuster is de- between the rods. Both these types of contact must exhibit
picted in Figure 5. The most challenging part in construct- non-sticking pushing and pulling properties.

228
Proceedings of the 26th International Workshop on Principles of Diagnosis

Misaligned adjuster bolts
In this fault mode the bolts of the adjuster deviate from their
nominal position. As a result, the instant at which the drive
rod meets the adjuster (and therefore the instant at which the
the switch rail starts moving) happens either earlier or later.
For example in a left-to-right motion, if the left bolt moves
to the right, the contact happens earlier. The reason is that
since the distance between the two bolts decreases, the left
bolt reaches the adjuster faster. As a result, when the drive
rod reaches its final position, there may be a gap between
the right switch blade and the right stock rail. In contrast, if
the left bolt moves to the left the contact happens later. The
model of the adjuster includes parameters that can set the
positions of the bolts, and therefore the effects of this fault
mode can be modeled without difficulty.
Figure 6: Motor torque with its five operating zones
Obstacle
In this fault mode, an obstacle prevents the switch blades
reach their final nominal position, and therefore a gap be- where the drive rod catches up again with switch blades an
tween the switch blades and the stock rail appears. The ef- pushes them to their final position. Finally, in Zone 5 the
fect on the motor torque is a sudden increase in value, as the switch blades are pushed against the stock rails for a short
motor tries to overcome the obstacle. To model this fault period of time, hence the increase in torque. In support of
we included a component that implements a hard stop for the validation of these five operating zone, a set of movies
the position of the switch blades. This component has two depicting the motion of the switch blades were used. With
parameters for setting the left and right limits within motion respect to the fault operating modes, we managed to gener-
of the switch blades is allowed. By changing the values of ate similar effects in the simulated data, as the ones observed
these parameters, the presence of an obstacle can be simu- in the measured data. Figure 7 shows the effect of the mis-
lated. aligned bolts fault, and in particular the case where the left
bolt moves to the left. The effect is a delay applied on the
Missing bearings time instant the drive rod reaches the switch blades. In ad-
To minimize friction, the rails are supported by a set of dition, Zone 5 is also affected since due to the decreased
rolling bearings. When they become stuck or lost, the en- distance, the switch blades are no longer pushed against the
ergy losses due to friction increase. As mentioned in the stock rails. In the case of an obstacle, the switch blades (and
section describing the switch blades modeling, a component
was included to account for friction. This component has a
parameter that sets the value for the friction coefficient. By
increasing the value of this parameter, the effect of the miss-
ing bearings fault can be simulated.

4 Model Validation
Motor angular velocity, cam angle and stroke, together with
the motor torque were used in the validation process. To
these measurements, we added the rail position that was
estimated from a set of movies depicting the rail motion,
to which image processing techniques were applied. We
achieved partial validation of the model. The simulated mo-
tor angular velocity, cam angle and stroke closely match
the measured data. The simulated motor torque however
matches in a qualitative sense its measured counterpart. The
main reason is the fact that we had to make assumptions on
the type controller motor and controller, without no way to Figure 7: Motor torque in the normal and misaligned bolts
validate these assumptions. In addition, the available mea- fault modes
surements did not allowe for the estimating the parameters
in the assumed models, as this problem is ill posed. Figure 6 hence the drive rod) push against an obstacle that does not
depicts the simulated torque, emphasizing the five operating allow the completion of the motion. Therefore, the electric
zone. In Zone 1, the motor rotates the cam and the drive rod motor develops the maximum allowable torque as seen in
moves freely. No contact with the switch blades takes place Figure 8. In the case of the missing bearing fault mode, the
in this zone, and the (small) energy loss is due to friction in motion friction of the switch blades increases, and hence
the mechanical components. Zone 2 corresponds to the case the torque generated by the motor must accommodate this
where the drive rod pushes the two switch blades. The elas- increase. We obtained this effect in simulation as shown in
ticity in the switch blades can be noticed in the toque profile Figure 9. Finally, Figure 10 shows the effects of the lock-
in this zone. In Zone 3, the switch blades accelerate (as they pin fault. The slackness introduced by the looseness of the
drop off the rolling bearings) and again the drive rod moves pin induces a delay in the rail motion which also affects the
freely (note the drop in torque). Zone 4 depicts the case behavior in Zone 5. In terms of the changes in the five op-

229
Proceedings of the 26th International Workshop on Principles of Diagnosis

effects in simulation. The choice of features described in the
next section was supported by this understanding.

5 Fault Detection and Diagnosis
In the case of a railway switch, our measurements include
the motor torque and motor angular velocity. As the switch
moves from one extreme position to the other, these quan-
tities are measured at a fixed sampling rate. Thus, we
obtain a time series for each of the measurements. Let
{τ (t1 ), . . . , τ (tN )} denote torque measured at time instants
{t1 , . . . , tN }. Likewise, let {ω(t1 ), . . . , ω(tN )} denote the
angular velocity. For simplicity’s sake, we denote the two
time series of measurements by X. The diagnosis objective
is to determine the underlying condition of the system from
these time series. In other words, the objective is to deter-
mine a classifier f : X → {N, F1 , F2 , F3 , F4 , F5 }, where
N refers to the class label corresponding to the normal con-
Figure 8: Motor torque in the normal and obstacle fault dition and F1 , F2 , F3 and F4 denote the class labels loose
modes bolt, tight bolt, loose lock-pin, missing bearings, and obsta-
cle respectively.
We adopt a machine learning approach to constructing the
above mentioned classifier. The two main steps in building
a machine learning classifier are feature selection and clas-
sifier type selection. These two steps are discussed next.

5.1 Feature selection
As seen in Figure 6, the motor torque profile shows five dis-
tinct operating zones. Moreover, we notice from Figures 7,
8, 9 and 10 that a given fault’s impact on the torque pro-
file seems limited to only some of the five zones. With this
observation, our feature selection strategy is as follows.
1. Identify the approximate time instants that define the
boundaries of the five zones. For example, Zone 1 is
defined to be between times 0.8 seconds and 2 seconds,
zone 2 is defined to be between times 2 seconds and 4.1
seconds, and so on.
Figure 9: Motor torque in the normal and missing bearings
2. Within each zone, compute a set of measures. An ex-
fault modes
ample of a measure is the total energy dissipated within
the zone. This is computed as instantaneous power in-
tegrated over the duration of the zone. The instanta-
neous power is the product of instantaneous torque and
angular velocity. Other examples of features include
maximum and minimum torque values within the zone.
The disclosure of the full set of measures used is not
possible at this time for proprietary reasons. The fea-
tures are normalized to have zero mean and unit stan-
dard deviation.
Note that it might be possible to combine one or more zones
into one for feature selection.

5.2 Classifier selection
To map the features to the classes, {N, F1 , F2 , F3 , F4 , F5 },
we use machine learning. Examples of types of classifiers
commonly used include k− nearest neighbors, support vec-
Figure 10: Motor torque in the normal and lock-pin fault tor machines, neural networks and decision trees. We chose
modes Random Forest, an ensemble classifier, because of its ro-
bustness to overfitting. For a more detailed discussion on
the advantages of Random Forest, we refer the reader to
erating zones, the simulated behavior showed similar char- [Breiman, 2001]. In addition, we also developed a binary
acteristics as in the case of the real data. The understanding classifier for fault detection based on Alternating Decision
of these behaviors come as a result of building the model, Tree (AD Tree). The advantage of AD Tree is that the re-
augmenting the model with fault modes, and analyzing their sults are human interpretable.

230
Proceedings of the 26th International Workshop on Principles of Diagnosis

5.3 Results primarily due to confusion between missing bearings and
For each fault type, we introduce varying magnitudes of normal. Figure 12 shows part of the fault detection AD
fault and simulate the switch model described earlier. The Tree. A pink oval represents a feature node. Depending
fault magnitude is parameterized by a factor k which is var- on the value of the feature, one of two branches is followed
ied over a pre specified range. A value of k equal to zero until a leaf node is reached. Each edge that is traversed re-
corresponds to normal case. Higher values of k correspond sults in a score shown within the blue rectangles. For every
to the faulty cases. In addition, we also add representative root to leaf traversal, the total score is the sum of the scores
noise to the measurements. Figure 11 shows some example accumulated on each edge. For a given data sample, mul-
torque profiles generated by the simulation. tiple root to leaf paths may be traversed. In that case, the
final score is the sum of the scores accumulated over all the
paths. If the final score is negative, the decision is normal;
otherwise the decision is abnormal.

Table 2: Fault detection confusion matrix on simulated data
Normal Abnormal
Normal 94.6 5.4
Abnormal 9.6 90.4

Next, we test the classifiers on real data. A key prepro-
cessing step is to compute a linear transformation that trans-
forms the mean and standard deviation of the features of the
nominal (normal) real data to make them equal to the mean
and standard deviation of the features of the nominal simu-
lated data. The same transformation is then applied on the
real faulty data before testing with the ML classifier. We
emphasize here that to compute the transformation we only
require examples of real data showing normal behavior. We
Figure 11: Simulated torque measurements with added do not use any real fault data for training the ML classifier.
noise. Table 3 shows the fault detection results on real data. As
The data generated is recorded and used to train and test can be seen, we achieve a high accuracy of greater than 80
the machine learning classifier. We use leave-one-out cross- percent. We also tested the multi-class random forest classi-
validation for training and testing the classifiers. In this ap- fier to diagnose the various faults. We were able to diagnose
proach, one data sample is used for testing whereas all the correctly all missing bearing faults but were unable to cor-
rest of the data is used for training. This is repeated un- rectly diagnose the other faults.
til each data sample has been tested once. Table 1 shows
the confusion matrix for the simulated data described ear- Table 3: Fault detection confusion matrix on real data
lier. The (i, j)th entry of the confusion matrix refers to the
percentage of cases where the true class was i but was clas- Normal Abnormal
sified as j by the classifier. A matrix with 100 along all Normal 85.5 14.5
the diagonal entries would correspond to a perfect classifier. Abnormal 20 80
In the results shown in Table 1, we observe some misclas-
sification between classes N and F4 . Recall that N is the
normal class and F4 is the missing bearing class. On fur-
ther investigation, we determined that the misclassification 6 Related Work
occurs between the normal data and data corresponding to A malfunctioning railway switch assembly can have a high
low magnitudes of the missing bearing fault. impact on the railway transportation safety, and therefore
the problem of diagnosing such systems has been addressed
in other works. [Zattoni, 2006] proposes a detection sys-
Table 1: Fault diagnosis confusion matrix on simulated data tem based on off-line processing of the armature current
N F1 F2 F3 F4 F5 and voltage. The system implements an algorithm that real-
N 97.2 0 0 2 0.8 0 izes a finite impulse response system designed on the basis
F1 0 100 0 0 0 0 of an H2 -norm criterion, and allows for detection of incre-
F2 0 0 99 1 0 0 mental faults (e.g., loss of lubrication, increasing obstruc-
F3 9 0 4 87 0 0 tions, etc.). The approach hinges on the availability of a
F4 11 0 0 0 89 0 validated model of the point machine, which was not the
case in our setup. [Zhou et al., 2001; 2002] propose a re-
F5 0 0 0 0 0 100
mote monitoring system for railway point machines. The
system includes a variety of sensors for acquiring trackside
The binary classification or fault detection result using data related to parameters such as, distance, driving force,
AD Tree is shown in Table 2. As in the multi-class classifi- voltage, electrical noise, or temperature. The monitoring
cation case, the false positives (normal classified as abnor- system logs data for offline analysis that offers detailed in-
mal), and false negatives (abnormal classified as normal) are formation on the condition of the system in the form of event

231
Proceedings of the 26th International Workshop on Principles of Diagnosis

0&(=&3.718&

0.135& <1.645&

Max&torque&in& Total&energy&
Feature&4& Feature&5&
zone&2& dissipated&

Feature&5& Feature&6&

Figure 12: Part of the fault detection AD Tree

analysis and data trends. Hence unlike in our setup, the fo- normal and abnormal behavior. This approach relies on a set
cus is on detection rather than isolation. In addition, due of sensors measurements such as motors, voltage, current or
to scalability constraints, our solution is based on the em- switch blade positions, not all of them being available in our
bedded sensors, no other sensor being added. In [Asada case. In addition, the computation of the net energy requires
et al., 2013] classification based fault detection and diag- parameters of the electrical motor (armature resistance and
nosis algorithm is developed using measurements such as motor shaft inertia) that again are not available in our setup.
drive force, electrical current and voltage. In particular, a In addition, unlike our diagnosis objective, the focus in on
classifier based on support vector machines is used. Our detecting abnormalities within the point machine.
work also uses classification for diagnosis, but considers a
wider verity of classifiers such as Multiclass Random For- 7 Conclusions
est or Logitboosted Random Forest that were proved to be
more robust [Opitz and Maclin, 1999]. The classification The three main general approaches to developing diagnostic
step in [Asada et al., 2013] depends on a set of features ex- software (FDI, MBR, and ML) all have severe limitations in
tracted by applying the discrete wavelet transform on the many real-world applications. We believe we will see many
active power. This step is oblivious on the operating modes more hybrid approaches to diagnosis that include the best of
of the point machine, which we showed to relevant in our these three approaches to build accurate diagnosers.The rail-
case. Hence, the diagnosis approach in [Asada et al., 2013] way switch is a critical and complex piece of equipment re-
is purely data driven. Since we had no access to current and quiring extremely high diagnostic accuracy (the main reason
voltage measurements this avenue for feature construction this project was initiated), and the approach outlined in this
was not available to us. Depending of the type of electri- paper was ultimately successful. Ultimately deployment of
cal motors, the current and the voltage could be computed this approach will depend on expanding the set of faults de-
from the angular velocity and torque, respectively. How- tecting and on installation of more sensor rich switches in
ever, knowledge of motor parameters is needed. [Asada railroad infrastructures.
et al., 2013] consider two type of faults: underdriving and
overdriving of the drive rod. Overdriving refers to the case References
where the switch blades are pushed against the stock rails
due to misalignment, and a higher force then normal ap- [Asada et al., 2013] T. Asada, C. Roberts, and T. Koseki.
pears between the stock rails and the switch blades. Over- An algorithm for improved performance of railway con-
driving map to misaligned bolts, missing bearings and ob- dition monitoring equipment: Alternating-current point
stacles in our setup. All these fault modes exhibit higher machine case study. Transportation Research Part C:
forces than normal. Underdriving maps to a particular in- Emerging Technologies, 30(0):81 – 92, 2013.
stance of the misaligned bolts fault (left bolt moves to the [Breiman, 2001] Leo Breiman. Random forests. Machine
left for example). Therefore, our solution differentiate be- learning, 45(1):5–32, 2001.
tween more possible causes of higher forces since we take
advantage of the particular signature these forces have in [de Kleer et al., 1992] J. de Kleer, A. Mackworth, and
each fault corresponding to overdriving. Another pure data- R. Reiter. Characterizing diagnoses and systems. 56(2-
driven approach for railway point machine monitoring was 3):197–222, 1992.
proposed in [Oyebande and Renfrew, 2002], where a net [Gertler, 1998] J. Gertler. Fault-Detection and Diagnosis in
energy analysis technique was used to discriminate between Engineering Systems. New York: Marcel Dekker, 1998.

232
Proceedings of the 26th International Workshop on Principles of Diagnosis

[Isermann, 1997] R. Isermann. Supervision, fault-detection
and fault-diagnosis methods - An introduction. Control
Engineering Practice, 5(5):639 – 652, 1997.
[Isermann, 2005] Rolf Isermann. Model-based fault-
detection and diagnosis - status and applications. Annual
Reviews in Control, 29(1):71 – 85, 2005.
[Minhas et al., 2014] R. Minhas, J. de Kleer, I. Matei,
B. Saha, B. Janssen, D.G. Bobrow, and T Kortuglu. Us-
ing fault augmented Modelica model for diagnostics. In
Proceedings of the 10th International Modelica Confer-
ence, Dec 2014.
[Opitz and Maclin, 1999] David Opitz and Richard Maclin.
Popular ensemble methods: an empirical study. Journal
of Artificial Intelligence Research, 11:169–198, 1999.
[Oyebande and Renfrew, 2002] B.O. Oyebande and A.C.
Renfrew. Condition monitoring of railway electric point
machines. Electric Power Applications, IEE Proceedings
-, 149(6):465–473, Nov 2002.
[Patton et al., 2000] Ron J. Patton, Paul M. Frank, and
Robert N. Clark. Issues of Fault Diagnosis for Dynamic
Systems. Springer-Verlag London, 2000.
[Tiller, 2001] Michael Tiller. Introduction to Physical Mod-
eling with Modelica. Kluwer Academic Publishers, Nor-
well, MA, USA, 2001.
[Zattoni, 2006] Elena Zattoni. Detection of incipient fail-
ures by using an -norm criterion: Application to rail-
way switching points. Control Engineering Practice,
14(8):885 – 895, 2006.
[Zhou et al., 2001] F. Zhou, M. Duta, M. Henry, S. Baker,
and C. Burton. Condition monitoring and validation
of railway point machines. In Intelligent and Self-
Validating Instruments – Sensors and Actuators (Ref. No.
2001/179), IEE Seminar on, pages 6/1–6/7, Dec 2001.
[Zhou et al., 2002] F.B. Zhou, M.D. Duta, M.P. Henry,
S. Baker, and C. Burton. Remote condition monitoring
for railway point machine. In Railroad Conference, 2002
ASME/IEEE Joint, pages 103–108, April 2002.

233
Proceedings of the 26th International Workshop on Principles of Diagnosis

234
Proceedings of the 26th International Workshop on Principles of Diagnosis

Design of PD Observer-Based Fault Estimator Using a Descriptor Approach

Dušan Krokavec, Anna Filasová, Pavol Liščinský, Vladimír Serbák
Department of Cybernetics and Artificial Intelligence,
Technical University of Košice, Faculty of Electrical Engineering and Informatics,
Košice, Slovakia
e-mail: {dusan.krokavec, anna.filasova, pavol.liscinsky, vladimir.serbak}@tuke.sk

Abstract fault estimation, a proportional multi-integral derivative es-
timators are proposed in [7], [24].
A generalized principle of PD faults observer de- Although the state observers for linear and nonlinear
sign for continuous-time linear MIMO systems is systems received considerable attention, the descriptor de-
presented in the paper. The problem addressed sign principles have not been studied extensively for non-
is formulated as a descriptor system approach to singular systems. Modifying the descriptor observer design
PD fault observers design, implying the asymp- principle [13], the first result giving sufficient design condi-
totic convergence both the state observer error as tions, but for linear time-delay systems, can be found in [5].
fault estimate error. Presented in the sense of the Reflecting the same problems concerning the observers for
second Lyapunov method, an associated structure descriptor systems, linear matrix inequality (LMI) methods
of linear matrix inequalities is outlined to possess were presented e.g. in [9] but a hint of this method can be
the observer asymptotic dynamic properties. The found in [23], [25]. The extension for a class of nonlinear
proposed design conditions are verified by simu- systems which can be described by Takagi-Sugeno models
lations in the numerical illustrative example. is presented in [12].
Adapting the approach to the observer-based fault estima-
tion for descriptor systems as well as its potential extension,
1 Introduction the main issue of this paper is to apply the descriptor prin-
As is well known, observer design is a hot research field ow- ciple in PD fault observer design. Preferring LMI formula-
ing to its particular importance in observer-based control, tion, the stability condition proofs use standard arguments
residual fault detection and fault estimation [1], where, es- in the sense of Lyapunov principle for the design condi-
pecially from the stand point of the active fault tolerant con- tions requiring to solve only LMIs without additional con-
trol (FTC) structures, the problem of simultaneous state and straints. This presents a method designing the PD observa-
fault estimation is very eligible. In that sense various effec- tion derivative and proportional gain matrices such that the
tive methods have been developed to take into account the design is non-singular and ensures that the estimation error
faults effect on control structure reconfiguration and fault dynamics has asymptotical convergence. From viewpoint
detection [16], [22]. The fault detection filters, usually re- of application, although the descriptor principle is used, it
lying on the use of particular type of state observers, are is not necessary to transform the system parameter into a
mostly used to produce fault residuals in FTC. Because it is descriptor form or to use matrix inversions in design task
generally not possible in residuals to decouple totally fault formulation. Despite a partly conservative form, the design
effects from the perturbation influence, different approaches conditions can be transformed to LMIs with minimal num-
are used to tackle in part this conflict and to create residuals ber of symmetric LMI variables.
that are as a rule zero in the fault free case, maximally sensi- The paper is organized as follows. Placed after Introduc-
tive to faults, as well as robust to disturbances [2], [8]. Since tion, Sec. 2 gives a basic description of the PD fault ob-
faults are detected usually by setting a threshold on the gen- server and Sec. 3 presents design problem formulation in
erated residual signal, determination of an actual thresh- the descriptor form for a standard Luenberger observer. A
old is often formulated in adaptive frames [3]. Generalized new LMI structure, describing the PD fault observer design
method to solve the problem of actuator faults detection and conditions, is theoretically explained in Sec 4. An example
isolation in over-actuated systems is given in [14], [15]. is provided to demonstrate the proposed approach in Sec. 5
To estimate actuator faults for the linear time invariant and Sec. 6 draws some conclusions.
systems without external disturbance the principles based Used notations are conventional so that xT , X T de-
on adaptive observers are frequently used, which make es- note transpose of the vector x and matrix X, respectively,
timation of actuator faults by integrating the system output X = X T > 0 means that X is a symmetric positive defi-
errors [25]. In particular, proportional-derivative (PD) ob- nite matrix, kXk∞ designs the H∞ norm of the matrix X,
servers introduce a design freedom giving an opportunity the symbol In represents the n-th order unit matrix, ρ(X)
for generating state and fault estimates with good sensitivity and rank(X) indicate the eigenvalue spectrum and rank of
properties and improving the observer design performance a square matrix X, IR denotes the set of real numbers and
[6], [18], [19]. Since derivatives of the system outputs can IRn , IRn×r refer to the set of all n-dimensional real vectors
be exploited in the fault estimator design to achieve faster and n × r real matrices, respectively.

235
Proceedings of the 26th International Workshop on Principles of Diagnosis

2 The Problem Statement Lemma 2. The Luenberger observer (8), (9) is stable if for
The systems under consideration are linear continuous-time given positive scalar δ ∈ IR there exist a symmetric positive
dynamic systems represented in state-space form as definite matrix P 1 ∈ IRn×n a regular matrix P 3 ∈ IRn×n
and a matrix Y ∈ IRn×m such that
q̇(t) = Aq(t) + Bu(t) + F f (t) , (1)
P 1 = P T1 > 0 , (13)
y(t) = Cq(t) , (2) " #
where q(t) ∈ IRn , u(t) ∈ IRr , y(t) ∈ IRm are the vectors ATP 3 + P T3 A − Y C − C T Y T ∗
of the state, input and output variables, f (t) ∈ IRp is the < 0.
P 1 − P 3 + δP 3 A − δY C −δ(P 3 + P T3 )
T
fault vector, A ∈ IRn×n , B ∈ IRn×r , C ∈ IRm×n and
(14)
F ∈ IRn×p are real finite values matrices, m, r, p < n and
When the above conditions hold, the observer gain matrix
A F J is given as
rank = n + p. (3)
C 0 J = (P T3 )−1 Y . (15)
It is considered that the fault f (t) may occur at an uncertain Hereafter, ∗ denotes the symmetric item in a symmetric
time, the size of the fault is unknown but bounded and that matrix.
the pair (A, C) is observable. Proof. Denoting the observer system matrix as
Focusing on fault estimation task for slowly-varying
faults, the fault PD observer is considered in the following Ae = A − J C , (16)
form [19]
then with the equality
q̇ e (t) = Aq e (t) + Bu(t) + F f e (t)+
(4) ė(t) = ė(t) (17)
+J (y(t) − y e (t)) + L(ẏ(t) − ẏ e (t)) ,
y e (t) = Cq e (t) , (5) the equivalent form of (11) can be written

ḟ e (t) = M (y(t) − y e (t)) + N (ẏ(t) − ẏ e (t)) , (6) In 0 ė(t) ė(t) 0 In e(t)
= = ,
where q e (t) ∈ IRn , y e (t) ∈ IRm , f e (t) ∈ IRp are esti- 0 0 ë(t) 0 Ae −I n ė(t)
mates of the system states vector, the output variables vec- (18)
tor and the fault vector, respectively, and J , L ∈ IRn×m , or, more generally,
M , N ∈ IRp×m is the set of observer gain matrices is to be
E ⋄ ė⋄ (t) = A⋄e e⋄ (t) , (19)
determined.
To explain and concretize the obtained results, the fol- where
lowing well known lemma of Schur complement property is e⋄T (t) = eT (t) ėT (t) , (20)
suitable.
In 0 0 In
Lemma 1. [20] Considering the matrices Q = QT , R = E ⋄ = E ⋄T = , A⋄e = . (21)
0 0 Ae −I n
RT and S of appropriate dimensions, where detR 6= 0,
then the following statements are equivalent Defining the Lyapunov function of the form

Q S v(e⋄ (t)) = e⋄T (t)E ⋄TP ⋄ e⋄ (t) > 0 , (22)
< 0 ⇔ Q + SR−1 S T < 0, R > 0 (7)
S T −R
where
This shows that the above block matrix inequality has a E ⋄TP ⋄ = P ⋄TE ⋄ ≥ 0 , (23)
solution if the implying set of inequalities has a solution. then the derivative of (22) becomes
3 Descriptor Principle in Luenberger v̇(e⋄ (t)) =
(24)
Observer Design = ė (t)E P e (t) + e⋄T(t)P ⋄TE ⋄ ė⋄ (t) < 0
⋄T ⋄T ⋄ ⋄

To formulate the proposed PD observer design approach, and, inserting (19) in (24), it yields
the descriptor principle in the observer stability analysis is
presented. v̇(e⋄ (t)) = e⋄T (t)(P ⋄TA⋄e + A⋄T ⋄ ⋄
e P )e (t) < 0 , (25)
If the fault-free system (1), (2) is considered, the Luen-
berger observer is given as P ⋄TA⋄e + A⋄T ⋄
e P < 0, (26)
q̇ e (t) = Aq e (t) + Bu(t) + J (y(t) − y e (t)) , (8) respectively. Introducing the matrix

y e (t) = Cq e (t) , (9) P1 P2
P⋄ = , (27)
and using (1), (2), (8), (9), it yields P3 P4
ė(t) = (A − J C)e(t) , (10) then, with respect to (23), it has to be
T
(A − J C)e(t) − ė(t) = 0 , (11) In 0 P1 P2 P 1 P T3 In 0
= ≥ 0,
respectively, where 0 0 P3 P4 P T2 P T4 0 0
eq (t) = q(t) − q e (t) . (12) (28)
which gives
Using the descriptor principle, the following lemma pre- T
sents the Luenberger observer design conditions in terms of P1 P2 P1 0
= ≥ 0. (29)
LMIs for the fault-free system (1), (2). 0 0 P T2 0

236
Proceedings of the 26th International Workshop on Principles of Diagnosis

It is evident that (29) can be satisfied only if by denoting

P 1 = P T1 > 0, P 2 = P T2 = 0 . (30) e◦T (t) = eTq (t) eTf (t) , (42)
After simple algebraic operations so (26) can be trans-
formed into the following form A F J L
A◦ = , J◦ = , L◦ = , (43)
" # 0 0 M N
ATeP 3 + P T3 Ae ∗
<0 (31) In 0
P 1 − P 3 + P T4 Ae −P 4 − P T4 I◦ = , C◦ = [ C 0 ] , (44)
0 Ip
and, owing to emerged products P T3 Ae , P T4 Ae in (31), the where A◦ , I ◦ ∈ IR(n+p)×(n+p) , J ◦ , L◦ ∈ IR(n+p)×m ,
restriction on the structure of P 4 can be enunciated as C ◦ ∈ IRm×(n+p) , then the equation (41) can be written
P 4 = δP 3 , (32) as
where δ > 0, δ ∈ IR. Since now (I ◦ + L◦ C ◦ )ė◦ (t) = (A◦ − J ◦ C ◦ )e◦ (t) , (45)
P T4 Ae = δP T3 (A − J C) , (33) A◦e e◦ (t) − D ◦e ė◦ (t) = 0 , (46)
then, with the notation respectively, where
Y = P T3 J , (34) A◦e = A◦ − J ◦ C ◦ , D ◦e = I ◦ + L◦ C ◦ . (47)
(31) implies (14). This concludes the proof.
Introducing the equality
Remark 1. It is naturally to point out that the symmetrical
form of Lemma 2, defined for P 1 = P , P 3 = P T3 = ė◦ (t) = ė◦ (t) , (48)
Q, is an equivalent inequality to the enhanced Lyapunov in analogy to the equation (18), then (48), (46) can be writ-
inequality for Luenberger observer design [11]. ten as
The above results can be generalized to formulate the de- ◦ ◦ ◦ ◦
I 0 ė (t) ė (t) 0 I◦ e (t)
scriptor principle in fault PD observer design. The main rea- = = .
son is to eliminate matrix inverse notations from the design 0 0 ë◦ (t) 0 A◦e −D ◦e ė◦ (t)
conditions. (49)
Thus, by denoting
4 PD Observer Design ◦ ◦
I 0 0 I◦ e (t)
E• = , A•e = , e •
(t) = ,
If the observer errors between the system state vector and 0 0 A◦e −D ◦e ė◦ (t)
the observer state vector as well as between the fault vector (50)
and the vector of its estimate are defined as follows the obtained descriptor form to PD fault observer is
eq (t) = q(t) − q e (t) , (35) E • ė• (t) = A•e e• (t) , (51)
ef (t) = f (t) − f e (t) , (36)
where A•e , E • ∈ IR2(n+p)×2(n+p) .
then, for slowly-varying faults, it is reasonable to consider The following solvability theorem is proposed to the de-
[12] sign PD fault observer in the structure proposed in (4)-(6).
ėf (t) = 0 − ḟ e (t) = −M Ceq (t) − N C ėq (t) . (37) Theorem 1. The PD fault observer (4)-(6) is stable if for
given positive scalar δ ∈ IR there exist a symmetric po-
Note, since f e (t) can be obtained as integral of ḟ e (t), an
sitive definite matrix P ◦1 ∈ IR(n+p)×(n+p) , a regular matriz
adapting parameter matrix G can be adjust interactively to
set the amplitude of f e (t), i.e., as results it is P ◦3 ∈ IR(n+p)×(n+p) and matrices Y ◦ ∈ IR(n+p)×m , Z ◦ ∈
IR(n+p)×m such that
Zt
f e (t) = G ḟ e (τ )dτ . (38) P ◦1 = P ◦T
1 > 0, (52)

0
A◦TP ◦3 + P ◦T ◦ ◦ ◦
3 A −Y C −C
◦T
Y ◦T ∗
To express the time derivative of the system state error eq (t), ◦ < 0,
V 21 V ◦22
the equations (1), (4) together with (2), (5) can be integrated (53)
as where
ėq (t) = Ae eq (t) + F ef (t) − LC ėq (t) , (39)
where Ae is given in (16) and the PD observer system ma- V ◦21 = P ◦1 − P ◦3 + δP ◦T ◦ ◦ ◦ ◦T ◦T
3 A − δY C − C Z , (54)
trix is
V ◦22 = −δP ◦3 − δP ◦T ◦ ◦ ◦T ◦T
3 − δZ C − δC Z . (55)
−1 −1
AP De = (I n+LC) Ae = (I n+LC) (A−J C) . (40)
If the above conditions hold, the set of observer gain matri-
Since (37), (39) can be rewritten in the following com- ces is given by the equations
posed form
J ◦ = (P ◦T
3 )
−1 ◦
Y , L◦ = (P ◦T
3 )
−1 ◦
Z (56)
ėq (t) Ae F eq (t) LC 0 ėq (t)
= − , and the matrices J , L M , N can be separated with respect
ėf (t) −M C 0 ef (t) N C 0 ėf (t)
(41) to (43).

237
Proceedings of the 26th International Workshop on Principles of Diagnosis

Proof. Defining the Lyapunov function of the form Theorem 2. The PD observer (4)-(6) is stable if for given
positive scalar δ ∈ IR there exist a symmetric positive
v(e• (t)) = e•T (t)E •TP • e• (t) > 0 , (57) definite matrix Q◦ ∈ IR(n+p)×(n+p) and matrices Y ◦ ∈
where IR(n+p)×m , Z ◦ ∈ IR(n+p)×m such that
E •TP • = P •TE • ≥ 0 , (58) Q◦ = Q◦T > 0 , (71)
then, using the property (58), the time derivative of (57)
along the trajectory of (51) becomes A◦TQ◦ + Q◦A◦ − Y ◦ C ◦ − C ◦T Y ◦T ∗
< 0,
W ◦21 W ◦22
•
v̇(e (t)) = (72)
(59) where
= ė (t)E P e (t) + e•T(t)P •TE • ė• (t) < 0 .
•T •T • •

Thus, substituting (51) into (59), it yields W ◦21 = δQ◦A◦ − δY ◦ C ◦ − C ◦TZ ◦T , (73)

v̇(e• (t)) = e•T (t)(P •TA•e + A•T • • W ◦22 = −2δQ◦ − δZ ◦ C ◦ − δC ◦TZ ◦T . (74)
e P )e (t) < 0 , (60)
If the above conditions are affirmative, the extended ob-
which implies
server gain matrices are given by the equations
P •TA•e + A•T •
e P < 0. (61) J ◦ = (Q◦ )−1 Y ◦ , L◦ = (Q◦ )−1 Z ◦ . (75)
Defining the Lyapunov matrix Proof. Since there is no restriction on the structure of P 3 it
◦ can be set
P1 P ◦2
P• = , (62) P ◦1 = P ◦3 = P ◦T
3 =Q >0
◦
(76)
P ◦3 P ◦4 ◦ ◦
and the conditioned structure of P 4 , with respect to P 3 and
in analogy with (29) then (58) implies A◦e , can be taken into account as
P ◦1 = P ◦T
1 > 0, P ◦2 = P ◦T
2 =0 (63) P ◦4 = δP ◦3 = δQ◦ , (77)
and, using (50) and (62), (63) in (61), it yields where δ > 0, δ ∈ IR. If these conditions are incorporated
" # " ◦ ◦T # into (66)-(68), then
0 A◦T e P ◦1 0 P1 P3 0 I◦
+ < 0. P T3 A◦e = Q◦ (A◦ − J ◦ C ◦ ) = Q◦ A◦ − Y ◦ C ◦ , (78)
I ◦ −D ◦T
e
P ◦3 P ◦4 0 P ◦T4
A◦e −D ◦e
(64) P ◦T ◦ ◦ ◦T ◦ ◦ ◦ ◦ ◦ ◦ ◦
4 L C = δP 3 L C = δQ L C = δZ C , (79)
After some algebraic manipulations, (64) takes the follow- where
ing form • Y ◦ = Q◦ J ◦ , Z ◦ = Q◦ L◦ . (80)
U 1 U •T2
< 0, (65) Thus, with these modifications, (65)-(68) imply (72)-(74).
U •2 U •3 This concludes the proof.
where, with the notation (47),
Note, the design conditions formulated in Theorem 2 give
U •1 = (A◦ − J ◦ C ◦ )T P ◦3 + P ◦T ◦ ◦ ◦
3 (A − J C ) , (66) potentially more conservative solutions.
U •2 = P ◦T ◦ ◦ ◦ ◦ ◦ ◦T ◦T ◦
4 (A − J C ) + P 1 − P 3 − C L P 3 , (67) 5 Illustrative Example
U •3 = −P ◦4 − P ◦T ◦T ◦ ◦ ◦T ◦T ◦
4 − P4 L C − C L P4 . (68) The considered system is represented by the model (1), (2)
By setting with the model parameters [10]
 
P ◦4 = δP ◦3 , Y ◦ = P ◦T ◦
Z ◦ = P ◦T ◦
(69) 1.380 −0.208 6.715 −5.676
3 J , 3 L ,
 −0.581 −4.290 0.000 0.675 
A=
where δ > 0, δ ∈ IR, then (65)-(68) imply (53)-(55). 1.067 4.273 −6.654 5.893 
Writing (68) as follows 0.048 4.273 1.343 −2.104
U •3 =  
0.000 0.000
(70)
= −P 4 (I +L C ) − (I ◦ +L◦ C ◦ )TP ◦4 = −R•
◦T ◦ ◦ ◦
B=
 5.679 0.000 
, C =
4 0 1 0
1.136 −3.146  0 0 0 1
and comparing (7) and (65), then, if the inequalities (52)- 1.136 0.000
(53) are satisfied, the Schur complement property (7) ap-
To consider single actuator faults it was set E = B, and
plied to (65) implies that R• is positive definite.
the matrix variables Q◦ , Y ◦ , Z ◦ satisfying (71)-(74) for
Since P ◦4 is regular, (I ◦ +L◦ C ◦ ) is also regular and so
δ = 0.75 were as follows
AP De given by (40) exists. This concludes the proof.
Q◦ = [ Q◦1 Q◦2 ] ,
Since there is no restriction on the structure of P 3 in The-
 
orem 1, it follows that the problem of checking the existence 0.1737 0.0012 0.1409
of a stable system matrix of PD adaptive fault observer in a  0.0012 0.1615 0.0195 
given matrix space may also be formulated with symmet-  
 0.1409 0.0195 0.1794 
Q◦1 =  ,
ric matrices P 3 and P 3 . This limit case of the LMI struc-  −0.1316 0.0252 −0.1439 

ture design condition, bound to a single symmetric matrix,  −0.0118 −0.1975 −0.0464 
is given by the following theorem. 0.1461 −0.0026 0.1557

238
Proceedings of the 26th International Workshop on Principles of Diagnosis

 
−0.1316 −0.0118 0.1461 2.5
 0.0252 −0.1975 −0.0026 
  2
f
1
 −0.1439 −0.0464 0.1557 
Q◦2 =  , f2
 0.2177 −0.1136 −0.1255  1.5
 −0.1136 1.4490 −0.1904  fe1

f(t), fe(t)
1
−0.1255 −0.1904 1.3479 fe2
  0.5
0.1162 −0.0220
 −0.0094 0.1404  0
 
◦  0.0814 0.1439  −0.5
Y = ,
 −0.0719 0.1072  0 20 40 60 80 100 120 140 160 180
 0.0060 0.0171  t[s]
0.0003 0.2159
 
−0.0164 −0.0445 Figure 1: Adaptive fault estimator responses
 0.0013 −0.0528 
 
◦  −0.0728 0.1181 
Z = ,
 0.0678 0.0229 
 Although many actuator faults can cause the gain to drift,
 0.0015 0.1434  in practice the faults lead to an abrupt change in gain [21].
−0.1062 0.1758 To simulate this phenomena, it was assumed that the fault in
actuators for (1) was given by
[ ]
where the SeDuMi package 17 was used to solve given set 
of LMIs.  0, t ≤ tsa ,


The PD observer extended matrix gains are then com- 
 tsb −t
fh
(t − tsa ), t sa < tsb ,
puted using (56) as f (t) =
sa
fh , tsb ≤ tca ,
  

0.8777 −1.5720 
 − fh (t − tcb ), tca < tcb ,
 tcb −tca
 −0.0801 0.5621  0, t ≥ tcb ,
 
◦  −0.0624 3.7385 
J = ,
 0.1229 2.2486 
 where, analyzing the single first actuator fault estimation, it
 −0.0021 0.3934  was set
−0.0767 0.1649
  fh = 2, tsa = 30s, tsb = 35s, tea = 65s, teb = 70s ,
0.7391 −2.0549
 0.0663 −0.6605  and for the single second actuator fault these parameters
  were
 −0.7915 3.4010 
L◦ =  .
 0.1994 1.3731 

 −0.0000 0.2244  fh = 2, tsa = 100s, tsb = 105s, tea = 135s, teb = 140s .
−0.0488 0.1187
It is demonstrates that for equal fh in the first and the sec-
Verifying the PD observer system matrix eigenvalue spec- ond actuator faults it is possible for given B to adjust the
trum, the results were common adapting parameter matrix G in (38) as follows
ρ(Ae ) = { −0.7731, −2.8914, −4.7816, −8.9188 } ,
40.0 5.9
G= .
ρ(AP De ) = { −1.1194, −1.6912, −1.9969, −2.9765 } . 5.9 22.0
That means the PD observer is stable as well as its "P" part The obtained results are illustrated in Fig. 1 where, just in
is stable, too. Moreover, also the descriptor form (45) of the terms of rendering, all faults responses and their estimates
PD observer is stable, where were combined into a single image, and so the demonstra-

ρ (I ◦ + L◦ C ◦ )−1 (A◦ − J ◦ C ◦ ) = tion can not be seen as a progressive sequence of single
faults in the actuators system. This figure presents the fault
−1.7763, −2.0966, .
= signals, as well as their estimations, reflecting the single first
−0.6629 ± 0.7872 i, −1.3632 ± 0.4931 i actuator fault starting at the time instant t = 30s and ap-
Comparing with a solution of (52)-(55) for the δ = 0.95, it plied for 40s and then the fault of the second actuator is
is possible to verify that in this case demonstrated beginning in the time instant t = 100s and
lasts for 40s. The presented simulation was carried out in
ρ(Ae ) = { −6.8230, −10.3876, −81.5789, −472.0230 } , the system autonomous mode, practically the same results
were obtained for forced regime of the system.
ρ(AP De ) = { −0.9562, −0.9774, −7.2561, −9.8300 } , The adapting parameter G and the tuning parameter δ

ρ (I ◦ + L◦ C ◦ )−1 (A◦ − J ◦ C ◦ ) = were set interactively considering the maximal value of fault
signal amplitude fh and the fault observer dynamics. It can
−1.0240, −1.0748, −6.4810, −9.1501 ,
= be seen that the exists very small differences between the
−0.9650 ± 0.0068 i
signals reflecting single actuator faults and the observer ap-
which implies in this case a faster dynamics of the descriptor proximate ones for slowly warring piecewise constant actu-
form of the PD observer but a slower for the PD observer. ator faults. The principle can be used directly in the control
Note, the exploitation of δ = 0.75 leads in this case to un- structures with the fault compensation [4], but can not be
stable "P" part of the PD observer. directly used to localize actuator faults [14].

239
Proceedings of the 26th International Workshop on Principles of Diagnosis

6 Concluding Remarks [11] D. Krokavec and A. Filasová. On observer design me-
thods for a class of Takagi-Sugeno fuzzy systems. Pro-
Based on the descriptor system approach a new PD fault ceedings of the 3rd International Conference on Con-
observer design method for continuous-time linear systems trol, Modelling, Computing and Applications CMCA
and slowly-varying actuator faults is introduced in the paper. 2014, Dubai, UAE, 1-11, 2014.
Presented version is derived in terms of optimization over
LMI constraints using standard LMI numerical procedures [12] D. Krokavec and A. Filasová. Design of PD observer-
to manipulate the fault observer stability and fault estima- based fault estimator for a class of Takagi-Sugeno de-
tion dynamics. Presented in the sense of the second Lya- scriptor systems. The 1st IFAC Conference on Mo-
punov method expressed through LMI formulation, design delling, Identification and Control of Nonlinear Sys-
conditions guaranty the asymptotic convergence of the state tems MICNOM-2015, Saint-Petersburg, Russia, 2015.
as well as fault estimation errors. The numerical simulation (in press)
results show good estimation performances. [13] G. Lu, D.W.C. Ho, and Y. Zheng. Observers for a
class of descriptor systems with Lipschitz constraint.
Acknowledgments Proceeding of the 2004 American Control Conference,
Boston, MA, USA, 3474-3479, 2004.
The work presented in this paper was supported by VEGA, [14] N. Meskin and K. Khorasani, Fault detection and iso-
the Grant Agency of the Ministry of Education and the lation of actuator faults in overactuated systems. Pro-
Academy of Science of Slovak Republic, under Grant No. ceedings of the 2007 American Control Conference,
1/0348/14. This support is very gratefully acknowledged. New York City, NY, USA, 2527-2532, 2007.
[15] N. Meskin and K. Khorasani, Actuator fault detec-
References tion and isolation for a network of unmanned vehicles.
[1] M. Blanke, M. Kinnaert, J. Lunze, and M. Staro- IEEE Transactions on Automatic Control, 54(4):835-
swiecki. Diagnosis and Fault-Tolerant Control, Sprin- 840, 2009.
ger-Verlag, Berlin, 2006. [16] R.J. Patton and S. Klinkhieo. Actuator fault estima-
tion and compensation based on an augmented state
[2] W. Chen and M. Saif. Observer-based strategies for ac- observer approach. Proceedings of the 48th IEEE
tuator fault detection, isolation and estimation for cer- Conference on Decision and Control, Shanghai, P.R.
tain class of uncertain nonlinear systems. IET Control China, 8482-8487, 2009.
Theory & Applications, 1(6):1672-1680, 2007.
[17] D. Peaucelle, D. Henrion, Y. Labit, and K. Taitz.
[3] S. Ding. Model-Based Fault Diagnosis Techniques. User’s Guide for SeDuMi Interface, LAAS-CNRS,
Design Schemes, Algorithms, and Tools, Springer- Toulouse, France, 2002.
Verlag, Berlin, 2008. [18] J. Ren and Q. Zhang. PD observer design for descrip-
[4] A. Filasová and D. Krokavec. Design principles of ac- tor systems. An LMI approach. International Journal
tive robust fault tolerant control systems. Robust Con- of Control, Automation, and Systems, 8(4):735-740,
trol. Theory and Applications, A. Bartoszewicz (Ed.), 2010.
InTech, Rijeca, 309-338, 2011. [19] F. Shi and R.J. Patton. Simultaneous state and fault
[5] E. Fridman and U. Shaked. A descriptor system ap- estimation for descriptor systems using an augmented
proach to H∞ control of linear time-delay systems. PD observer. Preprints of 19th IFAC World Congress,
IEEE Transactions on Automatic Control, 47(2):253- Cape Town, South Africa, 8006-8011, 2014.
270, 2002. [20] J.G. VanAntwerp and R.D. Braatz. A tutorial on linear
[6] Z. Gao. PD observer parametrization design for de- and bilinear matrix inequalities. Journal of Process
scriptor systems. Journal of the Franklin Institute, Control, 10:363-385, 2000.
342(5):551-564, 2005. [21] H. Wang and S. Daley. Actuator fault diagnosis. An
adaptive observer-based technique. IEEE Transactions
[7] Z. Gao and S.X. Ding. Fault estimation and fault- on Automatic Control, 41(7):1073-1078, 1996.
tolerant control for descriptor systems via propor-
[22] F. Zhang, G. Liu, and L. Fang. Actuator fault estima-
tional, multiple-integral and derivative observer de-
sign. IET Control Theory & Applications, 1(5):1208- tion based on adaptive H∞ observer technique. Pro-
1218, 2007. ceedings of the 2009 IEEE International Conference
on Mechatronics and Automation, Changchun, P.R.
[8] J. Guo, X. Huang, and Y. Cui. Design and analysis China, 352-357, 2009.
of robust fault detection filter using LMI tools. Com- [23] K. Zhang, B. Jiang, and V. Cocquempot. Adaptive
puters and Mathematics with Applications, 57(11-12): observer-based fast fault estimation. International
1743-1747, 2009. Journal of Control, Automation, and Systems 6(3):
[9] K. Ilhem, J. Dalel, B.H.A. Saloua, and A.M. Naceur, 320-26, 2008.
Observer design for Takagi-Sugeno descriptor system [24] K. Zhang, B. Jiang, and P. Shi. Fast fault estimation
with Lipschitz constraints. International Journal of In- and accommodation for dynamical systems. IET Con-
strumentation and Control Systems 2(2):13-25, 2012. trol Theory & Applications, 3(2):189-199, 2009.
[10] J. Kautsky, N. K. Nichols, and P. Van Dooren. Robust [25] K. Zhang, B. Jiang, and P. Shi. Observer-Based Fault
pole assignment in linear state feedback. International Estimation and Accomodation for Dynamic Systems,
Journal of Control, 41(5):1129-1155, 1985. Springer-Verlag, Berlin, 2013.

240
Proceedings of the 26th International Workshop on Principles of Diagnosis

Chronicle based alarm management in startup and shutdown stages
John W. Vasquez1,3,4 , Louise Travé-Massuyès1,2 ,Audine Subias1,3 Fernando Jimenez4 and Carlos Agudelo5
1
CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
2
Univ de Toulouse, LAAS, F-31400 Toulouse, France
3
Univ de Toulouse, INSA, LAAS, F-31400 Toulouse, France
4
Universidad de los Andes, Colombia.
5
ECOPETROL ICP, Colombia.
e-mail: jwvasque@laas.fr, louise@laas.fr, subias@laas.fr,
fjimenez@uniandes.edu.co, carlos.agudelo@ecopetrol.com.co
Abstract used for the chronicle design. The section 5 is devoted to
the chronicle generation. Finally , an illustrative application
The transitions between operational modes on real data from a petrochemical plant is given section 6.
(startup/shutdown) in chemical processes gen-
erate alarm floods and cause critical alarm
saturation. We propose in this paper an approach
2 State of the art: Alarm management
of alarm management based on a diagnosis Alarm management has recently focused the attention of
process. This diagnosis step relies on situation many researchers in themes such as:
recognition to provide to the operators relevant Alarm historian visualization and analysis: A combined
information on the failures inducing the alarms analysis of plant connectivity and alarm logs to reduce the
flows. The situation recognition is based on number of alerts in an automation system was presented by
chronicle recognition. We propose to use the [3]; the aim of the work presented is to reduce the num-
information issued from the modeling of the ber of alerts presented to the operator. If alarms are re-
system to generate temporal runs from which the lated to one another, those alarms should be grouped and
chronicles are extracted. An illustrative example presented as one alarm problem. Graphical tools for rou-
in the field of petrochemical plants ends the tine assessment of industrial alarm systems was proposed
article. by [4], they presented two new alarm data visualization tools
for the performance evaluation of the alarm systems, known
as the high density alarm plot (HDAP) and the alarm sim-
1 Introduction ilarity color map (ASCM). Event correlation analysis and
The petrochemical industries losses have been estimated at two-layer cause-effect model were used to reduce the num-
20 billion dollars only in the U.S. each year, and the AEM ber of alarms in [5]. A Bayesian method has been intro-
(Abnormal Events Management) has been classified as a duced for multimode process monitoring in [6]. This type
problem that needs to be solved. Hence the alarm man- of techniques helps us to recognize alarm chattering, group-
agement is one of the aspects of great interest in the safety ing many alarms or estimate the alarm limits in transition
planning for the different plants. In the process state tran- stages, but the time and the procedure actions are not in-
sitions such as startup and shutdown stages the alarm flood cluded.
increases and it generates critical conditions in which the Process data-based alarm system analysis and rational-
operator does not respond efficiently, then a dynamic alarm ization: The evaluation of plant alarm systems by behavior
management is required [1]. Currently, many fault detec- simulation using a virtual subject was proposed by [7]. [8]
tion and diagnosis techniques for multimode processes have introduced a technique for optimal design of alarm limits
been proposed; however, these techniques cannot indicate by analyzing the correlation between process variables and
fundamental faults in the basic alarm system [2], in the other alarm variables. In 2009 a framework based on the receiver
hand the technical report ”Advance Alarm System Require- operating characteristic (ROC) curve was proposed to op-
ments” EPRI (The Electric Power Research Institute) sug- timally design alarm limits, filters, dead bands, and delay
gests a cause-consequence and event-based processing. In timers; this work was presented in [9] and a dynamic risk
this perspective, diagnosis approaches based on complex analysis methodology that uses alarm databases to improve
events processing or situation recognition are interesting is- process safety and product quality was presented in [10]. In
sues. Therefore, in this paper, a dynamic alarm management [11] the Gaussian mixture model was employed to extract
strategy is proposed in order to deal with alarm floods hap- a series of operating modes from the historical process data
pening during transitions of chemical processes. This ap- and then the local statistic and its normalized contribution
proach relies on situations recognition (i.e. chronicle recog- chart were derived for detecting abnormalities early and for
nition). As, the efficiency of alarm management approaches isolating faulty variables. We can see that the use of virtual
depends on the operator expertise and process knowledge, subjects could be applied to probe the alarm system and us-
our final objective is to develop a diagnosis approach as a ing historical information about the alarm behavior for de-
decision tool for operators. The paper is divided into 6 sec- tecting abnormalities. The problem is presented when the
tions. Section 2 gives an overview on the relevant literature. simulation requires a lot time to probe the totally of scenar-
The section 3 concerns the modeling of the system. The sec- ios and when we have new plants that do not contain infor-
tion 4 is about the chronicle principle and the temporal runs mation about historical data.

241
Proceedings of the 26th International Workshop on Principles of Diagnosis

S
Plant connectivity and process variable causality analy- • CSD ◆ i CSDi is the Causal System
sis (causal methods): In the literature, transition monitor- Description or the causal model used to repre-
ing of chemical processes has been reported by many re- sent the constraints underlying in the continuous
searchers. In [12] was presented a fault diagnosis strategy dynamic of the hybrid system. Every CSDi asso-
for startup process based on standard operating procedures, ciated to a mode qi , is given by a graph (Gc = #
this approach proposes behavior observer combined with [ K, I). I is the set of influences where there is
dynamic PCA (Principal Component Analysis) to estimate an edge e(vi , vj ) 2 I from vi 2 # to vj 2 # if the
process faults and operator errors at the same time, and in variable vi influences variable vj . Then, the vertices
[13] was presented a framework for managing transitions represent the variables and the edges represent the
in chemical plants where a trend analysis-based approach influences between variables and for each edge exists
for locating and characterizing the modes and transitions in an association with a component in the system. The
historical data is proposed. Finally, in [14] a hybrid model- set of components is noted as COMP .
based framework was used for alarm anticipation where the • Init is the initial condition of the hybrid system,
user can prepare for the possibility of a single alarm occur-
rence. For the transition monitoring, these types of tech- 3.2 Qualitative abstraction of continuous
niques are the most used in industrial processes and the hy- behavior
brid model based framework could be a good representation
In each mode of operation, variables evolve according to
of our system. We can observe that a causal model allows
the corresponding dynamics. This evolution is represented
identify the root of the failures and check the correct evo-
with qualitative values. The domain D(Vi ) of a qualitative
lution in a transitional stage. Our proposal is closer to the
variable Vi 2 VQ is obtained through the function fqual :
third type of approach and seeks to exploit the causal rela-
D(vi ) ! D(Vi ) that maps the continuous values of variable
tionships as presented in the next sections.
vi to ranges defined by limit values (High Hi and Low Li ).
8
3 Representation of the system >
>
>
ViH if vi Hi ^ dv dt > 0
i

>
> M dv
if vi < Hi ^ dti < 0
3.1 Hybrid Causal Model < Vi
The hybrid system is represented by an extended transition f(vi )qual = _ (2)
>
> dvi
system [15], whose discrete states represent the different >
> vi Li ^ dt > 0
>
: L
modes of operation for which the continuous dynamics are Vi if vi < Li ^ dv dt < 0
i

characterized by a qualitative domain. Formally, a hybrid dvi
causal system is defined as a tuple: dt > 0 represents that the continuous variable vi is increas-
ing and dv
dt < 0 that it is decreasing. The behavior of these
i
= (#, D, Conf, T r, E, CSD, Init) (1) qualitative variables is represented in Figure 1. by the graph
Where GVi = (VQ , ⌃c , ) where VQ is the set of the possible qual-
itative states (ViL : Low, ViM : Medium, ViH : High) of
• # = {vi } is a set of continuous process variables
the continuous variable vi , ⌃c is the finite set of the events
which are function of time t.
associate to the transitions and : VQ ⇥ ⌃c ! VQ is the
• D is a set of discrete variables. D = Q [ K [ VQ . Q transition function. The corresponding event generator is
is a set of states qi of the transition system which repre-
sent the system operation modes. The set of auxiliary
discrete variables K = {Ki , i = 1, ...nc } represents
the system configuration in each mode qi as defined
below by Conf(qi ). VQ = {Vi } is a set of qualitative
variables whose values are obtained from the behavior
of each continuous variable vi .
• Conf(qi ): Q ! ⌦i D(Ki ) where ⌦ is the Cartesian
product and D(Ki ) is the domain of Ki 2 K that
provides the configuration associated to the mode. i.e.
the modes of the underlying multimode components
(typically, a valve has two normal modes, opened and
closed)
Figure 1: Qualitative values of the process variables
• E = ⌃[⌃c is a finite set of event types noted , where:
– ⌃ is the set of event type associated to the proce- defined by the abstraction function fVQ !
dure actions in a startup or shutdown stages.
– ⌃c is the set of event type associated to the behav- fVQ ! : VQ ⇥ (VQ , ⌃c ) ! ⌃c
8+
ior of the continuous process variables. > l (vi ) if ViL ! ViM
>
<
• T r : Q⇥ ⌃ ! Q is the transition function. The tran- l (vi ) if ViM ! ViL
8Vi 2 VQ , (Vin , Vim ) !
sition from mode qi to mode qj with associated event >
> h+ (vi ) if ViM ! ViH
:
is noted (qi , , qj ) or qi ! qj . We assume that the h (vi ) if ViH ! ViM
n m L M H
model is deterministic, without loss of generality i.e. Vi , Vi 2 {Vi , Vi , Vi }
whenever qi ! qj and qi ! qk then qj = qk for each S (3)
(qi , qj , qk ) 2 Q3 and each 2 ⌃. ⌃c = vi 2# {l+ (vi ), l (vi ), h+ (vi ), h (vi )} (4)

242
Proceedings of the 26th International Workshop on Principles of Diagnosis

3.3 Automatic derivation of the causal model
To obtain the causal model of a system in a given operat-
ing mode implies to collect the equations that represent the
behavior of the system in this mode. The theory of causal
ordering issued from the Qualitative Reasoning community
can be well applied to obtain automatically the causal struc-
ture associated to a set of equations. Now, associating acti-
vation conditions to the equations extend the causal order-
ing to systems with several operating modes [16]. Then
these activation conditions can be related in the influences
of the resulting causal graph.The proposed algorithm, im-
plemented in the Causalito software makes use of condi-
tions that avoid recomputing a totally new perfect matching
for every operating mode, thus reducing the computational
cost. In this work, the Causal System Description is given Figure 2: Principle of chronicle generation
by CSD = (#, I), where each influence I is labeled with:
• An activation condition indicating the modes in which
it is active (or no label if it is active in all modes), 4.2 Temporal runs
• The corresponding equation,
• The component whose behavior is expressed by the We denote a temporal run as h R, T i where R is a run and T
equation. is the time graph of the run that includes the time constraints
In the follow section we expose the principle of the chroni- CT between each pair of time points where must occurs the
cle generation where concepts such as event, chronicle and events type. Figure 3 gives time graph examples and the
temporal run are described. possible composition of time graphs. In our approach the

4 Chronicles
4.1 Events and chronicles
Let us consider time as a linearly ordered discrete set of in-
stants. The occurrence of different events in time represents
the system dynamics and a model can be determined to di-
agnose the correct evolution. An event is defined as a pair
( i , ti ), where i 2 E is an event type and ti is a variable of
integer type called the event date. We define E as the set of
all event types and a temporal sequence on E is an ordered
set of events denoted S = h( i , ti )ij with j 2 Nl where l
is the size of the temporal sequence S and Nl is a finite set
of linearly ordered time points of cardinal l. A chronicle is
a triplet C = (⇠, CT , G) such that ⇠ ✓ E, CT is the set of
temporal constraints. G = (N, It) is a directed graph where
N represent event types of E and the arcs It represent the Figure 3: Time graphs example
relationship between events 2 E, if the event 1 occurs t
time units after 2 , then it exists a directed link from 1 to runs are issued from the system evolution from one oper-
2 associated with a time constraint. Considering the two ation mode to another. The interleaved sequence of event
events ( i , ti ) and ( j , tj ), we define the time interval as types ↵1 , ↵2 , . . . ↵n represents the procedure actions and the
the pair ⌧ij = [t , t+ ], ⌧ij 2 CT corresponding to the lower behavior evolution of the process variables. The time con-
and upper bounds on the temporal distance between the two straints between each pair of event types are determined by
event dates ti and tj [17]. The idea of our proposal is to simulation of the continuous behavior for each process vari-
design the chronicles from the hybrid causal model of the able, responding to the procedure actions.
system. Indeed the evolution of the system can be captured
with temporal runs from which chronicles can be learn (See
Figure 2). More precisely, the system initiates in the state q0
and it evolves according to the transitions resulting from the 5 Generation of Chronicles
events defined by the procedure actions for specific scenar-
ios (startup/shutdown). For a given system modes qi 2 Q, 5.1 Chronicle database
the associated CSDi is used to generate the set of event
types corresponding to the evolution of the continuous pro- An industrial or complex process P r is composed of differ-
cess variables. A run is defined by a sequence of event types ent areas P r = {Ar1 , Ar2 , ...Arn } where each area Ark
↵1 , ↵2 , ....↵n where ↵i 2 E generated for each scenario us- has different operational modes such as startup, shutdown,
ing the startup/shutdown procedures. These runs with time slow march, fast march, etc. The set CArk of chronicles Cijk
constraints permit to construct the chronicle database of the for each area Ark is presented in the matrix below, where
system. In this preliminary approach, time constraints are the rows represent the operating modes (i.e. O1 : Startup,
obtained by simulation. O2 : Shutdown, O3 : Startuptype , O4 : Startuptype , etc)

243
Proceedings of the 26th International Workshop on Principles of Diagnosis

and the columns the different faults. 6 Case study
N f1 f2 . . . . . . fn 6.1 HTG (Hydrostatic Tank Gauging) system
2 k3
O1 C01k
C11 k
C21k
. . . . . . Ci1 In the Cartagena Refinery currently are being implemented
O2 6 C k C12 k
C22k
. . . . . . Ci2 k7 news units and elements. In the startup stage they will need
6 02 k k k k7 a tool to help the operator to recognize dangerous condi-
O3 6C03 C13 C23 . . . . . . Ci3 7
CArk = 6 k k k k7 tions. We will analyze the startup and shutdown stages in the
O4 6C04 C14 C24 . . . . . . Ci4 7
... 6 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .7
7 unit of water injection. This process is a HTG (Hydrostatic
... 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .5 Tank Gauging) system composed by the following compo-
Oj C0jk
C1j k
C2jk
. . . . . . Cijk nents: one tank (T K), two normally closed valves (V 1 and
(5) V 2), one pump (P u), a level sensor (LT ), a pressure sensor
The chronicle database used for diagnosis is composed by (P T ), inflow sensor (F T1 ) and an outflow sensor (F T2 ), see
the entries of all the matrices {CArk }. This chronicle Figure 5.
database is submitted to a chronicle recognition system that
identifies in an observable flow of events all the possible
matching with the set of chronicles from which the situation
(normal or faulty) can be assessed.

5.2 Chronicle learning

As explained previously when the system changes mode of
operation, a set of event types occurs forming a run R. As
this evolution is due to procedure actions. Not only a unique
temporal run can occur. Hence, we need to set up the maxi-
mal number of temporal runs that it could occur in each sce-
nario represented in the matrix (5). To obtain the chronicle
in each scenario is necessary to obtain the larger time graph
with as many event types and with the minimal values of the Figure 5: Process diagram
constraints. [18] proposes to determine the chronicles from
the temporal runs. They define a partial order relation be- Assuming this system as a hybrid causal model, the un-
tween two temporal runs as hR, T i  h R0 , T 0 i when the set derlying discrete event system and the different process
of event types in R0 is a subset of event type in R and the operation modes are described in Figure 6 where we can
time graphs T and T 0 are related by T T 0 determining the see a possible correct evolution for the startup procedure.
result graph where exists a unique equivalent constraint that The events V 1c,o , V 2c,o represents that the valves V 1,V 2
is the minimal. The relation expresses that the set of con- move from the state closed to the state opened, the events
straints in the time graph T 0 is a subset of constraints in T , V 1o,c ,V 2o,c represents on the contrary the valves moving
CT (t, t0 ) ✓ CT 0 (t, t0 ). Therefore, we apply the composition from the state opened to the state closed. The event P uf n
(see Figure 3) between the time graphs in order to merge the indicates that the pump P u is turned on and the event
constraints obtaining the larger and constrained time graph P un f indicates that the pump P u is turned off.
that represents the chronicle in that scenario. Figure 4 gives
an example of a chronicle generation from a maximal tem- 6.2 Identification of causal relationships
poral run. In the next section a case study is presented in
The level (L) in the tank is related to the weight (m) of
the liquid inside, its density (⇢) and the tank area (A). The
density (⇢) is the relationship of the pressures (Pmed ,Pinf )
in separated points (h). Based on the global material bal-
ance, we define that the input flow is equal to the outlet flow.
Then, the variation of the weight (dm(t)/dt) in the tank is
proportional to the difference between the inflow (QiT K )
and the outflow (QoV 2 ). The differential pressure in the
pump and in V 2 are specified as PP u and PV 2 . The
outlet pressure in the pump (P o) is related with the outlet
flow tank (QoT K ), the revolutions per minute in the pump
(RP MP u ), his capacity (C) and the radio of the outlet pipe
(r). The outflow (QoV 2 ) and inflow (QiT K ) control are re-
lated to the percentage aperture of the valves V 1 (LV 1) and
V 2 (LV 2) and differential pressures ( PV 1 , PV 2 ). In Fig-
ure 7 we can see the CSD of the system in the modes q1 ,
Figure 4: Chronicle example q5 and q7 . For example, the mode q1 activates the influence
of QiT K to L. The mode q5 activates the influence of QiT K
to L and the influence of L to P o and finally the mode q7
which the chronicle generation from the temporal runs is il- activates the influence of QiT K to L, L to P o and P o to
lustrated. QoV 2.

244
Proceedings of the 26th International Workshop on Principles of Diagnosis

6.3 Event identification
One of the most important steps for fault diagnosis based
on chronicle recognition is to determine the set of events
that can carry the system to a failure. Each situation pat-
tern (normal or abnormal) is a set of events and temporal
constraints between them; then a situation model may also
specify events to be generated and actions to be triggered
as a result of the situation occurrence. For a startup proce-
dure in the example process, the set of event types ⌃ that
represent the procedure actions is:
⌃ = {V 1c,o , V 2c,o , P uf n , V 1o,c , V 2o,c , P un f } (6)
According to the causal graphs associated to the modes in-
volved in the sequence of procedure actions (i.e q1 , q5 and q7
indicated by red arrows on Figure 6), the event types of ⌃c
correspond to the behavior of the variables L,Po and QoV 2 .
+
{l(L) , l(L) , h+(L) , h(L) ,
c + +
⌃ = l(P o) , l(P o) , h(P o) , h(P o) , (7)
l(QoV 2 ) , l(QoV 2 ) , h+
+
(QoV 2 ) , h(QoV 2 ) }
From the startup/shutdown procedures the different tempo-
ral runs are determined and these temporal runs are related
to the normal and abnormal situations. The chronicle result-
ing from a normal startup procedure is presented in Figure
8. The model system was developed in Matlab including

Figure 6: Underlying DES of the HGT system

Figure 8: Chronicle C01 for normal behavior startup

the injection water process area. The continuous behavior
is related to the evolution of the level L, outlet pump pres-
sure P o and the outlet flow QoV 2 in the system. The dis-
crete evolution is related to the event evolution of the pro-
cedures in the startup and shutdown stages. From the dif-
ferent failure modes of the process, the dynamic behavior
of the variables is shown with a detection for the possible
process states, including the normal procedure without fail-
ure. The simulation includes 3 types of startup procedures
(OK, fail1 and fail2 ) with 4 types of fault modes (V1 , V2 ,
Figure 7: CSD in the modes q1 , q5 and q7 P ump and Drainopen ) and 3 types of Shutdown proce-
dure (OK, N on actived and F ail). The evolution of the
continuous variables in the startup procedure without failure
is shown in Figure 9. The events are generated by the pro-
gram through the evolution of the differential equations, the

245
Proceedings of the 26th International Workshop on Principles of Diagnosis

variable conditions and the procedural actions. Recognition [6] Z. Ge and Z. Song. Multimode process monitoring
of the chronicles was done using the tool stateflow. based on bayesian method. Journal of Chemometrics,
23, 636e650., 2009.
[7] M. Noda X. Liu and H. Nishitani. Evaluation of plant
alarm systems by behavior simulation using a virtual
subject. Computers & Chemical Engineering, 34,
374e386, 2010.
[8] D. Xiao F. Yang, S. L. Shah and T. Chen. Im-
proved correlation analysis and visualization of indus-
trial alarm data. 18th IFAC World Congress Milano
(Italy), 2011.
[9] D.S. Shook S.R. Kondaveeti I. Izadi, S.L. Shah and
Figure 9: Normal behavior in startup procedure without fail- T. Chen. A framework for optimal design of alarm
ure. Blue: Level, Green:Pressure, Red: ouletflow systems. 7th IFAC symposium on fault detection, su-
pervision and safety of technical processes, Barcelona,
Spain, 2009.
7 Conclusion [10] U.G. Oktem A. Pariyani, W.D. Seider and M. Soroush.
A preliminary method for alarm management based on au- Dynamic risk analysis using alarm databases to im-
tomatically learned chronicles has been proposed. The pro- prove process safety and product quality: Part ii
posal is based on a hybrid causal model of the system and a bayesian analysis. AIChE Journal, 58, 826e841.,
chronicle based approach for diagnosis. An illustrative ex- 2012.
ample of an hydrostatic tank gauging has been considered [11] J. Liu and D. Chen. Non stationary fault detection and
to introduce the main concepts of the approach. In this pa- diagnosis for multimode processes. AIChE Journal,
per the design of the temporal constraints of the chronicles 56, 207e219., 2010.
were performed from simulation results, but further research
[12] L. Boang Z. Jing and Y. Hao. Fault diagnosis strategy
aim to generate the chronicles from the model of the system.
for startup process based on standard operating proce-
Learning approaches are currently considered for acquiring
dures. 25th Chinese Control and Decision Conference
the chronicle base directly from the sequences of events rep-
(CCDC), 2013.
resenting the situations. For this propose the algorithm HC-
DAM (Heuristic Chronicle Discovery Algorithm Modified [13] H. Vedam R. Srinivasan, P. Viswanathan and
[17]) may be used. The use of HIL (Hardware in the loop) A. Nochur. A framework for managing transitions in
to simulate and validate the proposal is also in our prospects. chemical plants. Computers & Chemical Engineering,
29, 305e322., 2005.
8 Acknowledge [14] A. Adhitya S. Xu and R. Srinivasan. Hybrid model-
based framework for alarm anticipation. Industrial &
The ECOPETROL - ICP engineers Jorge Prada, Francisco
Engineering Chemistry Research, 2014.
Cala and Gladys Valderrama help us to develop and validate
the simulations. [15] L. Travé-Massuyès R. Pons, A. Subias. Iterative hy-
brid causal model based diagnosis: Application to au-
tomotive embedded functions. Engineering Applica-
References tions of Artificial Intelligence, 2015.
[1] S. Ferrer D. Beebe and D. Logerot. The connection of [16] L. Travé-Massuyès and R. Pons. Causal ordering for
peak alarm rates to plant incidents and what you can multiple mode systems. in:. 11th International Work-
do to minimize. Process Safety Progress, 2013. shop on Qualitative Reasoning, Cortona, Italy, pp.
[2] J. Zhao J. Zhu, Y. Shu and F. Yang. A dynamic alarm 203– 214, 1997.
management strategy for chemical process transitions. [17] L. Travé-Massuyès A. Subias and E. Le Corronc.
Journal of Loss Prevention in the Process Industries Learning chronicles signing multiple scenario in-
30 207e218, 2014. stances. IFAC World Congress, Le Cap, South Africa,
[3] N. F. Thornhill M. Schleburg, L. Christiansen and 26-29 August, 2014 ; also 25th International Work-
A. Fay. A combined analysis of plant connectivity and shop on Principles of Diagnosis (DX-2015), Graz
alarm logs to reduce the number of alerts in an automa- (Austria), 9-11 September., 2014.
tion system. Journal of Process Control 23 839–851, [18] Bruno Guerraz and Christophe Dousson. Chronicles
2013. construction starting from the fault model of the sys-
[4] I. Izadi S. L. Shaha T. Black R. Sandeep, R. Kon- tem to diagnose. DX04 15th International Workshop
daveeti and T. Chen. Graphical tools for routine as- on Principles of Diagnosis. Carcassonne (France).,
sessment of industrial alarm systems. Computers and 2004.
Chemical Engineering 46 39–47, 2012.
[5] T. Takai M. Noda F. Higuchi, I. Yamamoto and
H. Nishitani. Use of event correlation analysis to re-
duce number of alarms. Computer Aided Chemical
Engineering, 27, 1521e1526., 2009.

246
Proceedings of the 26th International Workshop on Principles of Diagnosis

Data-Augmented Software Diagnosis

Amir Elmishali1 and Roni Stern1 and Meir Kalech1
1
Ben Gurion University of the Negev
e-mail: amir9979@gmail.com, roni.stern@gmail.com, kalech@bgu.ac.il

Abstract Barinel is a combination of MBD and Spectrum Fault Lo-
calization (SFL). SFL considers traces of executions, and
The task of software diagnosis algorithms is to finds diagnoses by considering the correlation between exe-
identify which software components are faulty, cution traces and which executions have failed. While very
based on the observed behavior of the system. scalable, Barinel suffers from one key disadvantage: it can
Software diagnosis algorithms have been studied return a very large set of possible diagnoses for the soft-
in the Artificial Intelligence community, using a ware developer to choose from. To handle this disadvantage,
model-based and spectrum-based approaches. In Abreu et al. [5] proposed a Bayesian approach to compute
this work we show how software fault predic- a likelihood score for each diagosis. Then, diagnoses are
tion algorithms, which have been studied in the prioritize according to their likelihood scores.
software engineering literature, can be used to Thanks to the open source movement and current soft-
improve software diagnosis. Software fault pre- ware engineering tools such as version control and issue
diction algorithms predict which software com- tracking systems, there is much more information about a
ponents is likely to contain faults using ma- diagnosed system than revealed by the traces of performed
chine learning techniques. The resulting data- tests. For example, version control systems store all revi-
augmented diagnosis algorithm we propose is sions of every source files, and it is quite common that a
able to overcome of key problems in software di- bug occurs in a source file that was recently revised. Barinel
agnosis algorithms: ranking diagnoses and distin- is agnostic to this data. We propose a data-driven approach
guishing between diagnoses with high probability to better prioritize the set of diagnoses returned by Barinel.
and low probability. This allows to significantly In particular, we use methods from the software engi-
reduce the outputted list of diagnoses. We demon- neering literature to learn from collected data how to pre-
strate the efficiency of the proposed approach dict which software components are expected to be faulty.
empirically on both synthetic bugs and bugs ex- These predictions are then integrated into Barinel to better
tracted from the Eclipse open source project. Re- prioritize the diagnoses it outputs and provide more accurate
sults show that the accuracy of the found diag- estimates of each diagnosis likelihood.
noses is substantially improved when using the The resulting data-augmented diagnosis algorithm is part
proposed combination of software fault prediction of a broader software troubleshooting paradigm that we call
and software diagnosis algorithms. Learn, Diagnose, and Plan (LDP). In this paradigm, illus-
trated in Figure 1(a), the troubleshooting algorithm learns
which source files are likely to fail from past faults, previ-
1 Introduction ous source code revisions, and other sources. When a test
Software is prevalent in practically all fields of life, and its fails, a data-augmented diagnosis algorithm considers the
complexity is growing. Unfortunately, software failures are observed failed and passed tests to suggest likely diagnoses
common and their impact can be very costly. As a result, leveraging the knowledge learned from past data. If further
there is a growing need for automated tools to identify soft- tests are necessary to determine which software component
ware failures and isolate the faulty software components, caused the failure, such test are planned automatically, tak-
such as classes and functions, that have caused the failure. ing into consideration the diagnoses found. This process
We focus on the latter task, of isolating faults in software continues until a sufficiently accurate diagnoses is found.
components, and refer to this task as software diagnosis. In this work we implemented this paradigm and simulated
Model-based diagnosis (MBD) is an approach to auto- its execution on a popular open source software project – the
mated diagnosis that uses a model of the diagnosed system Eclipse CDT. Information from the Git version control and
to infer possible diagnoses, i.e., possible explanations of the the Bugzilla issue tracking systems was used, as illustrated
observed system failure. While MBD was successfully ap- in Figure 1(b) and explained in the experimental results.
plied to a range of domains [1; 2; 3; 4], it has not been ap- Results show a huge advantage of using our data-
plied successfully yet to software. The reason for this is that augmented diagnoser over Barinel with uniform priors for
in software development, there is usually no formal model both finding more accurate diagnoses and for better select-
of the developed software. To this end, a scalable software ing tests for troubleshooting. Moreover, to demonstrate the
diagnosis algorithm called Barinel has been proposed [5]. potential benefit of our data-augmented approach we also

247
Proceedings of the 26th International Workshop on Principles of Diagnosis

Server Logs
Source Code
Issue Tracking Source Code
System
Version
Control System

AI Engine AI Engine

QA Tester Developer QA Tester Developer

(a) Learn, Diagnos, and Plan Paradigm (b) Our current implementation

Figure 1: The learn, diagnose, and plan paradigm and our implementation.

experimented with a synthetic fault prediction model that is does not scale well and modeling the behavior of software
correctly identifies the faulty component. As expected, us- component is often infeasible.
ing the synthetic fault prediction model is better than using
the learned fault prediction model, thus suggesting room for 2.1 SFL for Software Diagnosis
further improvements in future work. To our knowledge, An alternative approach to software diagnosis has been pro-
this is the first work to integrate successfully a data-driven posed by Abreu et al. (5; 7), based on spectrum-based fault
approach into software diagnosis. localization (SFL). In this SFL-based approach, there is no
need for a logical model of the correct functionality of every
2 Model-Based Diagnosis for Software software component in the system. Instead, the traces of the
observed tests are considered.
The input to classical MBD algorithms is a tuple Definition 2 (Trace). A trace of a test t, denoted by trace(t),
hSD, COM P S, OBSi, where SD is a formal description is the sequence of components involved in executing t.
of the diagnosed system’s behavior, COM P S is the set of
components in the system that may be faulty, and OBS Traces of tests can be collected in practice with com-
is a set of observations. A diagnosis problem arises when mon software profilers (e.g., Java’s JVMTI). Recent work
SD and OBS are inconsistent with the assumption that all showed how test traces can be collected with low over-
the components in COM P S are healthy. The output of an head [8]. Also, many implemented applications maintain
MBD algorithm is a set of diagnoses. a log with some form of this information.
In the SFL-based approach, SD is implicitly defined in
Definition 1 (Diagnosis). A set of components ∆ ⊆ SFL by the assumption that a test will pass if all the compo-
COM P S is a diagnosis if nents in its trace are not faulty. Let h(C) denote the health
^ ^ predicate for a component C, i.e., h(C) is true if C is not
(¬h(C)) ∧ (h(C 0 )) ∧ SD ∧ OBS faulty. Then we can formally define SD in the SFL-based
C∈∆ C 0 ∈∆
/ approach with the following set of Horn clauses:
^
is consistent, i.e., if assuming that the components in ∆ are ∀test ( h(C)) → passed(test)
faulty, then SD is consistent with OBS. C∈trace(test)
The set of components (COM P S) in software diagnoses Thus, if a test failed then we can infer that at least one of the
can be, for example, the set of classes, or all functions, or components in its trace is faulty. In fact, a trace of a failed
even a component per line of code. Low level granularity of test is a conflict.
components, e.g., setting each line of code as a component, Definition 3 (Conflict). A set of components Γ ⊆ COM P S
V
will result in very focused diagnoses (e.g., pointing on the is a conflict if h(C) ∧ SD ∧ OBS is inconsistent.
exact line of code that was faulty). Focusing the diagnoses C∈Γ
in such way comes at a price of an increase in the computa- Many MBD algorithms use conflicts to direct the search
tional effort. Automatically choosing the most suitable level towards diagnoses, exploiting the fact that a diagnosis must
of granularity is a topic for future work. be a hitting set of all the conflicts [9; 10; 11]. Intuitively,
Observations (OBS) in software diagnosis are observed since at least one component in every conflict is faulty, only
executions of tests. Every observed test t is labeled as a hitting set of all conflicts can explain the unexpected ob-
“passed” or “failed”, denoted by passed(t) and f ailed(t), servation (failed test).
respectively. This labeling is done manually by the tester or Barinel is a recently proposed software MBD algo-
automatically in case of automated tests (e.g., failed asser- rithm [5] based on exactly this concept: considering traces
tions). of tests with failed outcome as conflicts and returning their
There are two main approaches for applying MBD to hitting sets as diagnoses. With a fast hitting set algorithm,
software diagnosis, each defining SD somewhat differently. such as the STACATTO hitting set algorithm proposed by
The first approach requires SD to be a logical model of the Abreu et al. [12], Barinel can scale well to large systems.
correct functionality of every software component [6]. This The main drawback of using Barinel is that it often outputs
approach allows using logical reasoning techniques to infer a large set of diagnoses, thus providing weaker guidance to
diagnoses. The main drawbacks of this approach is that it the programmer that is assigned to solve the observed bug.

248
Proceedings of the 26th International Workshop on Principles of Diagnosis

2.2 Prioritizing Diagnoses preliminary set of experiments we found that the combina-
To address this problem, Barinel computes a score for every tion of features that performed best is a combination of 68
diagnosis it returns, estimating the likelihood that it is true. features from the features listed by Radjenovic et al. [13]
This serves as a way to prioritize the large set of diagnoses worked best. This list of features included the McCabe [14]
returned by Barinel. and Halstead [15] complexity measures, several object ori-
ented measures such as the number of methods overriding
The exact details of how this score is compute is given
a superclass, number of public methods, number of other
by Abreu et al. [5]. For the purpose of this paper, it is im-
classes referenced, and is the class abstract, and several pro-
portant to note that the score computation used by Barinel
cess features such as the age of the source file, the number
is Bayesian: it computes for a given diagnosis the posterior
of revisions made to it in the last release, the number of de-
probability that it is correct given the observed passes and
velopers contributed to its development, and the number of
failed tests. As a Bayesian approach, Barinel also requires
lines changed since the latest version.
some assumption about the prior probability of each com-
As shown in the experimental results section, the result-
ponent to be faulty. Prior works using Barinel has set these
ing fault prediction model was accurate enough so that the
priors uniformly to all components. In this work, we pro-
overall data-augmented software diagnoser be more effec-
pose a data-driven way to set these priors more intelligently
tive than Barinel with uniform priors. However, we are not
and demonstrate experimentally that this has a huge impact
sure that a better combination of features cannot be found,
of the overall performance of the resulting diagnoser.
and this can be a topic for future work. The main novelty of
our work is in integrating a software fault prediction model
3 Data-Augmented Software Diagnosis with the Barinel.
The prior probabilities used by Barinel represent the a-priori
probability of a component to be faulty, without considering
3.2 Integrating the Fault Prediction Model
any observed system behavior. Fortunately, there is a line of The software fault prediction model generated as described
work on software fault prediction in the software engineer- above is a classifier, accepting as input a software compo-
ing literature that deals exactly with this question: which nent and outputting a binary prediction: is the component
software components is more likely to have a bug. We pro- predicted to be faulty or not. Barinel, however, requires
pose to use these software fault predictions as priors to be a real number that estimates the prior probability of each
used by Barinel. First, we provide some background on soft- component to be faulty.
ware fault prediction. To obtain this estimated prior from the fault prediction
model, we rely on the fact that most prediction models also
3.1 Software Fault Prediction output a confidence score, indicating the model’s confidence
about the classified class. Let conf (C) denote this con-
Fault prediction in software is a classification problem.
fidence for component C. We use conf (C) for Barinel’s
Given a software component, the goal is to determine its
prior if C is classified as faulty, and 1 − conf (C) otherwise.
class – healthy or faulty. Supervised machine learning algo-
rithms are commonly used these days to solve classification
problems. They work as follows. As input, they are given a 4 Experimental Results
set of instances, in our case these are software components, To demonstrate the benefits of the proposed data-augmented
and their correct labeling, i.e., the correct class for each in- approach, we implemented it and evaluated it as follows.
stance. They output a classification model, which maps an
instance to a class. 4.1 Experimental Setup
Learning algorithm extract features from a given instance, As a benchmark, we used the source files, tests, and
and try to learn from the given labeled instances the relation bugs reported for the Eclipse CDT open source software
between the features of an instance and its class. Key to project (eclipse.org/cdt). Eclipse CDT is a popular
the success of machine learning algorithms is the choice of open source Integrated Development Environment (IDE) for
features used. Many possible features were proposed in the C/C++. The first release dates back to December 2003 and
literature for software fault prediction. the latest release we consider, labeled CDT 8.2.0, was re-
Radjenovic et al. [13] surveyed the features used by ex- leased in June 2013. It consists of 8,502 source code files
isting software prediction algorithms and categorizes them and have had more than 10,129 bugs reported so far (for all
into three families. Traditional. These features are tradi- releases). In addition, there are 3,493 automated tests coded
tional software complexity metrics, such as number of lines using the JUnit unit testing framework.
of code, McCabe [14] and Halstead [15] complexity mea-
sures. Determining Faulty Files
Object Oriented. These features are software complex- Eclipse CDT is developed using the Git version control sys-
ity metrics that are specifically designed for object oriented tem and the Bugzilla issue tracking system. Git maintains
programs. This includes metrics like cohesion and coupling all versions of each source file in a repository. This en-
levels and depth of inheritance. ables computing process metrics for every version of every
Process. These features are computed from the software source file. Similarly, Bugzilla is used to maintain all re-
change history. They try to capture the dynamics of the soft- ported bugs. Some source file versions are marked in the
ware development process, considering metrics such as lines Git repository as versions in which a specific bug was fixed.
added and deleted in the previous version and the age of the The Git repository for Eclipse CDT contained matching ver-
software component. sions of source files for 6,730 out of 10,129 bugs reported as
It is not clear from the literature which combination of fixed in Bugzilla. We performed our experiments on these
features yields the most accurate fault predictions. In a 6,730 bugs.

249
Proceedings of the 26th International Workshop on Principles of Diagnosis

For both learning and testing a fault prediction model, we is the area under the curve plotting the accuracy as a func-
require a mapping between reported bug and the source files tion of the recall (every point is a different threshold value).
that were faulty and caused it. One possible assumption is All metrics range between zero and one (where one is
that every source file revision that is marked as fixing bug optimal) and are standard metrics in machine learning and
X is a faulty file that caused X. We call this the “All files” information retrieval. The unfamiliar reader can find more
assumption. The “All files” assumption may overestimate details in Machine Learning books, e.g. Mitchell’s classical
the number of faulty files as some of these files may have book [17].
been modified due to other reasons, not related to the bug. The results for both “All files” and “Most modified” as-
Even if all changes in a revision are related to fixing a bug, sumptions show that the Random Forest classifier obtained
it still does not mean that all these files are faulty. For ex- the overall best results. This corresponds to many recent
ample, properties files and XML configuration files. As a works. Thus, in the results reported henceforth, we only
crude heuristic to overcome this, we also experiment with used the model generated by the Random Forest classifier
an alternative assumption that we call the “Most modified” in our data-augmented diagnoser. The precision and espe-
assumption. In the “Most modified” assumption, for a given cially recall results are fairly low. This is understandable,
bug X we only consider a single source file as faulty from as most files are healthy, and thus the training set is very
all the files associated with bug X, We chose from these imbalanced. This is a known inhibitor to performance of
source file the one in which the revision made to that source standard learning algorithms. We have experimented with
file was the most extensive. The extensiveness of the re- several known methods to handle this imbalanced setting,
vision is measured by the number of lines added, updated, such as SMOTE and random under sampling, but these did
and deleted to the source file in this revision. Below we not produce substantially better results. However, as we
present experiments for both “All files” and “Most modi- show below, even this imperfect prediction model is able
fied” assumptions. Śliwerski et al. [16] proposed a more to improve the existing data-agnostic software diagnosis al-
elaborate method to heuristically identify the source files gorithm. Note that we also experimented with other popular
that are caused the bug, when analyzing a similar data set. learning algorithms such as Support Vector Machine (SVM)
Training and Testing Set and Artificial Neural Network (ANN), but their results were
worse than those shown in Table 1.
The sources files and reported bugs from 5 releases, 8.0.0–
8.1.1, were used to train the model of our data-augmented Next, we evaluate the performance of our data-augmented
diagnoser, and the source files and reported bugs from re- diagnoser in two diagnostic tasks: finding diagnoses and
lease 8.1.2 were used to evaluate it. guiding test generation.

4.2 Comparing Fault Prediction Accuracy 4.3 Diagnosis Task
As a preliminary, we evaluated the quality of the fault pre- First, we compared the data-agnostic diagnoser with the
diction models used by our data-augmented diagnoser on proposed data-augmented diagnoser in the task of finding
our Eclipse CDT benchmark. accurate diagnoses. The input is a set of tests, with their
traces and outcomes and the output is a set of diagnoses,
All files Precision Recall F-Measure AUC each diagnosis having a score that estimates its correctness.
Random Forest 0.56 0.09 0.16 0.84 This score was computed by Barinel as desribed earlier in
J48 0.44 0.17 0.25 0.61
the paper, where the data-agnostic diagnoser uses uniform
Naive Bayes 0.27 0.31 0.29 0.80
priors and the proposed data-augmented diagnoser uses the
Most modified Precision Recall F-Measure AUC predicted fault probabilities from the learned model.
Random Forest 0.44 0.04 0.08 0.76
J48 0.15 0.03 0.05 0.55 Most modified All files
Naive Bayes 0.08 0.31 0.12 0.715 Diagnoser Precision Recall Precision Recall
Data-agnostic 0.72 0.27 0.55 0.26
Table 1: Faulty prediction performance. Data-augmented 0.90 0.32 0.73 0.35
We used the Weka software package (www.cs. Syn. (0.6,0.01) 0.97 0.39 0.96 0.45
waikato.ac.nz/ml/weka) to experiment with several Syn. (0.6,0.1) 0.84 0.35 0.89 0.42
learning algorithms and compared the resulting fault predic- Syn. (0.6,0.2) 0.77 0.34 0.83 0.39
Syn. (0.6,0.3) 0.73 0.33 0.78 0.37
tion models. Specifically, we evaluated the following learn-
Syn. (0.6,0.4) 0.69 0.32 0.74 0.36
ing algorithms: Random Forest, J48 (Weka’s implementa-
tion of a decision tree learning algorithm), and Naive Bayes. Table 2: Comparison of diagnosis accuracy.
Table 1 shows the precision, recall, F-measure, and AUC
of the fault prediction models generated by each of these To compare the set of diagnoses returned by the differ-
learning algorithms. These are standard metrics for evaluat- ent diagnosers, we computed the weighted average of their
ing classifiers. In brief, precision is the ratio of faulty files precision and recall. This was computed as follows. First,
among all files identified by the evaluated model as faulty. the precision and recall for every diagnoses was computed.
Recall is the number of faulty files identified as such by the Then, we averaged these values, weighted by the score given
evaluated model divided by the total number of faulty files. to the diagnoses by Barinel. This enables aggregating the
F-measure is a known combination of precision and recall. precision and recall of a set of diagnoses while incorporat-
The AUC metric addresses the known tradeoff between re- ing which diagnoses are regarded as more likely according
call and precision, where high recall often comes at the price to Barinel’s. For brevity, we will refer to this weighted av-
of low precision. This tradeoff can be controlled by setting erage precision and weighted average recall as simply pre-
different sensitivity thresholds to the evaluated model. AUC cision and recall.

250
Proceedings of the 26th International Workshop on Principles of Diagnosis

Table 2 shows the precision and recall results of the data- agnosis algorithm with different fault priors. Barinel uses
agnostic diagnoser and our data-augmented diagnoser, for these priors only to prioritize diagnoses, but Barinel consid-
both “Most modified” and “All files” assumptions. Each ers as diagnoses hitting sets of faulty traces. Thus, if two
result in the table is an average over the precision and re- faulty components are used in the same trace, only one of
call obtained for 50 problem instances. A problem instance them will be detected even if both have very high likelihood
consists of (1) a bug from one of the bugs reported for re- of being faulty according to the fault prediction model.
lease 8.1.2. of Eclipse CDT, and (2) a set of 25 tests, chosen
randomly, while ensuring that at least one tests would pass Considering More Tests
through the faulty files. Next, we investigate the impact of adding more tests to the
Both precision and recall of the data-augmented and data- accuracy of the returned diagnoses.
agnostic diagnosers support the main hypothesis of this Figure 2 shows the precision and recall results (Figures 2
work: a data-augmented diagnoser can yield substantially (a) and (b), respectively), as a function of the number of
better diagnoses that a data-agnostic diagnoser. For exam- observed tests. We compared the different diagnosers, given
ple, the precision of the data-augmented diagnoser under the 25, 40, 70, 100, and 130 observed tests.
“Most modified” assumption is 0.9 while that of the data- The results show two interesting trends in both precision
agnostic diagnoser is only 0.72. The superior performance and recall. First, as expected, the data-agnostic diagnoser
of the data-augmented diagnoser is shown for both “Most performs worse than the data-augmented diagnoser, which
modified” and “All files” assumptions. Another observation in terms performs worse than the diagnoser using a synthetic
that can be made from the results in Table 2 is that while the fault prediction model, with Ph = 0.01. This supports our
precision of the data-augmented diagnoses is very high and main hypothesis — that data-augmented diagnosers can be
is substantially better than that of the data-agnostic diag- better than a data-agnostic diagnoser. Also, the better per-
noser, the improvement in recall is relatively more modest. formance of Syn. (0.6, 0.01) demonstrates that future re-
This can be explained by the precision and recall results of search on improving the fault prediction model will results
the learned model, shown in Table 1 and discussed earlier. in a better diagnoser.
There too, the recall results was far worse than the preci- The second trend is that adding more tests reduces the
sion results (recall that we are using the model learned by precision and recall of the returned diagnoses. This, at
the Random Forest learning algorithm). It is possible that first glance, seem counter-intuitive, as we would expect
learning a model with higher recall may result in higher re- more tests to allow finding more accurate diagnoses and
call for the resulting diagnoses. We explore the impact of thus higher recall and precision. This non-intuitive results
learning more accurate fault prediction model next. can be explained by how tests were chosen. As explained
above, the observed tests were chosen randomly, only veri-
Synthetic Priors fying that at least one test passes through each faulty source
The data-augmented diagnoser is based on the priors gen- file. Adding randomly selected tests adds noise to the di-
erated by the learned fault prediction model. Building bet- agnoser. By contrast, intelligent methods to choose which
ter fault prediction models is an active field of study [13] tests to add can improve the accuracy of the diagnoses [18].
and thus future fault prediction models may be more accu- This is explored in the next section. Another reason for the
rate than the ones used by our data-augmented diagnoser. degraded performance when adding more tests is that more
To evaluate the benefit of a more accurate fault prediction tests may pass through more fault source files, in addition
model on our data-augmented diagnoser, we created a syn- to those from the specific reported bug used to generate the
thetic fault prediction model, in which faulty source files problem instance in the first place. Thus, adding more tests
get Pf probability and healthy source files get Ph , where increases the amount of faulty source files to detect.
Pf and Ph are parameters. Setting Ph = Pf would cause
the data-augmented diagnoser to behave in a uniform distri- 4.4 Troubleshooting Task
bution exactly like the data-agnostic diagnoser, setting the Efficient diagnosers are key components of troubleshoot-
same prior probability for all source files to be faulty. By ing algorithms. Troubleshooting algorithms choose which
contrast, setting Ph = 0 and Pf = 1 represent an optimal tests to perform to find the most accurate diagnosis. Za-
fault prediction model, that exactly predicts which files are mir et al. [18] proposed several troubleshootings algorithms
faulty and which are healthy. specifically designed to work with Barinel for troubleshoot-
The lines marked “Syn. (X,Y)” in Table 2 mark the ing software bugs. In the below preliminary study, we eval-
performance of the data-augmented diagnoser when using uated the impact of our data-augmented diagnoser on the
this synthetic fault prediction model, where X = Pf and overall performance of troubleshooting algorithms. Specif-
Y = Ph . Note that we experimented with many values of ically, we implemented the so-called highest probability
Pf and Ph , and presented above a representative subset of (HP) troubleshooting algorithm, in which tests are chosen
these results. in the following manner. HP chooses a test that is expected
As expected, setting lowering the value of Ph results in to pass through the source file having the highest probability
more better diagnoses being found. Setting a very low Ph of being faulty, given the diagnoses probabilities.
value improves the precision significantly up to almost per- We run the HP troubleshooting algorithm with each of
fect precision (0.97 and 0.96 for the “Most modified” and the diagnosers mentioned above (all rows in Table 2). We
“All files”, respectively). The recall results, while also im- compared the HP troubleshooting algorithm using different
proving as we lower Ph , do not reach a very high value. For diagnosers by counting the number of tests were required to
Ph = 0.01, the obtained recall is almost 0.39 and 0.45 for reach a diagnoses of score higher than 0.7.
the “Most modified” and “All files”, respectively. Table 3 shows the average number of tests performed by
A possible explanation for these low recall results lays in the HP troubleshooting algorithm until it halts (with a suit-
the fact that all the evaluated diagnosers use the Barinel di- able diagnosis). The results show the same over-arching

251
Proceedings of the 26th International Workshop on Principles of Diagnosis

1 1 Syn. (0.6,0.01) Syn. (0.6,0.2)
Syn. (0.6,0.4) Data-agnostic
0.8 0.8 Data-augmented

0.6
Precision
0.6

Recall
0.4 0.4

Syn. (0.6,0.01) Syn. (0.6,0.2)
0.2 0.2
Syn. (0.6,0.4) Data-agnostic
Data-augmented
0 0
25 40 70 100 130 25 40 70 100 130
# Tests # Tests
(a) Precision results (b) Recall results

Figure 2: Diagnosis accuracy as a function of # tests given to the diagnoser.
Algorithm Most modified All files [5] Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund. Si-
Data-agnostic 20.24 18.06 multaneous debugging of software faults. Journal of Systems
Data-augmented 10.80 15.45 and Software, 84(4):573–586, 2011.
Syn. (0.6,0.01) 3.94 14.91
[6] Franz Wotawa and Mihai Nica. Program debugging using
Syn. (0.6,0.1) 15.44 17.83
Syn. (0.6,0.2) 19.78 18.99 constraints – is it feasible? Quality Software, International
Syn. (0.6,0.3) 20.90 19.24 Conference on, 0:236–243, 2011.
Syn. (0.6,0.4) 20.74 19.18 [7] Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund.
Spectrum-based multiple fault localization. In Automated
Table 3: Avg. additional tests for troubleshooting. Software Engineering (ASE), pages 88–99. IEEE, 2009.
[8] Alexandre Perez, Rui Abreu, and André Riboira. A dynamic
theme: the data-augmented diagnoser is much better than code coverage approach to maximize fault localization effi-
the data-agnostic diagnoser for this troubleshooting task. ciency. Journal of Systems and Software, 2014.
Also, using the synthetic fault prediction model can result [9] Johan de Kleer and Brian C. Williams. Diagnosing multiple
in even further improvement, thus suggesting future work faults. Artif. Intell., 32(1):97–130, 1987.
for improving the learned fault prediction model.
[10] Brian C. Williams and Robert J. Ragno. Conflict-directed
A* and its role in model-based embedded systems. Discrete
5 Conclusion, and Future Work Appl. Math., 155(12):1562–1595, 2007.
We presented a method for using information about the di- [11] Roni Stern, Meir Kalech, Alexander Feldman, and Gre-
agnosed system to improve Barinel, a scalable, effective, gory M. Provan. Exploring the duality in conflict-directed
software diagnosis algorithm [7]. In particular, we incor- model-based diagnosis. In AAAI, 2012.
porated a software fault prediction model into Barinel. The [12] Rui Abreu and Arjan JC van Gemund. A low-cost approx-
resulting data-augmented diagnoser is shown to outperform imate minimal hitting set algorithm and its application to
Barinel without such a fault prediction model. This was model-based diagnosis. In SARA, volume 9, pages 2–9, 2009.
verified experimentally using a real source code system [13] Danijel Radjenovic, Marjan Hericko, Richard Torkar, and
(Eclipse CDT), real reported bugs and information from Ales Zivkovic. Software fault prediction metrics: A system-
the software’s source control repository. Results also sug- atic literature review. Information & Software Technology,
gests that future work on improving the learned fault pre- 55(8):1397–1418, 2013.
diction model will result in an improved diagnosis accuracy. [14] Thomas J. McCabe. A complexity measure. IEEE Trans.
In addition, it is worthwhile to incorporate the proposed Software Eng., 2(4):308–320, 1976.
data-augmented diagnosis methods with other proposed im- [15] Maurice H. Halstead. Elements of Software Science (Operat-
provements of the based SFL-based software diagnosis, as ing and Programming Systems Series). Elsevier Science Inc.,
those proposed by Hofer et al. [19; 20]. New York, NY, USA, 1977.
[16] Jacek Śliwerski, Thomas Zimmermann, and Andreas Zeller.
References When do changes induce fixes? ACM sigsoft software engi-
neering notes, 30(4):1–5, 2005.
[1] Brian C. Williams and P. Pandurang Nayak. A model-based
approach to reactive self-configuring systems. In Conference [17] Tom Mitchell. Machine learning. McGraw Hill, 1997.
on Artificial Intelligence (AAAI), pages 971–978, 1996. [18] Tom Zamir, Roni Stern, and Meir Kalech. Using model-
[2] Alexander Feldman, Helena Vicente de Castro, Arjan van based diagnosis to improve software testing. In AAAI Con-
Gemund, and Gregory Provan. Model-based diagnostic ference on Artificial Intelligence, 2014.
decision-support system for satellites. In IEEE Aerospace [19] Birgit Hofer, Franz Wotawa, and Rui Abreu. Ai for the win:
Conference, pages 1–14. IEEE, 2013. Improving spectrum-based fault localization. ACM SIGSOFT
[3] Peter Struss and Chris Price. Model-based systems in the Software Engineering Notes, 37:1–8, 2012.
automotive industry. AI magazine, 24(4):17–34, 2003. [20] Birgit Hofer and Franz Wotawa. Spectrum enhanced dy-
[4] Dietmar Jannach and Thomas Schmitz. Model-based diag- namic slicing for better fault localization. In ECAI, pages
nosis of spreadsheet programs: a constraint-based debugging 420–425, 2012.
approach. Automated Software Engineering, 1:1–40, 2014.

252
Proceedings of the 26th International Workshop on Principles of Diagnosis

Faults isolation and identification of Heat-exchanger/ Reactor
with parameter uncertainties

Mei ZHANG1,4,5 , Boutaïeb DAHHOU2,3 Michel CABASSUD 4,5 Ze-tao LI1
1
Guizhou University
gzgylzt@163.com
2
CNRS LAAS, Toulouse, France
boutaib.dahhou@laas.fr
3
Université de Toulouse, UPS, LAAS, Toulouse, France
4
Université de Toulouse, UPS, Laboratoire de Génie Chimique
michel.cabassud@ensiacet.fr
5
CNRS, Laboratoire de Génie Chimique

Abstract Supervision studies in chemical reactors have been reported
in the literature concerning process monitoring, fouling de-
This paper deals with sensor and process fault de- tection, fault detection and isolation. Existing approaches
tection, isolation (FDI) and identification of an in-
can be roughly divided into data based method as in [3],
tensified heat-exchanger/reactor. Extended high
neural networks as in [4] and model based method as in
gain observers are adopted for identifying sensor [5,6,7,8,9]. Among the model based approach, observer
faults and guaranteeing accurate dynamics since
based methods are said to be the most capable
they can simultaneously estimate both states and
[10,11,12,13,14] if analytical models are available.
uncertain parameters. Uncertain parameters in- Most of previous approaches focus on a particular class of
volve overall heat transfer coefficient in this paper.
failures. This paper deals with integrated fault diagnosis for
Meanwhile, in the proposed algorithm, an ex-
both sensor and process failures. Using temperature meas-
tended high gain observer is fed by only one meas- urements, together with state observers, an integrated diag-
urement. In this way, observers are allowed to act
nosis scheme is proposed to detect, isolate and identify
as soft sensors to yield healthy virtual measures for
faults. As for sensor faults, a FDI framework is proposed
faulty physical sensors. Then, healthy measure- based on the extended observer developed in [15]. Extended
ments, together with a bank of parameter interval
high gain observers are adopted in this paper due to its ca-
filters are processed, aimed at isolating process
pability of simultaneous estimation of both states and pa-
faults and identifying faulty values. Effectiveness rameters, resulting in more accurate system dynamics. The
of the proposed approach is demonstrated on an in-
estimates information provided by the observers and the
tensified heat-exchanger/ reactor developed by the
sensors measurements are processed so as to recognize the
Laboratoire de Génie Chimique, Toulouse, France. faulty physical sensors, thus achieving sensor FDI. Moreo-
ver, the extended high gain observers will work as soft sen-
1 Introduction sors to output healthy virtual measurements once there are
Nowadays, safety is a priority in the design and develop- sensor faults occurred. Then, the healthy measures are uti-
ment of chemical processes. Large research efforts contrib- lized to feed a bank of parameter intervals filters developed
uted to the improvement of new safety tools and methodol- in [11] to generate a bank of residuals. These residuals are
ogy. Process intensification can be considered as an inher- processed for isolating and identifying process faults which
ently safer design such as intensified heat exchangers involves jumps in overall heat transfer coefficient in this
(HEX) reactors in [1], the prospects are a drastic reduction work.
of unit size and solvent consumption while safety is in- It should be pointed out that the contribution of this work
creased due to their remarkable heat transfer capabilities. does not lie with the soft sensor design or the parameter in-
However, risk assessment presented in [2] shows that po- terval filter design as either part has individually already
tential risk of thermal runaway exists in such intensified been addressed in the existing literature. However, the au-
process. Further, several kinds of failures may compromise thors are not aware of any studies where both tasks are com-
safety and productivity: actuator failures (e.g., pump fail- bined for integrated FDI, besides, there is no report whereby
ures, valves failures), process failures (e.g., abrupt varia- parameter estimation capacity of the extended high gain ob-
tions of some process parameters) and sensor failures. server is used to adapt the coefficient, rather than parameter
Therefore, supervision like FDI is required prior to the im- FDI, thus together with sensor FDI framework forms the
plementation of an intensified process. contribution of this work.
For complex systems (e.g. heat-exchanger/reactors), fault
detection and isolation are more complicated for the reason 2 System modelling
that some sensors cannot be placed in a desirable place, and
for some variables (concentrations), no sensor exists. In ad- 2.1 Process description
dition, complete state and parameters measurements (i.e. The key feature of the studied intensified continuous heat-
overall heat transfer coefficient) are usually not available. exchanger/reactor is an integrated plate heat-exchanger

253
Proceedings of the 26th International Workshop on Principles of Diagnosis

technology which allows for the thermal integration of sev- channel volume and cannot be a failure leads to fatal acci-
eral functions in a single device. Indeed, by combining a re- dent normally, but it may influence the dynamic of the pro-
actor and a heat exchanger in only one unit, the heat gener- cess and it is rather difficult to calculate the changes online.
ated (or absorbed) by the reaction is removed (or supplied) In this paper, we treat the parameter uncertainty as an un-
much more rapidly than in a classical batch reactor. As a measured state, and employ an observer as soft sensor to
consequence, heat exchanger/reactors may offer better estimate it, unlike other literature, the estimation here is not
safety (by a better thermal control of the reaction), better for fouling detection but for more accurate model dynamics,
selectivity (by a more controlled operating temperature). and to ensure the value of the variable is within acceptable
parameter, (e.g., upper and lower bounds of the process var-
2.2 Dynamic model iable value).
Supervision like FDI study can be much more efficient if a To rewrite the whole model in the form of state equations,
dynamic model of the system under consideration is availa- due to the assumption that every element behaves like a per-
ble to evaluate the consequences of variables deviations and fectly stirred tank, we suppose that one cell can keep the
the efficiency of the proposed FDI scheme. main feature of the qualitative behavior of the reactor. For
Generally speaking, intensified continuous heat-exchanger/ the sake of simplicity, only one cell has been considered.
reactor is treated as similar to a continuous reactor [16,17], Let us delete the subscript k for a given cell.
then flow modelling is therefore based on the same hypoth- Define the state vector as x1 T = [x11 , x12 ]T = [Tp , Tu ]T , un-
dhp
esis as the one used for the modelling of real continuous re- measured state x2 T = [x21 , x22 ]T = [hu , hp ]T , =
dt
actors, represented by a series of N perfectly stirred tank re- dhu
actors (cells). According to [18] , the number of cells N = ε(t) , ε(t) is an unknown but bounded function refers
dt
should be greater than the number of heat transfer units, and to variation of h, the control input u = Tui , the output vector
the heat transfer units is related with heat capacity flowrate. T
of measurable variables y T = [y1 , y2 ]T = [Tp , Tu ] , then
The modelling of a cell is based on the expression of bal- the equation (1) and (2) can be rewritten in the following
ances (mass and energy) which describes the evolution of state-space form:
the characteristic values: temperature, mass, composition,
ẋ 1 = F1 (x1 )x2 + g1 (x1 , u)
pressure, etc. Given the specific geometry of the heat-ex-
changer/reactor, two main parts are distinguished. The first { ẋ 2 = ε(t) (3)
part is associated with the reaction and the second part en- y = x1
compasses heat transfer aspect. Without reaction, the basic
A
mass balance expression for a cell is written as: (Tp − Tu ) 0
ρp Cp Vp
{Rate of mass flow in – Rate of mass flow out = Rate of Where, F1 (x1 ) = ( p
),
A
change of mass within system} 0 (Tu − Tp )
ρu Cp Vu
The state and evolutions of the homogeneous medium cir- u
(Tpi −Tp )Fp
culating inside cell 𝑘 are described by the following bal-
Vp
ance: and g1 (x) = ( ) , Tpi , Tui is the output of previ-
(Tui −Tu )Fu
−1
2.2.1 Heat balance of the process fluid (J. s ) Vu
k ous cell, for the first cell, it is the inlet temperature of pro-
k dTp
ρkp Vpk Cp = hkp Ak (Tpk − Tuk ) + ρkp Fpk Cp k (Tpk−1 − Tpk ) (1) cess fluid and utility fluid.
p dt p
where ρkp is density of the process fluid in cell k (in In this case, the full state of the studied system is given as:
kg. m−3 ), Vpk is volume of the process fluid in cell k (in m3 ),
Cp k specific heat of the process fluid in cell k (in ẋ = F(x1 )x + G(x1 , u) + ε̅(t)
p { (4)
J. kg −1 . K −1 ) , hkp is the overall heat transfer coefficient (in y = Cx
J. m−2 . K −1 . s −1 ).
x1 0 F1 (x1 )
2.2.2 Heat balance of the utility fluid (J. s −1 ) Where x = [x ] , F(x1 ) = ( ) , G(x1 , u) =
2 0 0
dTku g (x, u) 0
ρku Vuk Cp k = hku Ak (Tuk − Thk ) + ρku Fuk Cp k (Tuk−1 − Tuk ) (2) ( 1 ) , C = (I 0), ε̅(t) = ( )
u dt u 0 ε(t)
whereρku is density of the utility fluid in cell k (in kg. m−3 ), 3 Fault detection and diagnose scheme
Vuk is volume of the utility fluid in cell k (in m3 ), Cp k specific
u
heat of the utility fluid in cell k (in J. kg −1 . K −1 ) , hku is 3.1 Observer design for sensor FDI
overall heat transfer coefficient (in J. m−2 . K −1 . s −1 ). The extended high gain observer proposed by [15] can be
The eq. (1) (2) represent the dynamic reactor comportment. used like an adaptive observer for estimation both states and
The two equations represent the evolution of two states (Tp : parameters simultaneously, in this paper, the latter capabil-
reactor temperature and Tu : utility fluid temperature).The ity is utilized to estimate incipient degradation of overall
heat transfer coefficient (h) is considered as a variable heat transfer coefficient (due to fouling), thus guaranteeing
which undergoes either an abrupt jumps (by an expected a more accurate approximation of the temperature. It is quite
fault in the process) or a gradual variation (essentially due useful in chemical processes since parameters are usually
to degradation). The degradation can be attributed to foul- with uncertainties and unable to be measured.
ing. Fouling in intensified process is tiny due to the micro

254
Proceedings of the 26th International Workshop on Principles of Diagnosis

Consider a nonlinear system as the form: With this formulation, the faulty model becomes:
ẋ = F(x1 )x + G(x1 , u) ẋ = F(x1 )x + G(x1 , u) + ε̅(t)
{ (5) { (11)
y = Cx y = Cx + Fs fs
where x = (x1 , x2 )T ∈ ℛ 2n , x1 ∈ ℛ n is the state, x2 ∈ ℛ n 𝐹𝑠 is the fault distribution matrix and we consider that fault
is the unmeasured state, x2 = ϵ(t), u ∈ ℛ m , y ∈ ℛ p are in- vector 𝑓𝑠 ∈ ℛ 𝑝 (𝑓𝑠𝑗 is the 𝑗𝑡ℎ element of the vector) is also a
put and output, ϵ(t) is an unknown bounded function bounded signal. Notice that, a faulty sensor may lead to in-
which may depend on u(t), y(t), noise, etc., and correct estimation of parameter. That is why we emphasized
healthy measurement for parameter fault isolation as men-
0 F1 (x1 ) g (x, u)
F(x1 ) = ( ) , G(x1 , u) = ( 1 ), C(I 0), tioned above.
0 0 0
F1 (x1 ) is a nonlinear vector function, g1 (x, u) is a matrix 3.2.2 Fault detection and isolation scheme
function whose elements are nonlinear functions. The proposed sensor FDI framework is based on a bank of
Supposed that assumptions related boundedness of the observers, the number of observers is equal to the number
states, signals, functions etc. in [15] are satisfied, the ex- of sensors. Each observer use only one sensor output to es-
tended high gain observer for the system can be given by: timate all the states and parameters. First, assumed the sen-
sor used by ith observer is healthy, let yi denotes the ith
x̂̇ = F(x̂1 )x + G(x̂1 , u) − Λ−1 (x̂1 )Sθ−1 C T (ŷ − y) system output used by the ith observer. Then we form the
{ (6)
ŷ = Cx̂ observer as:
𝐼 0 x̂̇ i = F(x̂1i )x + G(x̂1i , u) + Hi (yi − ŷii )
Where: Λ(𝑥̂1 ) = [ ] 1 ≤ i ≤ p{ i
0 𝐹1 (𝑥̂1 ) (12)
ŷ = Cx̂ i
𝑆𝜃 is the unique symmetric positive definite matrix satisfy-
ing the following algebraic Lyapunov equation: Define eix = x̂ i − x, eiy = Ceix , eiyj = ŷji − yj , rji (t) = ‖ŷji −
yj ‖, μi = ‖rji (t)‖ ≔ sup‖rji (t)‖, for t ≥ 0.
θSθ + AT Sθ + Sθ A − C T C = 0 (7)
Where i denotes the ith observer, ŷii , ŷji denotes the ith, jth
0 I
Where A = [ ] , θ > 0 is a parameter define by [15] estimated system output generated by the ith observer, Hi is
0 0
and the solution of eq. (7) is: the gain of ith observer determined by the following equa-
1 1
tion :
I − 2I 2θi I
Sθ = [ θ θ
] (8) Hi = Λ−1 (x̂1 )Sθ−1 C T = Λ(x̂1 ) [ 2 −1 ]
− 2I
1 2
I
i θi F1 (x̂1 )
θ θ3 Then we get:
Then, the gain of estimator can be given by: Theorem 1:
If the lth sensor is faulty, then for system of form (4), the
2θI
H = Λ−1 (x̂1 )Sθ−1 C T = Λ(x̂1 ) [ 2 −1 ] (9) observer (12) has the following properties:
θ F1 (x̂1 )
For i ≠ l , ŷ i = y asymptotically
Notice that larger 𝜃 ensures small estimation error. How- For i = l, ŷ i ≠ y
ever, very large values of 𝜃 are to be avoided in practice due Proof: If the lth sensor is faulty, then:
to noise sensitiveness. Thus, the choice of 𝜃 is a compro- For i ≠ l, means that fsi = 0, yi = θsi , we have:
mise between fast convergence and sensitivity to noise.
lim eix = lim (x̂ i − x) = 0 (13)
t→∞ t→∞
3.2 Sensor fault detection and isolation scheme
Then the vector of the estimated output ŷ i generated by ith
The above observer could guarantee the heat-exchanger/re- observer guarantee ŷ i = y after a finite time.
actor dynamics ideally. Then, a bank of the proposed ob- For i = l, means that θsl = ylf = yl + fsl , fsl ≠ 0 , the ob-
servers, together with sensor measurements, are used to server is designed on the assumption that there is no fault
generate robust residuals for recognizing faulty sensor. occurs, because there is fault fsl exit, so the estimation error
Thus, we propose a FDI scheme to detect, meanwhile, iso-
elx = 0 asymptotically cannot be satisfied, then :
late and recovery the sensor fault.
lim (x̂ i − x) = lim (x̂ l − x) ≠ 0 (14)
t→∞ t→∞
3.2.1 Sensor faulty model we have:
ė lx = F(x̂1i , u) elx − Hi G(x̂1i , u, fsl ) elx (15)
A sensor fault can be modeled as an unknown additive term
i
in the output equation. Supposed θsj is the actual measured Then the vector of the estimated output ŷ generated by the
output from jth sensor, if jth sensor is healthy, θsj= yj , while ith observer is different from y, that is ŷ i ≠ y.⊡
if jth sensor is faulty, θsj = yjf = yj + fsj , (𝑓𝑠𝑗 is the fault), As mentioned above, the observers are deigned under the
for t ≥ t f and lim |yj − θsj | ≠ 0.That means yjf is the actual assumption that no fault occurs, furthermore, each observer
t→∞
output of the jth sensor when it is faulty, while yj is the ex- just subject to one sensor output. Residual rii is the differ-
pected output when it is healthy, that is:
ence between the ith output estimation ŷii determined by
yi ; jth sensor when it is faulty the ith observer and the ith system output yi , then Theorem
θsi = { f (10)
yi = yi + fsi ; jth sensor when it is faulty 2 formulates the fault detection and isolation scheme.

255
Proceedings of the 26th International Workshop on Principles of Diagnosis

Theorem 2: all the intervals whether or not one of them contains the
If the lth sensor is faulty, then: faulty parameter value of the system, the faulty parameter
For i ≠ l, we have: value is found, the fault is therefore isolated and estimated.
fsi = 0, yi = θsi (16) The practical domain of each parameter is partitioned into a
thus ŷii converges to yi asymptotically, we get: certain number of intervals. For example, parameter hp is
rii = ‖ŷii − yi ‖ ≤ μi (17) partitioned into q intervals, their bounds are denoted
(0) (1) (i) (q)
For i = l, we have: by hp , hp , … , hp , … , hp . The bounds of ith interval are
fsl ≠ 0, θsl = ylf = yl + fsl ≠ yl , then ŷll could not track yl (i−1) (i)
hp and hp , are also noted as hbi ai
p and hp , and the nomi-
correctly: nal value for hp denotes by hp0 .
rll = ‖ŷll − yl ‖ ≥ μl (18) To verify if an interval contains the faulty parameter value
Therefore, in practice, we can check all the residuals rii , for of the post-fault system, a parameter filter is built for this
1 ≤ i ≤ p, if rii ≥ μi denotes that ith sensor is faulty, then interval. A parameter filter consists of two isolation observ-
the sensor fault detection and isolation is achieved. ers which correspond to two interval bounds, and each iso-
The residuals are designed to be sensitive to a fault that lation observer serves two neighboring intervals. An inter-
comes from a specific sensor and as insensitive as possible val which contains a parameter nominal value is unable to
to all the others sensor faults. This residual will permit us to contain the faulty parameter value, so a parameter filter will
treat not only with single faults but also with multiple and not be built for it.
simultaneous faults. Define Eq. (3) into a simple form as:
Let rsi denotes the fault signature of the ith sensor, define: ẋ = F1 (x1 )x2 + g1 (x1 , u) ẋ = f(x1 , hp , u)
{ 1 = { 1 (24)
1 if rii ≥ μi ; ith sensor is faulty y = x1 y = x1
rsi (t) = { (19)
0 if rii ≤ μi ; ith sensor is health The parameter filter for ith interval of hp is given below.
3.2.3 Fault identification and handling mechanism The isolation observers are:
1) Fault identification x̂̇ ai = f(x̂1 , hai ̂ ai )
p0 , u) + H(y − y

Supposed there are m healthy sensors and p − m faulty {ŷ̇ ai = cx̂̇ ai (25)
ones, then to identify the faulty size of ith sensor, use m ai
ε = y − cx̂ ̇ ai
estimated output ŷim generated by m observers which use
healthy measures, 1 ≤ m ≤ p − 1, m ≠ i , define f̂si as the x̂̇ bi = f(x̂1 , hbi ̂ bi )
p0 , u) + H(y − y
estimated faulty value of the ith sensor, then: {ŷ̇ bi = cx̂̇ bi (26)
∆
̂fsi = 1 ∑m |ŷ m − θsi | → fsi (20)
m i=1 i ε = y − hx̂̇
bi bi

2) Fault recovery Where:
As mentioned above, the extended high gain observer is hp0 , t < t f hp0 , t < t f
also worked as a software sensor to provide an adequate hai
p0 (t) = { (i) , hbi
p0 (t) = { (i−1) ,(27)
hp , t ≥ t f hp , t ≥ t f
estimation of the process output, thus replacing the meas-
urement given by faulty physical sensor. The isolation index of this parameter filter is calculated by:
θsi is the actual measured output from ith sensor:
yi νi (t) = sgn(εai )sgn(εbi ) (28)
θsi = { f (21) As soon as νi (t) = 1, the parameter filter sends the ’non-
yi = yi + fsi
Let m observers use healthy measurements as the soft sen- containing’ signal to indicate that this interval does not con-
sor for ith sensor, define: tain the faulty parameter value. And if the fault is in the ith
m interval. Let:
1 1
y̅i = ∑ ŷim (22) ĥA = (hai A + hbi A) (29)
m 2
i=1 to represent the faulty value, fault isolation and identifica-
If ith sensor is healthy, let the sensor actual output as θsi
tion is then achieved.
its output, while if it is faulty, let y̅i to replace θsi , that is:
θ , if ith sensor healthy 4 Numerical simulation
yi = { si (23)
y̅i , if ith sensor faulty A case study is developed to test the effectiveness of the
proposed scheme. The real data is from a laboratory pilot of
3.3 process fault diagnose a continuous intensified heat-exchanger/reactor. The pilot is
In order to achieve process FDD, healthy measurements are made of three process plates sandwiched between five util-
fed to a bank of parameter intervals filters developed in [11] ity plates, shown in Fig.1. More Relative information could
to generate a bank of residuals. These residuals are pro- be found in [2]. As previously said, the simulation model is
cessed for identifying parameter changes, which involves considered just for one cell which may lead to moderate in-
variation of overall heat transfer coefficient in this paper. accuracy of the dynamic behavior of the realistic reactor.
The main idea of the method is as follows. However, this point may not affect the application and
The practical domain of the value of each system parameter demonstration of the proposed FDD algorithm encouraging
is divided into a certain number of intervals. After verifying results are got.

256
Proceedings of the 26th International Workshop on Principles of Diagnosis

4.3 Sensor FDI and recovery demonstration
In order to show effectiveness of the proposed method on
sensor FDI, multi faults and simultaneous faults in the tem-
Figure 1 (a) Reactive channel design; (b) utility channel de- perature sensors are considered in case 1 and case 2 respec-
sign; (c) the heat exchanger/reactor after assembly. tively. Besides, the pilot is suffered to parameter uncertain-
The constants and physical data used in the pilot are given ties caused by heat transfer coefficient decreases with ℎ =
in table1. (1 − 0.01𝑡)ℎ. Two extended high gain observers are de-
signed to generate a set of residuals achieving fault detec-
Table 1. Physical data used in the pilot tion and isolation in individual sensors. Observer 1 is fed by
Constant Value units output of sensor 𝑇𝑝 to estimate the whole states and param-
eter while observer 2 uses output of sensor 𝑇𝑢 . Advantages
hA 214.8 W. K −1
of the proposed FDI methodology drop on that if one sensor
A 4e−6 m3 is faulty, we can use the estimated value generated by the
Vp 2.685e−5 m3 healthy one to replace the faulty physical value, thus provid-
Vu 1.141e−4 m3 ing a healthy virtual measure.
ρp , ρu 1000 kg. m−3 Case 1: abrupt faults occur at output of sensor 𝑇𝑝 at t=80s,
cp , cp 4180 J. kg −1 . k −1 100s, with an amplitude of 0.3℃, 0.5℃ respectively.the re-
p u sults are reported in Fig.5-8.
4.1 operation conditions

The inlet fluid flow rate in utility fluid and process fluid are
𝐹𝑢 = 4.17𝑒 −6 𝑚3 , 𝐹𝑝 = 4.22𝑒 −5 𝑚3 𝑠 −1 .The inlet tempera-
ture in utility fluid is time-varying between 15.6℃ and
12.6℃, which is a classical disturbance in the studied sys-
tem, as shown in Fig.2. The inlet temperature in process
fluid is 76℃. Initial condition for all observers and models
are supposed to be T̂𝑝0 = T̂𝑢0 = 30℃, hA = 214.8 W. K −1 .

Fig. 5 output temperature of both fluid in case 1 by observer
1, red curve demonstrates the estimated value while black
one is the measured value.

It is obviously that since t=80s, 𝑇̂𝑢 (red curve) cannot track
𝑇𝑢 (black curve) correctly, while it needs about 0.2s for 𝑇̂𝑝
Fig.2 utility inlet temperature 𝑇𝑢𝑖 to track 𝑇𝑝 at t=80s and t=100s. It suggests that faults occur,
then the following task is to identify size and location of
4.2 High gain observer performance faulty sensors. Fig.6 and Fig.7 achieves the goal. It takes
0.1s and 0.3s for isolating the faults at 80s, 100s respec-
To prove the convergence of the observers and show their tively.
tracking capabilities, suppose the heat transfer coefficient
subjects to a decreasing of ℎ = (1 − 0.01𝑡)ℎ and followes
by a sudden jumps of 15 at 𝑡 = 100𝑠. These variations and
observer estimation results are reported in Fig.3.

Fig.3. simulation and estimation of heat transfer coefficient
variation.
Fig.6 isolation residual in case 1.
Black curve simulates the actual changes of the parameter
while the red one illustrates the estimation generated by the
proposed observer, it can be seen from Fig. 3 that the esti-
mation value tracks behavior of the real value with a good
accuracy, thus ensuring a good dynamics.

257
Proceedings of the 26th International Workshop on Principles of Diagnosis

Fig.7b fault signature in case 1, obviously, faults only occur
at output of sensor 𝑇𝑝 . Fig. 9 isolation residual in case 2

For fault recovery, we can employ observer 2 as soft sensor
to generate a health value for faulty sensor 𝑇𝑝 . Observer 2
uses only measured 𝑇𝑢 to estimate all states and parameters.
Therefore, 𝑇̂𝑢 , 𝑇̂𝑝 generated by observer 2 are only decided
by 𝑇𝑢 . In case 1, faults occur only on sensor 𝑇𝑝 , sensor 𝑇𝑢 is
healthy, that is to say 𝑇̂𝑢 , 𝑇̂𝑝 generated by observer 2 will be
satisfied their expected values. As shown in Fig.8, we can
see that since 𝑇𝑢 is healthy, estimated value 𝑇̂𝑢 tracks meas-
ured 𝑇𝑢 perfectly, while estimated value 𝑇̂𝑝 (red curve) does
not track the faulty measured value 𝑇𝑝 (black curve), 𝑇̂𝑝 (red
curve) illustrates the expected value for sensor 𝑇𝑝 , we can
Fig 10. Fault signature in case 2
use estimate 𝑇̂𝑝 (red curve) to replace measured faulty value
𝑇𝑝 ( black curve) for fault recovery. 4.4 Fast process fault isolation and identification

Process fault is related to variation of overall heat transfer
coefficient (h). The heat transfer coefficient is considered as
variable which undergoes either an abrupt jumps (by an ex-
pected fault in the flow rate) or a gradual variation (essen-
tially due to fouling). For incipient variation, since fouling
in intensified heat-exchanger/reactor is tiny and only influ-
ence dynamics, we have employed extended high observers
to ensure the dynamic influenced by this slowly variation.
Therefore, the abrupt changes in heat transfer coefficient ℎ
can only be because of sudden changes in mass flow rate. It
Fig.8 fault recovery in case 1, red curve demonstrates the implies that the root cause of process fault is due to actuator
estimated value while black one is the measured value. fault in this system.
If there are faults occurred only on output of sensor 𝑇𝑢 , the Supposed an abrupt jumps in ℎ at t=40 from 214.8 to 167.
same results can be yield easily. For multi and simultaneous
faults on both sensors, we can still isolate the faults cor-
rectly. Case 2 will verify this point.

Case 2: simultaneous faults imposed to the outputs of sen-
sors 𝑇𝑝 as in case 1 and 𝑇𝑢 at t=80s with amplitude of 0.6℃.
Results are reported in Fig.9-10. Residuals are beyond their
threshold obviously at time 80s, 100s.
Fig.11 detection residual in process faulty case
It can be seen from Fig.9, Fig .10 that the proposed FDI
scheme can isolate faults correctly, and it takes 0.25s, 0.4s From Fig.11, at t=40s, unlike sensor fault cases, the residual
for isolating the faults in sensor 𝑇𝑝 at 80s, 100s and 0.2s for leaves zero and never goes back, this indicates that process
isolating that in sensor 𝑇𝑢 at t=80s respectively. Compared fault occurs. For fast fault isolation and identification, we
with Case 1, more times is needed in this Case 2. use the methodology of parameter interval filters developed
in [11]. In [2], heat transfer coefficient ℎ changes between
130.96 and 214.8, then ℎ is divided into 4 intervals as shown
in table 2 and simulation results are shown in Fig.12. It can
be seen at t=40s, only index for interval 150-170 goes to
zero rapidly, then there is a fault in this interval. The faulty
1 1
value is estimated by ℎ̂𝐴 = (ℎ𝑎 𝐴 + ℎ𝑏 𝐴) = (150 +
2 2
170) = 160. We can see it is closely to actual faulty value
167, and if more intervals are divided, the estimated value
may be closer to the actual faulty value.

258
Proceedings of the 26th International Workshop on Principles of Diagnosis

Table 2 parameter filter intervals [5] O. a Z. Sotomayor and D. Odloak, “Observer-based
fault diagnosis in chemical plants,” Chem. Eng. J., vol.
Interval NO. 1 2 3 3 112, pp. 93–108, 2005.
ℎ𝑎 𝐴 130 150 170 190 [6] M. Du and P. Mhaskar, “Isolation and handling of
ℎ𝑏 𝐴 150 170 190 214 sensor faults in nonlinear systems,” Automatica, vol.
50, no. 4, pp. 1066–1074, 2014.
[7] F. Xu, Y. Wang, and X. Luo, “Soft sensor for inputs
and parameters using nonlinear singular state observer
in chemical processes,” Chinese J. Chem. Eng., vol. 21,
no. 9, pp. 1038–1047, 2013.
[8] F. Caccavale and F. Pierri, “An integrated approach to
fault diagnosis for a class of chemical batch processes,”
J. Process Control, vol. 19, no. 5, pp. 827–841, 2009.
[9] M. Du, J. Scott, and P. Mhaskar, “Actuator and sensor
fault isolation of nonlinear proces systems,” Chem.
Eng. Sci., vol. 104, pp. 2940–303, 2013.
[10] D. Fragkoulis, G. Roux, and B. Dahhou, “Detection,
isolation and identification of multiple actuator and
sensor faults in nonlinear dynamic systems:
Fig.12 “non_containing fault” index sent by parameter filter Application to a waste water treatment process,” Appl.
Math. Model., vol. 35, no. 1, pp. 522–543, 2011.
5 Conclusion [11] Z. Li and B. Dahhou, “A new fault isolation and
An integrated approach for fault diagnose in intensified identification method for nonlinear dynamic systems:
heat-exchange/reactor has been developed in this paper. The Application to a fermentation process,” Appl. Math.
approach is capable of detecting, isolating and identifying Model., vol. 32, pp. 2806–2830, 2008.
failures due to both sensors and parameters. Robustness of [12] X. Zhang, M. M. Polycarpou, and T. Parisini, “Fault
the proposed FDI for sensors is ensured by adopting a soft diagnosis of a class of nonlinear uncertain systems with
sensor with respect to parameter uncertainties. Ideal isola- Lipschitz nonlinearities using adaptive estimation,”
tion speed for process fault is guaranteed due to adoption of Automatica, vol. 46, no. 2, pp. 290–299, 2010.
parameter interval filter. It should be notice that the pro- [13] R. F. Escobar, C. M. Astorga-Zaragoza, J. a.
posed method is suitable for a large kind of nonlinear sys- Hernández, D. Juárez-Romero, and C. D. García-
tems with dynamics models as the studied system. Appli- Beltrán, “Sensor fault compensation via software
cation on the pilot heat-exchange/reactor confirms the ef- sensors: Application in a heat pump’s helical
fectiveness and robustness of the proposed approach. evaporator,” Chem. Eng. Res. Des., pp. 2–11, 2014.
[14] F. Bonne, M. Alamir, and P. Bonnay, “Nonlinear
References observer of the thermal loads applied to the helium bath
[1] F. Théron, Z. Anxionnaz-Minvielle, M. Cabassud, C. of a cryogenic Joule–Thompson cycle,” J. Process
Gourdon, and P. Tochon, “Characterization of the Control, vol. 24, no. 3, pp. 73–80, 2014.
performances of an innovative heat- [15] M. Farza, K. Busawon, and H. Hammouri, “Simple
exchanger/reactor,” Chem. Eng. Process. Process nonlinear observers for on-line estimation of kinetic
Intensif., vol. 82, pp. 30–41, 2014. rates in bioreactors,” Automatica, vol. 34, no. 3, pp.
[2] N. Di Miceli Raimondi, N. Olivier-Maget, N. Gabas, 301–318, 1998.
M. Cabassud, and C. Gourdon, “Safety enhancement [16] W. Benaïssa, N. Gabas, M. Cabassud, D. Carson, S.
by transposition of the nitration of toluene from semi- Elgue, and M. Demissy, “Evaluation of an intensified
batch reactor to continuous intensified heat exchanger continuous heat-exchanger reactor for inherently safer
reactor,” Chem. Eng. Res. Des., vol. 94, pp. 182–193, characteristics,” J. Loss Prev. Process Ind., vol. 21, pp.
2015. 528–536, 2008.
[3] P. Kesavan and J. H. Lee, “A set based approach to [17] S. Li, S. Bahroun, C. Valentin, C. Jallut, and F. De
detection and isolation of faults in multivariable Panthou, “Dynamic model based safety analysis of a
systems,” Chem. Eng., vol. 25, pp. 925– 940, 2001. three-phase catalytic slurry intensified continuous
[4] D. Ruiz, J. M. Nougues, Z. Calderon, A. Espuna, L. reactor,” J. Loss Prev. Process Ind., vol. 23, no. 3, pp.
Puigjaner, J. Maria, Z. Caldern, and A. Espufia, 437–445, 2010.
“Neural network based framework for fault diagnosis [18] P. S. Varbanov, J. J. Klemeš, and F. Friedler, “Cell-
in batch chemical plants,” Comput. Chem. Eng., vol. based dynamic heat exchanger models-Direct
24, pp. 777–784, 2000. determination of the cell number and size,” Comput.
Chem. Eng., vol. 35, pp. 943–948, 2011.

259
Proceedings of the 26th International Workshop on Principles of Diagnosis

260
Proceedings of the 26th International Workshop on Principles of Diagnosis

LPV subspace identification for robust fault detection using a set-membership
approach: Application to the wind turbine benchmark
Chouiref. H, Boussaid. B, Abdelkrim. M.N 1 , Puig. V2 and Aubrun.C3
1
Research Unit of Modeling, Analysis and Control of Systems (MACS), Gabès University
e-mail: houda.chouiref@gmail.com, dr.boumedyen.boussaid@ieee.org,naceur.abdelkrim@enig.rnu.tn
2
Advanced Control Systems Group (SAC), Technical University of Catalonia
e-mail: vicenc.puig@upc.edu
3
Centre de Recherche en Automatique de Nancy (CRAN), Lorraine University
e-mail: christophe.aubrun@univ-lorraine.fr
Abstract
This paper focuses on robust fault detection for
Linear Parameter Varying (LPV) systems using a
set-membership approach. Since most of models
which represent real systems are subject to mod-
eling errors, standard fault detection (FD) LPV
methods should be extended to be robust against
model uncertainty. To solve this robust FD prob-
lem, a set-membership approach based on an in-
terval predictor is used considering a bounded de-
scription of the modeling uncertainty. Satisfac-
tory results of the proposed approach have been
obtained using several fault scenarios in the pitch Figure 1: Fault diagnosis with set estimator schema
subsystem considered in the wind turbine bench-
mark introduced in IFAC SAFEPROCESS 2009.

must be robust. When modeling uncertainty in a determin-
1 Introduction istic way, there are two robust estimation methods: the first
The fault diagnosis of industrial processes has become an method is the bounded error estimation that assumes the pa-
important topic because of its great influence on the opera- rameters are considered time invariant and there is only an
tional control of processes. Reliable diagnosis and early de- additive error [6]. On the other hand, the second approach
tection of incipient faults avoid harmful consequences. Typ- is the interval predictor that takes into account the variation
ically, faults in sensors and actuators and the process itself of parameters and which considers additive and multiplica-
are considered. In the case of the wind turbine benchmark, tive errors [7], [8]. Here, the interval predictor is combined
a set of pre-defined faults with different locations and types with existing nominal LPV identification presented by [9],
are proposed in [1] where the dynamic change in the pitch allowing to include robustness and minimizing false alarms
system is treated. The procedure of fault detection is based (see Fig. 1) [10]. Thus, this paper contributes with a new
either on the knowledge or on the model of the system [2]. set-membership estimator approach that combines the in-
Model-based fault detection is often necessary to obtain a terval predictor scheme with the LPV identification through
good performance in the diagnosis of faults. The methods subspace methods in one step. To illustrate the methodology
used in model-based diagnosis can be classified according proposed in this work, the pitch subsystem of wind turbine
if they are using state observers, parity equations and pa- system proposed as a benchmark in IFAC SAFEPROCESS
rameter estimation [3]. For linear time invariant systems 2009 will be used. First, this subsystem is modeled as an
(LTI), the FD task is largely solved by powerful tools. How- LPV model using the hydraulic pressure as the scheduling
ever, physical systems generally present nonlinear behav- variable. On the hypothesis that damping ratio and natural
iors. Using LTI models in many real applications is not frequency have an affine variation with hydraulic pressure,
sufficient for high performance design. In order to achieve this affine LPV model is estimated by means of the subspace
good performance while using linear like techniques, Lin- LPV estimation algorithm. Second, the residue is synthe-
ear Parameter Varying systems are recently received con- sized to take into account the robustness against the uncer-
siderable attention [4]. Recently, many model-based appli- tainties in the parameters. This work is organized as fol-
cations using such systems and the subspace identification lows: In Section 2, the LPV subspace estimation method is
method were published [5]. In model-based FD, a residual recalled. In Section 3, the interval predictor approach com-
vector is used to describe the consistency check between bined with the LPV subspace method is proposed as tool
the predicted and the real behavior of the monitored sys- for robust fault detection. In Section 4, the modeling of the
tem. Ideally, the residuals should only be affected by the pitch system as a LPV model is introduced. Section 5 deals
faults. However, the presence of disturbances, noises and with simulation experiments that illustrate the implementa-
modeling errors yields the residual to become non zero. To tion and performance of the proposed approach applied to
take into account these errors, the fault detection algorithm the robust fault detection of wind turbine pitch system. Fi-

261
Proceedings of the 26th International Workshop on Principles of Diagnosis

nally, Section 6 gives some concluding remarks.
Y = [yp+1 , ..., yN ] (5)
2 LPV Subspace Identification method
In the literature, there are two methods for LPV identifica- ]
tion: First, the ones based on global LPV estimation. Sec- Z = [N1p z̄1p , ..., NN
p p
(6)
−p+1 z̄N −p+1
ond, the ones based on the interpolation of local models
[11], However, those approaches could lead to unstable rep- the controllability matrix can be expressed as:
resentations of the LPV structure while the original system
is stable [12]. That is why in this paper, we propose to κp = [lp , ..., l1 ]
use a subspace identification algorithm proposed (see [9] with [ ]
and [13]) to identify LPV systems which does not require
l1 = B̄ (1) , ..., B̄ (m)
interpolation or identification of local models and avoid in-
stability problems. and [ ]
lj = Ã(1) lj−1 , ..., Ã(m) lj−1
2.1 Problem formulation
[ ]
In the model used in identification in [9], the system ma- If the matrix Z T , U T has full row rank, the matrix
trices depend linearly on the time varying scheduling vector Cκp and D can be estimated by solving the following linear
as follows: regression problem [14]:
∑m 2
xk+1 =
(i)
µk (A(i) xk + B (i) uk + K (i) ek ) (1) min ∥Y − Cκp Z − DU ∥F (7)
Cκp ,D
i=1
where ∥∥F represents the Frobenius norm. This problem
yk = Cxk + Duk + ek (2) can be solved by using traditional least square methods as
in the case of LTI identification for time varying systems.
with xk ∈ R , uk ∈ R , yk ∈ R are the state, input and
n r l
Moreover, the observability matrix for the first model is cal-
output vectors and ek denotes the zero mean white innova- culated as follows:
tion process and m is the number of local model or schedul-  
ing parameters: C
[  C Ã(1) 
 
(2)
µk = 1, µk , ..., µm
T  . 
k ] Γp =  
 . 
 
Eqs.(1) and (2) can be written in the predictor form:  . 
(1) p−1
m
∑ C(Ã )
(i)
xk+1 = µk (Ã(i) xk + B̃ (i) uk + K (i) yk ) (3)
with
i=1
⌣ ⌣ ⌣
with κ̄kp = [φp−1,k+1 B k , ..., φ1,k+p−1 B k+p−2 , B k+p−1 ]
Ã(i) = A(i) − K (i) C
and
B̃ (i) = B (i) − K (i) D ⌣
B k = [B̃, Kk ]
2.2 Assumptions and notation Then, Eq.(3) can be transformed into:
[ ]T
Defining zk = uTk , ykT and using a data window of
length p to define the following vector: xk+p = φp,k xk + κ̄kp z̄kp
  xk+p = φp,k xk + κp Nkp z̄kp
zk
 zk+1  where
  φp,k = ÃK+p−1 ...Ãk+1 Ãk
p  . 
z̄k =  
 .  If the system (3) is uniformly exponentially stable the ap-
 .  proximation error can be made arbitrarily small then:
zk+p−1
xk+p ≈ κp Nkp z̄kp
and introducing the matrix obtained using the Kronecker
product ⊗: To calculate the observability matrix Γp times the state X,
we first calculate the matrix Γp κp :
Pp/k = µk+p−1 ⊗ .... ⊗ µk ⊗ Ir+l  
Clp Clp−1 . . Cl1
we can define  0 CA(1) lp−1 . . C Ã(1) l1 
   
pp/k . . . 0
p p
Γ κ =  . . 

 . pp−1/k+1   . . 
  p−1
Nkp =  . .  0 (1)
C(Ã ) l1
 . . 
0 p1/k+p−1 Then, using the following Singular Value Decomposition
(SVD):
Now, by defining the matrices U , Y and Z : [ ∑ ][ ]
\
p p n 0
∑ V
U = [up+1 , ..., uN ] (4) Γ κ Z = [ υ υσ⊥ ]
0 V⊥

262
Proceedings of the 26th International Workshop on Principles of Diagnosis

the state is estimated by: with
⌢ ∑ ∏
p−j
X= V (Z i,j )T Z 1,j = ( µTÑ +v+j−i µÑ +v+j−1 )(zÑ
T
z
+j−i Ñ +j−1
)
n v=0
p
∑
Finally, C and D matrix are estimated using output equa- ZT Z = (Z 1,j )Z 1,j
tion (2) and A and B are estimated using the state equation j=1

(1). This algorithm can be summarized as follows [9]: (10)
Finally, the estimate sequence is obtained by solving the
• Create the matrices U , Y and Z using (4),(5) and (6), original SVD problem.
• Solve the linear problems given in (7) , The kernel method can be summarized as follows [9]:
• Construct Γp times the state X, • Create the matrices U T U using (4) and Z T Z and
• Estimate the state sequence, (Z i,j )T (Z i,j ) using (10),
• With the estimated state, use the linear relations to ob- • Solve the linear problem given in (8),
tain the system matrices. • Construct Γ times the state X using (9)and (10),
In the case of a very small p, we have in general a biased • Estimate the state sequence,
estimate. However, when the bias is too large, it will be a • With the estimated state, use the linear relation to ob-
problem. That is why a large p would be chosen. In the tain the system matrices.
case of a very large p, this method suffers from the curse
of dimensionality [13] and the number of rows of Z grows 3 Interval predictor approach
exponentially with the size of the past window. In fact, the
number of rows is given by: To add robustness to the LPV subspace identification ap-
proach presented in the previous section, it will be combined
∑p with the interval predictor approach [16]. The interval pre-
ρZ = (r + ℓ) mj
j=1 dictor approach is an extension of classical system identifi-
To overcome this drawback, the kernel method will be cation methods in order to provide the nominal model plus
introduced in the next subsection [15]. the uncertainty bounds for parameters guaranteeing that all
collected data from the system in non-faulty scenarios will
2.3 Kernel method be included in the model prediction interval. This approach
The
[ Tequation
] (7) has a unique solution if the matrix considers separately the additive and multiplicative uncer-
Z U T has full row rank and is given by: tainties. Additive uncertainty is taken into account in the
[ ] [ ] additive error term e(k) and modeling uncertainty is con-
[ T ] Z [ T ] sidered to be located in the parameters that are represented
d b
Cκ D = Y Z
p U T
( Z U T )−1
U by a nominal value plus some uncertainty set around. In the
When this is not the case, that will occurs when p is large, literature, there are many approximation of the set uncertain
the solution is computed by using the SVD of the matrix: parameter Θ. In our case, this set is described by a zonotope
[ ] [ ∑ ][ T ] [10] :
Z m 0 V Θ = θ0 ⊕ HB n = {θ0 + Hz : z ∈ B n } (11)
= [ υ υ⊥ ]
U 0 0 V⊥T
where: θ is the nominal model (here obtained with the
0
Then, the solution of the minimum norm is given by: identification approach, H is matrix uncertainty shape, B n
[ ] ∑−1 is a unitary box composed of n unitary (B = [−1, 1]) inter-
dp D
Cκ b =YV υT
m val vectors and ⊕ denotes the Minkowski sum. A particu-
To avoid computations in a large dimensional space, the lar case of the parameter set is used that corresponds to the
minimum norm results in: case where the parameter set Θ is bounded by an interval
2
box [17]:
min ∥α∥F (8)
α Θ = [θ1 , θ1 ] × ...[θi , θi ] × ...[θnθ , θnθ ] (12)
with [ ]
Y − α ZT Z + U T U = 0 where θi = θi0 − λi and θi = θi0 + λi with λi ≥ 0 and
[ ] i = 1, ..., nθ . In particular, the interval box can be viewed as
where α are the Lagrange multipliers and Z T Z + U T U a zonotope with center θ0 and H equal to an nθ ×nθ diagonal
is referred as the kernel matrix. matrix:
The matrix Γ times the state X can be constructed as fol-
lows: θ1 + θ1 θ2 + θ2 θn + θn
  θ0 = ( , , ..., ) (13)
∑
p 2 2 2
1,j T 1,j
 α j=1 (Z ) Z  H = diag(λ1 , λ2 , ..., λn ) (14)
 
 ∑ p
T 
 α 2,j
(Z ) Z 1,j  For every output, a model can be extracted in the following
 
 j=2  regressor form:
Γκp Z =   . 
 (9)
 .  y(k) = φ(k)θ(k) + e(k) (15)
 
 . 
 
 ∑ p
p,j T 1,j  where
α (Z ) Z
j=p

263
Proceedings of the 26th International Workshop on Principles of Diagnosis

• φ(k) is the regressor vector of dimension 1× nθ which with
can contain any function of inputs u(k) and outputs x1 = β, x2 = β̇, u = βr
y(k). which can be discretised using an Euler approximation.
Then, the following system is obtained:
• θ(k) ∈ Θ is the parameter vector of dimension nθ ×1. {
x(k + 1) = Ax(k) + Bu(k)
(22)
y(k) = Cx(k)
• Θ is the set that bounds parameter values.
with [ ]
1 Te
• e(k) is the additive error bounded by a constant where A=
−Te wn2 −2Te ξwn + 1
|e(k)| ≤ σ. [ ]
0
In the interval predictor approach, the set of uncertain pa- B=
Te wn2
rameters Θ should be obtained such that all measured data
in fault-free scenario will be covered by the interval pre- C=[ 1 0 ]
dicted output.
4.2 LPV Pitch system model
y(k) ∈ [ŷ(k) − σ, ŷ(k) + σ] (16) The pitch parameters wn and ξ are variable with hydraulic
pressure P [1] [22]. Then, the pitch model can be written
where as the following LPV model according to [23] using P as
ŷ(k) = yˆ0 (k) − ∥φ(k)H∥1 (17) the scheduling variable ϑ :
{
x(k + 1) = A(ϑ)x(k) + B(ϑ)u(k)
(23)
ŷ(k) = yˆ0 (k) + ∥φ(k)H∥1 (18) y(k) = Cx(k)

and yˆ0 (k) is the model output prediction with nominal pa- with [ ]
1 Te
rameters with θ0 =[θ1 , θ2 , ..., θnθ ]T obtained using the LPV A(ϑ) =
identification algorithm: −Te wn2 (P ) −2Te ξ(P )wn (P ) + 1
[ ]
0
yˆ0 (k) = φ(k)θ0 (k) (19) B=
Te wn2 (P )
Then, fault detection will be based on checking if (16) y(k) = x1 (k) = β(k)
is satisfied. In case that, it is not satisfied a fault can be
indicated. Otherwise, nothing can be said. 4.3 Regressor form pitch system model
The pitch model can be transformed to the following regres-
4 Case study: wind turbine benchmark sion form [24]:
system y(k) = φ(k)θ(k) (24)
In this work, a specific variable speed turbine is considered.
where, φ(k) is the regressor vector which can contain any
It is a three blade horizontal axis turbine with a full con-
function of inputs u(k) and outputs y(k). θ(k) ∈ Θ is the
verter. The energy conversion from wind energy to mechan-
parameter vector. Θ is the set that bounds parameter values.
ical energy can be controlled by changing the aerodynamics
of the turbine by pitching the blades or by controlling the
In particular
rotational speed of the turbine relative to the wind speed.
φ(k) = [ y(k − 2) y(k − 1) u(k − 2)]
The mechanical energy is converted to electrical energy by
a generator fully coupled to a converter. Between the ro- T
θ = [ θ1 θ2 θ3 ]
tor and the generator, a drive train is used to increase the
rotational speed from the rotor to the generator [18]. This θ1 = (−Te2 wn2 + (2wn ξTe − 1))
model can be decomposed into submodels: Aerodynamic,
Pitch, Drive train and Generator [19] [20]. In this paper, θ2 = −2wn ξTe + 2
we focus on faults in the pitch subsystem as explained in the
following subsection. θ3 = Te2 wn2
4.1 Pitch system model
5 Results
In the wind turbine benchmark model, the hydraulic pitch
is a piston servo mechanism which can be modeled by a The pitch systems, which in this case are hydraulic, could
second order transfer function [21] [1]: be affected by faults in any of the three blades. The con-
sidered faults in the hydraulic system can result in changed
β(s) ωn2 dynamics due to a drop in the main line pressure. This dy-
= 2 (20)
βr (s) s + 2ζωn s + ωn2 namic change induces a change in the system parameters:
the damping ratio between 0.6 rad/s and 0.9 rad/s and the
Notice that βr refers to reference values of pitch angles. frequency between 3.42 rad/s and 11.11 rad/s according
The pitch model can be written in the following state space: to [23]. In this work, a fault detection subspace estimator
{ is designed to determine the presence of a fault. To distin-
ẋ1 = x2 guish between fault and modeling errors, an interval predic-
(21) tor approach is applied and a residual generation is used for
ẋ2 = −2ξwn x2 − wn 2 x1 + wn 2 u

264
Proceedings of the 26th International Workshop on Principles of Diagnosis

16 Upper
0.635
Lower
14

12 0.63

10
0.625
minmaxoutput

damping ratio in case1
8
0.62
6

4 0.615

2
0.61
0

−2 0.605

1.58 1.6 1.62 1.64 1.66 1.68
Time(s) x 10
4
0.6
0 0.5 1 1.5 2 2.5
Time(s) x 10
4

Figure 2: Upper (red line) and lower (blue line) bounds
Figure 3: Damping ratio in non-faulty case

deciding if there is a fault. To illustrate the performance of
this robust fault detection approach:: ξ ∈[ 0.6 0.63 ] and
wn ∈[ 10.34 11.11 ] are considered. Then, a parameter
set Θ is bounded by an interval box: 11.3

(25)
11.2
Θ = [θ1 , θ1 ] × [θ2 , θ2 ] × [θ3 , θ3 ]
11.1

11
and for i = 1, · · · , 3
frequency in case1

10.9

10.8
θi − θi
λi = ( ) (26) 10.7
2
10.6

θi + θi 10.5
θi0 = ( ) (27)
2 10.4

using equations (17) and (18), the output bounds are calcu- 10.3
0 0.5 1 1.5 2 2.5
lated to be used in fault detection test which are given in Time(s) x 10
4

Fig. 2.yˆ0 (k) is obtained by the use of the identification ap-
proach described in Section 2. To validate this algorithm Figure 4: Frequency in non-faulty case
two cases are used:
- Case 1: In this case, the pressure varies after time 10000s
while parameters vary in the interval of parametric uncer-
tainties, that is, damping ratio varies between 0.6 rad/s
and 0.63 rad/s and the frequency between 10.34 rad/s
and 11.11 rad/s. These parameters are presented respec- 2.1

tively in Figures. 3 and 4. The pitch angle in this case is 2
given in Fig. 5 altogether with the prediction intervals.
pitch angle in non faulty case

1.9
For fault detection, the residual signal, based on the com-
parison between the measured pitch angle and the estimated 1.8

one at each sampling instance, is calculated and it is shown 1.7
in Fig. 6. For fault decision, a fault indicator signal is used 1.6
and the decision is taken in function of this indicator. If
the actual angle is not within the predicted interval given in 1.5

Eq.(16), the fault indicator is equal to 1 and the system is 1.4
faulty. Otherwise, it is equal to 0 and the system is fault- 1.3
free. The fault indicator signal given in Fig. 7 shows that
there is no fault despite the pressure variation. The parame- 1.2
1.6608 1.6608 1.6608 1.6608 1.6608 1.6608 1.6608
ters variation is considered as a modeling error. Time(s) 4
x 10
- Case 2: In this case, the pressure P varies between time
t = 10000s and t = 17000s outside its nominal value. In Figure 5: Pitch angle in non-faulty case
this time interval, the damping ratio varies between 0.63
rad/s and 0.72 rad/s and the frequency varies between

265
Proceedings of the 26th International Workshop on Principles of Diagnosis

0.72

0.7

damping ratio in case2
0.68

0.025 0.66

0.02

0.015 0.64

0.01
0.62
0.005
residue

0 0.6
0 0.5 1 1.5 2 2.5
−0.005 Time(s) 4
x 10

−0.01

−0.015 Figure 8: Damping ratio in faulty case
−0.02

−0.025
0 0.5 1 1.5 2 2.5 11.5
Time(s) x 10
4

Figure 6: Residual in non-faulty case
frequency in case2 10.5

9.5

8.5

8
0 0.5 1 1.5 2 2.5
Time(s) 4
x 10

Figure 9: Frequency in faulty case

0.8
8.03 rad/s and 10.34 rad/s. On the other hand, the damp-
ing ratio varies between 0.6 rad/s and 0.63 rad/s and the
0.6
natural frequency varies between 10.34 rad/s and 11.11
0.4 rad/s outside as shown in Figures 8 and 9. In this case,
Fault indicator in case1

0.2 the pitch angle is given in Fig. 10, while the residual and
fault indicator signals are presented in Fig. 11 and Fig. 12,
0
respectively.
−0.2 Fig. 12 shows that the fault indicator signal changes its
−0.4 signature between time 10000s and 17000s which induce
−0.6
that the parameters vary larger than the modeling range due
to actuator fault in wind turbine benchmark system between
−0.8
instants t = 10000s and 17000s.
−1
0 0.5 1 1.5 2 2.5
Time(s) x 10
4
6 Conclusions
The proposed approach is based on an LPV estimation ap-
Figure 7: Fault indicator in non-faulty case proach to generate a residual as the difference between the
real and the nominal behavior of the monitored system.
When a fault occurs, this residual goes out of the inter-
val which represents the uncertainty bounds in non faulty
case. These bounds are generated by means of an inter-
val predictor approach that adds robustness to this fault de-
tection method, by means of propagating the parameter un-
certainty to the residual or predicted output. The proposed

266
Proceedings of the 26th International Workshop on Principles of Diagnosis

approach is illustrated by implementing a robust fault de-
tection scheme for a pitch subsystem of the wind turbine
Mesaured benchmark. Simulations show satisfactory fault detection
2.2 Max
Min
performance despite model uncertainties.
2.1

2
References
pitch angle in faulty case

1.9

1.8
[1] P. Odgaard, J. Stoustrup, and M. Kinnaert. Fault toler-
ant control of wind turbines-a benchmark model. In
1.7
7th IFAC symposium on fault detection, supervision
1.6
and safety of technical processes, Barcelona, spain,
1.5 2009.
1.4
[2] R. Isermann. Fault diagnosis systems: an introduction
1.3
from fault detection to fault tolerance. 2006.
1.2
[3] M. Blanke, M. Kinnaert, J. Lunze, M. Staroswiecki,
1.6608 1.6608 1.6608 1.6608 1.6608 1.6608 1.6608
Time(s) x 10
4 and J Schröder. Diagnosis and fault-tolerant control.
2006.
Figure 10: Pitch angle in faulty case [4] G. Mercere, M. Lovera, and E Laroche. Identifi-
cation of a flexible robot manipulator using a linear
parameter-varying descriptor state-space structure. In
Proc. of the IEEE conference on decision and control,
Orlando, Florida,USA, 2011.
0.15
[5] J. Dong, B. Kulcsár, and M Verhaegen. Fault detection
and estimation based on closed-loop subspace identi-
0.1 fication for linear parameter varying systems. In DX,
Stockholm, 2009.
0.05 [6] J. Bravo, T. Alamo, and E.F. Camacho. Bounded
error identification of systems with time-varying pa-
residue

rameters. IEEE Transactions on Automatic Control,
0 51:1144 Ű 1150., 2006.
[7] D. Efimov, L. Fridman, T. Raissi, A. Zolghadri, and
−0.05 R. Seydou. Interval estimation for lpv systems apply-
ing high order sliding mode techniques. Automatica,
48:2365–2371, 2012.
−0.1
0 0.5 1
Time(s)
1.5 2 2.5
4
[8] D. Efimov, T. Raissi, and A. Zolghadri. Control of
x 10
nonlinear and lpv systems: interval observer-based
framework. IEEE Transactions on Automatic Control.,
Figure 11: Residual signal 2013.
[9] J. Van Willem and M Verhagen. Subspace identifica-
tion of bilinear and lpv systems for open-and closed-
loop data. Automatica, 45:371–381, 2009.
1
[10] J. Blesa, V. Puig, J Romera, and J Saludes. Fault di-
agnosis of wind turbines using a set-membership ap-
0.95 proach. In the 18th IFAC world congress, Milano,
Italy, 2011.
0.9 [11] H. Tanaka, Y Ohta, and Y Okimura. A local approach
fault indicator

to lpv-identification of a twin rotor mimo system. In in
0.85 proceedings of the 47th IEEE Conference on Decision
and Control Cancun, Mexico, 2008.
0.8
[12] R. Toth, F. Felici, P. Heuberger, and P Van den Hof.
Discrete time lpv i/o and state-space representations,
differences of behavior and pitfalls of interpolation.
0.75
In in proceedings of the European Control Conference
0 0.5 1 1.5 2 2.5 (ECC), Kos, Greece, 2007.
Time(s) x 10
4

[13] J. Van Willem and M Verhagen. Subspace identifica-
tion of mimo lpv systems: the pbsid approach. In in
Figure 12: Fault indicator
Proceedings of the 47th IEEE Conference on Decision
and Control Cancun, Mexico, 2008.

267
Proceedings of the 26th International Workshop on Principles of Diagnosis

[14] P. Gebraad, J. Van Wingerden, G. Van der Veen, and
M Verhaegen. Lpv subspace identification using a
novel nuclear norm regularization method. In Ameri-
can Control Conference on O’Farrell Street, San Fran-
cisco, CA, USA, 2011.
[15] V. Verdult and M Verhaegen. Kernel methods for sub-
space identification of multivariable lpv and bilinear
systems. Automatica, 41:1557–1565, 2005.
[16] J. Blesa, V. Puig, and J Saludes. Identification for pas-
sive robust fault detection using zonotope based set
membership appraches. International journal of adap-
tive control and signal processing, 25:788–812, 2011.
[17] P. Puig, V. Quevedo, T. Escobet, F. Nejjari, and
S De las Heras. Passive robust fault detection of dy-
namic processes using interval models. IEEE Transac-
tions on Control Systems Technology, 16:1083 –1089,
2008.
[18] B. Boussaid, C. Aubrun, and M.N Abdelkrim. Set-
point reconfiguration approach for the ftc of wind tur-
bines. In the 18th World Congress of the International
Federation of Automatic Control (IFAC), Milano, Italy,
2011.
[19] B. Boussaid, C. Aubrun, and M.N Abdelkrim. Two-
level active fault tolerant control approach. In The
Eighth International Multi-Conference on Systems,
Signals Devices (SSD’11),Sousse, Tunisia, 2011.
[20] B. Boussaid, C. Aubrun, and M.N Abdelkrim. Active
fault tolerant approach for wind turbines. In The In-
ternational Conference on Communications, Comput-
ing and Control Applications (CCCA’11), Hammamet,
Tunisia, 2011.
[21] P. Odgaard, J. Stoustrup, and M Kinnaert. Fault toler-
ant control of wind turbines-a benchmark model. IEEE
Transactions on control systems Technology, 21:1168–
1182, 2013.
[22] P. Odgaard and J Stoustrup. Results of a wind tur-
bine fdi competition. In 8th IFAC symposium on
fault detection ,supervision and safety of technical pro-
cesses,Mexico, 2012.
[23] C. Sloth, T. Esbensen, and J Stoustrup. Robust and
fault tolerant linear parameter varying control of wind
turbines. Mechatronics, 21:645–659, 2011.
[24] H. Chouiref, B. Boussaid, M.N Abdelkrim, V. Puig,
and C Aubrun. Lpv model-based fault detection:
Application to wind turbine benchmark. In Interna-
tional conference on electrical sciences and technolo-
gies (cistem’14), Tunis, 2014.

268
Proceedings of the 26th International Workshop on Principles of Diagnosis

Processing measure uncertainty into fuzzy classifier

Thomas Monrousseau1 , Louise Travé-Massuyès1 and Marie-Véronique Le Lann1,2
1
CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
2
Univ de Toulouse, INSA, LAAS, F-31400 Toulouse, France
e-mails: thomas.monrousseau@laas.fr, louise@laas.fr ,mvlelann@laas.fr

Abstract data [5] [6] [7], the use of fuzzy logic type-1 or type-2 [3]
or statistical models.
Machine learning such as data based classification Fuzzy logic is a multi-valued logic framework intro-
is a diagnosis solution useful to monitor complex duced by Zadeh [8] that is known to be more efficient for
systems when designing a model is a long and ex- representating uncertainty and impreciseness than binary
pensive process. When used for process monitor- logic. In previous work, a fuzzy classifier named Learning
ing the processed data are available thanks to sen- Algorithm for Multivariate Data Analysis (LAMDA)
sors. But in many situations it is hard to get an ex- has been proposed by Aguilar [9]. This classifier can
act measure from these sensors. Indeed measure originally process simultaneously two different types
is done with a lot of noise that can be caused by of data: quantitative data and qualitative data. A real
the environment, a bad use of the sensor or even number contains an infinite amount of precision whereas
the conversion from analogic to numerical mea- human knowledge is finite and discrete, thus LAMDA is
sure. In this paper we propose a framework based interesting because there is no solution proposed in the
on a fuzzy logic classifier to model the uncertainty literature to process in a uniform way heterogeneous data
on the data by the use of crisp (non fuzzy) or fuzzy and to handle in a same problem quantitative data and
intervals. Our objective is to increase the num- qualitative data is often a complex subject. A new type
ber of good classification results in the presence of data, the interval, has been introduced by Hedjazi [10]
of noisy data. The classifier is named LAMDA to model uncertainties by means of crisp intervals. In this
(Learning Algorithm for Multivariate Data Anal- paper we propose an extention to fuzzy intervals in order to
ysis) and can perform machine learning and clus- improve its application to process noisy data measurements
tering on different kind of data like numerical val- but with the capacity to handle others features types like
ues, symbols or interval values. “clean” data or qualitative features. Moreover the algorithm
should stay low cost in term of memory and computation
1 Introduction time to enable the method to be embedded on small systems.
Data classification is the process of dividing pattern space
using hard, fuzzy or probabilistic partitions into a number In the first part of the paper the LAMDA algorithm is
of regions [1]. Classification algorithms are more and more shortly presented then in a second time a method to use the
used nowadays in a world where it is not always simple to algorithm to classify noisy data is introduced. This method
get a model of complex process. On the opposite it is easier is in two parts: the first presents a general solution to model
to get data on systems by monitoring and store it. Differ- uncertainty on data with crisp intervals based on confidence
ent types of classifiers can be used depending on the sit- intervals and the second shows an improvement to model
uation. The principal ones described in the literature are Gaussian noise with fuzzy intervals. In both cases examples
artificial neural networks, k-nearest neighbors, support vec- of application are introduced to show the improvement of
tor machine, decision trees, fuzzy classifiers and statistical the method compared to the use of the data without trans-
methods. formation.
Most of the time, data are issued from sensor measure-
ments and are corrupted by noise. This noise can have dif- 2 LAMDA algorithm (Learning Algorithm
ferent origins, for example environment disturbances, bad for Multivariate Data Analysis)
use of the sensor, hysteresis effect or numerical conversion
and representation of the data. Many domains of applica- This section presents the principle of the LAMDA algo-
tion have to deal with noise problems like medical diagno- rithm.
sis [2], biologic identifications [3] or image recognition [4].
Uncertainty can be understood in two ways: the first is the 2.1 General principle
uncertainty directly present in the data like noise and the LAMDA is a classification algorithm based on fuzzy logic
second can be assimilated as the reliability of a feature in- created on an original idea of Aguilar [9] and can achieve
side a class. In this paper we consider only the first case. To machine learning and clustering on large data sets.
avoid noise problems in classification some solutions have The algorithm takes as input a sample x made up of N
been provided previously, for example the transformation of features. The first step is to compute for each feature of x, an

269
Proceedings of the 26th International Workshop on Principles of Diagnosis

Where xn is the n-th feature of the sample x, ρj,n is
the mean of the n-th feature for the class j and σj,n is
the standard deviation of the n-th feature for the class j.

• Qualitative data:

Qualitative can take values in a set of modalities. The
membership function of qualitative data returns the fre-
quency of modality taken by the feature into the class
during the learning phase. We introduce a qualitative
variable with K modality {Q1 , ..., QK } and the fre-
quency Φkj of the modality Qk for the class j. The
membership is described by:
f (xn ) = (Φ1j,n )q1 ∗ ... ∗ (ΦK
j,n )
qK
(4)
Figure 1: Summarized scheme of the LAMDA algorithm k
q = 0 if xn 6= Qk
with
adequacy degree to each class Cj , j = 1..J where J is the q k = 1 if xn = Qk
total number of class. This is obtained by the use of a fuzzy
adequacy function. So J vectors of N adequacy degrees • Intervals:
are computed, these vectors are called Marginal Adequacy
Degree vectors (MAD). At this point, all the features are in The membership function for interval data is a function
a common space. Then the second step is to take all the which tests the similarity between two fuzzy intervals.
MADs and aggregate them into one global adequacy degree In this case similarity is defined by two components:
(GAD) by means of a fuzzy aggregation function. Thus the the distance between the intervals and the surface that
J MAD vectors (composed of N MADs) become J scalar these intervals have in common. Indeed the class pro-
GADs, the higher the GAD, the better the adequacy to the totype for crisp interval data is a mean interval. The
class. The simplest way to assign the sample x to a class is similarity function is:
to keep as result the class with the biggest GAD.
All the process is summarized in Fig. 1. R
1 V µA∩B (ξ)dξ ∂[A, B]
S(A, B) = ( R +1− ) (5)
2.2 Fuzzy membership computation 2 V µA∪B (ξ)dξ $[V ]
During the learning step, the algorithm creates prototype
data for each class and for each feature. These data are
called classe descriptors or prototypes; they can be for ex- where µX (x) is the value of x in the fuzzy set X,
ample means or variances. We define as Cj,n the class pro- ∂[A, B] is the distance between intervals A = [a− , a+ ]
totype of the n-th feature for the class j. and B = [b− , b+ ]and $[X] is the size of a fuzzy set
As previously mentioned the first step of the algorithm into a V universe. This is described by:
is a comparison between the sample vector x and all the Z
Cj,n . This operation is performed with membership func- $[X] = µX (ξ)dξ (6)
tions and gives as result a membership adequacy degree. V
Thus M ADj,n is the MAD for the j-th class and the n- In the case of crisp intervals and in a universe between
th feature. As the framework is based on fuzzy logic, all 0 and 1:
memberships are numbers in the [0,1] interval. The general
membership function is: 1 $[A ∩ B]
S(A, B) = ( + 1 − ∂[A, B]) (7)
2 $[A ∪ B]
M ADj,n = f (Cj,n , xn ) (1)
The class prototype Cj,n depends on two things: the type where $[X] in this case can be replaced by the length
of data and the function used. Some functions may require of the interval:
only one data into Cj,n whereas others need a list of param-
eters. $[X] = upperbound(X)-lowerbound(X) (8)
In the following section, some examples of membership
functions are presented. and distance ∂[A, B] is defined as:
• Quantitative data: ∂[A, B] = max[0, max(a− , b− )−min(a+ , b+ )] (9)

Many functions are available for this kind of data. For In the case where an interval feature is used the pro-
example the Gaussian: totype for a class j is given by [ρn− n+ n−
j , ρj ] where ρj ,
n+
(xn − ρj,n )2 respectively ρj represents the mean value of lower
− 2
2σj,n bounds (respectively upper bounds) of all the elements
f (xn ) = e (2) belonging to class j for this feature.
Once the MAD are computed whatever the feature
or the binomial function: type, it is possible to perform any type of processing
f (xn ) = ρxj,n
n
.(1 − ρj,n )1−xn (3) as described on Fig. 2

270
Proceedings of the 26th International Workshop on Principles of Diagnosis

of confidence (for example a confidence interval of 95% is
an interval in which the exact value of the measure can be
found with a probability of 95%). Introducing x̂ the mea-
sured value and l the length of a centered on zero confi-
dence interval based on the measurement error, the interval
used by the algorithm is calculated: X = [x̂ − 2l ; x̂ + 2l ].
The main aim of the transformation is to improve the clas-
sification on the transition zones where data is really sensi-
tive to noise and a small change can modify the output of the
classifier. The use of intervals to model uncertainty is effec-
tive only if the “clean” data is relevant for the classification
problem. If it is not the case a better solution is to remove
the irrelevant feature. It will in most cases provide better
output results. This expresses the fact that if the “clean”
data is difficult to classify it is not improved by using confi-
dence intervals.
3.2 Experiments
A set of data has been created for an application test which
Figure 2: Projection principle for heterogeneous feature can be interpreted as sensors time evolution of a continuous
types process. This set of data is composed by three quantitative
(numerical) features of 101 samples that are shown on the
Fig. 3. Three classes are specified and used as targets for
2.3 Marginal adequacy degree merging the classifier. These classes are chosen arbitrarily to repre-
Once all the features are grouped into the membership space sent different behaviors of a system that could be healthy
the next step of the algorithm is to transform the MAD vec- or failure modes. Nevertheless the classes are built to make
tors into a set of single value which depicts the global mem- all the data relevant for the system monitoring which means
bership of the sample to a class. These values were intro- the three features do not have a global negative impact on
duced in section 2.1 and are called GAD. To perform this the classification results.
transformation a fuzzy aggregation function Ψ is used. The three features x, y and z are defined by the following
The aggregation function is the following: time functions:
Ψ(M AD) = α.γ(M AD) + (1 − α).β(M AD) (10) −t
• x=e 2
where γ is a fuzzy T-norm and β is a fuzzy T-conorm. t
α parameter is called exigency indicator. It enables to give • y = 12 · e 4 − 1
more or less significance to the union operation and the in- • z = tanh(t − 5)
tersection operation. Two fuzzy T-norm and T-conorm are
currently implemented in the algorithm, the min-max and
the probabilistic. For example if min-max is used, (10) be-
comes:
Ψ(M AD) = α.min(M AD)+(1−α).max(M AD) (11)
When all GAD are computed they give the membership
of the data x to each class. The final result depends on the
application but the simplest way to give a result is to class
the sample in the class which has the highest GAD. A limit
membership can also be fixed: if no GAD is higher than the
limit, the sample is defined as unclassifiable.

3 Uncertainty modeled with crisp intervals
3.1 Method presentation
Every data measurement is performed with noise. In some
cases noise has enough bad effect to increase the error of
classification. Thus the point is to model the imprecision of
the data to decrease the number of bad classifications.
A technique used in several fields of application is the
use of intervals to symbolize data uncertainty [11] [12]. So
we are suggesting a framework where numerical data are Figure 3: Data used to test the intervals method
transformed into intervals to model imprecision.
In a situation where the probability law followed by the This example is used to measure the improvement in the
noise on a variable is unknown, it may be possible to ob- classification results in the case of all data are noisy. Artifi-
tain a confidence interval. It is an interval in which the cial noise is added by the following: x is the ideal variable
real value of the measure is present with a certain amount without noise and x̂ the noisy variable, x̂ = x + Y with

271
Proceedings of the 26th International Workshop on Principles of Diagnosis

Figure 5: Example of approximation of a Gaussian fuzzy
interval by a triangular fuzzy interval
Figure 4: An example of data corrupted with a noise in the
interval [-0.5 ; 0.5]
4 Modeling Gaussian noise with fuzzy
intervals
Y a random variable following a uniform distribution on an 4.1 Fuzzy interval method presentation
interval I. Most of the time, noise on physical measure follows a Gaus-
sian distribution centered on the real value. Thus it is inter-
The experiment has been performed with these condi- esting to model this specific kind of uncertainty. Neverthe-
tions: α parameter of (10) is set at 0.8 with the [min,max] less, it is difficult to handle fuzzy intervals with an exact
functions to compute the fuzzy aggregation and the mem- Gaussian shape. That is why we suggest approximating the
bership function used for quantitative data is the bino- Gaussian with a triangular fuzzy interval. This interval is
mial.[min, max] aggregation is chosen because experiments described with a lower boundary x− and an upper boundary
on the algorithm showed that this kind of aggregation pro- x+ : X = [x− ; x+ ] which leads to a similar description as
vides better results on noisy data that the probabilistic one. crisp intervals. So:
A first classification without any noise gives a result of 91% µX (x− ) = 0 and µX (x+ ) = 0 and µX ( x +x
+ −
)=1
of good classification. Then the experiment is repeated a 2

great many times to avoid statistical mistakes. In this case, with µX (x) the fuzzy value of x into the fuzzy set X. As
the experiment has been run fifty thousand times, x̂ is re- a Gaussian of ρ mean is centered on the true measure value
computed at each new run. Results are given on table 1. + −
the maximum fuzzy value of the triangle x +x 2 is equal to
ρ. To compute x− and x+ we propose to use the full width
at half maximum (FWHM) that can be calculated this way:
Interval for ran- [-0.3 ; 0.3] [-0.5 ; 0.5] [-2 ; 2]
p
dom data F W HM = 2 2ln(2) · σ (12)
Mean success
with σ that is the standard deviation of the measure.
percentage 89.9% 84.7% 79.6%
Thus for a Gaussian function that has a mean value ρ and a
with binomial
standard deviation
p σ the approximated
p interval X is defined
function
Mean success by X = [ρ − 2 2ln(2) · σ; ρ + 2 2ln(2) · σ]. An example
percentage with 91.9% 89.8% 70.3% of this approximation is given on Fig. 5.
interval function
Until now all the implementations of the LAMDA algo-
Table 1: Table of results for the crisp intervals method rithm were using only crisp intervals despite the fact that
the general method was introduced. The class prototype is
now a triangle interval computed with the means of upper
and lower boundaries of the data used to train the algorithm.
Thus the membership function is still a similarity measure
As it can be seen, this method provides an improvement
between two fuzzy intervals like in (5) but it is necessary to
on the results in the two first cases where noise deteriorates
redefine the distance function between the intervals. A solu-
the classification with the quantitative method but when the
tion has been proposed to measure a distance with the center
data is still globally consistent. In these cases, the intervals
of gravity of triangular fuzzy intervals [13]. In the present
method gives better results than binomial method 82% of
situation:
the time. But when noise amplitude is much higher than the
data like in the [−2; +2] error interval, the interval method a+ + a− b+ + b−
does worse in general than the binomial function. ∂[A, B] = | − | (13)
2 2

272
Proceedings of the 26th International Workshop on Principles of Diagnosis

with A = [a− ; a+ ] and B = [b− ; b+ ], A and B being
triangular fuzzy intervals like described in this section.
The intersection A ∩ B needed in (5) is calculated with an
analytical solution based on geometry and trigonometry. It
avoids numerical integration that could be less precise and
longer to compute.

4.2 Experiments
As we did previously with the crisp method, a test is per-
formed with a Gaussian noise on the same data set (Fig. 3).
The test is done in the same conditions as in the previous
section. The difference is on the construction of the noisy
data x̂ = x + Y . Y is now a random variable that follows a
normal distribution of standard deviation σ and centered on
0. Results of the simulation are given on the table 2.
Figure 6: Representation of iris data by class
σ 0.2 0.5 0.7 1
Mean success
percentage 83.2% 79.8% 79.8% 79.6% The classifications are performed using the cross-
with binomial validation method. The percentages of well classified data
function for the two methods are:
Mean success
percentage with 86.8% 82.5% 77.2% 71.3% • using binomial function (scalar): 81.3%
crisp interval • using fuzzy triangular intervals: 94.0%
function
Mean success Once again the classification rate is increased by the use
percentage with 93.1% 84.5% 79.3% 74.8% of the fuzzy interval method instead of the binomial one.
fuzzy interval
function 5 Conclusion
We presented in this article two methods to model uncer-
Table 2: Table of results for the fuzzy intervals method
tainty for classification applications. An example showed
that these methods can improve classification results even
Similarly to the previous test, the interval method in- when the signal to noise ratio is high. The second method
creases the rate of good classifications until the standard de- based on fuzzy intervals demonstrated that try to model
viation σ becomes too high and the binomial function pro- more precisely the probability law of the noise can pro-
vides better results. This point is reached here for σ = 0.7 vide better results than use confidence intervals modelled
which corresponds to a signal to noise ratio (SNR) of 6 dB by crisp intervals. However this process to model uncer-
for the signal with the smallest amplitude. Also it is im- tainty reveals limits when the SNR reaches a low level. A
portant to notify that in all cases the fuzzy interval provides future important work is to limit the classification error of
better results than the crisp interval method. the interval method at the level of the numerical method.
These methods will now be tested on data out coming
4.3 Experiments on iris dataset from a real industrial process.
As a second example we use the classical iris dataset[14]. Another way to manage uncertainty on classifiers like
This dataset contains four features: sepal length in cm, LAMDA could be to use type-2 fuzzy functions [15]. This
sepal width in cm, petals length in cm and petal width is an expansion of classical fuzzy logic where the member-
in cm. All these features are measured for three types of ship functions give in output a fuzzy interval which can be
flower: iris Setosa, iris Versicolour and iris Virginica which used to model variance of the data.
constitute three classes. It is easy to classify without any To provide a better solution to manage uncertainty in the
error the iris dataset by using only the petals information LAMDA classifier it can be useful to extend the problem to
that are in general most relevant that the sepals ones. Thus the qualitative features. It is often difficult to determine if a
only the sepal sizes are kept in this test to simulate the qualitative element is close to another, for example the color
noise. The figure 6 shows the repartition of the data in the "orange" is closer to "red" than "blue". But on small training
2D space of the sepal features. dataset consider this kind of information can improve final
classification results. This could be done by using similarity
We assume that the data follow a normal distribution matrix which are already used in some artificial intelligence
centered on a mean µj,n and with a standard-deviation σj,n . problems.
This hypothesis can be verified by using a statistical test. LAMDA algorithm can work with a feature selection al-
The Kolmogorov-Smirnov test has been used for each class gorithm named MEMBAS (Membership Margin Based Fea-
with a 5% significance level, it shows that the hypothesis is ture Selection) [16]. This algorithm uses LAMDA classes
true for the iris Setosa and the iris Versicolour but not for definitions and its membership functions to provide an ana-
the iris Virginica. Nevertheless all the data are processed as lytical solution for the feature selection. A future work will
if they follow a normal distribution. be to measure the impact of the interval use on MEMBAS
algorithm to perform selection on noisy data.

273
Proceedings of the 26th International Workshop on Principles of Diagnosis

References [14] Fisher R.A. {UCI} machine learning repository, 1936.
[1] J. C. Bezdek. A review of probabilistic, fuzzy, and http://archive.ics.uci.edu/ml.
neural models for pattern recognition. Journal of Intel- [15] J.M. Mendel, R.I. John, and F. Liu. Interval type-
ligent and Fuzzy Systems, Vol. 1, No. 1:pp 1–25, 1993. 2 fuzzy logic systems made simple. Fuzzy Systems,
[2] E. Alba, J. Garcia-Nieto, L. Jourdan, and E. Talbi. IEEE Trans. on, Vol. 14, No. 6:pp 808–821, Dec. 2006.
Gene selection in cancer classification using pso/svm [16] L.Hedjazi, J.Aguilar-Martin, and M.V. Le Lann.
and ga/svm hybrid algorithms. In Evolutionary Com- Similarity-margin based feature selection for symbolic
putation, CEC 2007. IEEE Congress on, pages 284– interval data. Pattern Recognition Letters, Vol.32,
290, Sept. 2007. No4:pp. 578–585, March 2012.
[3] Scott Ferson, H. Resit Akqakaya, and Amy Dunham.
Using fuzzy intervals to represent measurement error
and scientific uncertainty in endangered species clas-
sification. In Fuzzy Information Processing Society,
1999. NAFIPS. 18th International Conference of the
North American on, pages pp 690–694, Jul 1999.
[4] Zhang Weiyu, S.X. Yu, and Shang-Hua Teng. Power
svm: Generalization with exemplar classification un-
certainty. In Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on, pages pp 2144–
2151, June 2012.
[5] Arafat Samer, Dohrmann Mary, and Skubic Mar-
jorie. Classification of coronary artery disease stress
ecgs using uncertainty modeling. In Computational
Intelligence Methods and Applications, 2005 ICSC
Congress, 2005.
[6] Kynan E. Graves and Romesh Nagarajah. Uncertainty
estimation using fuzzy measures for multiclass classi-
fication. Neural Networks, IEEE Transactions on, Vol.
18:pp. 128–140, 2007.
[7] Prabha Verma and R.D.S. Yadava. Fuzzy c-means
clustering based uncertainty measure for sample
weighting boosts pattern classification efficiency. In
Computational Intelligence and Signal Processing
(CISP), 2012 2nd National Conference on, pages 31–
35, 2012.
[8] L.A. Zadeh. Fuzzy sets. Information and Control, vol.
8:pp. 338–353, June 1965.
[9] Carrete N.P. and Aguilar-Martin J. Controlling selec-
tivity in nonstandard pattern recognition algorithms. In
IEEE Transactions on Systems, Man and Cybernetics,
volume 21, pages 71–82. IEEE, Jan/Feb 1991.
[10] Hedjazi L., Aguilar-Martin J., Le Lann M.V., and
Kempowsky T. Towards a unfined principle for rea-
soning about heterogeneous data: a fuzzy logic frame-
work. International Journal of Uncertainty, Fuzzy-
ness and Knowledge-Based Systems, Vol. 20, No. 2:pp.
281–302, 2012.
[11] B. Kuipers. Qualitative Reasoning: Modeling
and Simulation with Incomplete Knowledge. The
MIT Press,Cambridge, Massachusetts, london edition,
1994.
[12] Lynne Billard. Some analyses of interval data. Journal
of Computing and Information Technology, CIT 16:pp
225–233, 2008.
[13] Hsieh C. H. and Chen S. H. Similarity of general-
ized fuzzy numbers with graded mean integration rep-
resentation. In Proceedings of the Eighth International
Fuzzy Systems Association World Congress, volume
vol. 2, pages pp. 551–555, Taipei, Taiwan, Republic
of China, 1999.

274
Proceedings of the 26th International Workshop on Principles of Diagnosis

Tools/Benchmarks

275
Proceedings of the 26th International Workshop on Principles of Diagnosis

276
Proceedings of the 26th International Workshop on Principles of Diagnosis

Random generator of k-diagnosable discrete event systems

Yannick Pencolé1 2
1
CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
2
Univ de Toulouse, LAAS, F-31400 Toulouse, France
e-mail: yannick.pencole@laas.fr

Abstract 2. diagnosability algorithms: the fact that a fault f is k-
diagnosable is usually the worst case for this type of
This paper presents a random generator of dis-
algorithms (as they all look for the existence of an am-
crete event systems that are by construction k-
biguous scenario to conclude the system is not diag-
diagnosable. The aim of this generator is to pro-
nosable).
vide an almost infinite set of diagnosable systems
for creating benchmarks. The goal of such bench- The paper is organised as follows. After formally re-
marks is to provide a solid set of examples to test calling the problem that motivates the generation of bench-
and compare algorithms that solve many prob- marks, we describe the fundamental property which is being
lems around diagnosable discrete event systems. used for the effective generation of systems where a given
fault f is k-diagnosable. Then the description of the algo-
1 Introduction rithm of the generator is provided as well as some details
about its effective implementation.
For many years, the problem of fault diagnosis in discrete
event systems has been actively addressed by different sci-
entific communities such as DX (AI-based diagnosis) [1;
2 Background
2], FDI (Fault Detection and Isolation), DES (Discrete This paper addresses the random generation of benchmarks
Event Systems) [3]. Depending on the community, many for the problem of the fault diagnosis of discrete event sys-
differents aspects of the same problem have been addressed tem. This problem is briefly recalled in this section. We
such as the design of efficient diagnosers, the checking of di- assume that the reader is familiar with the notations of the
agnosability properties, the effective modelling of real sys- language theory (notion of Kleene closure, prefixes,...).
tems. When dealing with performance, most of the contri-
butions present experimental results on specific examples of 2.1 Modelling
their own usually inspired or based on real world systems. We suppose that the system under monitoring behaves as an
The main problem about these contributions is that they are event generator that can be modelled as an automaton.
not really comparable as they are not applied on the same Definition 1 (System description). The model (system de-
benchmarks. Moreover, used benchmarks may not be al- scription) SD of a discrete event system S is a finite state
ways completly defined in a paper due, most of the time, to automaton SD = (Q, Σ, T, q0 ) where:
confidential data that cannot be published so other academic
contributors cannot use them for comparison purposes. In • Q is a finite set of states;
order to analyse and boost the effective performance of algo- • Σ is a finite set of events;
rithms addressing the fault diagnosis problem in DES, com- • T ⊆ Q × T × Q is a finite set of transitions;
mon and fully available benchmarks become a necessity.
This paper addresses the random generation of k- • q0 is the initial state of the system.
diagnosable systems. We here propose the possibility to Σ is the set of events that the system can produce. Among
generate (and store on a web page) k-diagnosable systems Σ we distinguish events that are observable Σo ⊆ Σ and
that have been generated without any kind of bias that would events that are not observable. When the system operates,
come from a specific diagnosis/diagnosability method. By its effective behaviour is represented by a trace of the au-
doing so, we propose to design a random category for tomaton (also called a run).
benchmarks as the SAT community proposed for SAT prob-
Definition 2 (Trace). A trace τ ∈ Σ∗ of the system is a finite
lems and to get the same advantages by comparing different
sequence of events associated with a transition path from the
diagnosis/diagnosability approaches on the same but ran-
initial state q0 to a state q in the model of the system.
dom systems. The choice of generating k-diagnosable sys-
tems is motivated by the fact that they can be used as exam- The set of traces of the system is the language generated
ples for: by its model and is denoted L(S) (so the automaton SD
generates the language L(S)). Let PΣ0 (τ ) be the classical
1. diagnosis algorithms: given a fault f , we know by con-
projection of a sequence τ of Σ∗ on the alphabet Σ0 recur-
struction that the most precise algorithm will determine
sively defined as follows:
its occurrence with certainty within the next k observa-
tions after the occurrence of f ; 1. PΣ0 (ε) = ε;

277
Proceedings of the 26th International Workshop on Principles of Diagnosis

2. PΣ0 (τ.e) = PΣ0 (τ ) if e 6∈ Σ0 ; Definition 8 (Diagnosability). The fault f is diagnosable in
a system S if:
3. PΣ0 (τ.e) = PΣ0 (τ ).e if e ∈ Σ0 .
∃n ∈ N+ , Diagnosable(n)
Based on this notion of projection, we can associate with
any trace of the system its observable part. where Diagnosable(n) stands for:
Definition 3 (Observable trace). Let τ be a trace of the sys- ∀τ1 .f ∈ L(S), ∀τ2 : τ1 .f.τ2 ∈ L(S)
tem, the observable trace στ is the projection of τ over the |PΣo (τ2 )| ≥ n ⇒
set of observable events Σo :
(∀τ ∈ L(S), (PΣo (τ ) = PΣo (τ1 .f.τ2 ) ⇒ f ∈ τ )).
στ = PΣo (τ ). Definition 9 (k-Diagnosability). The fault f is k-
diagnosable, k ∈ N+ , in a system S if:
2.2 Diagnosis problem and solution
Diagnosable(k) ∧ ¬Diagnosable(k − 1).
Now we are ready to define the classical Fault diagnosis
problem on DES. Diagnosability is a property that relies on the liveness
of the observability of the system which means that, to be
Definition 4 (Fault). A fault is a non-observable event f ∈ (k)-diagnosable, a system must not generate unbounded se-
Σ. quences of unobservable events (no cycle of unobservable
A fault is represented as a special type of non-observable events in SD). Throughout this paper, we consider that the
event that can occur on the underlying system. Once the observability of the system is live.
event has occurred, we say that the fault is active in the sys-
tem, otherwise it is inactive. We consider here the problem 3 Random Generator
of permanent faults as initially introduced in [4]. The aim of this section is to present the algorithm that is
Definition 5 (Diagnosis problem). A diagnosis problem is being used to randomly generate a discrete event systems
a triple (SD, OBS, FAULTS) where SD is the model of where a fault f is k-dignosable and that has been imple-
a system, OBS is the sequence of observations of Σ?o and mented inside the Diades software. We focus on the gener-
FAULTS is the set of fault events defined over SD. ation of a system with one fault only. (see th conclusion for
the generation for n, n > 1 faults).
Informally speaking, (SD, OBS, FAULTS) represents
the problem of finding the set of active faults from FAULTS 3.1 Signatures and fault ambiguity
that have occurred relying on the model SD and the se- The algorithm that generates a k-diagnosable system relies
quence of observations OBS. on the notion of signatures. Let f be a faulty event, the sig-
Definition 6 (Diagnosis Candidate). A diagnosis candidate nature of f is the set of observable traces resulting from the
is a couple (q, F ) where q is a state of SD (q ∈ Q) and F is projection of system traces that contain at least one occur-
a set of faults. rence of an event f before the last observation of the trace.
A diagnosis candidate represents the fact that the under- Definition 10 (Signature). The signature of an event f into
lying system is in state q and the set F of faults has occurred a system S is the language Sig(f ) ⊆ Σ?o such that
before reaching state q. Sig(f ) ={στ |τ = τ1 .o.τ2 ∈ L(S),
Definition 7 (Solution Diagnosis). The solution ∆ of the f ∈ τ1 , o ∈ Σo , τ2 ∈ Σ? , στ = PΣo (τ )}.
problem (SD, OBS, FAULTS) is the set of diagnosis can-
In the following, we will also denote by Sig(¬f ) the set
didates (q, F ) such that there exists for each of them at least
of observable traces associated with the traces of the sys-
one trace τ of SD such that:
tem that do not contain any fault f before the last obser-
1. the observable trace of τ is exactly the sequence vation. Intuitively speaking, as long as the current observ-
OBS = o1 . . . om and the last event of τ is om ; able trace is in Sig(¬f ) ∩ Sig(f ), we know that the system
may have produced a faulty trace or a non-faulty trace be-
2. the set of fault events that has occurred in τ is exactly
fore the last observation. k-diagnosability ensures that the
F;
ambiguity can last at most for k observations. The principle
3. the final state of τ is q. of the generator relies on the following result that formal-
izes this intuition. Let L ARGEST P REFIXES(τ, n) = {τi0 :
Informally, candidate (q, F ) is part of the solution if it is
τ = τi0 τi , |τi | = i, i ∈ {0, . . . , n − 1}} be the set of the n
possible to find out in SD a behaviour of the system satisfy-
largest prefixes of τ (τ being a prefix of itself).
ing OBS which leads to the state q after the last observation
of OBS and in which the faults F have occurred. Theorem 1. In the system S, the event f is k-diagnosable
if and only if:
2.3 Diagnosability 1. For any observable trace σ in Sig(¬f ) ∩ Sig(f ), there
Diagnosability is a property of the system that asserts exists n < k such that L ARGEST P REFIXES(σ, n) ⊆
whether a fault f of a system S can be always diagnosed Sig(¬f ) ∩ Sig(f ) and L ARGEST P REFIXES(σ, n +
with certainty after the observation of a finite set of obser- 1) 6⊆ Sig(¬f ) ∩ Sig(f ).
vations [4]. In other words, once the fault f has occurred in 2. There exists at least one observable
S, it is sufficient to wait a certain amount of observations to trace σ in Sig(¬f ) ∩ Sig(f ) such that
ensure that any candidate (q, F ) of the solution contains f L ARGEST P REFIXES(σ, k − 1) ⊆ Sig(¬f ) ∩ Sig(f )
(f ∈ F ). and an observable o such that σo ∈ Sig(f ).

278
Proceedings of the 26th International Workshop on Principles of Diagnosis

Proof: (⇒)Let τ1 .f be a trace of the system S. As S is Algorithm 1 General algorithm for the random generation
k-diagnosable, there exists m ≤ k such that ∀τ2 : τ1 .f.τ2 ∈ of k-diagnosable systems.
L(S), |PΣo (τ2 )| ≥ m ⇒ (∀τ ∈ L(S), (PΣo (τ ) = Input: k ∈ N, k ≥ 1
PΣo (τ1 .f.τ2 ) ⇒ f ∈ τ )). Consider one of these Input: f an event
trace τ1 .f.τ2 such that τ2 contains exactly m observa- Input: deg maximal output degree
tions (PΣo (τ2 ) = o1 . . . om ). k-diagnosability implies (Σo , Σ) ← G ENERATE E VENTS()
that there exists a minimal integer n ∈ {1, . . . , m − 1} S←∅
such that PΣo (τ1 .f ).o1 . . . on+1 ⇒ f ∈ τ as soon as AmbSig(f ) ← G ENERATE A MB S IGNATURE(k,Σo , deg)
τ ∈ L(S) and PΣo (τ ) = PΣo (τ1 .f ).o1 . . . on+1 , there-
fore PΣo (τ1 .f ).o1 . . . on+1 ∈ Sig(f ) \ Sig(¬f ) and ∀i ∈ /* AmbSig(f ) = (Q, Σo , T, q0 , A) a deterministic au-
{1, . . . , n}, PΣo (τ1 .f ).o1 . . . oi ∈ Sig(f ) ∩ Sig(¬f ). So tomaton */
L ARGEST P REFIXES(PΣo (τ1 .f ).o1 . . . on , n) ⊆ Sig(¬f ) ∩ MF[q0 ] ← G ENERATE S TATES()
Sig(f ). So for any τ1 .f there exists n < k MNF[q0 ] ← G ENERATE S TATES()
such that L ARGEST P REFIXES(PΣo (τ1 .f ).o1 . . . on , n) ⊆ for all q ∈ Q in Breadth-First Order from q0 do
Sig(¬f ) ∩ Sig(f ). Now, remark that for any observ- (Σfo , Σo¬f ) ← R ANDOM S PLIT(Σo , q)
able sequence σ that belongs to Sig(¬f ) ∩ Sig(f ), there S ← S∪ G EN FAULT E XTS∗ (MF[q],Σfo ,deg)
must exist a trace τ1 .f.τ2 of the system, with τ2 con- S ← S∪ G EN N OM E XTS∗ (MNF[q],Σ¬f o ,deg)
taining at least one observable event, such that σ = o
for all q −→ q 0 ∈ T do
PΣo (τ1 .f.τ2 ), so there must exist n < k such that σ ∈ if MF[q 0 ] = ∅ then
L ARGEST P REFIXES(PΣo (τ1 .f ).o1 . . . on , n) ⊆ Sig(¬f ) ∩ MF[q 0 ] ← G ENERATE S TATES()
Sig(f ) so, for any σ that belongs to Sig(¬f ) ∩ Sig(f ), MNF[q 0 ] ← G ENERATE S TATES()
there is no set L ARGEST P REFIXES(σ, n + 1) that only con- end if
tains ambiguous signatures. S ← S∪
Finally, as S is k-diagnosable, we know that there exists G EN N OM E XTS(MNF[q],MNF[q 0 ],o,deg)
at least one trace τ.f.τ1 .o1 , such that τ is a trace of the sys- if q 0 6∈ A then
tem that does not contain f , τ1 is a finite continuation of τ.f S ← S∪ G EN N OM E XTS(MF[q],MF[q 0 ],o,deg)
that is unobservable and o1 is observable and there is a fi- else
nite continuation τ2 o2 τ3 o3 . . . τk ok with PΣo (τi ) = ε such if q ∈ A then
that for any i ∈ {1, . . . , k − 1}, PΣo (τ.f.τ1 .o1 . . . τi oi ) ∈ S ← S∪ G EN E XTS(MF[q],MF[q 0 ],o,deg)
Sig(¬f ) ∩ Sig(f ) PΣo (τ.f.τ1 .o1 . . . τk ok ) ∈ Sig(f ) \ else
Sig(¬f ) which implies the condition 2 with σ = S ← S∪
PΣo (τ.f.τ1 .o1 . . . τk−1 ok−1 ). G EN FAULT E XTS(MF[q],MF[q 0 ],o,deg)
(⇐) Suppose now that conditions 1 and 2 hold. Consider end if
an observable trace σ that is ambiguous (σ ∈ Sig(f ) ∩ end if
Sig(¬f )). Condition 1 states that there exists n < k such end for
that L ARGEST P REFIXES(σ, n) ⊆ Sig(¬f ) ∩ Sig(f ) and end for
L ARGEST P REFIXES(σ, n + 1) 6⊆ Sig(¬f ) ∩ Sig(f ). Con- Output: S where f is k-diagnosable.
sider now any largest observable trace σ 0 such that |σ 0 | −
|σ| = m and σ ∈ L ARGEST P REFIXES(σ 0 , m) ⊆ Sig(¬f )∩
The generation is composed of two steps. The first one
Sig(f ), it follows that L ARGEST P REFIXES(σ 0 , m + n) ⊆
is the generation of the ambiguous signature with G ENER -
Sig(¬f ) ∩ Sig(f ) and L ARGEST P REFIXES(σ 0 , m + n +
ATE A MB S IGNATURE . The result of this function is a de-
1) 6⊆ Sig(¬f ) ∩ Sig(f ). As σ 0 is one of the largest ob-
terministic automaton AmbSig(f ) = (Q, Σo , T, q0 , A) that
servable trace holding this condition, any observable trace
actually generates the language Sig(f )∩Sig(¬f ) (any tran-
σ 0 o, o ∈ Σo , is either in Sig(f ) or in Sig(¬f ) but not in
sition path from state q0 to an accepting state of A rep-
both of them. Condition 1 states that m + n < k, so k ob-
resents a sequence of Sig(f ) ∩ Sig(¬f )). The automa-
servations at least are required to solve the ambiguity. Con-
ton AmbSig(f ) is generated with respect to the conditions
dition 2 states that there exists at least such an observable
1 and 2 that are defined in Theorem 1 to ensure the k-
trace σ 0 with m + n = k − 1 and an observation o so that
diagnosability of the resulting system S. The second step of
σ 0 o is definitively in Sig(f ) so f can be diagnosed with
the generation is the effective generation of S based on the
certainty in this case with exactly k observations. Hence the
ambiguous signature AmbSig(f ). The idea is to map every
result.
state q of AmbSig(f ) with two sets of states in S denoted
M F [q] and M N F [q]. Given any path σ of AmbSig(f ) that
3.2 Algorithm leads to state q with, as a last observation, the event o, any
state of M F [q] (resp. M N F [q]) will be reached by at least
The principle of the random generator is depicted in Al- one transition path τ of S starting from a state of M F [q0 ]
gorithm 1. Given a parameter k and a fault event f , the (resp. M N F [q0 ]) that ends with a transition labelled with o
algorithm randomly generates a system S where the event and the observable projection of τ is exactly σ. The differ-
f is k-diagnosable by construction. We also provide an- ence between M F and M N F is that any underlying path
other parameter deg which is the maximal number of output of S leading to a state of M F [q] (resp. M N F [q]) has an
transitions that is allowed per state during the generation of observable projection which is a prefix of Sig(f ) (resp. a
the system. Parameter deg is important for the creation of prefix of Sig(¬f )). To generate S we explore AmbSig(f )
benchmarks as the output degree has a strong influence on from its initial state in a breadth-first search manner. For a
the diagnosis/diagnosability computations. given state q, we have to consider three types of transition

279
Proceedings of the 26th International Workshop on Principles of Diagnosis

generations going out of any state of M F [q], M N F [q]. The rameters like the number of (un)observable events the output
first ones are the transition paths that will lead to an observa- degree of transitions, the parameter k, the minimal number
tion o that belongs to AmbSig(f ), the second one is the set of observable events involved in the ambiguous signature,
of transition paths that do not lead to an observation o that the number of states (still experimental). One particular pa-
belongs to AmbSig(f ) but lead to an observation o0 that rameter is the seed parameter that allows the generation of
belongs to Sig(f ) only and the third one is the set of transi- the same system (the seed ensures the same generation of
tion paths that do not lead to an observation o that belongs random numbers). By construction, the algorithm is linear
to AmbSig(f ) but lead to an observation o00 that belongs to in the number of states. A set of pre-computed benchmarks
Sig(¬f ) only. as well as the implemented generator are available at the
The second and third cases are handled by randomly split- following url:
ting Σo into two subsets (Σfo , Σ¬fo ) each of them only con- http://homepages.laas.fr/ypencole/benchmarks
taining observable events that are not output event of q in
AmbSig(f ) (R ANDOM S PLIT(Σo , q)). Then given Σfo , we 4 Conclusions
randomly generate faulty extensions for a subset of Σfo (the To test and compare diagnosis and/or diagnosability algo-
selection of the subset is also random and might even be rithms, fully detailed and available benchmarks are a ne-
empty if q has no output events in AmbSig(f ), indeed if q cessity. In order to test how generic is an algorithm, we
has no output events, it must be extended to ensure that the propose here an algorithm that randomly generates systems
observability of the system is live). An extension is a set of where a fault f is k-diagnosable. We also propose an im-
acyclic and unobservable transition paths that lead to a tran- plementation within the D IA D ES framework. Extension to
sition labeled with an observable event from Σfo . A faulty generate systems with n k-diagnosable faults is easy, it re-
extension ensures that an event f has at least occurred on quires to repeat the generation of the ambiguous signatures
any generated transition path before the observable transi- for the n faults and explore them in parallel to generate the
tion (G EN FAULT E XTS). Given Σ¬f o , we proceed the same k-diagnosable system. Our short-term perspective is to im-
way to generate non-faulty extensions (G EN N OM E XTS). prove the generator to allow a better control of the number
As, in these two cases, the traces generated by these ex- of generated states. A fixed number of generated states re-
tensions are not associated with observable traces involved quires to add new constraints in the generation that prop-
in AmbSig(f ) any more, it is sufficient to generate further agate during the generation process. Without any control
extensions on these traces and guarantee that the observ- about this propagation, the generation may just fail as it
able language associated with these further extensions is live could become an over-constrained problem. Our perspec-
(this procedure is denoted by the ∗ in G EN N OM E XTS∗ and tive is to also go one step further by generating diagnosable
G EN FAULT E XTS∗ ). systems that are component-based in order to scale up the
The last case to handle now is the case where the observ- size of the generated system. The D IA D ES framework al-
able event o is an output event of q in AmbSig(f ), which ready has a tool to generate component-based systems [5]
o
means that there exists one and only one transition q −→ q 0 which ensures that any component is globally consistent, but
in AmbSig(f ). If q 0 has never been visited, the set of states adding the constraint of diagnosability makes the generation
M F [q 0 ] and M N F [q 0 ] are generated first. A nominal ex- far more complex to implement.
tension is generated from M N F [q] to M N F [q 0 ]. Depend-
ing on the status of q 0 , the extension between M F [q] and References
M F [q 0 ] is different. If q 0 6∈ A, it means that any prefix [1] Gianfranco Lamperti and Marina Zanella. Diagnosis of
generated by AmbSig(f ) with paths from q0 to q 0 are pre- discrete-event systems from uncertain temporal obser-
fixes of sequences in Sig(f ) ∩ Sig(¬f ) but they are not in vations. Artificial Intelligence, 137:91–163, 2002.
Sig(f ) ∩ Sig(¬f ), they can therefore be only in Sig(¬f ): [2] Yannick Pencolé and Marie-Odile Cordier. A formal
extensions between M F [q] and M F [q 0 ] are then nominal
framework for the decentralised diagnosis of large scale
extensions. Now, if q 0 ∈ A, there are two cases. If q 6∈ A,
discrete event systems and its application to telecommu-
it means the system must become faulty between the states
nication networks. Artificial Intelligence, 164(2):121–
of M F [q] and M F [q 0 ] so that paths of the system that reach
170, 2005.
any state of M F [q 0 ] is associated with an observable trace
that belongs to Sig(f ) (G EN FAULT E XTS). If q ∈ A, any [3] Janan Zaytoon and Stéphane Lafortune. Overview of
path that reaches a state of M F [q] is already faulty (its ob- fault diagnosis methods for discrete event systems. An-
servable trace is already in Sig(f )), any type of extension nual Reviews in Control, 37:308–320, 2013.
from M F [q] to M F [q 0 ] is therefore possible (faulty or not), [4] Meera Sampath, Raja Sengupta, Stéphane Lafortune,
hence the use of G EN E XTS. Kasim Sinnamohideen, and Demosthenis Teneketzis.
Diagnosability of discrete-event systems. Transactions
3.3 Implementation on Automatic Control, 40(9):1555–1575, 9 1995.
Algorithm 1 is implemented with the help of the Diades [5] Yannick Pencolé. Fault diagnosis in discrete-event sys-
library package [5]. Diades is a set of C++ libraries tems: How to analyse algorithm performance? In Di-
that implement discrete event systems in a component- agnostic reasoning: Model Analysis and Performance,
based way, different diagnosis algorithms as defined in pages 19–25, Montpellier, France, 2012.
the spectrum of [6] (from component-based algorithms [6] Anika Schumann, Yannick Pencolé, and Sylvie
to diagnoser-based algorithms). D IA D ES also imple- Thiébaux. A spectrum of symbolic on-line diagnosis
ments a diagnosability checker as well as an accuracy approaches. In 17th International Workshop on Princi-
checker. The generator results in a Linux terminal command ples of Diagnosis, pages 194–201, Nashville, TN USA,
dd-diagnosable-des-generate with a set of pa- 2007.

280
Proceedings of the 26th International Workshop on Principles of Diagnosis

H Y D IAG: extended diagnosis and prognosis for hybrid systems

Elodie Chanthery1,2 , Yannick Pencolé1 , Pauline Ribot1,3 , Louise Travé-Massuyès1
1
CNRS, LAAS, 7 avenue du colonel Roche, F-31400 Toulouse, France
e-mail: [firstname.name]@laas.fr
2
Univ de Toulouse, INSA, LAAS, F-31400 Toulouse, France
3
Univ de Toulouse, UPS, LAAS, F-31400 Toulouse, France

Abstract • (ζ0 , q0 ) ∈ ζ × Q, is the initial condition.
H Y D IAG is a software developed in Matlab by Each state q ∈ Q represents a behavioural mode that is
the DISCO team at LAAS-CNRS. It is currently characterized by a set of constraints Cq that model the lin-
a software designed to simulate, diagnose and ear continuous dynamics (defined by their representations
prognose hybrid systems using model-based tech- in the state space as a set of differential and algebraic equa-
niques. An extension to active diagnosis is also tions). A behavioural mode can be nominal or faulty (antic-
provided. This paper aims at presenting the na- ipated faults). The unknown mode can be added to model
tive H Y D IAG tool, and its different extensions to all the non anticipated faulty situations. The discrete part of
prognosis and active diagnosis. Some results on the hybrid automaton is given by M = (Q, Σ, T, q0 ), which
an academic example are given. is called the underlying discrete event system (DES). Σ is
the set of events that correspond to discrete control inputs,
autonomous mode changes and fault occurrences. The oc-
1 Introduction currence of an anticipated fault is modelled by a discrete
H Y D IAG is a software developed in Matlab, with Simulink. event fi ∈ Σf ⊆ Σuo , where Σuo ⊆ Σ is the set of unob-
The development of this software was initiated in the servable events. Σo ⊆ Σ is the set of observable events.
DISCO team with contributions about diagnosis on hybrid Transitions of T model the instantaneous changes of be-
systems [1]. It has undergone many changes and is cur- havioural modes. The continuous behaviour of the hybrid
rently a software designed to simulate, diagnose and prog- system is modelled by the so called underlying multimode
nose hybrid systems using model-based techniques [2; 3; 4]. system Ξ = (ζ, Q, C, ζ0 ). The set of directly measured vari-
An extension to active diagnosis has been also realized [5; ables is denoted by ζOBS ⊆ ζ.
6]. This article aims at presenting the native HyDiag tool An example of a hybrid system modeled by a hybrid au-
and its different extensions to prognosis and active diagno- tomaton is shown in Figure 1. Each mode qi is characterized
sis. by state matrices Ai , Bi , Ci and Di .
Section 2 recalls the hybrid formalism used by H Y D IAG.
Section 3 presents the native H Y D IAG tool that simulates Hybrid system
and diagnoses hybrid systems. Section 4 explains how H Y- q1 σ12 q2
D IAG has been extended in H Y D IAG P RO to prognose and u C1
x1(n+1)=A1x1(n)+B1u(n)
x2(n+1)=A2x2(n)+B2u(n)
Y1(n)=C1x1(n)+D1u(n) C2
diagnose hybrid systems. Section 5 presents the extension σ21
Y2(n)=C2x2(n)+D2u(n) y
to active diagnosis. Experimental results of H Y D IAG and its
σ1i
extension H Y D IAG P RO are finally presented in Section 6. σ
qi
xi(n+1)=Aixi(n)+Bu(n)
2 Hybrid Model for Diagnosis Ci Yi(n)=Cixi(n)+Diu(n)

…
H Y D IAG deals with hybrid systems defined in a monolithic
way. Such a system must be modeled by a hybrid automaton
[7]. Formally, a hybrid automaton is defined as a tuple S =
(ζ, Q, Σ, T, C, (q0 , ζ0 )) where:
Figure 1: Example of an hybrid system
• ζ is a finite set of continuous variables that comprises
input variables u(t) ∈ Rnu , state variables x(t) ∈
Rnx , and output variables y(t) ∈ Rny .
• Q is a finite set of discrete system states. 3 Overview of the native H Y D IAG diagnoser
• Σ is a finite set of events.
The method developed in [1] for diagnosing faults on-line
• T ⊆ Q × Σ → Q is the partial transition function
in hybrid systems can be seen as interlinking a standard di-
between states.
S agnosis method for continuous systems, namely the parity
• C = q∈Q Cq is the set of system constraints linking space method, and a standard diagnosis method for DES,
continuous variables. namely the diagnoser method [8].

281
Proceedings of the 26th International Workshop on Principles of Diagnosis

3.1 How to use H Y D IAG ? of the system by triggering the current transition of the hy-
Step 1: hybrid model edition brid diagnoser that matches the current observation. It is
H Y D IAG allows the user to edit the modes of a hybrid au- possible to define in H Y D IAG a simulation scenario for the
tomaton S as illustrated in Figure 1. To model the system, modeled system with a duration and a time sample defined
the user must first provide in the Graphical User Interface of by the user.
the H Y D IAG software the following information: the num-
ber of modes, the number of discrete events that can be ob- 3.2 Software architecture with extensions
servable or unobservable, and the sampling period used for The general architecture of H Y D IAG and its two extensions
the underlying multimode system (defined by the set of state (see the next sections for their description) is presented on
matrices of the state space representation of each mode). Figure 3. Ellipses represent the objects handled by the soft-
There are optional parameters that are helpful to initialize ware, rectangles with rounded edges depict H Y D IAG func-
the mode matrices automatically before editing them: the tions and rectangles with straight edges correspond to exter-
number of entries for the continuous dynamics, the number nal D IA D ES packages. The behaviour automaton is at the
of outputs for continuous dynamics, the dimensions of each heart of the architecture as H Y D IAG and both its extensions
matrix A. The number of entries (resp. outputs) must be the rely on it to perform diagnosis, active diagnosis and prog-
same for all the modes. nosis.
The simulator of the edited model has no restrictions on
the number of modes or the order of the continuous dynam- ActHyDiag

ics, it is generically designed. Online computations are per- Specialized
AND/OR Conditional
ActDiades Active
formed using Matlab / Simulink. Results provided by Mat-
AO* Algorithm
Graph plan
diagnosers

lab can be reused if a special need arises. Figure 2 shows an Conditional plan
display
overview of the software interface.
HyDiag
Model display Additional Behaviour
Signature Automaton display Diagnoser display
event
Enriched
Behaviour Diagnoser
hybrid ARRs computation Diades
Automaton diagnosis
model

Diagnosis display

diagnosis

Prognoser Prognosis display
prognosis
prognosis
HyDiagPro

Figure 3: H Y D IAG architecture with its extensions H Y D I -
AG P RO and ACT H Y D IAG .

4 H Y D IAG P RO : an extension for Prognosis
H Y D IAG has been extended in order to provide a progno-
sis functionality to the software [4]. The prognosis function
computes (1) the fault probability of the system in each be-
Figure 2: H Y D IAG Graphical User Interface havioural mode, (2) the future fault sequence that will lead
to the system failure, (3) the Remaining Useful Life (RUL)
of the system.
Step 2: building the diagnoser In H Y D IAG P RO, the initial hybrid model is enriched
H Y D IAG automatically computes the analytical redundancy by adding for each behavioural mode a set of aging laws:
relations (ARRs) by using the parity space approach [9]. S + = (ζ, Q, Σ, T, C, F, (q0 , ζ0 )) where F = {F q , q ∈ Q}
Details of this computation can be found in [10]. and F q is a set of aging laws one for each anticipated fault
The idea of H Y D IAG is to capture both the continuous f ∈ Σf in mode q. The aging modeling framework that
dynamics and the discrete dynamics within the same math- is adopted in H Y D IAG P RO is based on the Weibull proba-
ematical object. To do so, the discrete part of the hybrid bilistic model [11] (see more details in [4]). The Weibull
system M = (Q, Σ, T, q0 ) is enriched with specific observ- fault probability density function W (t, βjq , ηjq , γjq ) gives at
able events that are generated from continuous information. any time the probability that the fault fj occurs in the sys-
The resulting automaton is called the Behaviour Automaton tem mode q. Weibull parameters βjq and ηjq are fixed by the
(BA) of the hybrid system. H Y D IAG then builds the diag- system mode q and characterise the degradation in mode q
noser of the Behaviour Automaton (see [8]) by using the that leads to the fault fj . Parameter γjq is set at runtime to
D IA D ES1 software also developed within the DISCO team memorize the overall degradation evolution of the system
at LAAS-CNRS (see an example of diagnoser in Figure 7). accumulated in the past modes [11].
Step 3: system simulation and diagnosis The prognoser uses the aging laws in S + to predict fault
Given the built hybrid diagnoser, H Y D IAG then loads a set occurrences (see Figure 3). The prognoser uses the cur-
of timed observations produced by the system and it pro- rent diagnosis result to update on-line these aging laws (the
vides at each observation time an update of the diagnosis parameters γjq ) according to the operation time in each be-
havioural mode. For each new result of diagnosis, the prog-
1
http://homepages.laas.fr/ypencole/DiaDes/ nosis function computes the most likely sequence of dated

282
Proceedings of the 26th International Workshop on Principles of Diagnosis

faults that leads to the system failure. From this sequence is pump
Pump1 Pump 2
estimated the system RUL [4]. mode

1 ON ON

5 ACT H Y D IAG: Active Diagnosis 2 ON OFF

The second extension of H Y D IAG provides an active diag- 3 OFF OFF
nosis functionality to the software (see Figure 3). The inputs
are the same as for H Y D IAG but an additional file indicates 4 F il
Fail ON

the events of S that are actions, as well as their respective 5 ON Fail
cost. Based on the behaviour automaton, we compute a set
of specialised active diagnosers (one per fault): such a diag- 6 Fail OFF

noser is able to predict, based on the behaviour automaton, 7 OFF Fail
whether a fault can be diagnosed with certainty by applying
an action plan from a given ambiguous situation [6]. From 8 Fail Fail

these diagnosers, we also extract a planning domain as a
AND/OR graph.
At runtime, when H Y D IAG is diagnosing, the diagno- Figure 5: Water tank DES model
sis might be ambiguous. An active diagnosis session can
be launched as soon as a specialised active diagnoser can
analyse that the current faulty situation is discriminable by Table 1: Weibull parameters of aging models
applying some actions. If the active diagnosis session is Aging laws β η Aging laws β η
launched, an AO∗ algorithm starts and computes a condi- F q1 f1q1 1.5 3000 F q2 f1q2 2 3000
tional plan from the AND-OR graph that optimises an ac- f2q1 1.5 4000 f2q2 1 7000
tion cost criterion. It is important to note that in the case F q3
f1q3 1 8000 F q4
f1q4 NaN NaN
of a system with continuous dynamics, only discrete actions f2q3 1 7000 f2q4 2 4000
are contained in the active diagnosis plan issued by ACT H Y- F q5
f1q5 2 3000 F q6
f1q6 NaN NaN
D IAG. In particular, it is assumed that if it is necessary to f2q5 NaN NaN f2q6 1 7000
guide the system towards a value on continuous variables, F q7 f1q7 1 8000 F q8 f1q8 NaN NaN
f2q7 NaN NaN f2q8 NaN NaN
the synthesis of control laws must be performed elsewhere.

6 HyDiag/HyDiagPro Demonstration space:

Water tank system model X(k + 1) = AX(k) + BU (k)
(1)
Y (k) = CX(k) + DU (k)
Pump P1 Pump P2
where the state variable X is the water level in the tank,
continuous inputs U are the flows delivered by the pumps
P1 , P2 and the flow going through the valve, A = (1), B =
hmax !
eT e/S
h2
eT e/S with T e the sample time, S the tank base area
eT e/S
and ei = 1 (resp. 0) if the pump is turned on (resp. turned
h1 !
h
0
off), C = (1) and D = 0 .
0
Figure 4: Water tank system H Y D IAG results
Figure 6 presents the set of results obtained by H Y D IAG and
H Y D IAG P RO has been tested on a water tank system H Y D IAG P RO on the folllowing scenario. The time hori-
(Figure 4) composed of one tank with two hydraulic pumps zon is fixed at Tsim = 4000h, the sampling period is
(P1 , P2 ). Water flows through a valve at the bottom of the Ts = 36s and the filter sensitivity for the diagnosis is set
tank depending on the system control. Three sensors (h1 , as Tf ilter = 3min. The residual threshold is 10−12 . The
h2 , hmax ) detect the water level and allow to set the control scenario involves a variant use of water (max flow rate =
of the pumps (on/off). It is assumed that the pumps may 1200L/h) depending on user needs during 4000h. Pumps are
fail only if they are on. The discrete model of water tank automatically controlled to satisfy the specifications indi-
and the controls of pumps are given in Figure 5. Discrete cated above. Flow rate of P1 and P2 are respectively 750L/h
events in Σ = {h1 , h2s , h2i , hmax , f1 , f2 } allow the sys- and 500L/h.
tem to switch into different modes. Observable events are The diagnoser computed by H Y D IAG is given in Figure 7.
Σo = {h1 , h2s , h2i , hmax }. Two faults that correspond to Each state of the diagnoser indicates the belief state in the
the pump failures are anticipated Σf = {f1 , f2 } and are not model enriched by the abstraction of the continuous part of
observable.The Weibull parameter values of aging models the system, labelled with faults that have occurred on the
F = {F qi } are reported in Table 1. system. This label is empty in case of nominal mode. In the
The underlying continuous behaviour of every discrete scenario, fault f1 was injected after 3500h and fault f2 was
mode qi for i ∈ {1..8} is represented by the same state not injected.

283
Proceedings of the 26th International Workshop on Principles of Diagnosis

f1
q_32,{}
q_75,{f2}
q_64,{f1}
q3,{}

Predicted dates of fault occurrence (h)
q7,{f2}
q6,{f1}

Remaining Useful Life (h)
q_23,{} df2
q_21,{}
q_57,{f2}
q8,{f1,f2}
q_46,{f1}
f1
q2,{} df1
q5,{f2}
q4,{f1}
q12,{} f1
q1,{}

Time (h) Time (h) Time (h)

Figure 6: Scenario: Diagnoser belief state (left), Prognosis results of degradations df1 and df2 (middle), System RUL (right).

results on an academic example are exposed in the paper.
An extension to active diagnosis is also presented. The ac-
tive diagnosis algorithm is currently tested on a concrete in-
dustrial case. H Y D IAG and its user manual will be soon
available on the LAAS website.

References
[1] M. Bayoudh, L. Travé-Massuyès, and X. Olive. Hybrid sys-
tems diagnosis by coupling continuous and discrete event
techniques. In IFAC World Congress, 2008.
[2] P. Ribot, Y. Pencolé, and M. Combacau. Diagnosis and prog-
nosis for the maintenance of complex systems. In IEEE In-
ternational Conf. on Systems, Man, and Cybernetics, 2009.
[3] Elodie Chanthery and Pauline Ribot. An integrated frame-
work for diagnosis and prognosis of hybrid systems. In the
3rd Workshop on Hybrid Autonomous System (HAS), 2013.
[4] S. Zabi, P. Ribot, and E. Chanthery. Health Monitoring and
Prognosis of Hybrid Systems. In Annual Conference of the
Figure 7: Diagnoser state tracker Prognostics and Health Management Society , 2013.
[5] M. Bayoudh, L. Travé-Massuyès, and Xavier Olive. Active
diagnosis of hybrid systems guided by diagnosability proper-
Left hand side of Figure 6 shows the diagnoser belief state ties. In the 7th IFAC Symposium on Fault Detection, Super-
just before and after the fault f1 occurrence. Results are vision and Safety of Technical Processes, 2009.
consistent with the scenario: before 3500h, the belief states [6] E. Chanthery, Y. Pencolé, and N. Bussac. An AO*-like al-
of the diagnoser are always tagged with a nominal diagnosis. gorithm implementation for active diagnosis. In 10th Inter-
After 3500h, all the states are tagged with f1 . national Symposium on Artificial Intelligence, Robotics and
Middle of Figure 6 illustrates the predicted date of fault Automation in Space, i-SAIRAS,, 2010.
occurrence (df1 and df2 ). At the beginning of the process, [7] T. Henzinger. The theory of hybrid automata. In Proceedings
the prognosis result is: Π0 = ({f1 , 4120}, {f2 , 5105}). It of the 11th Annual IEEE Symposium on Logic in Computer
can be noted that the predicted dates df1 and df2 of f1 and Science, pages 278–292, 1996.
f2 globally increase. Indeed, the system oscillates between [8] M. Sampath, R. Sengputa, S. Lafortune, K. Sinnamohideen,
stressful modes and less stressful modes. To make it simple, and D. Teneketsis. Diagnosability of discrete-event systems.
we can consider that in some modes, the system does not IEEE Trans. on Automatic Control, 40:1555–1575, 1995.
degrade, so the predicted dates of f1 and f2 are postponed.
[9] M Staroswiecki and G Comtet-Varga. Analytical redundancy
Before 3500h, the predicted date of f1 is lower than the one
relations for fault detection and isolation in algebraic dy-
of f2 . After 3500h, the predicted date of f2 is updated, namic systems. Automatica, 37(5):687–699, 2001.
knowing that the system is in a degraded mode. Finally, the
[10] M. Maiga, E. Chanthery, and L. Travé Massuyès. Hybrid sys-
prognosis result is Π3501 = ({f2 , 5541}). Figure 6 shows
tem diagnosis : Test of the diagnoser hydiag on a benchmark
the evolution of the RUL of the system. At t = 3501, as the
of the international diagnostic competition DXC 2011. In 8th
fault f2 is estimated to occur at t = 5541, the system RUL IFAC Symposium on Fault Detection, Supervision and Safety
at t = 3501 is 5541 − 3501 = 2040h. of Technical Processes, 2012.
[11] P. Ribot and E. Bensana. A generic adaptative prognostic
7 Conclusion function for heterogeneous multi-component systems: appli-
H Y D IAG is a software developed in Matlab, with Simulink, cation to helicopters. In European Safety & Reliability Con-
by the DISCO team, at LAAS-CNRS. This tool has been ference, Troyes, France, September 18-22 2011.
extended into H Y D IAG P RO to simulate, diagnose and prog-
nose hybrid systems using model-based techniques. Some

284
Author index

D
A
Dague Philippe 51
Abdelkrim Mohamed Naceur 261 Dahhou Boutaı̈eb 253
Abreu Rui 193, 209 Daigle Matthew 201
Agudelo Carlos 241 De Kleer Johan 193, 225
Alonso-Gonzalez Carlos 59 Duvivier David 159
Archer Dave 193
Aubrun Christophe 261
E
B Eickmeyer Jens 43, 217
El Fallah Seghrouchni Amal 35
Berdjag Denis 159 Eldardiry Hoda 193, 209
Biswas Gautam 27, 75, 185
Blesa Joaquim 67
Bobrow Daniel
Boussaid Boumedyen
193
261
F
Bregon Anibal 59, 201 Feldman Alexander 127, 193
Bunte Andreas 185 Feng Wenquan 27
Burke David 193 Filasova Anna 235
Fourlas George 177

C
Cabassud Michel 253
G
Carrera Rolando 145 Ganguli Anurag 225
Cayetano Raúl 145 Gaurel Christian 159
Chanthery Elodie 281 Givehchi Omid 43
Chouiref Houda 261 Gomez Pablo 11
Christopher Cody 119 Grastien Alban 105, 119
Corruble Vincent 35 Grigoleit Florian 91
Cossé Ronan 159 Grill Tanja 11
Guan Xiumei 27
H M
Hanley John 193 Maier Alexander 217
He Zhangming 137 Matei Ion 225
Herpson Cédric 35 Mishali Amir 247
Honda Tomonori 193, 209, 225 Monrousseau Thomas 269
Mühlbacher Clemens 153

I
N
Ibrahim Hassan 51
Iverson Jonathan 209 Niggemann Oliver 43, 185, 217

J O
Jannach Dietmar 3
Jauberthie Carine 67, 83 Ocampo-Martinez Carlos 99
Jimenez Fernando 241
Jung Daniel 75

P
K Pavel Radu 209
Pencolé Yannick 83, 277, 281
Kalech Meir 113, 247
Perez Alexandre 193
Karras George 177
Pethig Florian 43
Khorasgani Hamed 75, 185
Piechowiak Sylvain 159
Kinnebrew John 185
Pons Renaud 83
Koitz Roxane 167
Provan Gregory 127
Krokavec Dusan 235
Puig Vicenç 99, 261
Kyriakopoulos Kostas 177
Pulido Belarmino 59

L R
Le Gall Françoise 67
Ribot Pauline 83, 281
Le Lann Marie-Véronique 19, 269
Roux Elisa 19
Li Peng 43
Roychoudhury Indranil 201
Li Shuxing 137
Li Ze-Tao 253
Liao Linxia 209
Liscinsky Pavol 235
S V
Saha Bhaskar 209 Vasquez John William 241
Santos Simón Jorge 153 Verde Cristina 145
Schmitz Thomas 3 Volgmann Sören 185
Serbak Vladimir 235
Shchekotykhin Kostyantyn 3
Shinitzky Hilla
Simon Laurent
113
51
W
Steinbauer Gerald 153 Wang Jiongqi 137
Stern Roni 113, 247 Wotawa Franz 167
Struss Peter 91
Subias Audine 241

Z
T Zhang Mei 253
Zhao Hongbo 27
Travé-Massuyès Louise 19, 67, 83, Zhou Gan 27
241, 269, 281 Zhou Haiyin 137
Traxler Patrick 11