A Bayesian Approach to Learning in Fault Isolation


   Hannes Wettig                  Anna Pernestål               Tomi Silander               Mattias Nyberg
  Helsinki Institute for     Dept. Electrical Engineering     Helsinki Institute for          Scania CV AB
Information Technology          Linköping University       Information Technology              Södertälje
        Finland                        Sweden                         Finland                    Sweden
      wettig@hiit.fi              annap@isy.liu.se               tsilande@hiit.fi       mattias.nyberg@scania.com


                      Abstract                              the process. The engineering knowledge can for exam-
                                                            ple be used to determine the structure of dependencies
                                                            between faults and observations. This kind of knowl-
    Fault isolation is the art of localizing faults in
                                                            edge is often the only basis in previous algorithms for
    a process, given observations from it. To do
                                                            fault isolation [6, 12, 19].
    this, a model describing the relation between
    faults and observations is needed. In this pa-          Due to the fact that there are previously unobserved
    per we focus on learning such models both               fault patterns in training data, frequentist and purely
    from training data and from prior knowledge.            data-based methods are bound to fail. To meet these
    There are several challenges in learning fault          challenges we use a Bayesian approach to learning in
    isolators. The number of data, as well as the           fault isolation. We consider five different methods of
    available computing resources, are often lim-           learning a model from training data, which are all pre-
    ited and there may be previously unobserved             viously present in the literature in different forms. We
    fault patterns. To meet these challenges we             taylor these methods to incorporate the available back-
    take on a Bayesian approach. We compare                 ground information. The methods we consider are Di-
    five different methods for learning in fault iso-       rect Inference (DI), Logistic Regression (LogR), Lin-
    lation, and evaluate their performance on a             ear Regression (LinR), Naive Bayes (NB) and general
    real fault isolation problem; the diagnosis of          Bayesian Networks (BN).
    an automotive engine.
                                                            The main contributions of the current work are the in-
                                                            vestigation of Bayesian learning methods and regres-
                                                            sion models for fault isolation by comparing the five
1   INTRODUCTION                                            methods mentioned above, the application and evalu-
                                                            ation of the methods on real-world data, and the com-
We consider the problem of fault isolation, i.e. the        bination of data-driven learning and prior knowledge
problem of localizing faults that are present in a pro-     within these methods. In order to do this investiga-
cess given observations from this process. To do this, a    tion, we first discuss the characteristics of the fault
model of the relations between observations and faults      isolation problem in terms of probability theory, and
is needed. In the current work we investigate and com-      performance measures that are meaningful for fault
pare different methods for learning from training data      isolation. Consecutively we show how the five meth-
and prior knowledge.                                        ods can be adopted to the isolation problem. We apply
                                                            them to the task of fault isolation in an automotive
We are motivated by the problem of fault isolation in
                                                            diesel engine. Finally, we compare the five methods,
an automotive engine, and the learning methods are
                                                            and discuss their advantages and drawbacks.
evaluated using experimental training data and evalu-
ation data from real driving situations. In engine fault    Bayesian methods for fault isolation are previously
isolation there may be several hundreds of faults and       studied in literature. In these previous works it is
observations. There will be fault patterns, i.e. co-        generally assumed that the model is given [26, 15],
occuring faults, from which there are no training data.     or can be derived from a physical model without us-
Furthermore, training data is typically experimental        ing training data [17, 25]. In the current work on the
and obtained by implementing faults, running the pro-       other hand, we focus on learning the models. Previous
cess, and collecting observations. On the other hand,       works on Learning models for fault isolation typically
there is often engineering knowledge available about        rely on pattern recognition methods described e.g. in
[1, 3]. Examples of such methods are presented for                                   Y1    Y2        Y3        Y4        Y5
example in [14]. Pattern recognition methods are ap-
plicable if there is sufficient training data available.
                                                                          X1    X2        X3    X4        X5        X6        X7   X8
Unfortunately, this is rarely the case in fault isolation.
In [20] the problem of learning with missing fault pat-
terns is discussed. In [20] training data is combined
                                                             Figure 1: A Bayesian network describing a typical
with fundamental methods for fault isolation described
                                                             fault isolation problem.
in [2, 22]. This approach is referred to as Direct Infer-
ence in the current work, and compared to the other
four methods for learning.                                   triggered by a fault detector telling us there must be
                                                             at least one fault present in the process.
The paper is structured as follows. We introduce no-
tation, and formulate the diagnosis problem in Sec-          The structure of dependencies between the faults and
tion 2. Therein we also define relevant performance          observations has three basic properties, illustrated in
measures. In Section 3 we briefly describe the five          the example Bayesian network of Figure 1.
methods used, and in particular how they are applied
                                                             The first property is that faults assumed to be a priori
to the diagnosis problem, before we perform the evalu-
                                                             independent, i.e. that
ating experiments and compare the results obtained in
Section 4. Finally, in Section 5 we conclude the paper                         K
                                                                               Y                                                   M
                                                                                                                                   Y
by summarizing our results and discussing future work               p(y) =           p(yk |y1 , . . . , yk−1 ) ≈                       p(yk ),   (1)
directions.                                                                    k=1                                             k=1

                                                             meaning that faults cannot cause other faults to occur.
2     PROBLEM FORMULATION                                    Although not necessary for the methods in the current
                                                             work, this is a standard assumption in many fault iso-
Before going into the details of each of the learning        lation algorithms [6], and it simplifies the reasoning in
methods we introduce some notation, and discuss the          the following sections.
characteristics of the fault isolation problem. Then we      Second, faults may causally affect one or several of
carefully state the problem at hand and define perfor-       the observation variables introducing dependencies be-
mance measures.                                              tween faults and variables. A dependency between
                                                             fault variable Yk and observation variable Xl means
2.1   BAYESIAN FAULT ISOLATION                               that the fault may be visible in the observation.
                                                             The third property is that an observation variable Xl
The fault isolation problem can be formulated as a
                                                             may be dependent on other observation variables. De-
prediction problem, where the task is to determine
                                                             pendencies between observation variables may arise
the fault(s) present in a system, given a set of ob-
                                                             due to several reasons. For example they can be caused
servations from the system. Let the faults be repre-
                                                             by unobserved factors, such as humidity, driver behav-
sented by the binary variables Y = (Y1 , . . . , YK ), and
                                                             ior, and operation point of the process. These unob-
let the observations from the system be represented
                                                             served factors could be modeled using hidden nodes,
by the variables X = (X1 , . . . , XL ), where each Xl is
                                                             but since they are numerous and unknown they are
discrete or continuous. Generally, we use upper case
                                                             here simply modeled with dependencies between ob-
letters to denote variables, and lower case letters to
                                                             servation variables. This is more carefully discussed in
denote their values. Boldface letters denote vectors.
                                                             [21].
We write p(X = x) (or simply p(x)) to denote either
probabilities or probability distributions both in the       We take a Bayesian view point on fault isolation. The
continuous and in the discrete case. The meaning will        objective is to find the probability for each fault to
be clear from the context.                                   be present given the current observation, the training
                                                             data, and the prior knowledge I, i.e. to compute the
We are given a set of training data D, consisting of
                                                             probabilities p(yk |x, D, I), k = 1, . . . , K. The proba-
samples (yn , xn ), n = 1, . . . , ND , pairs of fault and
                                                             bility for each fault can be found by marginalizing over
observation variables. The training data is collected
                                                             y−k = (y1 , . . . , yk−1 , yk+1 , . . . , yK ),
by implementing faults and then collecting observa-
tions, meaning that training data is experimental. To                  p(yk |x, D, I) =
                                                                                           X
                                                                                                 p(y−k , yk |x, D, I). (2)
evaluate the system we use a set E consisting of NE                                             y−k
samples. The evaluation data is collected by running
the system, meaning that it is observational. Further-       Note that (y−k , yk ) = y, and (2) means that we seek
more, we assume that the fault isolation algorithm is        the conditional distribution p(y|x, D, I). To simplify
the notation we will omit the background information        3.1   MODELLING ASSUMPTIONS
I in the equations.
                                                            All the methods considered in this paper – with the
Computing the conditional distribution p(y|x, D) is         exception of DI – build separate models for each fault
generally difficult. To approximate it we need a model      and thus assume independence among these. A priori
M and a method for determining the parameters of            this corresponds to approximation (1). However, when
the model.                                                  we build separate models for each fault, we also make
                                                            a stronger assumption, namely that the faults remain
2.2   PERFORMANCE MEASURES                                  independent given the observations,
To evaluate the different models to be used in Bayesian                  K
                                                                         Y                                      K
                                                                                                                Y
fault isolation, we use two performance measures: log-        p(y|x) =         p(yk |x, y1 , . . . , yk−1 ) ≈         p(yk |x) (5)
score and percentage of correct classification.                          k=1                                    k=1

The log-loss is a commonly used measure [1], and given      This approximation is (after applying Bayes’ rule and
by                                                          canceling terms) equivalent to
                         NE                                              K                 K
                       1 X                                               Y                 Y
         µ(E, M) =           log p(yj |xj , M),      (3)                     p(x|yk ) ≈          p(x|y1 , . . . , yk ),        (6)
                      NE j=1
                                                                       k=1                 k=1

The scoring function µ measures two important prop-         meaning that the observation x is dependent on each
erties of the fault isolation system; both the ability to   fault yk , but this dependency is assumed to be inde-
assign large probability mass to faults that are present,   pendent of all other faults yk′ , k ′ 6= k. In other words,
and also the ability to assign small probability mass       we assume no “explaining away” [10]. Looking at Fig-
to faults that are not present. Furthermore, the log-       ure 1 we observe, that this indeed is a strong assump-
score is a proper score. A proper score has the char-       tion, since there are unshielded colliders (V-structures,
acteristic that it is maximized when the learned prob-      bastards, common children of non-connected nodes) of
ability distribution corresponds to the empirically ob-     the faults present.
served probabilities. In the fault isolation problem the    Assumption (5) is primarily made for technical rea-
conditional probabilities for faults is often combined      sons, in order to be able to build separate models for
with decision theoretic methods for troubleshooting         each fault. But often it is also the case (as in the
[8], where optimal decision making requires conditional     application of Section 4) that there is training data
probabilities close to the generating distribution.         only from single faults. This means we do not have
The second measure we use is not proper. It is closely      any training data telling us about the joint effect of
related to the 0/1-loss used e.g. in pattern classifica-    multiple faults.
tion [1]. However, in case of multiple faults present it    Remember that it is known that there is at least one
suffices to assign highest probability to any of them.      fault present when the fault isolator is employed, see
We define                                                   Section 2.1. Therefore, instead of computing p(y|x),
                        j                                   we search
       ν(E, M) = #{j : ymax (xj , M) = 1}/NE ,       (4)
                                                                      X
        j                                                      p(y|x,     yk > 0) = p(y|x)(1 − p(y ≡ 0|x)). (7)
where ymax (xj , M) is the fault assigned highest prob-
                                                                         k
ability by M given xj . The ν-score reflects the per-
formance of the fault isolation system combined with        Unfortunately
the simple troubleshooting strategy “check the most                   X              Y          X
probable fault first”.                                         p(y|x,     yk > 0) 6=   p(yk |x,   yk > 0),                     (8)
                                                                         k                   k              k

3     MODELLING APPROACHES                                  a fact which recouples the single-fault models intro-
                                                            duced in (5). This fact is ignored during the learning
In this section we briefly present the inference meth-      phase and the single-fault models are trained individ-
ods used to tackle the fault isolation problem. We          ually. We then apply (7) in the evaluation phase.
carefully state all assumptions made, and describe the
adjustments of each method to apply it to the diag-         3.2   DIRECT INFERENCE
nosis problem. However, we begin by describing two
assumptions that need to be made for all methods ex-        Several previous fault isolation algorithms rely on prior
cept DI.                                                    knowledge about which observations may be affected
                                                               leave-one-out scoring function:
           Table 1: An example of an FSM
                       Y1 Y2 Y3                                            ND
                  X1    1   1    0                                       1 X
                                                                S(V ) =        log P (ykn |xn , V, D \ {(yn , xn )}, α),
                  X2    1   0    1                                      ND n=1
                                                                                                                     (10)

                                                               where V ⊂ X is the variable set under consideration
by each fault [2, 22, 12]. Such information is typi-           and α is the Dirichlet hyper-parameter for the NB-
cally expressed in a so called Fault Signature Matrix          model.
(FSM). An example of an FSM is given in Table 1.
In the FSM, a zero in position (k, l) means that fault         3.3.2    General Bayesian Network
Yk can never affect observation Xl . The direct infer-
ence method aims at combining the information given            Since it is known that the faults causally precede the
by the FSM with the training data available. Assume            observations, and since the observations are known to
that observations are binary and that the background           be dependent given the faults, a natural step forward
information I containing the FSM is given. Then, un-           from the Naive Bayes structure is a Bayesian network.
der certain assumptions it can be shown [20] that              In the network we constrain the fault to be a root
                                                               node, but otherwise leave the structure unconstrained.
                                                               One such network was learned for each fault using a
                      (
                          0                 x∈γ
        p(y|x, D) =       nxy +αxy p(y|I)                (9)   BDe score (with an equivalent sample size parameter
                           Ny +Ay    π0     otherwise,         of 1.0). For small systems (< 30 variables) learning can
                                                               be performed using the exact algorithm in [27], while
where π0 is a normalization constant, nxy is the count         for larger systems approximate methods, e.g. [9], can
of training data with fault y and observations x, αxy          be used.
is a parameter describing the prior belief in the ob-
servation
        P x when the fault isP   y (a Dirichlet prior),        3.4     REGRESSION
Ny =      x ′ n x ′ y , and Ay =
                                  x′ αx′ y . The sets γ
are determined by the background information as de-            Fault isolation is a discriminative task, where we are
scribed in [20].                                               to predict the fault vector y given the observations x,
                                                               i.e. estimate the conditional likelihood
The direct inference method is developed for sparse
sets of training data, particularly when there is only                                     p(y, x|θ)
training data from a subset of the fault patterns to                          p(y|x, θ) = P            .             (11)
                                                                                           y p(y, x|θ)
isolate.
                                                               It is well known [18, 11] that in such case it can be
3.3     BAYESIAN NETWORKS                                      of great benefit to employ a discriminative learning
                                                               method, that only learns the probabilities asked, in-
When using Bayesian networks for prediction, we                stead of wasting training data to learn the joint data
search the joint distribution p(y, x|θ), where θ are pa-       likelihood as in the Bayesian network methods of Sec-
rameters describing the conditional probability distri-        tion 3.3. Regression models form a family of such
butions in the network. From the joint distribution,           methods.
the conditional distribution for y can be computed.
We consider two types of Bayesian networks: Naive              3.4.1    Linear Regression
Bayes and general Bayesian Networks.
                                                               The most straight-forward regression method is linear
                                                               regression, where each fault variable is assumed to be a
3.3.1    Naive Bayes                                           linear combination of the observations plus a gaussian
                                                               noise term,
The Naive Bayes classifier assumes that the observa-
tions are independent given the fault. Naive Bayes is                   yk = wkT x + wk0 + ǫk ,    ǫ ∼ N (0, σ).
is one of the standard methods for Bayesian prediction
and often performs surprisingly well [3, 23]. However,         Here wk , wk0 , and σ are parameters to be determined.
due to the erroneous independence assumptions it is            This gives the probability distribution
poorly calibrated when there are strong dependencies
between the observations. To alleviate this problem,                              1      (wT x + wk0 − yk )2
                                                                     p(yk |x) =     exp(− k                  ),      (12)
we apply variable selection according to an internal                              Z             2σ 2
where Z is a normalization constant. To determine the           This amounts to a smoothing term
parameters we use the standard methods described for
example in [1].                                                     c′ (α, β) − 2 log(exp(α) + exp(−α))
                                                                                          L
                         ND                                                               X
                         X                                                           −4         log(exp(βl ) + exp(−βl )). (16)
        w∗ = arg min −         log p(ykn |xn , w)
                  w                                                                       l=1
                         n=1
                         ND
                         X                                      However, we found this smoothing term problematic,
           = arg min −        (wkT xn + wk0 − ykn )2 .          since it is flat near zero. Therefore, we never get any
                  w
                         n=1                                    parameters exactly zero. But in logistic regression
                                                                many small parameters can make a difference, while
When the parameters w∗ are known, the parameter σ               they may be weakly supported. We choose to replace
can also be computed. The normalization constant in
                                                                log(exp(x) + exp(−x)) by |x|. This is a good approxi-
(12) is given by Z = exp(−((w∗ )Tk x+wk0
                                      ∗
                                         −1)2 /2σ 2 )+          mation away from zero, but forces unsupported param-
           ∗ T      ∗     2  2
exp(−((w )k x + wk0 − 0) /2σ ).                                 eters to zero, implicitly performing attribute selection.

3.4.2    Logistic Regression                                    For fault Yk we search parameters as to maximize

Learning parameters to maximize (11) for a Bayes Net                log p(yk |x, α, β) + c(α, β)
B is known to be equivalent to logistic regression under                   ND                                       L
the condition that no child of the class can be a “bas-
                                                                           X                                        X
                                                                       =         log p(ykn |xn , α, β) − 2|α| − 4         |βl |. (17)
tard”, a common child of two variables that are not                        n=1                                      l=1
interconnected directly. More formal definition and
proofs can be found in [24]. In our case, this implies          We do this by simple line search, one parameter at a
approximation (5).                                              time2 .
To start with, for each fault we learn a logistic regres-       Finally, we try a variant of this algorithm which
sion model corresponding to a discriminative Naive              weights the training vectors. We have prior knowl-
Bayes classifier 1 .                                            edge about the probabilities p(yk ) with which to ex-
We name the parameters of the logistic regression               pect some fault yk in the real-world setting or, in this
model α and β such that the conditional likelihood              case, the evaluation set. These probabilities differ from
is defined as                                                   the relative frequencies observed in the training set.
                                                                The idea is to weight the training vectors in the objec-
                                 exp s(x, α, β)                 tive as to focus the optimization on areas of the data
 p(yk = 1|x, α, β) :=
                        exp s(x, α, β) + exp −s(x, α, β)        space more likely to be seen later on. The correspond-
                                                     (13)       ing objective for fault Yk becomes
where
                                     L                                       ND
                                                                             X
                                                                                   log wk p(yin |xn , α, β) + c(α, β)
                                     X
                 s(x, α, β) := α +         xl βl .       (14)                                                                   (18)
                                     l=1                                     n=1


We also include a smoothing term c(α, β) in our ob-             where the weight wk is the prior p(yk ) divided by the
jective function which takes the place of a prior in            observed relative frequency #{n : ykn = yk }/ND .
the corresponding NB classifier. To unify its role for
different observations, we first normalize our data by          4     EXPERIMENTS
shifting and scaling such that for l = 1, . . . , L
            X                                                   To evaluate the different methods learning fault isola-
                  xnl = 0 and max |xnl | = 1             (15)   tion models, we apply them to the diagnosis of the gas
                                     n
             n                                                  flow in a 6-cylinder diesel engine in a Scania truck. In
                                                                automotive engines, sensor faults are one of the most
Starting out from the uniform prior, we pretend to
                                                                common faults, and here we consider five faults that
have seen one vector of each class at node Yk and two
                                                                may appear in different sensors. The faults are listed
vectors of each class with extreme values ±1 at each
                                                                together with their prior probabilities in Table 2.
node Xl , with all other values zero (∼unobserved).
                                                                   2
                                                                     There are much faster optimization techniques, some
   1
   possible other choices include tree-augmented Naive          of which are compared in [16], but for our purposes this
Bayes (TAN) [24, 5]                                             did nicely
           Table 2: The faults considered                             Table 3: Comparison of the methods
         Fault       description       p(yk )                        method         log-score   ν-score   #pars
          y1    exhaust gas pressure    0.4                            DI             -1.088     0.781     106
          y2       intake pressure      0.13                         NB-bin.          -1.340     0.748     293
          y3     intake air pressure   0.057                        NB-disc.          -1.044     0.843     335
          y4     EGR vault position     0.13                         BN-bin.          -1.297     0.782     287
          y5          mass flow        0.057                        BN-disc.          -1.398     0.840    1136
                                                                      LinR            -1.839     0.834     150
                                                                      LogR            -1.071     0.829     46
4.1   EXPERIMENTAL SETUP                                          LogR+weights        -0.953     0.829     44
                                                                     default          -1.738     0.592      5
For the gas flow of the diesel engine there is physical
model from which a set of 29 diagnostic tests are au-
tomatically generated using structural analysis [4, 13].      Table 4: Comparision of DI and LogR on single faults
Each of the observations is constructed to be sensitive
to a subset of the faults.                                                 fault    µ DI    µ LogR+w
For training and evaluation data we use measurements                        y1     -0.346     -0.385
from real operation of the truck, with faults imple-                        y2     -0.324     -0.287
mented. The training data consists of 100 samples                           y3     -0.087     -0.008
each from the five single faults. Evaluation data con-                      y4     -0.334     -0.294
sists of data from the five single faults, but also of data                 y5     -0.177     -0.133
from two multiple faults y1 &y2 , and y1 &y4 . Evalua-
tion data is observational, and consists of 1000 sam-
ples, distributed roughly according to the prior prob-
                                                              We observe, that among the four best methods in Ta-
abilities in Table 2.
                                                              ble 3 three are discriminative and learn the conditional
The data we consider is originally continuous, but all        distribution instead of the joint distribution. Further-
except the regression algorithms take in discrete data.       more, LogR with training sample weighting performs
The data is discretized in two different ways: binary,        best on this data in log-score sense, while using a
with thresholds set such that all fault free data is          small number of parameters. Surprisingly the weight-
known to be contained in the same bin; and discretized        ing trick has made quite a difference and LogR without
using k-means clustering [7] with k = 4. DI is applied        weights it is outperformed by NB-disc. NB performs
to the discrete data. NB and BN are run both on dis-          better when it is fed with discretized observations in-
crete and binary data. The regression methods LinR            stead of binary, while for BN the effect is reversed.
and LogR are applied to the continuous data.                  Clearly the discretized data contain more information,
                                                              but it seems that in more complex Bayes Nets the con-
As described in Section 3 the NB and DI algorithms
                                                              ditional probability tables easily grow too large. In DI
perform best if not all observations are used. For both
                                                              good results are obtained by exploiting prior knowl-
DI and NB we perform variable selection such that an
                                                              edge in terms of that some faults never cause an ob-
internal log-score is maximized. For DI, the best result
                                                              servation to pass certain thresholds.
is obtained by using only six of the observations. In
NB between seven and 18 observations are used for             Measured by the ν-score the relative differences be-
each fault.                                                   tween the methods become smaller. We observe
                                                              that this score favors the regression models and the
                                                              Bayesian methods using binary data. The reason for
4.2   RESULTS
                                                              the good performance of the methods using binary
                                                              data is the particular way of thresholding the data
In Table 3 the log-score (µ) and percentage of cor-
                                                              such that all fault free samples are contained in the
rect classification (ν) are presented for the different
                                                              same bin.
methods. In addition we report the number of param-
eters used by each predictor. This is relevant, since         Table 4 compares the log-scores of the predictions
for on-board fault isolation the computing and stor-          given for the single faults by DI and LogR+weights.
age capacity is often limited. For comparison we also         Note that because of inequality (8) the columns do
report the default which is obtained by simply using          not sum to the corresponding entries in Table 3. Not
the prior probabilities given in Table 2.                     surprisingly, both methods (as all others) have most
trouble with faults y1 , y2 and y4 , the ones appearing                                of previously unseen fault patterns. In addition there
simultaneously in evaluation data, but not in training                                 is prior knowledge available about which faults may
data. This gives evidence for explaining away being                                    affect each observation, and also the knowledge that
important in this problem. Figure 2, in which the                                      at least one fault is present.
probabilities for each fault using LogR + weights are
                                                                                       We have studied different Bayesian and regression ap-
plotted, shows this in more detail. In the Figure we
                                                                                       proaches to combine this by nature heterogeneous in-
have ordered the evaluation data such that the right-
                                                                                       formation into probability distributions for the faults
most samples have multiple faults, visualizing that the
                                                                                       conditioned on given observations. We have compared
double faults are most difficult to predict.
                                                                                       the performance of the methods using real-world data,
                                                                                       and have found that the discriminative logistic regres-
                                                                                       sion method to perform best. Among the best methods
                                                                                       we have also found the naive Bayes classifier and the
                 1
                                                                                       direct inference method.
    p(y1 |xn)


                0.5
                                                                                       One of the clearest implications of this work is that
                 0
                                                                                       all methods have difficulties with handling unobserved
                      0   100   200   300   400   500   600   700   800   900   1000
                                                  n                                    fault patterns. Unfortunately, unobserved patterns are
                 1
                                                                                       common in fault isolation, so this problem should be
    p(y2 |xn)


                                                                                       tackled in future work. All the methods used, except
                0.5                                                                    direct inference, ignore explaining away. However, this
                                                                                       explaining away effect can possibly be helpful when di-
                 0
                      0   100   200   300   400   500   600   700   800   900   1000   agnosing unseen patterns. Furthermore, it is crucial to
                                                  n
                                                                                       include background information in the learning phase
                 1
                                                                                       whenever it is available.
    p(y3 |xn)


                0.5                                                                    In our work to come we will investigate models capa-
                                                                                       ble of both explaining away and taking prior knowl-
                 0
                      0   100   200   300   400   500
                                                  n
                                                        600   700   800   900   1000   edge into account, while providing an efficient infer-
                                                                                       ence procedure, as on-board computers offer very lim-
                 1
                                                                                       ited resources. We expect further improvement of per-
    p(y4 |xn)


                0.5                                                                    formance is possible.

                 0
                      0   100   200   300   400   500
                                                  n
                                                        600   700   800   900   1000
                                                                                       References
                 1
                                                                                        [1] Christopher M. Bishop. Neural Networks for Pat-
    p(y5 |xn)


                0.5                                                                         tern Recognition. Oxford University Press, 1995.

                 0
                      0   100   200   300   400   500   600   700   800   900   1000
                                                                                        [2] Johan de Kleer and Brian C. Williams. Diagnosis
                                                  n                                         with Behavioral Modes. In Readings in Model-
                                                                                            based Diagnosis, pages 124–130, San Francisco,
                                                                                            CA, USA, 1992. Morgan Kaufmann Publishers
                                                                                            Inc.
Figure 2: The predicted probability for the different
                                                                                        [3] Luc Devroye, Laszlo Györfi, and Gabor Lugosi.
faults given by LogR+w. Evaluation data is ordered
                                                                                            A Probabilistic Theory of Pattern Recognition.
after their fault patterns. The true fault is marked
                                                                                            Springer, New York, 1996.
with a solid line.
                                                                                        [4] Henrik Einarsson and Gustav Arrhenius. Auto-
                                                                                            matic design of diagnosis systems using consis-
5          CONCLUSIONS                                                                      tency based residuals. Master’s thesis, Uppsala
                                                                                            University, 2004.
We have considered the problem of fault isolation in
an automotive diesel engine. We have discussed the                                      [5] Russel Greiner and Wei Zhou. Structural Exten-
special characteristics of this problem. There is ex-                                       sion to Logistic Regression: Discriminative Pa-
perimental training data available which is distributed                                     rameter Learning of Belief Net Classifiers. In 13th
differently from what we expect to see in the real-world                                    international conference on uncertainty in artifi-
setting. In particular, evaluation data consists partly                                     cial intelligence, 2002.
 [6] Walter Hamscher, Luca Console, and Johan deK-          [18] Andrew Y. Ng and Michael I. Jordan. On Dis-
     leer. Readings in Model-based Diagnosis. Morgan             criminative vs. Generative classifiers: A compari-
     Kaufmann Publishers Inc., San Francisco, CA,                son of logistic regression and naive Bayes. In Ad-
     USA, 1992.                                                  vances in Neural Information Processing Systems
                                                                 14, 2002.
 [7] John A. Hartigan. Clustering Algorithms. Wiley,
     1975.                                                  [19] Mattias Nyberg. Model-Based Diagnosis of an
                                                                 Automotive Engine Using Several Types of Fault
 [8] David Heckerman, John S. Breese, and Koos                   Models. IEEE Transactions on Control Systems
     Rommelse. Decision-theoretic troubleshooting.               Technology, 10(5):679–689, 2005.
     Communications of the ACM, 38(3):49–57, 1995.
                                                            [20] Anna Pernestål and Mattias Nyberg. Diagnos-
 [9] David Heckerman, Dan Geiger, and David M.                   ing Known and Unknown Faults from Incomplete
     Chickering. Learning Bayesian Networks: The                 Data. In Proceedings of European Control Con-
     Combination of Knowledge and Statistical Data.              ference, 2007.
     Machine Learning, 20(3):197–243, 1995.
                                                            [21] Anna Pernestål,     Mattias Nyberg,       and
[10] Finn V. Jensen. Bayesian Networks. Springer-                Bo Wahlberg. A Bayesian Approach to Fault
     Verlag, New York, 2001.                                     Isolation with Application to Diesel Engine
                                                                 Diagnosis. In Proceedings of 17th International
[11] Petri. Kontkanen, Petri. Myllymäki, and Henry.             Workshop on Principles of Diagnosis (DX 06),
     Tirri. Classifier learning with supervised marginal         pages 211–218, 2006.
     likelihood. In J. Breese and D. Koller, editors,
     Proceedings of the 17th International Conference       [22] Raymond Reiter. A Theory of Diagnosis From
     on Uncertainty in Artificial Intelligence (UAI),            First Principles. In Readings in Model-based Di-
     pages 277–284, 2001.                                        agnosis, pages 29–48, San Francisco, CA, USA,
                                                                 1992. Morgan Kaufmann Publishers Inc.
[12] Jozef Korbicz, Jan M. Koscielny, Zdzislaw Kowal-
                                                            [23] Irina Rish. An empirical study of the naive bayes
     czuk, and Wojciech Cholewa. Fault Diagno-
                                                                 classifier. In IJCAI 2001 Workshop on Empirical
     sis. Models, Artificial Intelligence , Applications.
                                                                 Methods in Artificial Intelligence, 2001.
     Springer, Berlin, Germany, 2004.
                                                            [24] Teemu Roos, Hannes Wettig, Peter Grünwald,
[13] Mattias Krysander, Jan Åslund, and Mattias Ny-             Petri Myllymäki, and Henry Tirri. On Discrim-
     berg. An Efficient Algorithm for Finding Min-               inative Bayesian Network Classifiers and Logis-
     imal Over-constrained Sub-systems for Model-                tic Regression. Machine Learning, pages 267–296,
     based Diagnosis. IEEE Transactions on Systems,              2005.
     Man, and Cybernetics – Part A: Systems and Hu-
     mans, 38(1):197–206, 2008.                             [25] Indranil Roychoudhury, Gautam Biswas, and
                                                                 Xenofon Koutsoukos. A Bayesian Approach to
[14] Gareth Lee, Parisa Bahri, Srinivas Shastri, and             Efficient Diagnosis of Incipient Faults. In Pro-
     Anthony Zaknich. A multi-category decision sup-             ceedings of 17th International Workshop on Prin-
     port framework for the tennessee eastman prob-              ciples of Diagnosis (DX 06), pages 243–250, 2006.
     lem. In Proceedings of the European Control Con-
     ference 2007, Greece, 2007.                            [26] Matthew Schwall and Christian Gerdes. A prob-
                                                                 abilistic Approach to Residual Processing for Ve-
[15] Uri Lerner, Ronald Parr, Daphne Koller, and                 hicle Fault Detection. In Proceedings of the 2002
     Gautam Biswas. Bayesian Fault Detection and                 ACC, pages 2552–2557, 2002.
     Diagnosis in Dynamic Systems. In AAAI/IAAI,
     pages 531–537, 2000.                                   [27] Tomi Silander and Petri Myllymäki. A Sim-
                                                                 ple Approach for Finding the Globally Optimal
[16] Thomas P. Minka. A comparison of numerical op-              Bayesian Network Structure. In Proceedings of
     timizers for logistic regression. Technical report,         the 22nd Conference on Uncertainty in AI (UAI),
     Micrsoft Research, 2003.                                    2006.

[17] Sriram Narasimhan and Gautam Biswas. Model-
     based Diagnosis of Hybrid Systems. IEEE Trans.
     on Systems, Man, and Cybernetics, Part A,
     37(3):348–361, 2007.