A Bayesian Approach to Learning in Fault Isolation Hannes Wettig Anna Pernestål Tomi Silander Mattias Nyberg Helsinki Institute for Dept. Electrical Engineering Helsinki Institute for Scania CV AB Information Technology Linköping University Information Technology Södertälje Finland Sweden Finland Sweden wettig@hiit.fi annap@isy.liu.se tsilande@hiit.fi mattias.nyberg@scania.com Abstract the process. The engineering knowledge can for exam- ple be used to determine the structure of dependencies between faults and observations. This kind of knowl- Fault isolation is the art of localizing faults in edge is often the only basis in previous algorithms for a process, given observations from it. To do fault isolation [6, 12, 19]. this, a model describing the relation between faults and observations is needed. In this pa- Due to the fact that there are previously unobserved per we focus on learning such models both fault patterns in training data, frequentist and purely from training data and from prior knowledge. data-based methods are bound to fail. To meet these There are several challenges in learning fault challenges we use a Bayesian approach to learning in isolators. The number of data, as well as the fault isolation. We consider five different methods of available computing resources, are often lim- learning a model from training data, which are all pre- ited and there may be previously unobserved viously present in the literature in different forms. We fault patterns. To meet these challenges we taylor these methods to incorporate the available back- take on a Bayesian approach. We compare ground information. The methods we consider are Di- five different methods for learning in fault iso- rect Inference (DI), Logistic Regression (LogR), Lin- lation, and evaluate their performance on a ear Regression (LinR), Naive Bayes (NB) and general real fault isolation problem; the diagnosis of Bayesian Networks (BN). an automotive engine. The main contributions of the current work are the in- vestigation of Bayesian learning methods and regres- sion models for fault isolation by comparing the five 1 INTRODUCTION methods mentioned above, the application and evalu- ation of the methods on real-world data, and the com- We consider the problem of fault isolation, i.e. the bination of data-driven learning and prior knowledge problem of localizing faults that are present in a pro- within these methods. In order to do this investiga- cess given observations from this process. To do this, a tion, we first discuss the characteristics of the fault model of the relations between observations and faults isolation problem in terms of probability theory, and is needed. In the current work we investigate and com- performance measures that are meaningful for fault pare different methods for learning from training data isolation. Consecutively we show how the five meth- and prior knowledge. ods can be adopted to the isolation problem. We apply them to the task of fault isolation in an automotive We are motivated by the problem of fault isolation in diesel engine. Finally, we compare the five methods, an automotive engine, and the learning methods are and discuss their advantages and drawbacks. evaluated using experimental training data and evalu- ation data from real driving situations. In engine fault Bayesian methods for fault isolation are previously isolation there may be several hundreds of faults and studied in literature. In these previous works it is observations. There will be fault patterns, i.e. co- generally assumed that the model is given [26, 15], occuring faults, from which there are no training data. or can be derived from a physical model without us- Furthermore, training data is typically experimental ing training data [17, 25]. In the current work on the and obtained by implementing faults, running the pro- other hand, we focus on learning the models. Previous cess, and collecting observations. On the other hand, works on Learning models for fault isolation typically there is often engineering knowledge available about rely on pattern recognition methods described e.g. in [1, 3]. Examples of such methods are presented for Y1 Y2 Y3 Y4 Y5 example in [14]. Pattern recognition methods are ap- plicable if there is sufficient training data available. X1 X2 X3 X4 X5 X6 X7 X8 Unfortunately, this is rarely the case in fault isolation. In [20] the problem of learning with missing fault pat- terns is discussed. In [20] training data is combined Figure 1: A Bayesian network describing a typical with fundamental methods for fault isolation described fault isolation problem. in [2, 22]. This approach is referred to as Direct Infer- ence in the current work, and compared to the other four methods for learning. triggered by a fault detector telling us there must be at least one fault present in the process. The paper is structured as follows. We introduce no- tation, and formulate the diagnosis problem in Sec- The structure of dependencies between the faults and tion 2. Therein we also define relevant performance observations has three basic properties, illustrated in measures. In Section 3 we briefly describe the five the example Bayesian network of Figure 1. methods used, and in particular how they are applied The first property is that faults assumed to be a priori to the diagnosis problem, before we perform the evalu- independent, i.e. that ating experiments and compare the results obtained in Section 4. Finally, in Section 5 we conclude the paper K Y M Y by summarizing our results and discussing future work p(y) = p(yk |y1 , . . . , yk−1 ) ≈ p(yk ), (1) directions. k=1 k=1 meaning that faults cannot cause other faults to occur. 2 PROBLEM FORMULATION Although not necessary for the methods in the current work, this is a standard assumption in many fault iso- Before going into the details of each of the learning lation algorithms [6], and it simplifies the reasoning in methods we introduce some notation, and discuss the the following sections. characteristics of the fault isolation problem. Then we Second, faults may causally affect one or several of carefully state the problem at hand and define perfor- the observation variables introducing dependencies be- mance measures. tween faults and variables. A dependency between fault variable Yk and observation variable Xl means 2.1 BAYESIAN FAULT ISOLATION that the fault may be visible in the observation. The third property is that an observation variable Xl The fault isolation problem can be formulated as a may be dependent on other observation variables. De- prediction problem, where the task is to determine pendencies between observation variables may arise the fault(s) present in a system, given a set of ob- due to several reasons. For example they can be caused servations from the system. Let the faults be repre- by unobserved factors, such as humidity, driver behav- sented by the binary variables Y = (Y1 , . . . , YK ), and ior, and operation point of the process. These unob- let the observations from the system be represented served factors could be modeled using hidden nodes, by the variables X = (X1 , . . . , XL ), where each Xl is but since they are numerous and unknown they are discrete or continuous. Generally, we use upper case here simply modeled with dependencies between ob- letters to denote variables, and lower case letters to servation variables. This is more carefully discussed in denote their values. Boldface letters denote vectors. [21]. We write p(X = x) (or simply p(x)) to denote either probabilities or probability distributions both in the We take a Bayesian view point on fault isolation. The continuous and in the discrete case. The meaning will objective is to find the probability for each fault to be clear from the context. be present given the current observation, the training data, and the prior knowledge I, i.e. to compute the We are given a set of training data D, consisting of probabilities p(yk |x, D, I), k = 1, . . . , K. The proba- samples (yn , xn ), n = 1, . . . , ND , pairs of fault and bility for each fault can be found by marginalizing over observation variables. The training data is collected y−k = (y1 , . . . , yk−1 , yk+1 , . . . , yK ), by implementing faults and then collecting observa- tions, meaning that training data is experimental. To p(yk |x, D, I) = X p(y−k , yk |x, D, I). (2) evaluate the system we use a set E consisting of NE y−k samples. The evaluation data is collected by running the system, meaning that it is observational. Further- Note that (y−k , yk ) = y, and (2) means that we seek more, we assume that the fault isolation algorithm is the conditional distribution p(y|x, D, I). To simplify the notation we will omit the background information 3.1 MODELLING ASSUMPTIONS I in the equations. All the methods considered in this paper – with the Computing the conditional distribution p(y|x, D) is exception of DI – build separate models for each fault generally difficult. To approximate it we need a model and thus assume independence among these. A priori M and a method for determining the parameters of this corresponds to approximation (1). However, when the model. we build separate models for each fault, we also make a stronger assumption, namely that the faults remain 2.2 PERFORMANCE MEASURES independent given the observations, To evaluate the different models to be used in Bayesian K Y K Y fault isolation, we use two performance measures: log- p(y|x) = p(yk |x, y1 , . . . , yk−1 ) ≈ p(yk |x) (5) score and percentage of correct classification. k=1 k=1 The log-loss is a commonly used measure [1], and given This approximation is (after applying Bayes’ rule and by canceling terms) equivalent to NE K K 1 X Y Y µ(E, M) = log p(yj |xj , M), (3) p(x|yk ) ≈ p(x|y1 , . . . , yk ), (6) NE j=1 k=1 k=1 The scoring function µ measures two important prop- meaning that the observation x is dependent on each erties of the fault isolation system; both the ability to fault yk , but this dependency is assumed to be inde- assign large probability mass to faults that are present, pendent of all other faults yk′ , k ′ 6= k. In other words, and also the ability to assign small probability mass we assume no “explaining away” [10]. Looking at Fig- to faults that are not present. Furthermore, the log- ure 1 we observe, that this indeed is a strong assump- score is a proper score. A proper score has the char- tion, since there are unshielded colliders (V-structures, acteristic that it is maximized when the learned prob- bastards, common children of non-connected nodes) of ability distribution corresponds to the empirically ob- the faults present. served probabilities. In the fault isolation problem the Assumption (5) is primarily made for technical rea- conditional probabilities for faults is often combined sons, in order to be able to build separate models for with decision theoretic methods for troubleshooting each fault. But often it is also the case (as in the [8], where optimal decision making requires conditional application of Section 4) that there is training data probabilities close to the generating distribution. only from single faults. This means we do not have The second measure we use is not proper. It is closely any training data telling us about the joint effect of related to the 0/1-loss used e.g. in pattern classifica- multiple faults. tion [1]. However, in case of multiple faults present it Remember that it is known that there is at least one suffices to assign highest probability to any of them. fault present when the fault isolator is employed, see We define Section 2.1. Therefore, instead of computing p(y|x), j we search ν(E, M) = #{j : ymax (xj , M) = 1}/NE , (4) X j p(y|x, yk > 0) = p(y|x)(1 − p(y ≡ 0|x)). (7) where ymax (xj , M) is the fault assigned highest prob- k ability by M given xj . The ν-score reflects the per- formance of the fault isolation system combined with Unfortunately the simple troubleshooting strategy “check the most X Y X probable fault first”. p(y|x, yk > 0) 6= p(yk |x, yk > 0), (8) k k k 3 MODELLING APPROACHES a fact which recouples the single-fault models intro- duced in (5). This fact is ignored during the learning In this section we briefly present the inference meth- phase and the single-fault models are trained individ- ods used to tackle the fault isolation problem. We ually. We then apply (7) in the evaluation phase. carefully state all assumptions made, and describe the adjustments of each method to apply it to the diag- 3.2 DIRECT INFERENCE nosis problem. However, we begin by describing two assumptions that need to be made for all methods ex- Several previous fault isolation algorithms rely on prior cept DI. knowledge about which observations may be affected leave-one-out scoring function: Table 1: An example of an FSM Y1 Y2 Y3 ND X1 1 1 0 1 X S(V ) = log P (ykn |xn , V, D \ {(yn , xn )}, α), X2 1 0 1 ND n=1 (10) where V ⊂ X is the variable set under consideration by each fault [2, 22, 12]. Such information is typi- and α is the Dirichlet hyper-parameter for the NB- cally expressed in a so called Fault Signature Matrix model. (FSM). An example of an FSM is given in Table 1. In the FSM, a zero in position (k, l) means that fault 3.3.2 General Bayesian Network Yk can never affect observation Xl . The direct infer- ence method aims at combining the information given Since it is known that the faults causally precede the by the FSM with the training data available. Assume observations, and since the observations are known to that observations are binary and that the background be dependent given the faults, a natural step forward information I containing the FSM is given. Then, un- from the Naive Bayes structure is a Bayesian network. der certain assumptions it can be shown [20] that In the network we constrain the fault to be a root node, but otherwise leave the structure unconstrained. One such network was learned for each fault using a ( 0 x∈γ p(y|x, D) = nxy +αxy p(y|I) (9) BDe score (with an equivalent sample size parameter Ny +Ay π0 otherwise, of 1.0). For small systems (< 30 variables) learning can be performed using the exact algorithm in [27], while where π0 is a normalization constant, nxy is the count for larger systems approximate methods, e.g. [9], can of training data with fault y and observations x, αxy be used. is a parameter describing the prior belief in the ob- servation P x when the fault isP y (a Dirichlet prior), 3.4 REGRESSION Ny = x ′ n x ′ y , and Ay = x′ αx′ y . The sets γ are determined by the background information as de- Fault isolation is a discriminative task, where we are scribed in [20]. to predict the fault vector y given the observations x, i.e. estimate the conditional likelihood The direct inference method is developed for sparse sets of training data, particularly when there is only p(y, x|θ) training data from a subset of the fault patterns to p(y|x, θ) = P . (11) y p(y, x|θ) isolate. It is well known [18, 11] that in such case it can be 3.3 BAYESIAN NETWORKS of great benefit to employ a discriminative learning method, that only learns the probabilities asked, in- When using Bayesian networks for prediction, we stead of wasting training data to learn the joint data search the joint distribution p(y, x|θ), where θ are pa- likelihood as in the Bayesian network methods of Sec- rameters describing the conditional probability distri- tion 3.3. Regression models form a family of such butions in the network. From the joint distribution, methods. the conditional distribution for y can be computed. We consider two types of Bayesian networks: Naive 3.4.1 Linear Regression Bayes and general Bayesian Networks. The most straight-forward regression method is linear regression, where each fault variable is assumed to be a 3.3.1 Naive Bayes linear combination of the observations plus a gaussian noise term, The Naive Bayes classifier assumes that the observa- tions are independent given the fault. Naive Bayes is yk = wkT x + wk0 + ǫk , ǫ ∼ N (0, σ). is one of the standard methods for Bayesian prediction and often performs surprisingly well [3, 23]. However, Here wk , wk0 , and σ are parameters to be determined. due to the erroneous independence assumptions it is This gives the probability distribution poorly calibrated when there are strong dependencies between the observations. To alleviate this problem, 1 (wT x + wk0 − yk )2 p(yk |x) = exp(− k ), (12) we apply variable selection according to an internal Z 2σ 2 where Z is a normalization constant. To determine the This amounts to a smoothing term parameters we use the standard methods described for example in [1]. c′ (α, β) − 2 log(exp(α) + exp(−α)) L ND X X −4 log(exp(βl ) + exp(−βl )). (16) w∗ = arg min − log p(ykn |xn , w) w l=1 n=1 ND X However, we found this smoothing term problematic, = arg min − (wkT xn + wk0 − ykn )2 . since it is flat near zero. Therefore, we never get any w n=1 parameters exactly zero. But in logistic regression many small parameters can make a difference, while When the parameters w∗ are known, the parameter σ they may be weakly supported. We choose to replace can also be computed. The normalization constant in log(exp(x) + exp(−x)) by |x|. This is a good approxi- (12) is given by Z = exp(−((w∗ )Tk x+wk0 ∗ −1)2 /2σ 2 )+ mation away from zero, but forces unsupported param- ∗ T ∗ 2 2 exp(−((w )k x + wk0 − 0) /2σ ). eters to zero, implicitly performing attribute selection. 3.4.2 Logistic Regression For fault Yk we search parameters as to maximize Learning parameters to maximize (11) for a Bayes Net log p(yk |x, α, β) + c(α, β) B is known to be equivalent to logistic regression under ND L the condition that no child of the class can be a “bas- X X = log p(ykn |xn , α, β) − 2|α| − 4 |βl |. (17) tard”, a common child of two variables that are not n=1 l=1 interconnected directly. More formal definition and proofs can be found in [24]. In our case, this implies We do this by simple line search, one parameter at a approximation (5). time2 . To start with, for each fault we learn a logistic regres- Finally, we try a variant of this algorithm which sion model corresponding to a discriminative Naive weights the training vectors. We have prior knowl- Bayes classifier 1 . edge about the probabilities p(yk ) with which to ex- We name the parameters of the logistic regression pect some fault yk in the real-world setting or, in this model α and β such that the conditional likelihood case, the evaluation set. These probabilities differ from is defined as the relative frequencies observed in the training set. The idea is to weight the training vectors in the objec- exp s(x, α, β) tive as to focus the optimization on areas of the data p(yk = 1|x, α, β) := exp s(x, α, β) + exp −s(x, α, β) space more likely to be seen later on. The correspond- (13) ing objective for fault Yk becomes where L ND X log wk p(yin |xn , α, β) + c(α, β) X s(x, α, β) := α + xl βl . (14) (18) l=1 n=1 We also include a smoothing term c(α, β) in our ob- where the weight wk is the prior p(yk ) divided by the jective function which takes the place of a prior in observed relative frequency #{n : ykn = yk }/ND . the corresponding NB classifier. To unify its role for different observations, we first normalize our data by 4 EXPERIMENTS shifting and scaling such that for l = 1, . . . , L X To evaluate the different methods learning fault isola- xnl = 0 and max |xnl | = 1 (15) tion models, we apply them to the diagnosis of the gas n n flow in a 6-cylinder diesel engine in a Scania truck. In automotive engines, sensor faults are one of the most Starting out from the uniform prior, we pretend to common faults, and here we consider five faults that have seen one vector of each class at node Yk and two may appear in different sensors. The faults are listed vectors of each class with extreme values ±1 at each together with their prior probabilities in Table 2. node Xl , with all other values zero (∼unobserved). 2 There are much faster optimization techniques, some 1 possible other choices include tree-augmented Naive of which are compared in [16], but for our purposes this Bayes (TAN) [24, 5] did nicely Table 2: The faults considered Table 3: Comparison of the methods Fault description p(yk ) method log-score ν-score #pars y1 exhaust gas pressure 0.4 DI -1.088 0.781 106 y2 intake pressure 0.13 NB-bin. -1.340 0.748 293 y3 intake air pressure 0.057 NB-disc. -1.044 0.843 335 y4 EGR vault position 0.13 BN-bin. -1.297 0.782 287 y5 mass flow 0.057 BN-disc. -1.398 0.840 1136 LinR -1.839 0.834 150 LogR -1.071 0.829 46 4.1 EXPERIMENTAL SETUP LogR+weights -0.953 0.829 44 default -1.738 0.592 5 For the gas flow of the diesel engine there is physical model from which a set of 29 diagnostic tests are au- tomatically generated using structural analysis [4, 13]. Table 4: Comparision of DI and LogR on single faults Each of the observations is constructed to be sensitive to a subset of the faults. fault µ DI µ LogR+w For training and evaluation data we use measurements y1 -0.346 -0.385 from real operation of the truck, with faults imple- y2 -0.324 -0.287 mented. The training data consists of 100 samples y3 -0.087 -0.008 each from the five single faults. Evaluation data con- y4 -0.334 -0.294 sists of data from the five single faults, but also of data y5 -0.177 -0.133 from two multiple faults y1 &y2 , and y1 &y4 . Evalua- tion data is observational, and consists of 1000 sam- ples, distributed roughly according to the prior prob- We observe, that among the four best methods in Ta- abilities in Table 2. ble 3 three are discriminative and learn the conditional The data we consider is originally continuous, but all distribution instead of the joint distribution. Further- except the regression algorithms take in discrete data. more, LogR with training sample weighting performs The data is discretized in two different ways: binary, best on this data in log-score sense, while using a with thresholds set such that all fault free data is small number of parameters. Surprisingly the weight- known to be contained in the same bin; and discretized ing trick has made quite a difference and LogR without using k-means clustering [7] with k = 4. DI is applied weights it is outperformed by NB-disc. NB performs to the discrete data. NB and BN are run both on dis- better when it is fed with discretized observations in- crete and binary data. The regression methods LinR stead of binary, while for BN the effect is reversed. and LogR are applied to the continuous data. Clearly the discretized data contain more information, but it seems that in more complex Bayes Nets the con- As described in Section 3 the NB and DI algorithms ditional probability tables easily grow too large. In DI perform best if not all observations are used. For both good results are obtained by exploiting prior knowl- DI and NB we perform variable selection such that an edge in terms of that some faults never cause an ob- internal log-score is maximized. For DI, the best result servation to pass certain thresholds. is obtained by using only six of the observations. In NB between seven and 18 observations are used for Measured by the ν-score the relative differences be- each fault. tween the methods become smaller. We observe that this score favors the regression models and the Bayesian methods using binary data. The reason for 4.2 RESULTS the good performance of the methods using binary data is the particular way of thresholding the data In Table 3 the log-score (µ) and percentage of cor- such that all fault free samples are contained in the rect classification (ν) are presented for the different same bin. methods. In addition we report the number of param- eters used by each predictor. This is relevant, since Table 4 compares the log-scores of the predictions for on-board fault isolation the computing and stor- given for the single faults by DI and LogR+weights. age capacity is often limited. For comparison we also Note that because of inequality (8) the columns do report the default which is obtained by simply using not sum to the corresponding entries in Table 3. Not the prior probabilities given in Table 2. surprisingly, both methods (as all others) have most trouble with faults y1 , y2 and y4 , the ones appearing of previously unseen fault patterns. In addition there simultaneously in evaluation data, but not in training is prior knowledge available about which faults may data. This gives evidence for explaining away being affect each observation, and also the knowledge that important in this problem. Figure 2, in which the at least one fault is present. probabilities for each fault using LogR + weights are We have studied different Bayesian and regression ap- plotted, shows this in more detail. In the Figure we proaches to combine this by nature heterogeneous in- have ordered the evaluation data such that the right- formation into probability distributions for the faults most samples have multiple faults, visualizing that the conditioned on given observations. We have compared double faults are most difficult to predict. the performance of the methods using real-world data, and have found that the discriminative logistic regres- sion method to perform best. Among the best methods we have also found the naive Bayes classifier and the 1 direct inference method. p(y1 |xn) 0.5 One of the clearest implications of this work is that 0 all methods have difficulties with handling unobserved 0 100 200 300 400 500 600 700 800 900 1000 n fault patterns. Unfortunately, unobserved patterns are 1 common in fault isolation, so this problem should be p(y2 |xn) tackled in future work. All the methods used, except 0.5 direct inference, ignore explaining away. However, this explaining away effect can possibly be helpful when di- 0 0 100 200 300 400 500 600 700 800 900 1000 agnosing unseen patterns. Furthermore, it is crucial to n include background information in the learning phase 1 whenever it is available. p(y3 |xn) 0.5 In our work to come we will investigate models capa- ble of both explaining away and taking prior knowl- 0 0 100 200 300 400 500 n 600 700 800 900 1000 edge into account, while providing an efficient infer- ence procedure, as on-board computers offer very lim- 1 ited resources. We expect further improvement of per- p(y4 |xn) 0.5 formance is possible. 0 0 100 200 300 400 500 n 600 700 800 900 1000 References 1 [1] Christopher M. Bishop. Neural Networks for Pat- p(y5 |xn) 0.5 tern Recognition. Oxford University Press, 1995. 0 0 100 200 300 400 500 600 700 800 900 1000 [2] Johan de Kleer and Brian C. Williams. Diagnosis n with Behavioral Modes. In Readings in Model- based Diagnosis, pages 124–130, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. Figure 2: The predicted probability for the different [3] Luc Devroye, Laszlo Györfi, and Gabor Lugosi. faults given by LogR+w. Evaluation data is ordered A Probabilistic Theory of Pattern Recognition. after their fault patterns. The true fault is marked Springer, New York, 1996. with a solid line. [4] Henrik Einarsson and Gustav Arrhenius. Auto- matic design of diagnosis systems using consis- 5 CONCLUSIONS tency based residuals. Master’s thesis, Uppsala University, 2004. We have considered the problem of fault isolation in an automotive diesel engine. We have discussed the [5] Russel Greiner and Wei Zhou. Structural Exten- special characteristics of this problem. There is ex- sion to Logistic Regression: Discriminative Pa- perimental training data available which is distributed rameter Learning of Belief Net Classifiers. In 13th differently from what we expect to see in the real-world international conference on uncertainty in artifi- setting. In particular, evaluation data consists partly cial intelligence, 2002. [6] Walter Hamscher, Luca Console, and Johan deK- [18] Andrew Y. Ng and Michael I. Jordan. On Dis- leer. Readings in Model-based Diagnosis. Morgan criminative vs. Generative classifiers: A compari- Kaufmann Publishers Inc., San Francisco, CA, son of logistic regression and naive Bayes. In Ad- USA, 1992. vances in Neural Information Processing Systems 14, 2002. [7] John A. Hartigan. Clustering Algorithms. Wiley, 1975. [19] Mattias Nyberg. Model-Based Diagnosis of an Automotive Engine Using Several Types of Fault [8] David Heckerman, John S. Breese, and Koos Models. IEEE Transactions on Control Systems Rommelse. Decision-theoretic troubleshooting. Technology, 10(5):679–689, 2005. Communications of the ACM, 38(3):49–57, 1995. [20] Anna Pernestål and Mattias Nyberg. Diagnos- [9] David Heckerman, Dan Geiger, and David M. ing Known and Unknown Faults from Incomplete Chickering. Learning Bayesian Networks: The Data. In Proceedings of European Control Con- Combination of Knowledge and Statistical Data. ference, 2007. Machine Learning, 20(3):197–243, 1995. [21] Anna Pernestål, Mattias Nyberg, and [10] Finn V. Jensen. Bayesian Networks. Springer- Bo Wahlberg. A Bayesian Approach to Fault Verlag, New York, 2001. Isolation with Application to Diesel Engine Diagnosis. In Proceedings of 17th International [11] Petri. Kontkanen, Petri. Myllymäki, and Henry. Workshop on Principles of Diagnosis (DX 06), Tirri. Classifier learning with supervised marginal pages 211–218, 2006. likelihood. In J. Breese and D. Koller, editors, Proceedings of the 17th International Conference [22] Raymond Reiter. A Theory of Diagnosis From on Uncertainty in Artificial Intelligence (UAI), First Principles. In Readings in Model-based Di- pages 277–284, 2001. agnosis, pages 29–48, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. [12] Jozef Korbicz, Jan M. Koscielny, Zdzislaw Kowal- [23] Irina Rish. An empirical study of the naive bayes czuk, and Wojciech Cholewa. Fault Diagno- classifier. In IJCAI 2001 Workshop on Empirical sis. Models, Artificial Intelligence , Applications. Methods in Artificial Intelligence, 2001. Springer, Berlin, Germany, 2004. [24] Teemu Roos, Hannes Wettig, Peter Grünwald, [13] Mattias Krysander, Jan Åslund, and Mattias Ny- Petri Myllymäki, and Henry Tirri. On Discrim- berg. An Efficient Algorithm for Finding Min- inative Bayesian Network Classifiers and Logis- imal Over-constrained Sub-systems for Model- tic Regression. Machine Learning, pages 267–296, based Diagnosis. IEEE Transactions on Systems, 2005. Man, and Cybernetics – Part A: Systems and Hu- mans, 38(1):197–206, 2008. [25] Indranil Roychoudhury, Gautam Biswas, and Xenofon Koutsoukos. A Bayesian Approach to [14] Gareth Lee, Parisa Bahri, Srinivas Shastri, and Efficient Diagnosis of Incipient Faults. In Pro- Anthony Zaknich. A multi-category decision sup- ceedings of 17th International Workshop on Prin- port framework for the tennessee eastman prob- ciples of Diagnosis (DX 06), pages 243–250, 2006. lem. In Proceedings of the European Control Con- ference 2007, Greece, 2007. [26] Matthew Schwall and Christian Gerdes. A prob- abilistic Approach to Residual Processing for Ve- [15] Uri Lerner, Ronald Parr, Daphne Koller, and hicle Fault Detection. In Proceedings of the 2002 Gautam Biswas. Bayesian Fault Detection and ACC, pages 2552–2557, 2002. Diagnosis in Dynamic Systems. In AAAI/IAAI, pages 531–537, 2000. [27] Tomi Silander and Petri Myllymäki. A Sim- ple Approach for Finding the Globally Optimal [16] Thomas P. Minka. A comparison of numerical op- Bayesian Network Structure. In Proceedings of timizers for logistic regression. Technical report, the 22nd Conference on Uncertainty in AI (UAI), Micrsoft Research, 2003. 2006. [17] Sriram Narasimhan and Gautam Biswas. Model- based Diagnosis of Hybrid Systems. IEEE Trans. on Systems, Man, and Cybernetics, Part A, 37(3):348–361, 2007.