The Impact of Overconfidence Bias on Practical Accuracy of Bayesian Network Models: An Empirical Study Marek J. Drużdżel1,2 & Agnieszka Oniśko1,3 1 Faculty of Computer Science, Bialystok Technical University, Wiejska 45A, 15-351 Bialystok, Poland 2 Decision Systems Laboratory, School of Information Sciences and Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, USA 3 Magee Womens Hospital, University of Pittsburgh Medical Center, Pittsburgh, PA 15260, USA Abstract tivity analysis. Sensitivity analysis studies how much a model output changes as various model parameters In this paper, we examine the influence vary through the range of their plausible values. It of overconfidence in parameter specification allows to get insight into the nature of the problem on the performance of a Bayesian network and its formalization, helps in refining the model so model in the context of Hepar II, a sizeable that it is simple and elegant (containing only those Bayesian network model for diagnosis of liver factors that matter), and checks the need for precision disorders. We enter noise in the parameters in refining the numbers [8]. It is theoretically pos- in such a way that the resulting distributions sible that small variations in a numerical parameter become biased toward extreme probabilities. cause large variations in the posterior probability of We believe that this offers a systematic way interest. Van der Gaag and Renooij [17] found that of modeling expert overconfidence in proba- practical networks may indeed contain such parame- bility estimates. It appears that the diagnos- ters. Because practical networks are often constructed tic accuracy of Hepar II is less sensitive to with only rough estimates of probabilities, a question overconfidence in probabilities than it is to of practical importance is whether overall imprecision underconfidence and to random noise, espe- in network parameters is important. If not, the effort cially when noise is very large. that goes into polishing network parameters might not be justified, unless it focuses on their small subset that is shown to be critical. 1 INTRODUCTION There is a popular belief, supported by some anecdo- tal evidence, that Bayesian network models are overall Decision-analytic methods provide an orderly and co- quite tolerant to imprecision in their numerical pa- herent framework for modeling and solving decision rameters. Pradhan et al. [14] tested this on a large problems in decision support systems [5]. A popu- medical diagnostic model, the CPCS network [7, 16]. lar modeling tool for complex uncertain domains is a Their key experiment focused on systematic introduc- Bayesian network [13], an acyclic directed graph quan- tion of noise in the original parameters (assumed to be tified by numerical parameters and modeling the struc- the gold standard) and measuring the influence of the ture of a domain and the joint probability distribution magnitude of this noise on the average posterior prob- over its variables. There exist algorithms for reason- ability of the true diagnosis. They observed that this ing in Bayesian networks that typically compute the average was fairly insensitive to even very large noise. posterior probability distribution over some variables This experiment, while ingenious and thought provok- of interest given a set of observations. As these algo- ing, had two weaknesses. The first of these, pointed rithms are mathematically correct, the ultimate qual- out by Coupé and van der Gaag [3], is that the ex- ity of reasoning depends directly on the quality of the periment focused on the average posterior rather than underlying models and their parameters. These pa- individual posterior in each diagnostic case and how rameters are rarely precise, as they are often based it varies with noise, which is of most interest. The on subjective estimates. Even when they are based second weakness is that the posterior of the correct on data, they may not be directly applicable to the diagnosis is by itself not a sufficient measure of model decision model at hand and be fully trustworthy. robustness. The weaknesses of this experiment were Search for those parameters whose values are critical also discussed in [6] and [9]. In our earlier work [9], for the overall quality of decisions is known as sensi- we replicated the experiment of Pradhan et al. using Hepar II, a sizeable Bayesian network model for diag- 2 THE Hepar II MODEL nosis of liver disorders. We systematically introduced noise in Hepar II’s probabilities and tested the di- Our experiments are based on Hepar II [10, 11], a agnostic accuracy of the resulting model. Similarly Bayesian network model consisting of over 70 vari- to Pradhan et al., we assumed that the original set ables modeling the problem of diagnosis of liver dis- of parameters and the model’s performance are ideal. orders. The model covers 11 different liver diseases Noise in the original parameters led to deterioration and 61 medical findings, such as patient self-reported in performance. The main result of our analysis was data, signs, symptoms, and laboratory tests results. that noise in numerical parameters started taking its The structure of the model, (i.e., the nodes of the toll almost from the very beginning and not, as sug- graph along with arcs among them) was built based gested by Pradhan et al., only when it was very large. on medical literature and conversations with domain The region of tolerance to noise, while noticeable, was experts and it consists of 121 arcs. Hepar II is a rather small. That study suggested that Bayesian net- real model and it consists of nodes that are a mix- works may be more sensitive to the quality of their nu- ture of propositional, graded, and general variables. merical parameters than popularly believed. Another There are on the average 1.73 parents per node and study that we conducted more recently [4] focused on 2.24 states per variable. The numerical parameters of the influence of progressive rounding of probabilities the model (there are 2,139 of these in the most recent on model accuracy. Here also, rounding had an ef- version), i.e., the prior and conditional probability dis- fect on the performance of Hepar II, although the tributions, were learned from a database of 699 real pa- main source of performance loss were zero probabili- tient cases. Readers interested in the Hepar II model ties. When zeros introduced by rounding are replaced can download it from Decision Systems Laboratory’s by very small non-zero values, imprecision resulting model repository at http://genie.sis.pitt.edu/. from rounding has minimal impact on Hepar II’s per- formance. As our experiments study the influence of precision of Hepar II’s numerical parameters on its accuracy, we Empirical studies conducted so far that focused on the owe the reader an explanation of the metric that we impact of noise in probabilities on Bayesian network used to test the latter. We focused on diagnostic accu- results disagree in their conclusions. Also, the noise racy, which we defined in our earlier publications as the introduced in parameters was usually assumed to be percentage of correct diagnoses on real patient cases. random, which may not be a reasonable assumption. When testing the diagnostic accuracy of Hepar II, we Human experts, for example, often tend to be over- were interested in both (1) whether the most probable confident [8]. This paper describes a follow-up study diagnosis indicated by the model is indeed the correct that probes the issue of sensitivity of model accuracy diagnosis, and (2) whether the set of w most probable to noise in probabilities further. We examine whether diagnoses contains the correct diagnosis for small val- a bias in the noise that is introduced into the network ues of w (we chose a “window” of w=1, 2, 3, and 4). makes a difference. We enter noise in the parameters The latter focus is of interest in diagnostic settings, in such a way that the resulting distributions become where a decision support system only suggest possi- biased toward extreme probabilities. We believe that ble diagnoses to a physician. The physician, who is this offers a systematic way of modeling expert over- the ultimate decision maker, may want to see several confidence in probability estimates. Our results show alternative diagnoses before focusing on one. again that the diagnostic accuracy of Hepar II is sen- sitive to imprecision in probabilities. It appears, how- With diagnostic accuracy defined as above, the most ever, that the diagnostic accuracy of Hepar II is less recent version of the Hepar II model reached the di- sensitive to overconfidence in probabilities than it is to agnostic accuracy of 57%, 69%, 75%, and 79% for win- random noise. We also test the sensitivity of Hepar II dow sizes of 1, 2, 3, and 4 respectively [12]. to underconfidence in parameters and show that un- derconfidence in paramaters leads to more error than 3 INTRODUCTION OF NOISE random noise. INTO Hepar II PARAMETERS The remainder of this paper is structured as follows. Section 2 introduces the Hepar II model. Section 3 When introducing noise into parameters, we used es- describes how we introduced noise into our probabili- sentially the same approach as Pradhan et al. [14], ties. Section 4 describes the results of our experiments. which is transforming each original probability into Finally, Section 5 discusses our results in light of pre- log-odds function, adding noise parametrized by a pa- vious work. rameter σ (as we will show, even though σ is propor- tional to the amount of noise, in our case it cannot be directly interpreted as standard deviation), and trans- Figure 1: Transformed (biased, overconfident) vs. original probabilities for various levels of σ. forming it back to probability, i.e., abilities are likely to become smaller and large prob- abilities are likely to become larger. Please note that distributions have become more biased towards the ex- p0 = Lo−1 [Lo(p) + Noise(0, σ)] , (1) treme probabilities. It is straightforward to prove that where the entropy of Pr0 is smaller than the entropy of Pr. Lo(p) = log10 [p/(1 − p)] . (2) The transformed probability distributions reflect over- confidence bias, common among human experts. 3.1 Overconfidence bias An alternative way of introducing biased noise, sug- Now, we designed the Noise() function as follows. gested by one of the reviewers, is by means of build- Given a discrete probability distribution Pr, we iden- ing a logistic regression/IRT model (e.g., [1, 2, 15])for tify the smallest probability pS . We transform this each conditional probability table and, subsequently, smallest probability pS into p0S by making it even manipulating the slope parameter. smaller, according to the following formula: 3.2 Underconfidence bias p0S = Lo−1 [Lo(pS ) − |Normal(0, σ)|] . Now, we designed the Noise() function as follows. We make the largest probability in the probability dis- Given a discrete probability distribution Pr, we iden- tribution Pr, pL larger by precisely the amount by tify the highest probability pS . We transform this which we decreased pS , i.e., largest probability pL into p0L by making it smaller, according to the following formula: p0L = pL + pS − p0S . p0L = Lo−1 [Lo(pL ) − |Normal(0, σ)|] . We are by this guaranteed that the transformed pa- rameters of the probability distribution Pr0 add up to We make the smallest probability in the probability 1.0. distribution Pr, pS larger by precisely the amount by which we decreased pL , i.e., Figure 1 shows the effect of introducing the noise. As we can see, the transformation is such that small prob- p0S = pS + pL − p0L . Figure 2: Transformed (biased, underconfident) vs. original probabilities for various levels of σ. We are by this guaranteed that the transformed pa- improved upon. In the experiment, we studied how rameters of the probability distribution Pr0 add up to this baseline performance degrades under the condi- 1.0. tion of noise, as described in Section 3. Figure 2 shows the effect of introducing this noise. The We tested 30 versions of the network (each for a dif- transformed probability distributions reflect undercon- ferent standard deviation of the noise σ ∈< 0.0, 3.0 > fidence bias. with 0.1 increments) on all records of the Hepar data set and computed Hepar II’s diagnostic accuracy. We 3.3 Random noise plotted this accuracy in Figures 4 and 5 as a function of σ for different values of window size w. For illustration purpose, Figure 3 shows the transfor- mation applied in our previous study [9]. For σ > 1 the amount of noise becomes so large that any value of probability can be transformed in any other value. This suggests strongly that σ > 1 is not really a region that is of interest in practice. The main reason why we look at such high σ values is that this was the range used in Pradhan et al. paper. 4 EXPERIMENTAL RESULTS We have performed an experiment investigating the Figure 4: The diagnostic accuracy of Hepar II for influence of biased noise in Hepar II’s probabilities various window sizes as a function of the amount of on its diagnostic performance. For the purpose of our biased overconfident noise (expressed by σ) experiment, we assumed that the model parameters . were perfectly accurate and, effectively, the diagnos- tic performance achieved was the best possible. Of It is clear that Hepar II’s diagnostic performance de- course, in reality the parameters of the model may not teriorates with noise. In order to facilitate compari- be accurate and the performance of the model can be son between biased and unbiased noise and, by this, Figure 3: Transformed (unbiased) vs. original probabilities for various levels of σ. Figure 5: The diagnostic accuracy of Hepar II for Figure 6: The diagnostic accuracy of Hepar II for var- various window sizes as a function of the amount of ious window sizes as a function of amount of unbiased biased underconfident noise (expressed by σ) noise (expressed by σ) [9]. . judgment of the influence of overconfidence bias on too sensitive to imprecision of their probability param- the results, we reproduce the experimental result of eters. [9] in Figure 6. The results are qualitatively similar, although it can be seen that performance under over- 5 SUMMARY confidence bias degrades more slowly with the amount of noise than performance under random noise. Perfor- This paper has studied the influence of bias in param- mance under underconfidence bias degrades the fastest eters on model performance in the context of a prac- of the three. Figure 7 shows the accuracy of Hepar II tical medical diagnostic model, Hepar II. We believe (w = 1) for biased and unbiased noise on the same that the study was realistic in the sense of focusing on plot, where this effect is easier to see. a real, context-dependent performance measure. Our It is interesting to note that for small values of σ, such study has shown that the performance of Hepar II as σ < 0.2, there is only a minimal effect of noise on is sensitive to noise in numerical parameters, i.e., the performance. This observation may offer some justi- diagnostic accuracy of the model decreases after intro- fication to the belief that Bayesian networks are not ducing noise into numerical parameters of the model. bles in educational assessment. In T. Jaakkola and T. Richardson, editors, Artificial Intelligence and Statistics 2001, pages 137–143. Morgan Kauf- mann, 2001. [2] Russell G. Almond, Louis V. DiBello, Brad Moul- der, and Juan-Diego Zapata-Rivera. Modeling diagnostic assessment with Bayesian networks. Journal of Educational Measurement, 44(4):341– 359, 2007. [3] Veerle H. M. Coupé and Linda C. van der Gaag. Figure 7: The diagnostic accuracy of Hepar II as a Properties of sensitivity analysis of Bayesian be- function of the amount of noise (random, underconfi- lief networks. Annals of Mathematics and Artifi- dent, and overconfident), window w = 1 cial Intelligence, 36:323–356, 2002. [4] Marek J. Druzdzel and Agnieszka Oniśko. Are While our result is merely a single data point that Bayesian networks sensitive to precision of their sheds light on the hypothesis in question, it looks like parameters? In S.T. Wierzchoń M. Klopotek, overconfidence bias has a smaller negative effect on M. Michalewicz, editor, Intelligent Information model performance than random noise. Underconfi- Systems XVI, Proceedings of the International dence bias leads to most serious deterioration of per- IIS’08 Conference, pages 35–44, Warsaw, Poland, formance. While it is only a wild speculation that 2008. Academic Publishing House EXIT. begs for further investigation, one might see our re- sults as an explanation of the fact that humans tend [5] Max Henrion, John S. Breese, and Eric J. Horvitz. to be overconfident rather than underconfident in their Decision Analysis and Expert Systems. AI Mag- probability estimates. azine, 12(4):64–91, Winter 1991. [6] O. Kipersztok and H. Wang. Another look at sen- Acknowledgments sitivity analisis of Bayesian networks to imprecise probabilities. In Proceedings of the Eight Inter- This work was supported by the Air Force Office of national Workshop on Artificial Intelligence and Scientific Research grant FA9550-06-1-0243, by Intel Statistics (AISTAT-2001), pages 226–232, San Research, and by the MNiI (Ministerstwo Nauki i In- Francisco, CA, 2001. Morgan Kaufmann Publish- formatyzacji) grant 3T10C03529. We thank Linda van ers. der Gaag for suggesting extending our earlier work on sensitivity of Bayesian networks to precision of their [7] B. Middleton, M.A. Shwe, D.E. Heckerman, numerical parameters by introducing bias in the noise. M. Henrion, E.J. Horvitz, H.P. Lehmann, and Reviewers for The Sixth Bayesian Modelling Appli- G.F. Cooper. Probabilistic diagnosis using a re- cations Workshop provided several useful suggestions formulation of the INTERNIST–1/QMR knowl- that have improved the readability and extended the edge base: II. Evaluation of diagnostic perfor- scope of the paper. mance. Methods of Information in Medicine, 30(4):256–267, 1991. The Hepar II model was created and tested using SMILE, an inference engine, and GeNIe, a develop- [8] M. Granger Morgan and Max Henrion. Uncer- ment environment for reasoning in graphical proba- tainty: A Guide to Dealing with Uncertainty in bilistic models, both developed at the Decision Sys- Quantitative Risk and Policy Analysis. Cam- tems Laboratory, University of Pittsburgh, and avail- bridge University Press, Cambridge, 1990. able at http://genie.sis.pitt.edu/. We used SMILE in our experiments and the data pre-processing [9] Agnieszka Oniśko and Marek J. Druzdzel. Ef- module of GeNIe for plotting scatter plot graphs in fect of imprecision in probabilities on Bayesian Figure 1. network models: An empirical study. In Work- ing notes of the European Conference on Artificial Intelligence in Medicine (AIME-03): Qualitative References and Model-based Reasoning in Biomedicine, Pro- taras, Cyprus, October 18–22 2003. [1] Russell G. Almond, Louis V. Dibello, F. Jenk- ins, R.J. Mislevy, D. Senturk, L.S. Steinberg, and [10] Agnieszka Oniśko, Marek J. Druzdzel, and Hanna D. Yan. Models for conditional probability ta- Wasyluk. Extension of the Hepar II model to multiple-disorder diagnosis. In S.T. Wierzchoń M. Klopotek, M. Michalewicz, editor, Intelli- gent Information Systems, Advances in Soft Com- puting Series, pages 303–313, Heidelberg, 2000. Physica-Verlag. [11] Agnieszka Oniśko, Marek J. Druzdzel, and Hanna Wasyluk. Learning Bayesian network parameters from small data sets: Application of Noisy-OR gates. International Journal of Approximate Rea- soning, 27(2):165–182, 2001. [12] Agnieszka Oniśko, Marek J. Druzdzel, and Hanna Wasyluk. An experimental comparison of meth- ods for handling incomplete data in learning pa- rameters of Bayesian networks. In S.T. Wierzchoń M. Klopotek, M. Michalewicz, editor, Intelligent Information Systems, Advances in Soft Comput- ing Series, Heidelberg, 2002. Physica-Verlag (A Springer-Verlag Company). 351–360. [13] Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Mor- gan Kaufmann Publishers, Inc., San Mateo, CA, 1988. [14] Malcolm Pradhan, Max Henrion, Gregory Provan, Brendan del Favero, and Kurt Huang. The sensitivity of belief networks to impre- cise probabilities: An experimental investigation. Artificial Intelligence, 85(1–2):363–397, August 1996. [15] Frank Rijmen. Bayesian networks with a logistic regression model for the conditional probabilities. International Journal of Approximate Reasoning, in press. [16] M.A. Shwe, B. Middleton, D.E. Heckerman, M. Henrion, E.J. Horvitz, H.P. Lehmann, and G.F. Cooper. Probabilistic diagnosis using a re- formulation of the INTERNIST–1/QMR knowl- edge base: I. The probabilistic model and in- ference algorithms. Methods of Information in Medicine, 30(4):241–255, 1991. [17] Linda C. van der Gaag and Silja Renooij. Analysing sensitivity data from probabilistic net- works. In Uncertainty in Artificial Intelligence: Proceedings of the Sixteenth Conference (UAI- 2001), pages 530–537, San Francisco, CA, 2001. Morgan Kaufmann Publishers.