The design and validation of an intuitive confidence measure Jasper van der Waa Jurriaan van Diggelen Mark Neerincx TNO TNO TNO Soesterberg, the Netherlands Soesterberg, the Netherlands Soesterberg, the Netherlands jasper.vanderwaa@tno.nl jurriaan.vandiggelen@tno.nl mark.neerincx@tno.nl ABSTRACT of trust. The field of Explainable Artificial Intelligence (XAI) Explainable AI becomes increasingly important as the use aims to develop and validate methods for this capacity. of intelligent systems becomes more widespread in high-risk The process of explaining something consists of a minimum of domains. In these domains it is important that the user knows two actors: explainer and the explainee [12]. A large number to which degree the system’s decisions can be trusted. To facil- of studies in XAI focus on the system as the explainer and itate this, we present the Intuitive Confidence Measure (ICM): how it can generate explanations. For example in methods A lazy learning meta-model that can predict how likely a given that focus on identifying feature importance [11, 15], those decision is correct. ICM is intended to be easy to understand that extract a confidence measure [7], those that search for which we validated in an experiment. We compared ICM with an informative prototypical feature set [10] or explain action two different methods of computing confidence measures: The policies in reinforcement learning [9]. Although these are numerical output of the model and an actively learned meta- effective approaches to generate explanations, they do not model. The validation was performed using a smart assistant validate their methods with the explainee. A working XAI for maritime professionals. Results show that ICM is easier methods needs to incorporate the user’s wishes, context and to understand but that each user is unique in its desires for requirements [5, 13, 1]. As XAI tries to make ML models explanations. This user studies with domain experts shows more transparant, a requirement for XAI methods is to be what users need in their explanations and that personalization transparant themselves so the user can understand where the is crucial. explanation comes from. ACM Classification Keywords The proposed Intuitive Confidence Measure (ICM), is a case- I.2.M Artificial Intelligence: Miscellaneous; I.2.1 Artificial In- based machine learning model that can predict how likely a telligence: Applications and Expert Systems; I.2.M Artificial given model output is correct in (semi-)supervised learning Intelligence: Miscellaneous tasks. ICM is a meta-model that is stacked on top and indepen- dent of its underlying ML model. The intuitive idea behind Author Keywords ICM is that it uses past situations and any incorrect or correct Explainability, Machine Learning, lazy learning, instance outputs in those situations to compute the probability that a based, ICM, experiment, user, validation, confidence, given output in some situation is correct. A high confidence measure, certainty is given when the current situation and output is similar to situations in which that output proved to be correct. Since ICM is a case-based or lazy-learning algorithm it allows each INTRODUCTION outputted confidence to be traced back to items in a data set or The number of intelligent systems is increasing rapidly due memory [2]. For example, the confidence in some output is to recent developments in Artificial Intelligence (AI) and Ma- low because this output is similar to past outputs that proved chine Learning (ML). The applications of intelligent systems to be incorrect that were given in very similar situations. This begin to spread to high-risk domains, for example in medical is opposed to a confidence measure that uses active learning diagnoses [3], maritime automation [18] and cybersecurity [6]. where a (possibly large) set of parameters describe learned The need for transparancy and explanations towards end users knowledge that are difficult to explain or understand [8]. is becoming a neccessity [8, 4]. This self-explaining capability of intelligent systems allow these to become more effective Other approaches to estimate a confidence value exist. Several tools that allow their users to establish an appriopriate level machine learning model types can already provide a proba- bilistic output such as neural networks with soft-max output layers. However, these confidence estimations can be inac- curate as these models can learn to be very confident in an incorrect output, as a trade-off for general improvement on the overall dataset [14]. Other approaches may not prove to be model agnostic. For example the usage of dropout in neural © 2018 Copyright for the individual papers remains with the authors. Copying per- networks [7]. mitted for private and academic purposes. ExSS ’18, March 11, Tokyo, Japan. ISBN 978-1-4503-2138-9. DOI: 10.1145/1235 To test if ICM is indeed easy to understand, we performed an points in M with the same groundtruth T as the output of experiment where we compared ICM as a lazy learned meta- model A for x; model to two different types of certainty or confidence mea- sures: The numerical output of the underlying model itself and  d(x|x )2  an actively learned meta-model approach. We claim that ICM ∑xi ∈M(T =A(x)) exp − 2σ 2i is prefered over these two types because 1) the numerical out- C(x|σ , M) =   (1) d(x|x )2 put of the underlying model is not always available, transparant ∑xi ∈M exp − 2σ 2i or accurate [14] and 2) an actively learned meta-model has no The memory or dataset M is sequentially sampled according clear connection between its outputted confidence and used to three aspects from the trainset or during actual usage of the data [2]. ICM on the other hand is a meta-model and as such system. This strategy prefers data points with 1) a ground truth independent of the workings of the underlying model except least common in the memory, 2) datapoints that are some time for its outputs and ICM’s confidence values are directly related apart to mitigate temporal dependencies and 3) datapoints that to its training set due to lazy learning. are relatively dissimilar to the datapoints inside the memory. The experiment was performed within a maritime use case for We refer to the original paper of ICM for a detailed description computer-controlled propulsion, we refer to our earlier paper [17]. This memory is restricted to a fixed size, k, to prevent for a detailed description [18]. Participants had no knowl- extreme computational costs. The number of computations edge about ML and worked in a high-risk maritime domain increases exponentially with each added data point and to store with extreme responsibilities. In our experiment we simulated all data would quickly become unfeasibly for real world cases the operator’s working environment and presented the partici- where the model A and ICM may run for indefinite time. pant with classification outputs accomponied by a confidence ICM has several properties in common with other lazy learning value. Later we interviewed the operators about their expe- techniques such as k-Nearest Neighbours (k-NN). In specific riences and presented them with the three measure types we ICM is very similar to the weighted k-NN algorithm with identified earlier; 1) a numerical model output, 2) an actively an exponential weighting scheme where the normalization learned meta-model and 3), our method, a lazily learned meta- garantuees that all weights sum to one. ICM becomes an model. We tested the participant’s understanding of each of instance of weighted k-NN for non-linear regression with the the measures to validate whether ICM, and lazy learned mea- model’s groundtruth as the dependent variable, the memory M sures in general, are indeed easier to understand and as such to mitigate computation cost and an arbitrary distance function prefered over numerical model outputs and actively learned d. meta-models. The experiment showed that ICM is indeed easier to under- EXPERIMENT stand but each operator had various wishes of when, and even In a small experiment we compared the understanding of three if, a confidence value should be presented and all overesti- instances of different types of confidence measures by end- mated their own understanding of complex ML methods. XAI users 1) ICM as a lazily learned meta-models, 2) the approach experiments with expert users such as these offer valuable by Park et al. as actively learned meta-models [16] and 3) insights in what kind of explanations are required and when. the soft-max output as a numerical output of the actual model. The experiment was done based on a recent study with a virtual smart assistant that supports an operator on a ship INTUITIVE CONFIDENCE MEASURE with situation predictions to aid in his/her monitoring task ICM computes the probability that the given output is correct. [18]. We simulated the operator’s work environment and the It does this by weighing the difference of that output with virtual smart assistant and provided realistic scenarios and the ground truths of a set of known past datapoints with the responses from the assistant including a confidence value for similarity of the current datapoint with those past data points. any made predictions. This simulation was used to introduce We visualized this in Figure 1 for a simple example where the participants with the assistant and the numerical confidence Euclidean distance can be used as the similarity measure. This values it could provide. figure illustrates the intutive idea that when a situation and This simulated work environment was followed by an inter- output is similar to past situations in which different outputs view during which they received increasingly more informa- proved to be correct, confidence will be low. The more similar tion about the three confidence measures. The goal of the situations there are with a different and correct output, the interview was to test how well and how quickly the partici- lower the confidence. If there are no similar situations, the pant understood each of the three confidence measures. The confidence will be unknown or uniform, depending on the interview went through several stages; choice of presentation to the user. In the following paragraphs we only explain the vital technical details of ICM, we refer to 1. First stage earlier work for a more technical description and discussion (a) Brief textual explanations of each measure and oppor- of its advantages and disadvantages [17]. tunity for the participant to rate his/her understanding ICM is based on the following three equations, with x as of the measure. an arbitrary data point, M an arbitrary data set, d the used (b) Per measurement a moment for the participant to ask similarity function, σ as the standard deviation used for the questions to allow the supervisor to rate how well the exponential weighting and M(T = A(x)) to select all data participant understands the examples. F (a) (b) (c) Figure 1: Figure that shows three examples of how ICM works in a 2D binary classification task (class A and B) given a current data point with its output (square) and a set of known data points (circles) with their known ground truths. (c) An explanation by the participant for each measure in to past situations. The explanations from the participants their own words to rate by a ML expert. regarding the ‘meta-model’ measure were the most inaccurate. Nearly all participants had the tendency to see this measure 2. Second stage as a combination of ICM and a probabilistic output. This was (a) Three concrete examples, both visual and textual, for also the reason why three out five participants tended to prefer each measure to illustrate its mechanisms where the this measure in the end, even though their own explanations of participant could rate his/her level of understanding for the confidence values resembled the approach used by ICM. each set of examples.1 CONCLUSION (b) Per set of examples a moment to ask questions to the In the introduction we stated that XAI methods should not only supervisor, to allow the supervisor to rate how well the be developed but also validated in experiments. We mentioned participant understands the examples. that XAI methods should be transparant by themselves such (c) An explanation by the participant for each example in that the user can understand where the explanation comes from their own words to rate by a ML expert. and why it is given. (d) The participant’s final preference for one of the three confidence measures and an explanation of a given The Intuitive Confidence Measure (ICM) was developed as confidence. A ML expert checked whether this expla- a method to provide a confidence value alongside a machine nation overlaps with one of the three measures. learning model’s output. It uses lazy-learning and intuitive ideas to keep the method relatively simple with clear con- Results nections between input and output. We performed a limited usability study with qualitative interviews. These interviews The results of the five participants are shown in table 2. All indicated that ICM is relatively simple to explain compared to were experts and potential end-users in the maritime use case. other confidence measures based on model output (e.g. values The two users saw no use for a confidence measure believed from a softmax function) or values from meta-models based that predictions should always be correct or otherwise not on active learning. presented at all. All participants believed that they had some basic to advanced comprehension of each measure and its The results showed that in a group of similar end-users, there set of examples, however the experiment supervisor and ML were mixed opinions about the necessity of a confidence mea- expert disagreed with this for both the ‘numerical’ and ‘active sure and how it should be presented. However, most partici- learning meta-model’ measures. pants thought of ICM as an easy to understand measure and could recall the workings of ICM accurately. Most of the Both the supervisor and the ML expert concluded that most participants were even able to identify advantages and dis- participants had some degree of understanding for ICM. Only advantages of ICM in specific situations, showing a deeper one participant was not able to comprehend the textual expla- understanding. Future work will focus on a larger study to test nation but the understanding of ICM was on average rated the intuitiveness of ICM, technical improvements to ICM to higher than that of the ‘numerical’ and ‘active learning meta- mitigate disadvantages and way on how to generate confidence model’ measures, by both the supervisor and ML expert. explanations. The explanations about the numerical output were lacking The development of new XAI methods for high-risk domains because participants had trouble comprehending that a model is important, but their validation in experiments with domain could learn knowledge and represent it in parameters. They experts is equally important. Like the one presented in this had less difficulty for ICM because its outputs related clearly paper, experiments and usability studies with domain experts 1 The textual descriptions and examples can be requested by e-mail. can help shape the field of XAI. Figure 2: This table shows the three sets of ratings (min. of 1 and max. of 4): 1) the participant’s own belief of understanding (row ‘own’), 2) the supervisor’s belief and 3) the ML expert’s opinion of how well the given explanations from the participant matches the measures and examples. It shows whether the participant found a confidence measure useful, their prefered measure and the best match with their explanation of a confidence value. ACKNOWLEDGEMENTS 10. Pang Wei Koh and Percy Liang. 2017. Understanding This study was performed as part of the the Early Research Black-box Predictions via Influence Functions. arXiv Program Adaptive Maritime Automation (ERP AMA) within preprint arXiv:1703.04730 (2017). TNO, an independent research organisation in the Netherlands. 11. Scott Lundberg and Su-In Lee. 2016. An unexpected REFERENCES unity among methods for interpreting model predictions. 1. Saleema Amershi, Maya Cakmak, William Bradley Knox, arXiv:1611.07478 [cs] (Nov. 2016). arXiv: 1611.07478. and Todd Kulesza. 2014. Power to the people: The role of 12. Tim Miller. 2017. Explanation in artificial intelligence: humans in interactive machine learning. AI Magazine 35, Insights from the social sciences. arXiv preprint 4 (2014), 105–120. arXiv:1706.07269 (2017). 2. Christopher G. Atkeson, Andrew W. Moore, and Stefan 13. Tim Miller, Piers Howe, and Liz Sonenberg. 2017. Schaal. 1997. Locally weighted learning for control. In Explainable AI: Beware of Inmates Running the Asylum. Lazy learning. Springer, 75–113. In IJCAI-17 Workshop on Explainable AI (XAI). 36. 3. Arnaud Belard, Timothy Buchman, Jonathan Forsberg, 14. Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deep Benjamin K. Potter, Christopher J. Dente, Allan Kirk, neural networks are easily fooled: High confidence and Eric Elster. 2017. Precision diagnosis: a view of the predictions for unrecognizable images. In Proc. of the clinical decision support systems (CDSS) landscape IEEE Conf. on Computer Vision and Pattern Recognition. through the lens of critical care. Journal of clinical 427–436. monitoring and computing 31, 2 (2017), 261–271. 4. Jordi Bieger, Kristinn R. Thórisson, and B. Steunebrink. 15. Chris Olah, Alexander Mordvintsev, and Ludwig 2017. Evaluating understanding. In IJCAI Workshop on Schubert. 2017. Feature Visualization. Distill 2, 11 Evaluating General-Purpose AI. (2017), e7. 5. Alan Cooper, Robert Reimann, David Cronin, and 16. No-Wook Park, Phaedon C. Kyriakidis, and Suk-Young Christopher Noessel. 2014. About face: the essentials of Hong. 2016. Spatial estimation of classification accuracy interaction design. John Wiley & Sons. using indicator kriging with an image-derived ambiguity index. Remote Sensing 8, 4 (2016), 320. 6. Sumeet Dua and Xian Du. 2016. Data mining and machine learning in cybersecurity. CRC press. 17. Jasper van der Waa, Jurriaan van Diggelen, and Mark 7. Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Neerincx. 2018. ICM: An intuitive model independent Bayesian approximation: Representing model uncertainty and accurate certainty measure for machine learning. In in deep learning. In Int. Conf. on Machine Learning. Proc. of the Int. Conf. on Agents and Artificial 1050–1059. Intelligence. 8. David Gunning. 2017. Explainable artificial intelligence 18. Jurriaan van Diggelen, Hans van den Broek, Jan Maarten (xai). Defense Advanced Research Projects Agency Schraagen, and Jasper van der Waa. 2017. An Intelligent (2017). Operator Support System for Dynamic Positioning. In Int. 9. Bradley Hayes and Julie A. Shah. 2017. Improving Robot Conf. on Applied Human Factors and Ergonomics. Controller Transparency Through Autonomous Policy Springer, 48–59. Explanation. In Proc. of the 2017 ACM/IEEE Int. conf. on HRI. ACM, 303–312.