The design and validation of an intuitive confidence
                                measure
          Jasper van der Waa                                     Jurriaan van Diggelen                          Mark Neerincx
                  TNO                                                     TNO                                        TNO
       Soesterberg, the Netherlands                           Soesterberg, the Netherlands                Soesterberg, the Netherlands
        jasper.vanderwaa@tno.nl                               jurriaan.vandiggelen@tno.nl                   mark.neerincx@tno.nl


ABSTRACT                                                                            of trust. The field of Explainable Artificial Intelligence (XAI)
Explainable AI becomes increasingly important as the use                            aims to develop and validate methods for this capacity.
of intelligent systems becomes more widespread in high-risk
                                                                                    The process of explaining something consists of a minimum of
domains. In these domains it is important that the user knows
                                                                                    two actors: explainer and the explainee [12]. A large number
to which degree the system’s decisions can be trusted. To facil-
                                                                                    of studies in XAI focus on the system as the explainer and
itate this, we present the Intuitive Confidence Measure (ICM):
                                                                                    how it can generate explanations. For example in methods
A lazy learning meta-model that can predict how likely a given
                                                                                    that focus on identifying feature importance [11, 15], those
decision is correct. ICM is intended to be easy to understand
                                                                                    that extract a confidence measure [7], those that search for
which we validated in an experiment. We compared ICM with
                                                                                    an informative prototypical feature set [10] or explain action
two different methods of computing confidence measures: The
                                                                                    policies in reinforcement learning [9]. Although these are
numerical output of the model and an actively learned meta-
                                                                                    effective approaches to generate explanations, they do not
model. The validation was performed using a smart assistant
                                                                                    validate their methods with the explainee. A working XAI
for maritime professionals. Results show that ICM is easier
                                                                                    methods needs to incorporate the user’s wishes, context and
to understand but that each user is unique in its desires for
                                                                                    requirements [5, 13, 1]. As XAI tries to make ML models
explanations. This user studies with domain experts shows
                                                                                    more transparant, a requirement for XAI methods is to be
what users need in their explanations and that personalization
                                                                                    transparant themselves so the user can understand where the
is crucial.
                                                                                    explanation comes from.
ACM Classification Keywords                                                         The proposed Intuitive Confidence Measure (ICM), is a case-
I.2.M Artificial Intelligence: Miscellaneous; I.2.1 Artificial In-                  based machine learning model that can predict how likely a
telligence: Applications and Expert Systems; I.2.M Artificial                       given model output is correct in (semi-)supervised learning
Intelligence: Miscellaneous                                                         tasks. ICM is a meta-model that is stacked on top and indepen-
                                                                                    dent of its underlying ML model. The intuitive idea behind
Author Keywords                                                                     ICM is that it uses past situations and any incorrect or correct
Explainability, Machine Learning, lazy learning, instance                           outputs in those situations to compute the probability that a
based, ICM, experiment, user, validation, confidence,                               given output in some situation is correct. A high confidence
measure, certainty                                                                  is given when the current situation and output is similar to
                                                                                    situations in which that output proved to be correct. Since
                                                                                    ICM is a case-based or lazy-learning algorithm it allows each
INTRODUCTION
                                                                                    outputted confidence to be traced back to items in a data set or
The number of intelligent systems is increasing rapidly due                         memory [2]. For example, the confidence in some output is
to recent developments in Artificial Intelligence (AI) and Ma-                      low because this output is similar to past outputs that proved
chine Learning (ML). The applications of intelligent systems                        to be incorrect that were given in very similar situations. This
begin to spread to high-risk domains, for example in medical                        is opposed to a confidence measure that uses active learning
diagnoses [3], maritime automation [18] and cybersecurity [6].                      where a (possibly large) set of parameters describe learned
The need for transparancy and explanations towards end users                        knowledge that are difficult to explain or understand [8].
is becoming a neccessity [8, 4]. This self-explaining capability
of intelligent systems allow these to become more effective                         Other approaches to estimate a confidence value exist. Several
tools that allow their users to establish an appriopriate level                     machine learning model types can already provide a proba-
                                                                                    bilistic output such as neural networks with soft-max output
                                                                                    layers. However, these confidence estimations can be inac-
                                                                                    curate as these models can learn to be very confident in an
                                                                                    incorrect output, as a trade-off for general improvement on the
                                                                                    overall dataset [14]. Other approaches may not prove to be
                                                                                    model agnostic. For example the usage of dropout in neural
© 2018 Copyright for the individual papers remains with the authors. Copying per-   networks [7].
mitted for private and academic purposes. ExSS ’18, March 11, Tokyo, Japan. ISBN
978-1-4503-2138-9.
DOI: 10.1145/1235
To test if ICM is indeed easy to understand, we performed an       points in M with the same groundtruth T as the output of
experiment where we compared ICM as a lazy learned meta-           model A for x;
model to two different types of certainty or confidence mea-
sures: The numerical output of the underlying model itself and                                              
                                                                                                                 d(x|x )2
                                                                                                                          
an actively learned meta-model approach. We claim that ICM                               ∑xi ∈M(T =A(x)) exp − 2σ 2i
is prefered over these two types because 1) the numerical out-              C(x|σ , M) =                                     (1)
                                                                                                            d(x|x )2
put of the underlying model is not always available, transparant                              ∑xi ∈M exp − 2σ 2i
or accurate [14] and 2) an actively learned meta-model has no      The memory or dataset M is sequentially sampled according
clear connection between its outputted confidence and used         to three aspects from the trainset or during actual usage of the
data [2]. ICM on the other hand is a meta-model and as such        system. This strategy prefers data points with 1) a ground truth
independent of the workings of the underlying model except         least common in the memory, 2) datapoints that are some time
for its outputs and ICM’s confidence values are directly related   apart to mitigate temporal dependencies and 3) datapoints that
to its training set due to lazy learning.                          are relatively dissimilar to the datapoints inside the memory.
The experiment was performed within a maritime use case for        We refer to the original paper of ICM for a detailed description
computer-controlled propulsion, we refer to our earlier paper      [17]. This memory is restricted to a fixed size, k, to prevent
for a detailed description [18]. Participants had no knowl-        extreme computational costs. The number of computations
edge about ML and worked in a high-risk maritime domain            increases exponentially with each added data point and to store
with extreme responsibilities. In our experiment we simulated      all data would quickly become unfeasibly for real world cases
the operator’s working environment and presented the partici-      where the model A and ICM may run for indefinite time.
pant with classification outputs accomponied by a confidence       ICM has several properties in common with other lazy learning
value. Later we interviewed the operators about their expe-        techniques such as k-Nearest Neighbours (k-NN). In specific
riences and presented them with the three measure types we         ICM is very similar to the weighted k-NN algorithm with
identified earlier; 1) a numerical model output, 2) an actively    an exponential weighting scheme where the normalization
learned meta-model and 3), our method, a lazily learned meta-      garantuees that all weights sum to one. ICM becomes an
model. We tested the participant’s understanding of each of        instance of weighted k-NN for non-linear regression with the
the measures to validate whether ICM, and lazy learned mea-        model’s groundtruth as the dependent variable, the memory M
sures in general, are indeed easier to understand and as such      to mitigate computation cost and an arbitrary distance function
prefered over numerical model outputs and actively learned         d.
meta-models.
The experiment showed that ICM is indeed easier to under-          EXPERIMENT
stand but each operator had various wishes of when, and even       In a small experiment we compared the understanding of three
if, a confidence value should be presented and all overesti-       instances of different types of confidence measures by end-
mated their own understanding of complex ML methods. XAI           users 1) ICM as a lazily learned meta-models, 2) the approach
experiments with expert users such as these offer valuable         by Park et al. as actively learned meta-models [16] and 3)
insights in what kind of explanations are required and when.       the soft-max output as a numerical output of the actual model.
                                                                   The experiment was done based on a recent study with a
                                                                   virtual smart assistant that supports an operator on a ship
INTUITIVE CONFIDENCE MEASURE                                       with situation predictions to aid in his/her monitoring task
ICM computes the probability that the given output is correct.     [18]. We simulated the operator’s work environment and the
It does this by weighing the difference of that output with        virtual smart assistant and provided realistic scenarios and
the ground truths of a set of known past datapoints with the       responses from the assistant including a confidence value for
similarity of the current datapoint with those past data points.   any made predictions. This simulation was used to introduce
We visualized this in Figure 1 for a simple example where          the participants with the assistant and the numerical confidence
Euclidean distance can be used as the similarity measure. This     values it could provide.
figure illustrates the intutive idea that when a situation and     This simulated work environment was followed by an inter-
output is similar to past situations in which different outputs    view during which they received increasingly more informa-
proved to be correct, confidence will be low. The more similar     tion about the three confidence measures. The goal of the
situations there are with a different and correct output, the      interview was to test how well and how quickly the partici-
lower the confidence. If there are no similar situations, the      pant understood each of the three confidence measures. The
confidence will be unknown or uniform, depending on the            interview went through several stages;
choice of presentation to the user. In the following paragraphs
we only explain the vital technical details of ICM, we refer to    1. First stage
earlier work for a more technical description and discussion           (a) Brief textual explanations of each measure and oppor-
of its advantages and disadvantages [17].                                  tunity for the participant to rate his/her understanding
ICM is based on the following three equations, with x as                   of the measure.
an arbitrary data point, M an arbitrary data set, d the used           (b) Per measurement a moment for the participant to ask
similarity function, σ as the standard deviation used for the              questions to allow the supervisor to rate how well the
exponential weighting and M(T = A(x)) to select all data                   participant understands the examples.
                                                                                     F
                          (a)                                         (b)                                       (c)


Figure 1: Figure that shows three examples of how ICM works in a 2D binary classification task (class A and B) given a current
data point with its output (square) and a set of known data points (circles) with their known ground truths.


   (c) An explanation by the participant for each measure in                to past situations. The explanations from the participants
       their own words to rate by a ML expert.                              regarding the ‘meta-model’ measure were the most inaccurate.
                                                                            Nearly all participants had the tendency to see this measure
2. Second stage                                                             as a combination of ICM and a probabilistic output. This was
   (a) Three concrete examples, both visual and textual, for                also the reason why three out five participants tended to prefer
       each measure to illustrate its mechanisms where the                  this measure in the end, even though their own explanations of
       participant could rate his/her level of understanding for            the confidence values resembled the approach used by ICM.
       each set of examples.1
                                                                            CONCLUSION
   (b) Per set of examples a moment to ask questions to the                 In the introduction we stated that XAI methods should not only
       supervisor, to allow the supervisor to rate how well the             be developed but also validated in experiments. We mentioned
       participant understands the examples.                                that XAI methods should be transparant by themselves such
   (c) An explanation by the participant for each example in                that the user can understand where the explanation comes from
       their own words to rate by a ML expert.                              and why it is given.
   (d) The participant’s final preference for one of the three
       confidence measures and an explanation of a given                    The Intuitive Confidence Measure (ICM) was developed as
       confidence. A ML expert checked whether this expla-                  a method to provide a confidence value alongside a machine
       nation overlaps with one of the three measures.                      learning model’s output. It uses lazy-learning and intuitive
                                                                            ideas to keep the method relatively simple with clear con-
Results
                                                                            nections between input and output. We performed a limited
                                                                            usability study with qualitative interviews. These interviews
The results of the five participants are shown in table 2. All
                                                                            indicated that ICM is relatively simple to explain compared to
were experts and potential end-users in the maritime use case.
                                                                            other confidence measures based on model output (e.g. values
The two users saw no use for a confidence measure believed
                                                                            from a softmax function) or values from meta-models based
that predictions should always be correct or otherwise not
                                                                            on active learning.
presented at all. All participants believed that they had some
basic to advanced comprehension of each measure and its                     The results showed that in a group of similar end-users, there
set of examples, however the experiment supervisor and ML                   were mixed opinions about the necessity of a confidence mea-
expert disagreed with this for both the ‘numerical’ and ‘active             sure and how it should be presented. However, most partici-
learning meta-model’ measures.                                              pants thought of ICM as an easy to understand measure and
                                                                            could recall the workings of ICM accurately. Most of the
Both the supervisor and the ML expert concluded that most
                                                                            participants were even able to identify advantages and dis-
participants had some degree of understanding for ICM. Only
                                                                            advantages of ICM in specific situations, showing a deeper
one participant was not able to comprehend the textual expla-
                                                                            understanding. Future work will focus on a larger study to test
nation but the understanding of ICM was on average rated
                                                                            the intuitiveness of ICM, technical improvements to ICM to
higher than that of the ‘numerical’ and ‘active learning meta-
                                                                            mitigate disadvantages and way on how to generate confidence
model’ measures, by both the supervisor and ML expert.
                                                                            explanations.
The explanations about the numerical output were lacking
                                                                            The development of new XAI methods for high-risk domains
because participants had trouble comprehending that a model
                                                                            is important, but their validation in experiments with domain
could learn knowledge and represent it in parameters. They
                                                                            experts is equally important. Like the one presented in this
had less difficulty for ICM because its outputs related clearly
                                                                            paper, experiments and usability studies with domain experts
1 The textual descriptions and examples can be requested by e-mail.         can help shape the field of XAI.
 Figure 2: This table shows the three sets of ratings (min. of 1 and max. of 4): 1) the participant’s own belief of understanding (row
‘own’), 2) the supervisor’s belief and 3) the ML expert’s opinion of how well the given explanations from the participant matches
 the measures and examples. It shows whether the participant found a confidence measure useful, their prefered measure and the
 best match with their explanation of a confidence value.


ACKNOWLEDGEMENTS                                                      10. Pang Wei Koh and Percy Liang. 2017. Understanding
This study was performed as part of the the Early Research                Black-box Predictions via Influence Functions. arXiv
Program Adaptive Maritime Automation (ERP AMA) within                     preprint arXiv:1703.04730 (2017).
TNO, an independent research organisation in the Netherlands.
                                                                      11. Scott Lundberg and Su-In Lee. 2016. An unexpected
REFERENCES                                                                unity among methods for interpreting model predictions.
 1. Saleema Amershi, Maya Cakmak, William Bradley Knox,                   arXiv:1611.07478 [cs] (Nov. 2016). arXiv: 1611.07478.
    and Todd Kulesza. 2014. Power to the people: The role of          12. Tim Miller. 2017. Explanation in artificial intelligence:
    humans in interactive machine learning. AI Magazine 35,               Insights from the social sciences. arXiv preprint
    4 (2014), 105–120.                                                    arXiv:1706.07269 (2017).
 2. Christopher G. Atkeson, Andrew W. Moore, and Stefan               13. Tim Miller, Piers Howe, and Liz Sonenberg. 2017.
    Schaal. 1997. Locally weighted learning for control. In               Explainable AI: Beware of Inmates Running the Asylum.
    Lazy learning. Springer, 75–113.                                      In IJCAI-17 Workshop on Explainable AI (XAI). 36.
 3. Arnaud Belard, Timothy Buchman, Jonathan Forsberg,
                                                                      14. Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deep
    Benjamin K. Potter, Christopher J. Dente, Allan Kirk,
                                                                          neural networks are easily fooled: High confidence
    and Eric Elster. 2017. Precision diagnosis: a view of the
                                                                          predictions for unrecognizable images. In Proc. of the
    clinical decision support systems (CDSS) landscape
                                                                          IEEE Conf. on Computer Vision and Pattern Recognition.
    through the lens of critical care. Journal of clinical
                                                                          427–436.
    monitoring and computing 31, 2 (2017), 261–271.
 4. Jordi Bieger, Kristinn R. Thórisson, and B. Steunebrink.          15. Chris Olah, Alexander Mordvintsev, and Ludwig
    2017. Evaluating understanding. In IJCAI Workshop on                  Schubert. 2017. Feature Visualization. Distill 2, 11
    Evaluating General-Purpose AI.                                        (2017), e7.
 5. Alan Cooper, Robert Reimann, David Cronin, and                    16. No-Wook Park, Phaedon C. Kyriakidis, and Suk-Young
    Christopher Noessel. 2014. About face: the essentials of              Hong. 2016. Spatial estimation of classification accuracy
    interaction design. John Wiley & Sons.                                using indicator kriging with an image-derived ambiguity
                                                                          index. Remote Sensing 8, 4 (2016), 320.
 6. Sumeet Dua and Xian Du. 2016. Data mining and
    machine learning in cybersecurity. CRC press.                     17. Jasper van der Waa, Jurriaan van Diggelen, and Mark
 7. Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a                   Neerincx. 2018. ICM: An intuitive model independent
    Bayesian approximation: Representing model uncertainty                and accurate certainty measure for machine learning. In
    in deep learning. In Int. Conf. on Machine Learning.                  Proc. of the Int. Conf. on Agents and Artificial
    1050–1059.                                                            Intelligence.
 8. David Gunning. 2017. Explainable artificial intelligence          18. Jurriaan van Diggelen, Hans van den Broek, Jan Maarten
    (xai). Defense Advanced Research Projects Agency                      Schraagen, and Jasper van der Waa. 2017. An Intelligent
    (2017).                                                               Operator Support System for Dynamic Positioning. In Int.
 9. Bradley Hayes and Julie A. Shah. 2017. Improving Robot                Conf. on Applied Human Factors and Ergonomics.
    Controller Transparency Through Autonomous Policy                     Springer, 48–59.
    Explanation. In Proc. of the 2017 ACM/IEEE Int. conf. on
    HRI. ACM, 303–312.