The Problem of Explanations without User Feedback
                       Alison Smith                                                      James J. Nolan
              Decisive Analytics Corporation                                      Decisive Analytics Corporation
                 Arlington, United States                                            Arlington, United States
                   alison.smith@dac.us                                                  jim.nolan@dac.us


ABSTRACT                                                                explanation interfaces to explain algorithm output [4], we
Explanations are necessary for building users’                          argue that explanations alone are not always sufficient.
understanding and trust in machine learning systems.                    Support for user feedback should be treated as an equally
However, users may abandon systems if these explanations                important component of an explainable machine learning
demonstrate consistent errors and they cannot affect change             system as providing an explanation without a method for
in the systems’ behavior in response. When user feedback is             feedback may lead to frustrated users and overall system
supported, then the utility of explanations is to not only              disuse. For example, [5] find that users ignore explanation
promote understanding, but also enable users to help the                when the benefit of attending to them is unclear or if they
machine learning system overcome errors. We suggest an                  are unable to successfully control predictions.
experiment to examine how users react when a system
makes explainable mistakes with varied support for user                 In this position paper, we discuss prior work on explainable
feedback.                                                               systems that do and do not take user feedback into account
                                                                        and outline a study design to examine how users react when
Author Keywords                                                         a system makes explainable mistakes with varied support
Explanations; user feedback; human-in-the-loop systems;                 for user feedback.
human-machine interfaces
                                                                        MOTIVATION
ACM Classification Keywords                                             Imagine the following example (from a frequently-cited
H.1.2 User/Machine            systems:     Human       Information      explainable machine learning paper [10]): an explainable
Processing                                                              system, specifically an image classification tool, makes an
INTRODUCTION                                                            error, such as classifying an image of a husky incorrectly as
Analysts in domains such as military, intelligence,                     a wolf. The user then requests an explanation of why the
financial, and medical face the ever-growing problem of                 system produced this incorrect classification. Using
needing to perform multi-modal analysis on complex data                 attention,1 the system can explain its mistake by displaying
sets. Machine learning techniques show promise for                      to the user that the presence of snow in the image led to the
dramatically increasing the speed and effectiveness of                  wolf misclassification. At this point, the user now
analytic workflows for analyzing large amounts of data.                 understands why the system made this initial mistake, but
However, because an analysts’ credibility and reputation                what happens when the system makes the same or similar
may be rated based on automated decisions, they hesitate if             mistake again? Here we consider two possible outcomes.
they do not have a full understanding of how the algorithm              One possibility is that the user will be frustrated or choose
reached its final decision. To overcome this doubt, and to              not to use the system if they know that the it errs on certain
increase interpretability and trust of these systems, it is             types of data or problems, but they cannot do anything
necessary to provide a transparent way to inspect,                      about it. Alternatively, the system may be deceiving to
interrogate, and understand machine learning results.                   users who believe that the it can learn from mistakes (as a
                                                                        human who admits to making a mistake is expected to), but
While significant work explores this need for more                      in fact it will continue to make the same mistake.
explainable machine learning – whether by making the
algorithms themselves more explainable [1,2] or by creating             In fact, Ribeiro et al. [10] find that while 10 of 27
                                                                        participants trust the model that misclassifies a husky for a
                                                                        wolf without any explanation, only three out of 27
                                                                        participants trust the model when it explains the mistake.
                                                                        Thus, without a way to provide user feedback to improve

                                                                        1
                                                                         In image classification, attention [8] can be used to
                                                                        determine the portion of an image that most affected the
                                                                        system’s classification, or the part of the image that the
                                                                        system “attended to” the most when making a
© 2018. Copyright for the individual papers remains with the authors.   classification.
Copying permitted for private and academic purposes. ExSS '18, March
11, Tokyo, Japan.
the system, explaining predictions is more likely to be          (or apologetic) explanations may be more likely to lead
utilized as a method for knowing when not to trust the           users to thinking the system will learn from a mistake.
system.
                                                                 Similarly, whether users expect a system to improve may
Alternatively, in our prior work [13], we developed a            vary based on their interactions with explanations, such as
system for intelligence analysts that both provides evidence     simply clicking ‘ok’ to dismiss explanations as opposed to
for its decisions and supports analyst feedback to improve       interactions such as ‘accepting’ or ‘rejecting’
the underlying model. This system automatically clusters         classifications. The latter may lead users to believe they are
entity mentions (people, places, and organizations) from         correcting the system.
large unstructured corpora to overarching entity clusters.2      Q2: How is trust of and frustration with an explainable
For example, clustering entity mentions throughout a large       system affected by varied supports for user feedback?
news corpus, such as Mr. Obama, President Obama, and             Prior work implied, albeit without a formal experiment, that
Barack, into one entity cluster, President Barack Obama.         users may trust systems less when they explain their
The system provides as evidence the entity mention in            mistakes [10]. Similarly, Lim and Dey [7] find that users’
context as well as the other entity mentions in the cluster.     impressions of a system are negatively impacted when
While this evidence may help the analyst to understand why       systems are highly uncertain of their decisions (even when
certain mentions were incorrectly placed in clusters or other    they behave appropriately). While supporting user
mentions are missing from a cluster, simply understanding        feedback, particularly in cases of system error or high
the system’s mistakes is not sufficient for supporting trust     uncertainty, could mitigate these issue, the level of control
and utilization. To this end, the system supports interactive    given to the user may have varied effects on trust and
feedback mechanisms, such as accepting and rejecting             frustration. For example, in prior work we discuss whether
mentions as well as merging clusters. While no formal user       user feedback should be taken as a command or a
experiment has been performed with this system, we have          suggestion for different types of interactive systems [12].
received positive feedback from analysts regarding the
interactive feedback mechanisms.                                 Method
                                                                 To support examination of the identified research questions,
EXPERIMENT DESIGN                                                we outline the following two-part study methodology.
We outline an experiment design to examine how users
react to an explainable system with varied support for user      The first part of the study will be performed as an interview
feedback. Specifically, we suggest two possible study            study following a think-aloud methodology followed by a
methods: the first, which aims to explore user frustration or    post-task survey. First, users will be shown an explainable
confusion that occurs when an explainable system (that           system. When the system errs, it will provide an
does not update with user feedback) continues to make the        explanation. We will then ask users whether they believe
same or similar mistakes, and the second, which compares         the system will make the same or similar mistakes followed
users’ reactions to versions of a system based on the            by measuring frustration and/or surprise when it does
amount of control given to the user.                             continue to do so. Frustration will be measured on the
                                                                 incident level and overall level following the methodology
Research Questions
                                                                 described by Bessier et al. [1]. For this part of the study we
The goal of this proposed experiment is to answer the            will vary what explanations look like and how users attend
following research questions:                                    to explanations, as we hypothesize these will have an effect
Q1: Do users assume an explainable system learns from            on whether users believe the system will learn from
mistakes?                                                        mistakes.
We hope to better understand what users expect when
utilizing an explainable system. Whether or not users            The second part of the study will be performed as a
expect the system to continue making the same or similar         crowdsourced survey. In this case, we will incorporate user
mistakes impacts how negatively the users will be affected       feedback into an explainable system. We will vary the
when it does. Furthermore, we would like to understand           system only in how it incorporates user feedback,
whether users’ expectations change if we vary how                representing the amount of control the user has over the
explanations are attended to or the form of explanations.        system. We propose three system variants: one that ignores
                                                                 all user feedback, one that takes feedback into account as a
Shneiderman [11] and Lanier [6] argue that systems               suggestion, and one that takes feedback into account as a
(intelligent agents, in particular) should not have human-       command. We will then measure how user trust, frustration,
like characteristics as these lead users to believing that the   and other user reactions differ between these variants.
system may act rationally or take some responsibility for its    Frustration will again be measured on the incident and
actions [3]. Therefore, we hypothesize that conversational       overall level [1]. The users’ impressions of the system, and
                                                                 in particular trust, will be measured by rating responses to
2
 This technique utilizes inter and intra-document entity co-     relevant survey questions.
reference, meaning it clusters entity mentions within and
across documents.
CONCLUSION                                                          VL/HCC,                                   3–10.
In this position paper, we argue that while explainable             https://doi.org/10.1109/VLHCC.2013.6645235
systems are important, incorporating user feedback into
these systems is equally important for supporting trust and   6.    Jaron Lanier. 1996. My Problems with Agents.
continued use. And this goes both ways – systems that               Wired.
support user feedback must also ensure users understand       7.    Brian Y. Lim and Anind K. Dey. 2011.
how they work, such that they can give appropriate                  Investigating intelligibility for uncertain context-
feedback. A truthful explanation into the system’s black            aware applications. In Proceedings of the 13th
box improves users’ understanding, which better prepares            international conference on Ubiquitous computing -
them for providing feedback to improve the system. We               UbiComp                     ’11,               415.
propose an experiment to provide additional evidence for            https://doi.org/10.1145/2030112.2030168
this argument.
                                                              8.    Volodymyr Mnih, Nicolas Heess, Alex Graves, and
REFERENCES                                                          Koray Kavukcuoglu. 2014. Recurrent Models of
1.     Katie Bessiere, Irina Ceaparu, Jonathan Lazar, John          Visual Attention. Advances in Neural Information
       Robinson, and Ben            Shneiderman.     2003.          Processing Systems 27: 1–9. https://doi.org/ng
       Understanding Computer User Frustration:
       Measuring and Modeling the Disruption from Poor        9.    Seyoung Park, Xiaohan Nie, and Song Chun Zhu.
       Designs. Technical Reports from UMIACS.                      2017. Attribute And-Or Grammar for Joint Parsing
       Retrieved                                      from          of Human Pose, Parts and Attributes. IEEE
       http://drum.lib.umd.edu/handle/1903/1233%5Cnhtt              Transactions on Pattern Analysis and Machine
       p://drum.lib.umd.edu//bitstream/1903/1233/1/CS-              Intelligence.
       TR-4409.pdf                                                  https://doi.org/10.1109/TPAMI.2017.2731842

2.     William Brendel and Sinisa Todorovic. 2011.            10.   Marco Tulio Ribeiro, Sameer Singh, and Carlos
       Learning spatiotemporal graphs of human activities.          Guestrin. 2016. Why Should I Trust You?
       In Proceedings of the IEEE International                     Explaining the Predictions of Any Classifier.
       Conference on Computer Vision, 778–785.                      Proceedings of the 22nd ACM SIGKDD
       https://doi.org/10.1109/ICCV.2011.6126316                    International Conference on Knowledge Discovery
                                                                    and Data Mining - KDD ’16 39, 2011: 117831.
3.     K. Höök. 2000. Steps to take before intelligent user         https://doi.org/10.1145/2939672.2939778
       interfaces become real. Interacting with Computers
       12, 4: 409–426. https://doi.org/10.1016/S0953-         11.   Ben Shneiderman. 1997. Direct manipulation for
       5438(99)00006-5                                              comprehensible, predictable and controllable user
                                                                    interfaces. In Proceedings of the 2nd international
4.     Todd Kulesza, Margaret Burnett, Weng-Keen                    conference on Intelligent user interfaces - IUI ’97,
       Wong, and Simone Stumpf. 2015. Principles of                 33–39. https://doi.org/10.1145/238218.238281
       Explanatory Debugging to Personalize Interactive
       Machine Learning. In Proceedings of the 20th           12.   Alison Smith, Varun Kumar, Jordan Boyd-Graber,
       International Conference on Intelligent User                 Kevin Seppi, and Leah Findlater. 2017. Accounting
       Interfaces       -     IUI      ’15,    126–137.             for Input Uncertainty in Human-in-the-loop
       https://doi.org/10.1145/2678025.2701399                      Systems. In Designing for Uncertainty Workshop at
                                                                    CHI 2017.
5.     Todd Kulesza, Simone Stumpf, Margaret Burnett,
       Sherry Yang, Irwin Kwan, and Weng Keen Wong.           13.   Kevin Ward and Jack Davenport. 2017. Human-
       2013. Too much, too little, or just right? Ways              machine interaction to disambiguate entities in
       explanations impact end users’ mental models. In             unstructured text and structured datasets. In SPIE
       Proceedings of IEEE Symposium on Visual                      Conference on Next-Generation Analyst V.
       Languages and Human-Centric Computing,