The Problem of Explanations without User Feedback Alison Smith James J. Nolan Decisive Analytics Corporation Decisive Analytics Corporation Arlington, United States Arlington, United States alison.smith@dac.us jim.nolan@dac.us ABSTRACT explanation interfaces to explain algorithm output [4], we Explanations are necessary for building users’ argue that explanations alone are not always sufficient. understanding and trust in machine learning systems. Support for user feedback should be treated as an equally However, users may abandon systems if these explanations important component of an explainable machine learning demonstrate consistent errors and they cannot affect change system as providing an explanation without a method for in the systems’ behavior in response. When user feedback is feedback may lead to frustrated users and overall system supported, then the utility of explanations is to not only disuse. For example, [5] find that users ignore explanation promote understanding, but also enable users to help the when the benefit of attending to them is unclear or if they machine learning system overcome errors. We suggest an are unable to successfully control predictions. experiment to examine how users react when a system makes explainable mistakes with varied support for user In this position paper, we discuss prior work on explainable feedback. systems that do and do not take user feedback into account and outline a study design to examine how users react when Author Keywords a system makes explainable mistakes with varied support Explanations; user feedback; human-in-the-loop systems; for user feedback. human-machine interfaces MOTIVATION ACM Classification Keywords Imagine the following example (from a frequently-cited H.1.2 User/Machine systems: Human Information explainable machine learning paper [10]): an explainable Processing system, specifically an image classification tool, makes an INTRODUCTION error, such as classifying an image of a husky incorrectly as Analysts in domains such as military, intelligence, a wolf. The user then requests an explanation of why the financial, and medical face the ever-growing problem of system produced this incorrect classification. Using needing to perform multi-modal analysis on complex data attention,1 the system can explain its mistake by displaying sets. Machine learning techniques show promise for to the user that the presence of snow in the image led to the dramatically increasing the speed and effectiveness of wolf misclassification. At this point, the user now analytic workflows for analyzing large amounts of data. understands why the system made this initial mistake, but However, because an analysts’ credibility and reputation what happens when the system makes the same or similar may be rated based on automated decisions, they hesitate if mistake again? Here we consider two possible outcomes. they do not have a full understanding of how the algorithm One possibility is that the user will be frustrated or choose reached its final decision. To overcome this doubt, and to not to use the system if they know that the it errs on certain increase interpretability and trust of these systems, it is types of data or problems, but they cannot do anything necessary to provide a transparent way to inspect, about it. Alternatively, the system may be deceiving to interrogate, and understand machine learning results. users who believe that the it can learn from mistakes (as a human who admits to making a mistake is expected to), but While significant work explores this need for more in fact it will continue to make the same mistake. explainable machine learning – whether by making the algorithms themselves more explainable [1,2] or by creating In fact, Ribeiro et al. [10] find that while 10 of 27 participants trust the model that misclassifies a husky for a wolf without any explanation, only three out of 27 participants trust the model when it explains the mistake. Thus, without a way to provide user feedback to improve 1 In image classification, attention [8] can be used to determine the portion of an image that most affected the system’s classification, or the part of the image that the system “attended to” the most when making a © 2018. Copyright for the individual papers remains with the authors. classification. Copying permitted for private and academic purposes. ExSS '18, March 11, Tokyo, Japan. the system, explaining predictions is more likely to be (or apologetic) explanations may be more likely to lead utilized as a method for knowing when not to trust the users to thinking the system will learn from a mistake. system. Similarly, whether users expect a system to improve may Alternatively, in our prior work [13], we developed a vary based on their interactions with explanations, such as system for intelligence analysts that both provides evidence simply clicking ‘ok’ to dismiss explanations as opposed to for its decisions and supports analyst feedback to improve interactions such as ‘accepting’ or ‘rejecting’ the underlying model. This system automatically clusters classifications. The latter may lead users to believe they are entity mentions (people, places, and organizations) from correcting the system. large unstructured corpora to overarching entity clusters.2 Q2: How is trust of and frustration with an explainable For example, clustering entity mentions throughout a large system affected by varied supports for user feedback? news corpus, such as Mr. Obama, President Obama, and Prior work implied, albeit without a formal experiment, that Barack, into one entity cluster, President Barack Obama. users may trust systems less when they explain their The system provides as evidence the entity mention in mistakes [10]. Similarly, Lim and Dey [7] find that users’ context as well as the other entity mentions in the cluster. impressions of a system are negatively impacted when While this evidence may help the analyst to understand why systems are highly uncertain of their decisions (even when certain mentions were incorrectly placed in clusters or other they behave appropriately). While supporting user mentions are missing from a cluster, simply understanding feedback, particularly in cases of system error or high the system’s mistakes is not sufficient for supporting trust uncertainty, could mitigate these issue, the level of control and utilization. To this end, the system supports interactive given to the user may have varied effects on trust and feedback mechanisms, such as accepting and rejecting frustration. For example, in prior work we discuss whether mentions as well as merging clusters. While no formal user user feedback should be taken as a command or a experiment has been performed with this system, we have suggestion for different types of interactive systems [12]. received positive feedback from analysts regarding the interactive feedback mechanisms. Method To support examination of the identified research questions, EXPERIMENT DESIGN we outline the following two-part study methodology. We outline an experiment design to examine how users react to an explainable system with varied support for user The first part of the study will be performed as an interview feedback. Specifically, we suggest two possible study study following a think-aloud methodology followed by a methods: the first, which aims to explore user frustration or post-task survey. First, users will be shown an explainable confusion that occurs when an explainable system (that system. When the system errs, it will provide an does not update with user feedback) continues to make the explanation. We will then ask users whether they believe same or similar mistakes, and the second, which compares the system will make the same or similar mistakes followed users’ reactions to versions of a system based on the by measuring frustration and/or surprise when it does amount of control given to the user. continue to do so. Frustration will be measured on the incident level and overall level following the methodology Research Questions described by Bessier et al. [1]. For this part of the study we The goal of this proposed experiment is to answer the will vary what explanations look like and how users attend following research questions: to explanations, as we hypothesize these will have an effect Q1: Do users assume an explainable system learns from on whether users believe the system will learn from mistakes? mistakes. We hope to better understand what users expect when utilizing an explainable system. Whether or not users The second part of the study will be performed as a expect the system to continue making the same or similar crowdsourced survey. In this case, we will incorporate user mistakes impacts how negatively the users will be affected feedback into an explainable system. We will vary the when it does. Furthermore, we would like to understand system only in how it incorporates user feedback, whether users’ expectations change if we vary how representing the amount of control the user has over the explanations are attended to or the form of explanations. system. We propose three system variants: one that ignores all user feedback, one that takes feedback into account as a Shneiderman [11] and Lanier [6] argue that systems suggestion, and one that takes feedback into account as a (intelligent agents, in particular) should not have human- command. We will then measure how user trust, frustration, like characteristics as these lead users to believing that the and other user reactions differ between these variants. system may act rationally or take some responsibility for its Frustration will again be measured on the incident and actions [3]. Therefore, we hypothesize that conversational overall level [1]. The users’ impressions of the system, and in particular trust, will be measured by rating responses to 2 This technique utilizes inter and intra-document entity co- relevant survey questions. reference, meaning it clusters entity mentions within and across documents. CONCLUSION VL/HCC, 3–10. In this position paper, we argue that while explainable https://doi.org/10.1109/VLHCC.2013.6645235 systems are important, incorporating user feedback into these systems is equally important for supporting trust and 6. Jaron Lanier. 1996. My Problems with Agents. continued use. And this goes both ways – systems that Wired. support user feedback must also ensure users understand 7. Brian Y. Lim and Anind K. Dey. 2011. how they work, such that they can give appropriate Investigating intelligibility for uncertain context- feedback. A truthful explanation into the system’s black aware applications. In Proceedings of the 13th box improves users’ understanding, which better prepares international conference on Ubiquitous computing - them for providing feedback to improve the system. We UbiComp ’11, 415. propose an experiment to provide additional evidence for https://doi.org/10.1145/2030112.2030168 this argument. 8. Volodymyr Mnih, Nicolas Heess, Alex Graves, and REFERENCES Koray Kavukcuoglu. 2014. Recurrent Models of 1. Katie Bessiere, Irina Ceaparu, Jonathan Lazar, John Visual Attention. Advances in Neural Information Robinson, and Ben Shneiderman. 2003. Processing Systems 27: 1–9. https://doi.org/ng Understanding Computer User Frustration: Measuring and Modeling the Disruption from Poor 9. Seyoung Park, Xiaohan Nie, and Song Chun Zhu. Designs. Technical Reports from UMIACS. 2017. Attribute And-Or Grammar for Joint Parsing Retrieved from of Human Pose, Parts and Attributes. IEEE http://drum.lib.umd.edu/handle/1903/1233%5Cnhtt Transactions on Pattern Analysis and Machine p://drum.lib.umd.edu//bitstream/1903/1233/1/CS- Intelligence. TR-4409.pdf https://doi.org/10.1109/TPAMI.2017.2731842 2. William Brendel and Sinisa Todorovic. 2011. 10. Marco Tulio Ribeiro, Sameer Singh, and Carlos Learning spatiotemporal graphs of human activities. Guestrin. 2016. Why Should I Trust You? In Proceedings of the IEEE International Explaining the Predictions of Any Classifier. Conference on Computer Vision, 778–785. Proceedings of the 22nd ACM SIGKDD https://doi.org/10.1109/ICCV.2011.6126316 International Conference on Knowledge Discovery and Data Mining - KDD ’16 39, 2011: 117831. 3. K. Höök. 2000. Steps to take before intelligent user https://doi.org/10.1145/2939672.2939778 interfaces become real. Interacting with Computers 12, 4: 409–426. https://doi.org/10.1016/S0953- 11. Ben Shneiderman. 1997. Direct manipulation for 5438(99)00006-5 comprehensible, predictable and controllable user interfaces. In Proceedings of the 2nd international 4. Todd Kulesza, Margaret Burnett, Weng-Keen conference on Intelligent user interfaces - IUI ’97, Wong, and Simone Stumpf. 2015. Principles of 33–39. https://doi.org/10.1145/238218.238281 Explanatory Debugging to Personalize Interactive Machine Learning. In Proceedings of the 20th 12. Alison Smith, Varun Kumar, Jordan Boyd-Graber, International Conference on Intelligent User Kevin Seppi, and Leah Findlater. 2017. Accounting Interfaces - IUI ’15, 126–137. for Input Uncertainty in Human-in-the-loop https://doi.org/10.1145/2678025.2701399 Systems. In Designing for Uncertainty Workshop at CHI 2017. 5. Todd Kulesza, Simone Stumpf, Margaret Burnett, Sherry Yang, Irwin Kwan, and Weng Keen Wong. 13. Kevin Ward and Jack Davenport. 2017. Human- 2013. Too much, too little, or just right? Ways machine interaction to disambiguate entities in explanations impact end users’ mental models. In unstructured text and structured datasets. In SPIE Proceedings of IEEE Symposium on Visual Conference on Next-Generation Analyst V. Languages and Human-Centric Computing,