Position: We Can Measure XAI Explanations Better with Templates Jonathan Dodge Margaret Burnett dodgej@eecs.oregonstate.edu burnett@eecs.oregonstate.edu Oregon State University Oregon State University ABSTRACT Explanation Satisfaction is defined as the degree to which This paper argues that the Explainable AI (XAI) research commu- users feel that they understand the AI system or process being nity needs to think harder about how to compare, measure, and explained to them. Compared to goodness, satisfaction is a describe the quality of XAI explanations. We conclude that one (or contextualized, a posteriori judgment of explanations.” a few) explanations can be reasonably assessed with methods of the In this paper, we will use ExpG and ExpS to refer to Hoffman’s “Explanation Satisfaction” type, but that scaling up our ability to concepts of Explanation Goodness and Satisfaction, respectively. evaluate explanations requires more development of “Explanation Now, consider the three highlighted properties in each definition: Goodness” methods. • Contextualization: ExpS is defined relative to a task, while CCS CONCEPTS ExpG is not. • Human-centered computing → User studies. • Actor: ExpS is measured from the perspective of a user per- forming a task, while ExpG is from the perspective of re- KEYWORDS searchers (ideally dispassionate bystanders, but often the designers themselves). Explainable AI, Evaluating XAI Explanations, Empirical Studies, • Timing: Because ExpS is defined relative to a task, it must Heuristic Evaluations be measured after the task is completed, while ExpG can be ACM Reference Format: measured anytime. Jonathan Dodge and Margaret Burnett. 2020. Position: We Can Measure XAI Explanations Better with Templates. In Proceedings of the IUI workshop The main thesis of this paper will be that we, as a research on Explainable Smart Systems and Algorithmic Transparency in Emerging community, need to think harder about how to compare, measure, Technologies (ExSS-ATEC’20). Cagliari, Italy, 5 pages. and describe the ExpG of explanation templates (clarified later in Section 5), as a potentially strong complement to ExpS. 1 INTRODUCTION To develop the main thesis, we attempt to argue several points: As AI plays an ever-increasing role in our lives, society needs a • Background: Most current research has focused on ExpS. variety of tools to inspect them. Explanations have emerged to fill • Tasks: ExpS is easier to operationalize, but incurs a great deal that role, but measuring their quality continues to prove challenging. of experimental noise. Hoffman et al. [10] offers two terms to describe mechanisms for • Benefits: ExpS’s usefulness is hampered by participants’ lim- measuring the quality of an explanation, quoted at length here, ited exposure to the system. with highlighting added to assist this paper’s discussion. • Scope: ExpG affords the opportunity to consider a wider range “Explanation Goodness: Looking across the scholastic and of behaviors, making ExpG mechanisms particularly well research literatures on explanation, we find assertions about suited to reasoning about explanation templates. what makes for a good explanation, from the standpoint of statements as explanations. There is a general consensus on Through these points, we hope to provoke thought about how this; factors such as clarity and precision. Thus, one can look explanation designers can better validate their design decisions via at a given explanation and make an a priori (or decontextual- ExpG for explanation templates. This is of particular importance ized) judgment as to whether or not it is “good.” ... In a proper because a great many design decisions are never evaluated via experiment, the researchers who complete the checklist [eval- ExpS mechanisms. uation of the explanation] with reference to some particular AI-generated explanation, would not be the ones who created 2 BACKGROUND: MOST CURRENT the XAI system under study.” RESEARCH HAS FOCUSED ON EXPS “Explanation Satisfaction: While an explanation might be How have past researchers evaluated XAI design decisions? Most deemed good in the manner described above, it may at the have used ExpS mechanisms, but a few have used ExpG mechanisms. same time not be adequate or satisfying to users-in-context. For both types, an important criterion is rigor. The dangers of departure from rigorous processes for validating design decisions ExSS-ATEC’20, March 2020, Cagliari, Italy can be potentially severe if it devolves into “I methodology” [24], i.e., Copyright © 2020 for this paper by its authors. Use permitted under Creative designers relying solely on their own views and assumptions about Commons License Attribution 4.0 International (CC BY 4.0). what their users will need and how they will use the functionalities the designers decide to provide. ExSS-ATEC’20, March 2020, Cagliari, Italy Jonathan Dodge and Margaret Burnett 100% 2.1 Research that uses ExpS mechanisms 80% Through extensive literature review, Hoffman et al. [10] identified a Average % Accuracy 60% group of existing ExpS methods for mental model elicitation (their 40% Table 4). Among them, many are essentially qualitative and focus 20% on things people say (e.g. Think Aloud or Interview techniques). 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 We felt the “Retrospection Task” [18] and the “Prediction Task” [20] looked to be the most well suited for quantitative study, and chose Figure 1: (Source: Anderson et al. [4]) Percentage of partic- to use them for Anderson et al.’s empirical studies [4]. Approaching ipants in four explanation treatments (4 colors) correctly the problem from another angle, Dodge et al. [8] investigated several predicting the AI’s action at 14 decision points. This image aspects of perceptions of fairness and explanations in a decision shows: 1. No treatment was a clear winner. 2. Some decisions support setting. were easy enough that all participants predicted correctly— Other researchers have used ExpS to understand a wide va- even those without explanations. 3. Conversely, others were riety of effects in explanation. Providing explanation has been hard enough that few participants predicted correctly, even shown to improve mental models [15, 16]. Of particular impor- with explanations. 4. There is no evident learning effect (de- tance to moderating the effects of explanation is the explanation’s cision points are shown sequentially over time). soundness and completeness [17]; most easily described with the phrase “the whole truth (completeness) and nothing but the truth (soundness)” about how the system is really working. Note computing to another scientist. Then imagine the same person that neither soundness nor completeness are binary properties, but giving the same explanation to a child1 . The explanation itself a smooth continuum—with 100% soundness or completeness not could be high quality, it just was not appropriate for that audi- always achievable. Explanation has also been shown to increase ence and needed reformulation. Thus, empirically measuring the satisfaction (here we mean in the colloquial sense, the user’s self- explanation’s quality is entangled with many factors beyond the reported feeling) [2, 12], and understanding—particularly in low explanation itself. expertise observers [27]. Several different kinds of explanation have Second, there is a great deal of variability in the state/action also been shown to improve user acceptance via setting appropriate space (as we observed in [4]). This leads some choices to be easy, expectations (e.g. by showing an accuracy gauge) [14]. There are causing all treatments to have nearly 100% participant prediction many other researchers studying explanations using ExpS mech- accuracy—even those without explanation (e.g. the 5th decision anisms, and we refer the reader to Abdul et al. [1] for a recent point in Figure 1). In contrast, others are much harder, and all literature review. treatments had nearly 0% prediction accuracy (e.g. the 4th decision point in Figure 1). As a result of these floor and ceiling effects, some 2.2 Research that uses ExpG mechanisms of the variation between treatments is obscured. One XAI tool for ExpG is the checklist proposed by Hoffman et Third, it is difficult to assign “partial credit” for participants’ pre- al. [10]’s Appendix A, composed of 8 yes/no questions that re- dictions. In Figure 1’s case, participants faced a choice of 4 options, searchers and designers ask themselves (e.g. “The explanation of leading random guessing to be right 25% of the time. However, AI the [software, algorithm, tool] is sufficiently detailed.”). Other ap- is regularly used in domains with much larger action spaces, so the proaches include Amershi et al. [3]’s guidelines for interactive AI, probability a participant picks right can be vanishingly small. As Kulesza et al. [15]’s design principles for explanatory debugging, a result, it seems natural to think about which answers might be and Wang et al. [25]’s guidelines to match human reasoning pro- considered better than others. To do so, one might consider simi- cesses with XAI techniques. However, most XAI research is not larity in the action space (actions that look similar) or in the value explicit about usage of these or other ExpG mechanisms. space (actions that produce similar consequences), but either way it is a challenge to design rigorously. 3 TASKS: EXPS IS “EASY” TO 4 BENEFITS: EXPS’S USEFULNESS IS OPERATIONALIZE, BUT NOISY HAMPERED BY LIMITED EXPOSURE ExpS sings a siren song of sorts; it appears simple to evaluate, one must simply define a task and criteria to measure performance at Consider that in-lab user studies are typically designed to be exe- that task. Easy enough, right? Wrong. cuted within a 2-hour window for a variety of reasons (e.g. reliabil- We have been using one of the XAI tasks which we felt would ity). As a result, the amount of participant exposure to the system is be most well suited for quantitative study, the “Prediction Task”, actually quite low. As an example, in Anderson et al.’s study [4] we proposed by Muramatsu et al. [20]. However, we have run into a showed 14 decision points to participants over the available 2 hours. number of challenges using it in our XAI studies, some of which In that paper, we point to this limited exposure as a possible reason are apparent in Figure 1, taken from Anderson et al. [4]. that we did not observe any learning effect (evident in Figure 1). First, participants’ ability to perform the task (predict an AI’s Other challenges also surround exposing participants to the next actions) in a domain are moderated by a number of other system when performing a ExpS evaluation. In particular, which things, such as their need for cognition, interest in the task, domain decision points do we show to participants? Because this agent has experience, etc. To illustrate the effect of variability in explana- 1 https://www.youtube.com/watch?v=OWJCfOvochA conducts a similar exercise, tion consumers, imagine a scientist effectively describing quantum though Dr. Gershon changes the explanation for 5 different audiences. Position: We Can Measure XAI Explanations Better with Templates ExSS-ATEC’20, March 2020, Cagliari, Italy ***Prediction: likely to reoffend The training set contained 8 individuals matching this one. 2 of them reoffended (25%) ***Prediction: likely to reoffend The training set contained 151 individuals matching this one. 93 of them reoffended (61%) Figure 2: Left: Two example explanations provided by the system used by Dodge et al. [8]. The black text is static and the added red highlights show text that will be based on calculations about the input—intended to show how explanation templates get filled in. It demonstrates how the ExpS results vary based on the input (e.g. the top explanation is far less convincing). Right: Histogram of the matching percentages underlined in Figure 2, for the classifiers trained on raw and processed data. These histograms show how differently the two classifiers behaved, but also show an interesting result—namely how often case-based explanation self-refutes (by providing low %s), or does not substantiate any claim (by giving near 50%, in a binary classification setting). However, this insight might not have been observable for a user under the ExpS formulation, as users typically only work with a small number of explanations and self-refuting ones are rare. been training for 30,000 episodes, scrutinizing the training data in types, which they term “Case-based explanations”, is demonstrated its entirety would be a daunting task. After training is complete, in the left side of Figure 2 as we used them in Dodge et al. [8] to one could imagine presenting test cases, some of which could be explain an AI system’s judicial sentencing recommendations. Note handcrafted. One way to select test cases to present to assessors is a that the templates shown in this paper are for textual explanations, recent approach to solving the problem of which decision points to but the idea extends naturally to other types of explanations, like show by Huang et al. [11]. Their approach measures the “criticality” the visual explanations in Mai et al.’s Figures 1 and 2 [19]. of each state, and chooses the ones where the agent perceived its choice to matter the most. Note again, the large variability in state/action space, which we commented on in Section 3. When combined with limited exposure, 5.2 Why consider explanation templates? participants are essentially gazing into a vast expanse of behavior An explanation template can be combined with appropriate soft- through a tiny peephole. ware infrastructure and a test set to generate a large set of ex- planation instances. We argue that examining the distribution of 5 SCOPE: EXPG CAN CONSIDER A WIDER thousands of explanations generated in this way can be more illu- minating than seeing individual ones (e.g., Figure 2, right). RANGE OF BEHAVIORS, VIA TEMPLATES Although ExpG mechanisms can be used on any explanation One great advantage of ExpG is, it supports explanation templates. template one desires, consider using a case-based explanation tem- plate. One way to produce these explanations is by finding training 5.1 What is an explanation template? examples “near” the input, then characterize how well the labels of The earliest evidence we could find for what we call “explanation the resultant set match the label of the input (Figure 2, left). If we templates” is from Khan et al. [13], and we think studying them can run an explanation generator on the whole test set, we can create a help address some of the problems discussed earlier in this paper. histogram from those matching percentages, shown on the right Explanation templates operate on a different granularity than an side of Figure 2. explanation. If an explanation describes or justifies an individual However, this introduces an issue with explanation soundness. action, the explanation template is like the factory that creates the To see why, note that many instances fall near 100%—these are explanation. the explanations a user might find “convincing”. However, there We have built templates inspired by Binns et al. [5], who used a is also a good chance the explanation finds that around 50% of wizard of oz methodology to generate multiple types of explanation the nearby training examples matched the input’s label—these are for a decision support setting. The decision they were trying to the explanations that do not substantiate any claim (note that the explain was an auto insurance quote ([5]’s Figure 2). One of their classifier is binary). Even worse, there are instances that fall near ExSS-ATEC’20, March 2020, Cagliari, Italy Jonathan Dodge and Margaret Burnett 0%—these are the self-refuting explanations2 along the lines of “This ***Prediction: likely to reoffend instance was labelled an A because all the nearby training examples The training set contained 80 individuals matching this one. were B’s.”. 20 of them reoffended (25%) In this circumstance, the lack of soundness arises from the fact that the explanation uses nearest neighbors while the underlying ***Prediction: likely to reoffend classifier does not. Note also that the fix for these two problems is different. When the explanation lacks evidence, one should go find The training set contained 20 individuals matching this one who also reoffended. more evidence. But when the explanation self-refutes, one must figure out why the contradiction exists. Note that if we were evaluating these explanations with ExpS, the Figure 3: Two alternative designs for a case-based explana- result would depend strongly on whether the provided explanation tion template. The top shows one used in Dodge et al. [8], was “convincing” or “self-refuting”—but the explanation template is while the bottom illustrates strawman proposal for the dis- the same in both cases, only the input changed. On the other hand, cussion in this paper. Note how the alternative fails to ac- ExpG allows us to consider the wider scope—that the template is knowledge nearby training examples that did not match the capable of generating explanations which will occasionally refute input’s predicted label. This will have the effect of decreas- itself —and decide if that is acceptable. ing completeness, but the alternative explanation will not To continue with the example of case-based self-refutation, sup- self-refute (as the example shows). pose a member of the research team proposed an alternative ex- planation template that avoids refuting itself3 . To do so, we adjust the static text and the calculation that fills in the variable parts, as design decisions is large enough that we cannot hope to evaluate illustrated in Figure 3. Note that the new proposal only highlights them all with ExpS mechanisms. counts on things that match, which has the effect of essentially Given this, we suspect many XAI design decisions are never ignoring nearby training examples that do not match the input’s assessed, unless the XAI designers use ExpG—strictly due to the predicted label. impracticality of doing ExpS at the scale needed for full coverage. In The advantage of this proposal is that this alternative will not a sense, explanation is a user interface, and user interface designers refute itself. But the advantage came at a cost: according to known have long used a wide variety of techniques relating to ExpG (e.g. taxonomies, it brings a decrease in completeness, as the explanation design guidelines [21, 22], cognitive dimensions [9], cognitive walk- is telling less of the “whole truth.” throughs [26], etc). Fortunately, there are some promising works So did the overall ExpG go up or down? We think most would that demonstrate the use of these approaches (e.g., [3, 15, 23]). argue down... but we cannot measure how much. This example Hypothesis: Studying one (or a few) explanations is well- exposes a critical weakness in the ExpG approach: the vocabulary suited to ExpS oriented methods, but the template level may and calculus currently available to us cannot adequately describe require ExpG methods. and measure the implications of a single design decision. This suggests that, to create XAI systems according to rigorous 5.3 Scalability: Many design decisions are only science—especially XAI systems that generate explanations with validated with ExpG a template—we must develop improvements in the rigor and mea- surability of ExpG mechanisms. These will bring outsize benefit to So where does this weakness leave us? designers and researchers as compared to improvements in ExpS. During a full design cycle, XAI designers face many design deci- sions, of which only a few can be evaluated with ExpS. To illustrate, 6 ACKNOWLEDGMENTS consider that case-based explanation as originally proposed by Binns et al. [5] would be implemented by showing the single near- This work was supported by DARPA #N66001-17-2-4030. We would est neighbor. That approach could be extended by showing the k like to acknowledge all our co-authors on the work cited here, with nearest neighbors for a number of different k. Or, it could be im- specific highlights to Andrew Anderson, Alan Fern, Q. Vera Liao, plemented by showing whatever neighbors lie within some feature Yunfeng Zhang, Rachel Bellamy, and Casey Dugan—research work space volume—which is the approach used by Dodge et al. [8] and does not happen in a vacuum, nor do ideas. illustrated in the left side of Figure 2. The space of these possible REFERENCES [1] Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y Lim, and Mohan Kankan- 2 The original reason we generated the histogram on the right of Figure 2 was not to halli. 2018. Trends and trajectories for explainable, accountable and intelligible see if the explanation would self-refute, but to compare the two classifiers: one trained systems: An hci research agenda. In Proceedings of the 2018 CHI Conference on on raw data (raw) and another trained on the processed data (proc). “Processing” the Human Factors in Computing Systems. ACM, New York, NY, USA, 582. data refers to the use of a preprocessor by Calmon et al. [6] intended to debias the [2] S. Amershi, M. Cakmak, W. Knox, and T. Kulesza. 2014. Power to the people: data—perhaps inducing a classifier people consider more fair. In this effort, we looked The role of humans in interactive machine learning. AI Magazine 35, 4 (2014), to see what classifications were different, compared confidence score histograms, etc. 105–120. 3 Here we use a strawman explanation template that is known to be bad, in order to [3] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira explore the extent of our ability to characterize how bad it is. Correll performed a Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, similar exercise in the visualization community, proposing Ross-Chernoff glyphs as a Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human- strawman, “...as a call to action that we need a better vocabulary and ontology of bad AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in ideas in visualization. That is, we ought to be able to better identify ideas that seem prima Computing Systems (Glasgow, Scotland Uk) (CHI ’19). ACM, New York, NY, USA, facie bad for visualization, and better articulate and defend our judgments.” [7]. Article 3, 13 pages. https://doi.org/10.1145/3290605.3300233 Position: We Can Measure XAI Explanations Better with Templates ExSS-ATEC’20, March 2020, Cagliari, Italy [4] Andrew Anderson, Jonathan Dodge, Amrita Sadarangani, Zoe Juozapaitis, Evan [24] Nelly Oudshoorn and Trevor Pinch. 2003. How Users Matter: The Co-Construction Newman, Jed Irvine, Souti Chattopadhyay, Alan Fern, and Margaret Burnett. of Users and Technology (Inside Technology). The MIT Press, Cambridge, MA, 2019. Explaining Reinforcement Learning to Mere Mortals: An Empirical Study. USA. In Proceedings of the 28th International Joint Conference on Artificial Intelligence [25] Danding Wang, Qian Yang, Ashraf Abdul, and Brian Y Lim. 2019. Designing (Macao, China) (IJCAI’19). AAAI Press, Palo Alto, CA, USA, 1328–1334. http: Theory-Driven User-Centric Explainable AI. In Proceedings of the SIGCHI Confer- //dl.acm.org/citation.cfm?id=3367032.3367221 ence on Human Factors in Computing Systems. CHI, Vol. 19. [5] Reuben Binns, Max Van Kleek, Michael Veale, Ulrik Lyngs, Jun Zhao, and Nigel [26] Cathleen Wharton, John Rieman, Clayton Lewis, and Peter Polson. 1994. The Shadbolt. 2018. ’It’s Reducing a Human Being to a Percentage’: Perceptions of cognitive walkthrough method: A practitioner’s guide. In Usability inspection Justice in Algorithmic Decisions. In 2018 CHI Conference on Human Factors in methods. 105–140. Computing Systems (Montreal QC, Canada) (CHI ’18). ACM, New York, NY, USA, [27] Robert H Wortham, Andreas Theodorou, and Joanna J Bryson. 2017. Improving Article 377, 14 pages. robot transparency:real-time visualisation of robot AI substantially improves [6] Flavio Calmon, Dennis Wei, Bhanukiran Vinzamuri, Karthikeyan Natesan Rama- understanding in naive observers, In IEEE RO-MAN 2017. IEEE RO-MAN 2017. murthy, and Kush R Varshney. 2017. Optimized Pre-Processing for Discrimination http://opus.bath.ac.uk/55793/ Prevention. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., Red Hook NY, 3992–4001. http://papers.nips.cc/ paper/6988-optimized-pre-processing-for-discrimination-prevention.pdf [7] Michael Correll. 2018. Ross-Chernoff Glyphs Or: How Do We Kill Bad Ideas in Visualization?. In Extended Abstracts of the 2018 CHI Conference on Human Factors in Computing Systems (Montreal QC, Canada) (CHI EA ’18). ACM, New York, NY, USA, Article alt05, 10 pages. https://doi.org/10.1145/3170427.3188398 [8] Jonathan Dodge, Q. Vera Liao, Yunfeng Zhang, Rachel K. E. Bellamy, and Casey Dugan. 2019. Explaining Models: An Empirical Study of How Explanations Impact Fairness Judgment. In Proceedings of the 24th International Conference on Intelligent User Interfaces (Marina del Ray, California) (IUI ’19). ACM, New York, NY, USA, 275–285. https://doi.org/10.1145/3301275.3302310 [9] Thomas R. G. Green and Marian Petre. 1996. Usability analysis of visual pro- gramming environments: a ‘cognitive dimensions’ framework. Journal of Visual Languages & Computing 7, 2 (1996), 131–174. [10] Robert R. Hoffman, Shane T. Mueller, Gary Klein, and Jordan Litman. 2018. Metrics for Explainable AI: Challenges and Prospects. CoRR abs/1812.04608 (2018). arXiv:1812.04608 http://arxiv.org/abs/1812.04608 [11] Sandy H. Huang, Kush Bhatia, Pieter Abbeel, and Anca D. Dragan. 2018. Es- tablishing Appropriate Trust via Critical States. IROS (Oct 2018). https: //doi.org/10.1109/IROS.2018.8593649 [12] Ashish Kapoor, Bongshin Lee, Desney Tan, and Eric Horvitz. 2010. Interactive optimization for steering machine classification. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, 1343–1352. [13] Omar Zia Khan, Pascal Poupart, and James P. Black. 2009. Minimal Sufficient Explanations for Factored Markov Decision Processes. In Proceedings of the Nineteenth International Conference on International Conference on Automated Planning and Scheduling (Thessaloniki, Greece) (ICAPS’09). AAAI Press, 194–200. [14] Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Accept an Imperfect AI?: Exploring Designs for Adjusting End-user Expectations of AI Systems. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (Glasgow, Scotland Uk) (CHI ’19). ACM, New York, NY, USA, Article 411, 14 pages. https://doi.org/10.1145/3290605.3300641 [15] Todd Kulesza, Margaret Burnett, Weng-Keen Wong, and Simone Stumpf. 2015. Principles of explanatory debugging to personalize interactive machine learning. In Proceedings of the 20th International Conference on Intelligent User Interfaces. ACM, 126–137. [16] Todd Kulesza, Simone Stumpf, Margaret Burnett, Weng-Keen Wong, Yann Riche, Travis Moore, Ian Oberst, Amber Shinsel, and Kevin McIntosh. 2010. Explanatory debugging: Supporting end-user debugging of machine-learned programs. In Visual Languages and Human-Centric Computing (VL/HCC), 2010 IEEE Symposium on. IEEE, 41–48. [17] T. Kulesza, S. Stumpf, M. Burnett, S. Yang, I. Kwan, and W. K. Wong. 2013. Too much, too little, or just right? Ways explanations impact end users’ mental models. In 2013 IEEE Symposium on Visual Languages and Human Centric Computing. 3–10. https://doi.org/10.1109/VLHCC.2013.6645235 [18] Katherine Lippa, Helen Klein, and Valerie Shalin. 2008. Everyday expertise: cognitive demands in diabetes self-management. Human Factors 50, 1 (2008). [19] Theresa Mai, Roli Khanna, Jonathan Dodge, Jed Irvine, Kin-Ho Lam, Zhengxian Lin, Nicholas Kiddle, Evan Newman, Sai Raja, Caleb Matthews, Christopher Perdriau, Margaret Burnett, and Alan Fern. 2020. Keeping It “Organized and Logical”: After-Action Review for AI (AAR/AI). In In 25th International Conference on Intelligent User Interfaces (Cagliari, Italy) (IUI ’20). ACM, New York, NY, USA. [20] Jack Muramatsu and Wanda Pratt. 2001. Transparent Queries: investigation users’ mental models of search engines. In Intl. ACM SIGIR Conf. on Research and Development in Info. Retrieval. ACM. [21] Jakob Nielsen. 2005. Ten usability heuristics. https://www.nngroup.com/articles/ ten-usability-heuristics/ [22] Donald A Norman. 1983. Design principles for human-computer interfaces. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 1–10. [23] Oluwakemi Ola and Kamran Sedig. 2016. Beyond simple charts: Design of visualizations for big health data. Online journal of public health informatics 8 (28 12 2016). Issue 3. https://doi.org/10.5210/ojphi.v8i3.7100