A Taxonomy for Human Subject Evaluation of Black-Box Explanations in XAI Michael Chromik Martin Schuessler LMU Munich Technische Universität Berlin Munich, Germany Berlin, Germany michael.chromik@ifi.lmu.de schuessler@tu-berlin.de ABSTRACT intelligence (XAI) and interpretable machine learning (IML) on de- The interdisciplinary field of explainable artificial intelligence (XAI) veloping models, methods, and interfaces that are interpretable to aims to foster human understanding of black-box machine learning human users – often through some notion of explanation. models through explanation methods. However, there is no consen- However, most works focus on computational problems while sus among the involved disciplines regarding the evaluation of their limited research effort is reported concerning their user evaluation. effectiveness - especially concerning the involvement of human Previous surveys identified the need for more rigid empirical evalu- subjects. For our community, such involvement is a prerequisite for ation of explanations [2, 5, 17]. The AI and ML communities often rigorous evaluation. To better understand how researchers across strive for functional evaluation of their approaches with bench- the disciplines approach human subject XAI evaluation, we propose mark data to demonstrate generalizability. While this is suitable to developing a taxonomy that is iterated with a systematic literature demonstrate technical feasibility, it is also problematic since often review. Approaching them from an HCI perspective, we analyze "there is no formal definition of a correct or best explanation" [24]. which study designs scholar chose for different explanation goals. Even if a formal foundation exists, it does not necessarily result in Based on our preliminary analysis, we present a taxonomy that practical utility for humans as the utility of an explanation is highly provides guidance for researchers and practitioners on the design dependent on the context and capabilities of human users. With- and execution of XAI evaluations. With this position paper, we put out proper human behavior evaluations, it is difficult to assess an our survey approach and preliminary results up for discussion with explanation method’s utility for practical use cases [26]. We argue our fellow researchers. that functional and behavioral evaluation approaches have their legitimacy. Yet, since there is no consensus on evaluation methods, CCS CONCEPTS the comparison and validation of diverse explanation techniques is an open challenge [2, 4]. • Human-centered computing → HCI design and evaluation In this work, we take an HCI perspective and focus on evalu- methods. ations with human subjects. We believe that the HCI community should be the driving force for establishing rigorous evaluation KEYWORDS procedures that investigate how XAI can benefit users. Our work explainable artificial intelligence; explanation; human evaluation; is guided by three research questions: taxonomy. • RQ-1: Which evaluation approaches have been proposed and discussed across disciplines in the field of XAI? ACM Reference Format: • RQ-2: Which study design decisions have researchers made Michael Chromik and Martin Schuessler. 2020. A Taxonomy for Human in previous evaluations with human subjects? Subject Evaluation of Black-Box Explanations in XAI. In Proceedings of the • RQ-3: How can the proposed approaches and study designs IUI workshop on Explainable Smart Systems and Algorithmic Transparency in be integrated into a guiding taxonomy for human-centered Emerging Technologies (ExSS-ATEC’20) Cagliari, Italy. 7 pages. XAI evaluation? 1 INTRODUCTION The contribution of this workshop paper is two-fold: First, we in- We have witnessed the widespread adoption of intelligent systems troduce our methodology for taxonomy development and literature into many contexts of our lives. Such systems are often built on review guided by RQ-1 and RQ-2. The review aims to provide an advanced machine learning (ML) algorithms that enable powerful overview of how evaluations are currently conducted and help iden- predictions – often at the expense of interpretability. As these sys- tify suitable best practices. As a second contribution, we present tems are introduced into more sensitive contexts of society, there is a a preliminary taxonomy of human evaluation approaches in XAI growing acceptance that they need to be capable of explaining their and describe its dimensions. Taxonomies have been used in many behavior in human-understandable terms. Hence, much research disciplines to help researchers and practitioners to understand and is conducted within the emerging domain of explainable artificial analyze complex domains [23]. Our overarching goal is to syn- thesize a human subject evaluation guideline for researchers and practitioners of different disciplines in the field of XAI. With this ExSS-ATEC’20, March 2020, Cagliari, Italy Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons work, we put our review methodology and preliminary taxonomy License Attribution 4.0 International (CC BY 4.0). up for discussion with our fellow researchers. ExSS-ATEC’20, March 2020, Cagliari, Italy Chromik and Schuessler 2 FOUNDATIONS AND RELATED WORK 2.3 Evaluating Explanations in XAI 2.1 Evaluating Explanations in Social Sciences Multiple surveys of the ever-growing field of XAI exist. They for- malize and ground the concept of XAI [1, 2], relate it to adjacent Miller defines explanation as either a process or a product [16]. concepts and disciplines [1, 16], categorize methods [9], or discuss On the one hand, an explanation describes the cognitive process future research directions [1, 2]. All these surveys report a lack of of identifying the cause(s) of a particular event. At the same time, rigid evaluations. Adadi et al. found that only 5% of surveyed papers it is a social process between an explainer (sender of an explana- evaluate XAI methods and quantify their relevance [2]. Similarly, tion) and an explainee (receiver of an explanation) with the goal Nunes and Jannach found that 78% of the analyzed papers on expla- to transfer knowledge about the cognitive process. Lastly, an ex- nations in decision support systems lacked structured evaluations planation can describe the product that results from the cognitive that go beyond anecdotal "toy examples" [24]. process and aims to answer a why-question. In our paper, we refer Some works have addressed the design and conduction of expla- to explanations from the product perspective. Psychologists and nation evaluations in XAI. Gilpin et al. survey explainable methods social scientists investigated how humans evaluate explanations for deep neural networks and describe a categorization of evaluation for decades. Within their disciplines, explanation evaluation refers approaches at different stages of the ML development process [8]. to the process applied by an explainee for determining if an expla- Yang et al. provide a framework consisting of multiple levels of nation is satisfactory [16]. Scholars conducted experiments where explanation evaluation [33]. Their definition of persuasibility (mea- they presented participants with different types of explanations as suring the degree of human comprehension) focuses on the human treatments. These experiments indicate that choosing one explana- and resonates with our notion of human subject evaluation. Our tion over another is often an arbitrary choice heavily influenced work aims to elaborate on their generic strategy of "employing by cognitive biases and heuristics [12]. The primary criteria of ex- users for human studies". Nunes and Jannach reviewed 217 publica- plainees are whether the explanation helps them to understand tions spanning multiple decades and briefly report findings from the underlying cause [16]. For instance, humans are more likely to applied evaluation approaches [24]. Based on their survey they accept explanations that are consistent with their prior beliefs. Fur- derive a comprehensive taxonomy that guides the design of ex- thermore, they prefer explanations that are simpler (i.e., with fewer planations. However, their taxonomy omits aspects of evaluation. causes), and more generalizable (i.e., that apply to more events). Mueller identified 39 XAI papers that reported empirical evalu- Also, the effectiveness of an explanation depends on the current ations and qualitatively described chosen evaluation approaches information needs of the explainee. A suitable explanation for one along 9 dimensions [20]. purpose may be irrelevant for another. Thus, for an explanation to While these works offer valuable ideas, they are limited in their be effective, it is essential to know the intended context of use. scope and, thus, offer little guidance for XAI user evaluations. Of course, "there is no standard design for user studies that evaluate forms of explanations" [24]. However, we believe that a unified taxonomy is needed that integrates the most common ideas related 2.2 Explainable Artificial Intelligence (XAI) to human subject evaluation and extends them with best practice Interpretability in machine learning is not a monolithic concept [15]. examples. Such an actionable format can provide great benefit for Instead, it is used to indirectly evaluate whether important desider- researchers and practitioners by guiding them through the design ata, such as fairness, reliability, causality, or trust, are met in a par- and reporting of structured XAI evaluations. ticular context [4]. Some definitions of interpretability are rather system-centric. Doshi-Velez and Kim [4] describe it as a model’s 3 METHODOLOGY "ability to explain or to present in understandable terms to a human." Miller [16] takes a more human-centered perspective calling it "the In this section, we outline our method of taxonomy development degree to which an observer can understand the cause of a decision". as well as the planned literature review. Our goal is to develop a Human understanding can be fostered either by offering means of comprehensive taxonomy for human subject evaluations in XAI. introspection or through explanations [3]. A large variety of meth- We seek to validate and iterate it through a structured literature ods exist for both approaches [9]. The term interpretable machine review (SLR). Figure 2 illustrates our proposed methodology and learning (IML) often refers to research on models and algorithms the interplay between taxonomy and SLR. that are considered as inherently interpretable while explainable AI (XAI) often refers to the generation of (post-hoc) explanations 3.1 Taxonomy Development or means of introspection for black-box models [27, 33]. A model’s There are two approaches to constructing a taxonomy. Following black-box behavior may manifest itself in two ways: either from the conceptual-to-empirical approach, the researcher proposes a complex architectures, as with deep neural networks, or from pro- classification based on a theory or model (deductive). In contrast, prietary models (that may otherwise be interpretable), as with the the empirical-to-conceptual approach derives the taxonomy from COMPAS recidivism model [27]. The lines between IML and XAI empirical cases (inductive). We follow the iterative process for are often seamless and the terms are often used interchangeably. taxonomy development proposed by Nickerson et al. [23]. Their For instance, DARPA’s XAI program subsumes both terms with method unifies both approaches in an iterative process under a the objective to "enable human users to understand, appropriately shared meta-characteristic and defined ending conditions. trust, and effectively manage the emerging generation of artificially In line with RQ-3, we defined our meta-characteristic as the de- intelligent partners" [10]. velopment of a taxonomy for human subject evaluation of black-box A Taxonomy for Human Subject Evaluation of Black-Box Explanations in XAI ExSS-ATEC’20, March 2020, Cagliari, Italy Taxonomy Development Structured Literature Review (Nickerson et al.) (Kitchenham and Charters) Meta-Characteristic: Exclusion Criteria: Taxonomy for human-subject evaluation of black-box EC-1: Not written in English; EC-2: Not related to black- explanations that guides researchers and practitioners box explanations; EC-3: Not reporting human subject with the design and reporting of future studies evaluation; EC-4: Full-text could not be retrieved; EC-5: Not a scientific full- or short paper; EC-6: Is a duplicate Ending condition according to Nickerson et al. Inclusion Criterion IC-1: Reports setup and results of a human subject evaluation in the XAI Determine Publications identified meta-characteristic through Scopus and ending conditions (n=653) Publications screened Publications excluded Conceptual-to-empirical Empirical-to-conceptual based on abstract and after screening approach approach full-text (n=653) (n=507) Publications analyzed Publications excluded Preliminary based on abstract and after detailed analysis taxonomy full-text (n=146) (n=13) Taxonomy Publications meeting meeting inclusion criteria ending conditions (n=133) Figure 1: The proposed methodology for taxonomy development with an integrated structured literature review (SLR). Steps highlighted in green describe the preliminary results presented in this workshop paper. explanations that guides researchers and practitioners with the design Source Selection: An exploratory search for XAI on Google Scholar and reporting of future studies. We start by applying the conceptual- indicated that relevant work is dispersed across multiple publishers, to-empirical approach. To follow this approach, one needs to pro- conferences, and journals. Thus, we use the Scopus database as a pose a classification based on a theory or model. We do this by source as it integrates publications from relevant publishers such consolidating proposed categories for XAI evaluation in prior work as ACM, IEEE, and AAAI. and connecting them with foundational literature on empirical Search Query: Through our exploratory search, we obtained an studies. The resulting taxonomy describes an ideal type, which initial understanding of relevant keywords, synonyms, and related allows us to examine empirically how much current human subject concepts that helped us to construct a search query. We found that evaluations deviate from an ideal type. different terms are used between the disciplines to describe the field of XAI and human subject evaluation approaches. Early research 3.2 Structured Literature Review does not explicitly state the expressions XAI nor explainable artifi- As part of the empirical-to-conceptual iteration, we aim to vali- cial intelligence. Thus, our search queries are composed of groups date and iterate the taxonomy using a structured literature review and terms. Groups refer to a specific aspect of the research question (SLR). In line with RQ-2, the review’s objective is to capture how and limit the search scope. Terms have a similar semantic meaning researchers currently evaluate XAI methods and systems with hu- within the group domain or are often used interchangeably. We are man subjects. Through this, we seek to find out how structured interested in the intersection of 3 groups that can be phrased using and precise we can describe the field using our taxonomy. During different terms. Table 1 shows our used groups and terms. this process, we also aim to iterate the taxonomy. The planned Study Selection Criteria: We filtered the search results by six exclu- SLR follows established approaches proposed by Kitchenham and sion criteria (EC) and one inclusion criterion (IC). We are interested Charters [13]. In the following, we outline the proposed search in primary studies that report the setup and result of human subject strategy. ExSS-ATEC’20, March 2020, Cagliari, Italy Chromik and Schuessler Table 1: Groups and terms used for search query Doshi-Velez and Kim distinguish two types of human subject evaluations that differ in their level of task abstraction [4]: App- Group Terms lication-grounded evaluations conduct experiments within a real ap- 1 - Explainable explainability, explainable, explana- plication context. Typically, this requires a high level of participant tion, explanatory, interpretability, in- expertise. The quality of the explanation is assessed in measures terpretable, intelligibility, intelligible, of the application context, typically with a test of performance. scrutability, scrutable, justification Human-grounded evaluations conduct simplified or abstracted ex- 2 - AI XAI, AI, artificial intelligence, machine periments that aim to maintain the essence of the target application. learning, black-box, recommender sys- Multiple types of user tasks have been proposed to elicit the tem, intelligent system, expert system, quality of explanations [4, 18, 33]. We suggest distinguishing them intelligent agent, decision support sys- by the information provided to the participant and the information tem inquired in return. In verification tasks, participants are provided 3 - Human Subject Evaluation user study, lab study, empirical study, with input, explanation, and output and asked for their satisfac- online experiment, human experiment, tion with the explanation. Forced choice tasks extend this setting. human evaluation, user evaluation, Here, participants are asked to choose from multiple competing participant, within-subject, between- explanations. In the case of forward simulation tasks, participants subject, probe, crowdsourcing, Me- are presented with inputs as well as explanations and need to pre- chanical Turk dict the system’s output. Counterfactual simulation tasks, present participants with an input, an explanation, an output, and an al- ternative output (the counterfactual). Based on these, they predict what input changes are necessary to obtain the alternative output. evaluations in the XAI context (IC-1). We limit the survey to publi- In "Clever Hans" detection tasks, participants need to identify and cations addressing the black-box explanation problem, according to possibly debug flawed models, e.g., a naive or short-sighted pre- Guidotti et al. [9] (EC-2). Furthermore, we exclude publications that dictor [14]. System usage tasks are characterized by participants do not report human-grounded or application-grounded evaluations using the system and its explanations for its primary purpose, e.g., according to Doshi-Velez and Kim [4] (EC-3). We applied the ex- a decision-making situation. The quality of the explanation is as- clusion criteria in cascading order, i.e., if we excluded publications sessed in terms of decision quality. In annotation tasks, participants due to one EC, we did not assess any following criteria. provide a suitable explanation given input and output of a model. Study Analysis: So far, we conducted the search procedure for Explanations are provided to users with very different goals in Scopus in September 2019, which returned a total of 653 potentially mind. For their effective evaluation, researchers need to ensure that relevant publications. Both authors filtered the returned publica- the intended explanation goal(s) are aligned with their intended tions by the inclusions and exclusion criteria to control for inter- evaluation goal(s), and vice versa. Also, calibration of the indi- rater effects. We discussed differing assessments until we reached vidual goals of participants with the intended explanation goal(s) consensus. We are currently in the process of analyzing the publi- might be necessary (e.g., through a briefing before the task) [31]. cations that met the inclusion criterion. We distinguish 9 common explanation goals, which are derived from [24, 30, 32]: transparency aims to explain how the system 4 TAXONOMY OF HUMAN SUBJECT works, scrutability aims to allow users to tell the system it is wrong, EVALUATION IN XAI trust aims to increase the user’s confidence in the system, persua- siveness aims to convince the user to perform an action, satisfaction In the following section, we describe relevant dimensions of black- aim to increase the ease of use or enjoyment, effectiveness aims to box explanation evaluation with human subjects. We group identi- help users make good decisions, efficiency aims to make decisions fied characteristics into task-related, participant-related, and study faster, education aims to enable users to generalize and learn, de- design-related dimensions. The outlined taxonomy is a prelimi- bugging aims to enable users to identify defects in the system. In nary result after the first iterations of the conceptual-to-empirical the case of multiple intended explanation goals, their dependencies approach based on propositions in prior work. Furthermore, the tax- may be complementary, contradictory, or even unknown (e.g., the onomy was validated and refined based on a small subset consisting impact of transparency on trust). of 34 publications from the structured literature review following Hoffman et al. describe multiple levels of task evaluation to as- the empirical-to-conceptual approach. sess a participant’s understanding of and XAI system. Furthermore, they discuss suitable metrics for each level [11]. Tests of satisfaction 4.1 Task Dimensions measure participants’ self-reported satisfaction with an explana- Mohseni and Ragan distinguish two types of human involve- tion and their perception of system understanding. On this level, ment in the evaluation of explanations [18]. In the feedback set- researchers can rarely be sure whether participants understand ting, participants provide feedback on actual explanations. Exper- the system to the degree that participants claim. Tests of compre- imenters determine the quality of the explanations through this hension assess the participants’ mental models of the system and feedback. In contrast, in the feed-forward setting no explanations are tests their understanding, for example, through prediction tests and provided. Instead, humans are generating examples of reasonable generative exercises. Tests of performance measure the resulting explanations serving as a benchmark for algorithmic explanations. human-XAI system performance. A Taxonomy for Human Subject Evaluation of Black-Box Explanations in XAI ExSS-ATEC’20, March 2020, Cagliari, Italy Task Dimensions Study Design Dimensions Intended Explanation Goal [24, 30, 32] Study Approach Treat. Assignment Treat. Combination [24] Transparency Persuasiveness Satisfaction Qualitative Within-subjects Single Explanation Scrutability Effectiveness Efficiency Quantitative Between-subjects With and Without Explanation Trust Education Debugging Mixed Altern. Explanation Altern. Explanation Interface Human Involvement [18] Information given to Participant Participant Incentivation [28, 29, 25] Feedback Task Type [4, 18, 33, 14] Input Explanation Output Feedforward Verification    Monetary , … ,  Non-Monetary Forced Choice   Forward Simulation   ? Evaluation Level [11] Counterfactual Simulation ,?  , Test of Satisfaction "Clever Hans" Detection    Number of Participants Test of Comprehension System Usage    Test of Performance Low Annotation  ?  High  = information provided to participant ? = information inquired of participant Abstraction Level [4] Participant Foresight [21] Level of Expertise Participant Recruiting Participant Type [19] AI Domain Human-grounded Intrinsic Field Study Application-grounded Extrinsic (AI) Novice User low low Lab Study Domain Expert low high Online Study Crowd-sourcing AI Expert high low Participant Dimensions Figure 2: Preliminary taxonomy of human subject evaluation in XAI based on the conceptual-to-empirical approach. 4.2 Participant Dimensions more suitable for data experts. However, it also makes controlling Mohseni et al. distinguish between several participant types: AI for participants’ knowledge more difficult. novices who are usually end-users, data experts (including domain Incentivization of participants is another relevant dimension. experts), and AI experts [19]. This distinction is important as user According to Sova and Nielsen, it should be chosen considering expertise strongly influences other participant-related dimensions. study length, task demand, and participant expertise [28]. Stadt- For example, Doshi-Velez and Kim [4], referencing the work of müller and Porst advise us to use a monetary incentive for partici- Neath and Surprenant [22], point out that user expertise determines pants [29]. However, several non-monetary incentives are known to what kind of cognitive chunks participants apply to a situation. The be effective as well (e.g., gifts for already paid employees) [25, 28]. expertise of participants may determine the recruiting method Prost and Briel found that participants may take part in a study and number of participants. Recruiting difficulty is likely to in- because of study-related incentives (e.g., curiosity, sympathy, or crease with the required level of participants’ expertise [4]. One can entertainment), personal-incentive (e.g., professional interest or a recruit novices in large numbers via crowd-sourcing. In contrast, do- promise made), or altruistic reasons (e.g., to benefit science, society, main or AI experts are usually harder to identify and recruit. They or others) [25]. Esser argues that researchers should consider in- are often invited to a targeted online study, a lab study, or a field centives in their combination such that the benefits of participating study. According to Narayana et al., the user study task may have out-weigh the perceived cost [6]. dependencies with the level of participant foresight [21]. In an intrinsic setting, the participant’s understanding of the context is 4.3 Study Design Dimensions solely based on the provided information. Thus, all participants are The study design of evaluations may follow a qualitative, quanti- assumed to have equal knowledge about the context. Such types of tative, or mixed study approach. In experimental studies, experi- experiments are usually suitable for novices. In an extrinsic setting, menters assign treatments to groups of participants. Applied to the participants can additionally draw upon external facts, such as prior context of explanation evaluations, we can distinguish four com- experience, that may be relevant for assessing the quality of an mon types of treatments combinations in line with Nunes and explanation, e.g., for spotting model flaws. Such a setting may be Jannach [24]: single treatment (i.e., no alternative treatment), with ExSS-ATEC’20, March 2020, Cagliari, Italy Chromik and Schuessler and without explanation (i.e., no explanation is alternative treat- Technology, Electronics and Microelectronics (MIPRO). 0210–0215. https://doi.org/ ment), alternative explanation (i.e., varying information provided 10.23919/MIPRO.2018.8400040 [6] Hartmut Esser. 1986. Über die Teilnahme an Befragungen. ZUMA Nachrichten in explanations between treatments with other aspects of user in- 10, 18 (1986), 38–47. terface fixed), alternative explanation interface (i.e., varying user [7] Gerhard Friedrich and Markus Zanker. 2011. A Taxonomy for Generating Ex- planations in Recommender Systems. AI Magazine 32, 3 (Jun. 2011), 90–98. interfaces between treatments). Furthermore, we can distinguish https://doi.org/10.1609/aimag.v32i3.2365 study designs by the treatment assignment: Between-subjects de- [8] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal. 2018. Ex- signs study the differences in understanding between groups of plaining Explanations: An Overview of Interpretability of Machine Learning. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics participants, each usually assigned to one treatment. In contrast, (DSAA). 80–89. https://doi.org/10.1109/DSAA.2018.00018 within-subject designs study differences within individual partici- [9] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca pants who are assigned to multiple treatments. Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black box models. Comput. Surveys 51, 5 (aug 2018). https://doi.org/10.1145/3236009 [10] David Gunning and David Aha. 2019. DARPA’s Explainable Artificial Intelligence 5 LIMITATION AND FUTURE WORK (XAI) Program. AI Magazine 40, 2 (Jun. 2019), 44–58. https://doi.org/10.1609/ aimag.v40i2.2850 Our preliminary taxonomy has limitations. The taxonomy is neither [11] Robert R. Hoffman, Shane T. Mueller, Gary Klein, and Jordan Litman. 2018. collectively exhaustive nor mutually exclusive. Thus, it does not Metrics for Explainable AI: Challenges and Prospects. CoRR abs/1812.04608 (2018). arXiv:1812.04608 http://arxiv.org/abs/1812.04608 meet the ending conditions of taxonomy development [23]. We [12] Frank C. Keil. 2006. Explanation and Understanding. Annual Review of Psychology aim to refine and iterate the taxonomy with the results from the 57, 1 (2006), 227–254. https://doi.org/10.1146/annurev.psych.57.102904.190100 arXiv:https://doi.org/10.1146/annurev.psych.57.102904.190100 PMID: 16318595. proposed structured literature review. [13] B. Kitchenham and S Charters. 2007. Guidelines for performing Systematic Furthermore, human subject evaluations in XAI are typically Literature Reviews in Software Engineering. Keele University and University of embedded in a broader context, which may create dependencies Durham, Technical Report EBSE-2007-01 (2007). [14] Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, and limit applicable evaluation approaches. Dependencies may Wojciech Samek, and Klaus-Robert Müller. 2019. Unmasking Clever Hans pre- arise from the explanation design context, such as the form of dictors and assessing what machines really learn. Nature communications 10, 1 an explanation, its contents, or its underlying generation method. (2019), 1096. [15] Zachary C. Lipton. 2018. The Mythos of Model Interpretability. Queue 16, 3, Multiple taxonomies have been developed for guiding the design Article 30 (June 2018), 27 pages. https://doi.org/10.1145/3236386.3241340 of explanations [7, 24]. Nunes and Jannach proposed an elaborate [16] Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence 267 (2019), 1 – 38. https://doi.org/10.1016/j.artint. explanation design taxonomy [24]. However, their taxonomy omits 2018.07.007 aspects of evaluation. For now, we have abstained from relating our [17] Tim Miller, Piers Howe, and Liz Sonenberg. 2017. Explainable AI: Beware of preliminary human subject evaluation taxonomy with this prior Inmates Running the Asylum. In IJCAI 2017 Workshop on Explainable Artificial Intelligence (XAI). http://people.eng.unimelb.edu.au/tmiller/pubs/explanation- work, but plan to integrate them in later iterations. inmates.pdf [18] Sina Mohseni and Eric D. Ragan. 2018. A Human-Grounded Evaluation Bench- mark for Local Explanations of Machine Learning. CoRR abs/1801.05075 (2018). 6 CONCLUSION arXiv:1801.05075 http://arxiv.org/abs/1801.05075 In this work, we gave a brief overview of recent efforts on explana- [19] Sina Mohseni, Niloofar Zarei, and Eric D. Ragan. 2018. A Survey of Evaluation Methods and Measures for Interpretable Machine Learning. CoRR abs/1811.11839 tion evaluation with human subjects in the growing field of XAI. (2018). arXiv:1811.11839 http://arxiv.org/abs/1811.11839 We proposed a methodology for developing a comprehensive tax- [20] Shane T. Mueller, Robert R. Hoffman, William J. Clancey, Abigail Emrey, and onomy for human subject evaluation that integrates the knowledge Gary Klein. 2019. Explanation in Human-AI Systems: A Literature Meta-Review, Synopsis of Key Ideas and Publications, and Bibliography for Explainable AI. from multiple disciplines involved in XAI. Based on ideas from CoRR abs/1902.01876 (2019). arXiv:1902.01876 http://arxiv.org/abs/1902.01876 prior work, we presented a preliminary taxonomy following the [21] Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale Doshi-Velez. 2018. How do Humans Understand Explanations from Machine conceptual-to-empirical approach. Despite its limitations, we be- Learning Systems? An Evaluation of the Human-Interpretability of Explanation. lieve our work is a starting point for rigorously evaluating the CoRR abs/1802.00682 (2018). arXiv:1802.00682 http://arxiv.org/abs/1802.00682 utility of explanations for human understanding of XAI systems. [22] Ian Neath and Aimee Surprenant. 2002. Human Memory (2 edition ed.). Thom- son/Wadsworth, Australia ; Belmont, CA. Researchers and practitioners developing XAI explanation facilities [23] Robert C Nickerson, Upkar Varshney, and Jan Muntermann. 2013. A method for and systems have been asked to "respect the time and effort involved taxonomy development and its application in information systems. European to do such evaluations" [4]. We aim to spark a discussion at the Journal of Information Systems 22, 3 (2013), 336–359. https://doi.org/10.1057/ejis. 2012.26 arXiv:https://doi.org/10.1057/ejis.2012.26 workshop on how to support them along the way. [24] Ingrid Nunes and Dietmar Jannach. 2017. A Systematic Review and Taxonomy of Explanations in Decision Support and Recommender Systems. User Modeling and User-Adapted Interaction 27, 3-5 (Dec. 2017), 393–444. https://doi.org/10. REFERENCES 1007/s11257-017-9195-0 [1] Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y. Lim, and Mohan Kankan- [25] Rolf Porst and Christa von Briel. 1995. Wären Sie vielleicht bereit, sich gegebenen- halli. 2018. Trends and Trajectories for Explainable, Accountable and Intelligible falls noch einmal befragen zu lassen? Oder: Gründe für die Teilnahme an Panelbe- Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI Conference on fragungen. Vol. 1995/04. Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). ACM, New [26] Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wort- York, NY, USA, Article 582, 18 pages. https://doi.org/10.1145/3173574.3174156 man Vaughan, and Hanna M. Wallach. 2018. Manipulating and Measuring [2] A. Adadi and M. Berrada. 2018. Peeking Inside the Black-Box: A Survey on Model Interpretability. CoRR abs/1802.07810 (2018). arXiv:1802.07810 http: Explainable Artificial Intelligence (XAI). IEEE Access 6 (2018), 52138–52160. //arxiv.org/abs/1802.07810 https://doi.org/10.1109/ACCESS.2018.2870052 [27] Cynthia Rudin. 2019. Stop explaining black box machine learning models for [3] Or Biran and Courtenay Cotton. 2017. Explanation and justification in machine high stakes decisions and use interpretable models instead. Nature Machine learning: A survey. In IJCAI-17 workshop on explainable AI (XAI), Vol. 8. 1. Intelligence 1, 5 (2019), 206–215. [4] Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Inter- [28] Deborah Hinderer Sova and Jacob Nielsen. 2003. How to Recruit Participants pretability. CoRR abs/1702.08608 (2017). arXiv:1702.08608 http://arxiv.org/abs/ for Usability Studies. https://www.nngroup.com/reports/how-to-recruit- 1702.08608 participants-usability-studies/, accessed December 20th, 2019. [5] F. K. Dosilovic, M. Brcic, and N. Hlupic. 2018. Explainable artificial intelligence: A [29] Sven Stadtmüller and Rolf Porst. 2005. Zum Einsatz von Incentives bei postalischen survey. In 2018 41st International Convention on Information and Communication Befragungen. Vol. 14. A Taxonomy for Human Subject Evaluation of Black-Box Explanations in XAI ExSS-ATEC’20, March 2020, Cagliari, Italy [30] Nava Tintarev and Judith Masthoff. 2007. A Survey of Explanations in Recom- [32] Danding Wang, Qian Yang, Ashraf Abdul, and Brian Y Lim. 2019. Designing mender Systems. In Proceedings of the 2007 IEEE 23rd International Conference on Theory-Driven User-Centric Explainable AI. In Proceedings of the 2019 CHI Con- Data Engineering Workshop (ICDEW ’07). IEEE Computer Society, Washington, ference on Human Factors in Computing Systems. ACM, 601. DC, USA, 801–810. https://doi.org/10.1109/ICDEW.2007.4401070 [33] Fan Yang, Mengnan Du, and Xia Hu. 2019. Evaluating Explanation Without [31] Nadya Vasilyeva, Daniel A Wilkenfeld, and Tania Lombrozo. 2015. Goals Affect Ground Truth in Interpretable Machine Learning. CoRR abs/1907.06831 (2019). the Perceived Quality of Explanations.. In CogSci. arXiv:1907.06831 http://arxiv.org/abs/1907.06831