A Taxonomy for Human Subject Evaluation of Black-Box
                          Explanations in XAI
                              Michael Chromik                                                             Martin Schuessler
                               LMU Munich                                                             Technische Universität Berlin
                             Munich, Germany                                                                Berlin, Germany
                         michael.chromik@ifi.lmu.de                                                     schuessler@tu-berlin.de
ABSTRACT                                                                               intelligence (XAI) and interpretable machine learning (IML) on de-
The interdisciplinary field of explainable artificial intelligence (XAI)               veloping models, methods, and interfaces that are interpretable to
aims to foster human understanding of black-box machine learning                       human users – often through some notion of explanation.
models through explanation methods. However, there is no consen-                          However, most works focus on computational problems while
sus among the involved disciplines regarding the evaluation of their                   limited research effort is reported concerning their user evaluation.
effectiveness - especially concerning the involvement of human                         Previous surveys identified the need for more rigid empirical evalu-
subjects. For our community, such involvement is a prerequisite for                    ation of explanations [2, 5, 17]. The AI and ML communities often
rigorous evaluation. To better understand how researchers across                       strive for functional evaluation of their approaches with bench-
the disciplines approach human subject XAI evaluation, we propose                      mark data to demonstrate generalizability. While this is suitable to
developing a taxonomy that is iterated with a systematic literature                    demonstrate technical feasibility, it is also problematic since often
review. Approaching them from an HCI perspective, we analyze                           "there is no formal definition of a correct or best explanation" [24].
which study designs scholar chose for different explanation goals.                     Even if a formal foundation exists, it does not necessarily result in
Based on our preliminary analysis, we present a taxonomy that                          practical utility for humans as the utility of an explanation is highly
provides guidance for researchers and practitioners on the design                      dependent on the context and capabilities of human users. With-
and execution of XAI evaluations. With this position paper, we put                     out proper human behavior evaluations, it is difficult to assess an
our survey approach and preliminary results up for discussion with                     explanation method’s utility for practical use cases [26]. We argue
our fellow researchers.                                                                that functional and behavioral evaluation approaches have their
                                                                                       legitimacy. Yet, since there is no consensus on evaluation methods,
CCS CONCEPTS                                                                           the comparison and validation of diverse explanation techniques is
                                                                                       an open challenge [2, 4].
• Human-centered computing → HCI design and evaluation                                    In this work, we take an HCI perspective and focus on evalu-
methods.                                                                               ations with human subjects. We believe that the HCI community
                                                                                       should be the driving force for establishing rigorous evaluation
KEYWORDS                                                                               procedures that investigate how XAI can benefit users. Our work
explainable artificial intelligence; explanation; human evaluation;                    is guided by three research questions:
taxonomy.
                                                                                           • RQ-1: Which evaluation approaches have been proposed
                                                                                             and discussed across disciplines in the field of XAI?
ACM Reference Format:                                                                      • RQ-2: Which study design decisions have researchers made
Michael Chromik and Martin Schuessler. 2020. A Taxonomy for Human                            in previous evaluations with human subjects?
Subject Evaluation of Black-Box Explanations in XAI. In Proceedings of the                 • RQ-3: How can the proposed approaches and study designs
IUI workshop on Explainable Smart Systems and Algorithmic Transparency in
                                                                                             be integrated into a guiding taxonomy for human-centered
Emerging Technologies (ExSS-ATEC’20) Cagliari, Italy. 7 pages.
                                                                                             XAI evaluation?

1    INTRODUCTION                                                                         The contribution of this workshop paper is two-fold: First, we in-
We have witnessed the widespread adoption of intelligent systems                       troduce our methodology for taxonomy development and literature
into many contexts of our lives. Such systems are often built on                       review guided by RQ-1 and RQ-2. The review aims to provide an
advanced machine learning (ML) algorithms that enable powerful                         overview of how evaluations are currently conducted and help iden-
predictions – often at the expense of interpretability. As these sys-                  tify suitable best practices. As a second contribution, we present
tems are introduced into more sensitive contexts of society, there is a                a preliminary taxonomy of human evaluation approaches in XAI
growing acceptance that they need to be capable of explaining their                    and describe its dimensions. Taxonomies have been used in many
behavior in human-understandable terms. Hence, much research                           disciplines to help researchers and practitioners to understand and
is conducted within the emerging domain of explainable artificial                      analyze complex domains [23]. Our overarching goal is to syn-
                                                                                       thesize a human subject evaluation guideline for researchers and
                                                                                       practitioners of different disciplines in the field of XAI. With this
ExSS-ATEC’20, March 2020, Cagliari, Italy
Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons   work, we put our review methodology and preliminary taxonomy
License Attribution 4.0 International (CC BY 4.0).                                     up for discussion with our fellow researchers.
ExSS-ATEC’20, March 2020, Cagliari, Italy                                                                                    Chromik and Schuessler


2 FOUNDATIONS AND RELATED WORK                                               2.3    Evaluating Explanations in XAI
2.1 Evaluating Explanations in Social Sciences                               Multiple surveys of the ever-growing field of XAI exist. They for-
                                                                             malize and ground the concept of XAI [1, 2], relate it to adjacent
Miller defines explanation as either a process or a product [16].
                                                                             concepts and disciplines [1, 16], categorize methods [9], or discuss
On the one hand, an explanation describes the cognitive process
                                                                             future research directions [1, 2]. All these surveys report a lack of
of identifying the cause(s) of a particular event. At the same time,
                                                                             rigid evaluations. Adadi et al. found that only 5% of surveyed papers
it is a social process between an explainer (sender of an explana-
                                                                             evaluate XAI methods and quantify their relevance [2]. Similarly,
tion) and an explainee (receiver of an explanation) with the goal
                                                                             Nunes and Jannach found that 78% of the analyzed papers on expla-
to transfer knowledge about the cognitive process. Lastly, an ex-
                                                                             nations in decision support systems lacked structured evaluations
planation can describe the product that results from the cognitive
                                                                             that go beyond anecdotal "toy examples" [24].
process and aims to answer a why-question. In our paper, we refer
                                                                                Some works have addressed the design and conduction of expla-
to explanations from the product perspective. Psychologists and
                                                                             nation evaluations in XAI. Gilpin et al. survey explainable methods
social scientists investigated how humans evaluate explanations
                                                                             for deep neural networks and describe a categorization of evaluation
for decades. Within their disciplines, explanation evaluation refers
                                                                             approaches at different stages of the ML development process [8].
to the process applied by an explainee for determining if an expla-
                                                                             Yang et al. provide a framework consisting of multiple levels of
nation is satisfactory [16]. Scholars conducted experiments where
                                                                             explanation evaluation [33]. Their definition of persuasibility (mea-
they presented participants with different types of explanations as
                                                                             suring the degree of human comprehension) focuses on the human
treatments. These experiments indicate that choosing one explana-
                                                                             and resonates with our notion of human subject evaluation. Our
tion over another is often an arbitrary choice heavily influenced
                                                                             work aims to elaborate on their generic strategy of "employing
by cognitive biases and heuristics [12]. The primary criteria of ex-
                                                                             users for human studies". Nunes and Jannach reviewed 217 publica-
plainees are whether the explanation helps them to understand
                                                                             tions spanning multiple decades and briefly report findings from
the underlying cause [16]. For instance, humans are more likely to
                                                                             applied evaluation approaches [24]. Based on their survey they
accept explanations that are consistent with their prior beliefs. Fur-
                                                                             derive a comprehensive taxonomy that guides the design of ex-
thermore, they prefer explanations that are simpler (i.e., with fewer
                                                                             planations. However, their taxonomy omits aspects of evaluation.
causes), and more generalizable (i.e., that apply to more events).
                                                                             Mueller identified 39 XAI papers that reported empirical evalu-
Also, the effectiveness of an explanation depends on the current
                                                                             ations and qualitatively described chosen evaluation approaches
information needs of the explainee. A suitable explanation for one
                                                                             along 9 dimensions [20].
purpose may be irrelevant for another. Thus, for an explanation to
                                                                                While these works offer valuable ideas, they are limited in their
be effective, it is essential to know the intended context of use.
                                                                             scope and, thus, offer little guidance for XAI user evaluations. Of
                                                                             course, "there is no standard design for user studies that evaluate
                                                                             forms of explanations" [24]. However, we believe that a unified
                                                                             taxonomy is needed that integrates the most common ideas related
2.2     Explainable Artificial Intelligence (XAI)
                                                                             to human subject evaluation and extends them with best practice
Interpretability in machine learning is not a monolithic concept [15].       examples. Such an actionable format can provide great benefit for
Instead, it is used to indirectly evaluate whether important desider-        researchers and practitioners by guiding them through the design
ata, such as fairness, reliability, causality, or trust, are met in a par-   and reporting of structured XAI evaluations.
ticular context [4]. Some definitions of interpretability are rather
system-centric. Doshi-Velez and Kim [4] describe it as a model’s             3     METHODOLOGY
"ability to explain or to present in understandable terms to a human."
Miller [16] takes a more human-centered perspective calling it "the          In this section, we outline our method of taxonomy development
degree to which an observer can understand the cause of a decision".         as well as the planned literature review. Our goal is to develop a
Human understanding can be fostered either by offering means of              comprehensive taxonomy for human subject evaluations in XAI.
introspection or through explanations [3]. A large variety of meth-          We seek to validate and iterate it through a structured literature
ods exist for both approaches [9]. The term interpretable machine            review (SLR). Figure 2 illustrates our proposed methodology and
learning (IML) often refers to research on models and algorithms             the interplay between taxonomy and SLR.
that are considered as inherently interpretable while explainable
AI (XAI) often refers to the generation of (post-hoc) explanations           3.1    Taxonomy Development
or means of introspection for black-box models [27, 33]. A model’s           There are two approaches to constructing a taxonomy. Following
black-box behavior may manifest itself in two ways: either from              the conceptual-to-empirical approach, the researcher proposes a
complex architectures, as with deep neural networks, or from pro-            classification based on a theory or model (deductive). In contrast,
prietary models (that may otherwise be interpretable), as with the           the empirical-to-conceptual approach derives the taxonomy from
COMPAS recidivism model [27]. The lines between IML and XAI                  empirical cases (inductive). We follow the iterative process for
are often seamless and the terms are often used interchangeably.             taxonomy development proposed by Nickerson et al. [23]. Their
For instance, DARPA’s XAI program subsumes both terms with                   method unifies both approaches in an iterative process under a
the objective to "enable human users to understand, appropriately            shared meta-characteristic and defined ending conditions.
trust, and effectively manage the emerging generation of artificially           In line with RQ-3, we defined our meta-characteristic as the de-
intelligent partners" [10].                                                  velopment of a taxonomy for human subject evaluation of black-box
A Taxonomy for Human Subject Evaluation of Black-Box Explanations in XAI                                         ExSS-ATEC’20, March 2020, Cagliari, Italy


             Taxonomy Development                                           Structured Literature Review
             (Nickerson et al.)                                             (Kitchenham and Charters)

              Meta-Characteristic:                                          Exclusion Criteria:
              Taxonomy for human-subject evaluation of black-box            EC-1: Not written in English; EC-2: Not related to black-
              explanations that guides researchers and practitioners        box explanations; EC-3: Not reporting human subject
              with the design and reporting of future studies               evaluation; EC-4: Full-text could not be retrieved; EC-5:
                                                                            Not a scientific full- or short paper; EC-6: Is a duplicate

              Ending condition according to Nickerson et al.                Inclusion Criterion IC-1: Reports setup and results of a
                                                                            human subject evaluation in the XAI


                                        Determine                                Publications identified
                                    meta-characteristic                             through Scopus
                                   and ending conditions                                (n=653)


                                                                                  Publications screened         Publications excluded
                   Conceptual-to-empirical       Empirical-to-conceptual          based on abstract and            after screening
                         approach                       approach                    full-text (n=653)                  (n=507)


                                                                                  Publications analyzed         Publications excluded
                                        Preliminary                               based on abstract and         after detailed analysis
                                         taxonomy                                   full-text (n=146)                   (n=13)


                                        Taxonomy                                  Publications meeting
                                         meeting                                    inclusion criteria
                                     ending conditions                                   (n=133)


Figure 1: The proposed methodology for taxonomy development with an integrated structured literature review (SLR). Steps
highlighted in green describe the preliminary results presented in this workshop paper.


explanations that guides researchers and practitioners with the design        Source Selection: An exploratory search for XAI on Google Scholar
and reporting of future studies. We start by applying the conceptual-      indicated that relevant work is dispersed across multiple publishers,
to-empirical approach. To follow this approach, one needs to pro-          conferences, and journals. Thus, we use the Scopus database as a
pose a classification based on a theory or model. We do this by            source as it integrates publications from relevant publishers such
consolidating proposed categories for XAI evaluation in prior work         as ACM, IEEE, and AAAI.
and connecting them with foundational literature on empirical                 Search Query: Through our exploratory search, we obtained an
studies. The resulting taxonomy describes an ideal type, which             initial understanding of relevant keywords, synonyms, and related
allows us to examine empirically how much current human subject            concepts that helped us to construct a search query. We found that
evaluations deviate from an ideal type.                                    different terms are used between the disciplines to describe the field
                                                                           of XAI and human subject evaluation approaches. Early research
3.2     Structured Literature Review                                       does not explicitly state the expressions XAI nor explainable artifi-
As part of the empirical-to-conceptual iteration, we aim to vali-          cial intelligence. Thus, our search queries are composed of groups
date and iterate the taxonomy using a structured literature review         and terms. Groups refer to a specific aspect of the research question
(SLR). In line with RQ-2, the review’s objective is to capture how         and limit the search scope. Terms have a similar semantic meaning
researchers currently evaluate XAI methods and systems with hu-            within the group domain or are often used interchangeably. We are
man subjects. Through this, we seek to find out how structured             interested in the intersection of 3 groups that can be phrased using
and precise we can describe the field using our taxonomy. During           different terms. Table 1 shows our used groups and terms.
this process, we also aim to iterate the taxonomy. The planned                Study Selection Criteria: We filtered the search results by six exclu-
SLR follows established approaches proposed by Kitchenham and              sion criteria (EC) and one inclusion criterion (IC). We are interested
Charters [13]. In the following, we outline the proposed search            in primary studies that report the setup and result of human subject
strategy.
ExSS-ATEC’20, March 2020, Cagliari, Italy                                                                                             Chromik and Schuessler


       Table 1: Groups and terms used for search query                                 Doshi-Velez and Kim distinguish two types of human subject
                                                                                    evaluations that differ in their level of task abstraction [4]: App-
 Group                                Terms                                         lication-grounded evaluations conduct experiments within a real ap-
 1 - Explainable                      explainability, explainable, explana-         plication context. Typically, this requires a high level of participant
                                      tion, explanatory, interpretability, in-      expertise. The quality of the explanation is assessed in measures
                                      terpretable, intelligibility, intelligible,   of the application context, typically with a test of performance.
                                      scrutability, scrutable, justification        Human-grounded evaluations conduct simplified or abstracted ex-
 2 - AI                               XAI, AI, artificial intelligence, machine     periments that aim to maintain the essence of the target application.
                                      learning, black-box, recommender sys-            Multiple types of user tasks have been proposed to elicit the
                                      tem, intelligent system, expert system,       quality of explanations [4, 18, 33]. We suggest distinguishing them
                                      intelligent agent, decision support sys-      by the information provided to the participant and the information
                                      tem                                           inquired in return. In verification tasks, participants are provided
 3 - Human Subject Evaluation         user study, lab study, empirical study,       with input, explanation, and output and asked for their satisfac-
                                      online experiment, human experiment,          tion with the explanation. Forced choice tasks extend this setting.
                                      human evaluation, user evaluation,            Here, participants are asked to choose from multiple competing
                                      participant, within-subject, between-         explanations. In the case of forward simulation tasks, participants
                                      subject, probe, crowdsourcing, Me-            are presented with inputs as well as explanations and need to pre-
                                      chanical Turk
                                                                                    dict the system’s output. Counterfactual simulation tasks, present
                                                                                    participants with an input, an explanation, an output, and an al-
                                                                                    ternative output (the counterfactual). Based on these, they predict
                                                                                    what input changes are necessary to obtain the alternative output.
evaluations in the XAI context (IC-1). We limit the survey to publi-                In "Clever Hans" detection tasks, participants need to identify and
cations addressing the black-box explanation problem, according to                  possibly debug flawed models, e.g., a naive or short-sighted pre-
Guidotti et al. [9] (EC-2). Furthermore, we exclude publications that               dictor [14]. System usage tasks are characterized by participants
do not report human-grounded or application-grounded evaluations                    using the system and its explanations for its primary purpose, e.g.,
according to Doshi-Velez and Kim [4] (EC-3). We applied the ex-                     a decision-making situation. The quality of the explanation is as-
clusion criteria in cascading order, i.e., if we excluded publications              sessed in terms of decision quality. In annotation tasks, participants
due to one EC, we did not assess any following criteria.                            provide a suitable explanation given input and output of a model.
   Study Analysis: So far, we conducted the search procedure for                       Explanations are provided to users with very different goals in
Scopus in September 2019, which returned a total of 653 potentially                 mind. For their effective evaluation, researchers need to ensure that
relevant publications. Both authors filtered the returned publica-                  the intended explanation goal(s) are aligned with their intended
tions by the inclusions and exclusion criteria to control for inter-                evaluation goal(s), and vice versa. Also, calibration of the indi-
rater effects. We discussed differing assessments until we reached                  vidual goals of participants with the intended explanation goal(s)
consensus. We are currently in the process of analyzing the publi-                  might be necessary (e.g., through a briefing before the task) [31].
cations that met the inclusion criterion.                                           We distinguish 9 common explanation goals, which are derived
                                                                                    from [24, 30, 32]: transparency aims to explain how the system
4     TAXONOMY OF HUMAN SUBJECT                                                     works, scrutability aims to allow users to tell the system it is wrong,
      EVALUATION IN XAI                                                             trust aims to increase the user’s confidence in the system, persua-
                                                                                    siveness aims to convince the user to perform an action, satisfaction
In the following section, we describe relevant dimensions of black-
                                                                                    aim to increase the ease of use or enjoyment, effectiveness aims to
box explanation evaluation with human subjects. We group identi-
                                                                                    help users make good decisions, efficiency aims to make decisions
fied characteristics into task-related, participant-related, and study
                                                                                    faster, education aims to enable users to generalize and learn, de-
design-related dimensions. The outlined taxonomy is a prelimi-
                                                                                    bugging aims to enable users to identify defects in the system. In
nary result after the first iterations of the conceptual-to-empirical
                                                                                    the case of multiple intended explanation goals, their dependencies
approach based on propositions in prior work. Furthermore, the tax-
                                                                                    may be complementary, contradictory, or even unknown (e.g., the
onomy was validated and refined based on a small subset consisting
                                                                                    impact of transparency on trust).
of 34 publications from the structured literature review following
                                                                                       Hoffman et al. describe multiple levels of task evaluation to as-
the empirical-to-conceptual approach.
                                                                                    sess a participant’s understanding of and XAI system. Furthermore,
                                                                                    they discuss suitable metrics for each level [11]. Tests of satisfaction
4.1       Task Dimensions                                                           measure participants’ self-reported satisfaction with an explana-
Mohseni and Ragan distinguish two types of human involve-                           tion and their perception of system understanding. On this level,
ment in the evaluation of explanations [18]. In the feedback set-                   researchers can rarely be sure whether participants understand
ting, participants provide feedback on actual explanations. Exper-                  the system to the degree that participants claim. Tests of compre-
imenters determine the quality of the explanations through this                     hension assess the participants’ mental models of the system and
feedback. In contrast, in the feed-forward setting no explanations are              tests their understanding, for example, through prediction tests and
provided. Instead, humans are generating examples of reasonable                     generative exercises. Tests of performance measure the resulting
explanations serving as a benchmark for algorithmic explanations.                   human-XAI system performance.
A Taxonomy for Human Subject Evaluation of Black-Box Explanations in XAI                                                      ExSS-ATEC’20, March 2020, Cagliari, Italy


Task Dimensions                                                                                                                            Study Design Dimensions

  Intended Explanation Goal [24, 30, 32]                                        Study Approach           Treat. Assignment          Treat. Combination [24]

  Transparency            Persuasiveness          Satisfaction                  Qualitative              Within-subjects            Single Explanation
  Scrutability            Effectiveness           Efficiency                    Quantitative             Between-subjects           With and Without Explanation
  Trust                   Education               Debugging                     Mixed                                               Altern. Explanation
                                                                                                                                    Altern. Explanation Interface


  Human Involvement [18]                                                            Information given to Participant                Participant Incentivation
                                                                                                                                    [28, 29, 25]
  Feedback                         Task Type [4, 18, 33, 14]                     Input         Explanation       Output
  Feedforward                      Verification                                                                                  Monetary
                                                                                                , … ,                             Non-Monetary
                                   Forced Choice                                                                   
                                   Forward Simulation                                                             ?
  Evaluation Level [11]
                                   Counterfactual Simulation                      ,?                            ,
  Test of Satisfaction             "Clever Hans" Detection                                                                       Number of Participants
  Test of Comprehension            System Usage                                                                   
  Test of Performance                                                                                                               Low
                                   Annotation                                                      ?               
                                                                                                                                    High
                                    = information provided to participant
                                   ? = information inquired of participant


  Abstraction Level [4]           Participant Foresight [21]                                            Level of Expertise          Participant Recruiting
                                                                    Participant Type [19]                  AI      Domain
  Human-grounded                  Intrinsic                                                                                         Field Study
  Application-grounded            Extrinsic                         (AI) Novice User                     low            low         Lab Study
                                                                    Domain Expert                        low         high           Online Study
                                                                                                                                    Crowd-sourcing
                                                                    AI Expert                            high           low

                                                                                                                                            Participant Dimensions


      Figure 2: Preliminary taxonomy of human subject evaluation in XAI based on the conceptual-to-empirical approach.


4.2     Participant Dimensions                                                         more suitable for data experts. However, it also makes controlling
Mohseni et al. distinguish between several participant types: AI                       for participants’ knowledge more difficult.
novices who are usually end-users, data experts (including domain                         Incentivization of participants is another relevant dimension.
experts), and AI experts [19]. This distinction is important as user                   According to Sova and Nielsen, it should be chosen considering
expertise strongly influences other participant-related dimensions.                    study length, task demand, and participant expertise [28]. Stadt-
For example, Doshi-Velez and Kim [4], referencing the work of                          müller and Porst advise us to use a monetary incentive for partici-
Neath and Surprenant [22], point out that user expertise determines                    pants [29]. However, several non-monetary incentives are known to
what kind of cognitive chunks participants apply to a situation. The                   be effective as well (e.g., gifts for already paid employees) [25, 28].
expertise of participants may determine the recruiting method                          Prost and Briel found that participants may take part in a study
and number of participants. Recruiting difficulty is likely to in-                     because of study-related incentives (e.g., curiosity, sympathy, or
crease with the required level of participants’ expertise [4]. One can                 entertainment), personal-incentive (e.g., professional interest or a
recruit novices in large numbers via crowd-sourcing. In contrast, do-                  promise made), or altruistic reasons (e.g., to benefit science, society,
main or AI experts are usually harder to identify and recruit. They                    or others) [25]. Esser argues that researchers should consider in-
are often invited to a targeted online study, a lab study, or a field                  centives in their combination such that the benefits of participating
study. According to Narayana et al., the user study task may have                      out-weigh the perceived cost [6].
dependencies with the level of participant foresight [21]. In an
intrinsic setting, the participant’s understanding of the context is                   4.3     Study Design Dimensions
solely based on the provided information. Thus, all participants are                   The study design of evaluations may follow a qualitative, quanti-
assumed to have equal knowledge about the context. Such types of                       tative, or mixed study approach. In experimental studies, experi-
experiments are usually suitable for novices. In an extrinsic setting,                 menters assign treatments to groups of participants. Applied to the
participants can additionally draw upon external facts, such as prior                  context of explanation evaluations, we can distinguish four com-
experience, that may be relevant for assessing the quality of an                       mon types of treatments combinations in line with Nunes and
explanation, e.g., for spotting model flaws. Such a setting may be                     Jannach [24]: single treatment (i.e., no alternative treatment), with
ExSS-ATEC’20, March 2020, Cagliari, Italy                                                                                                                     Chromik and Schuessler


and without explanation (i.e., no explanation is alternative treat-                                Technology, Electronics and Microelectronics (MIPRO). 0210–0215. https://doi.org/
ment), alternative explanation (i.e., varying information provided                                 10.23919/MIPRO.2018.8400040
                                                                                               [6] Hartmut Esser. 1986. Über die Teilnahme an Befragungen. ZUMA Nachrichten
in explanations between treatments with other aspects of user in-                                  10, 18 (1986), 38–47.
terface fixed), alternative explanation interface (i.e., varying user                          [7] Gerhard Friedrich and Markus Zanker. 2011. A Taxonomy for Generating Ex-
                                                                                                   planations in Recommender Systems. AI Magazine 32, 3 (Jun. 2011), 90–98.
interfaces between treatments). Furthermore, we can distinguish                                    https://doi.org/10.1609/aimag.v32i3.2365
study designs by the treatment assignment: Between-subjects de-                                [8] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal. 2018. Ex-
signs study the differences in understanding between groups of                                     plaining Explanations: An Overview of Interpretability of Machine Learning. In
                                                                                                   2018 IEEE 5th International Conference on Data Science and Advanced Analytics
participants, each usually assigned to one treatment. In contrast,                                 (DSAA). 80–89. https://doi.org/10.1109/DSAA.2018.00018
within-subject designs study differences within individual partici-                            [9] Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca
pants who are assigned to multiple treatments.                                                     Giannotti, and Dino Pedreschi. 2018. A survey of methods for explaining black
                                                                                                   box models. Comput. Surveys 51, 5 (aug 2018). https://doi.org/10.1145/3236009
                                                                                              [10] David Gunning and David Aha. 2019. DARPA’s Explainable Artificial Intelligence
5     LIMITATION AND FUTURE WORK                                                                   (XAI) Program. AI Magazine 40, 2 (Jun. 2019), 44–58. https://doi.org/10.1609/
                                                                                                   aimag.v40i2.2850
Our preliminary taxonomy has limitations. The taxonomy is neither                             [11] Robert R. Hoffman, Shane T. Mueller, Gary Klein, and Jordan Litman. 2018.
collectively exhaustive nor mutually exclusive. Thus, it does not                                  Metrics for Explainable AI: Challenges and Prospects. CoRR abs/1812.04608
                                                                                                   (2018). arXiv:1812.04608 http://arxiv.org/abs/1812.04608
meet the ending conditions of taxonomy development [23]. We                                   [12] Frank C. Keil. 2006. Explanation and Understanding. Annual Review of Psychology
aim to refine and iterate the taxonomy with the results from the                                   57, 1 (2006), 227–254. https://doi.org/10.1146/annurev.psych.57.102904.190100
                                                                                                   arXiv:https://doi.org/10.1146/annurev.psych.57.102904.190100 PMID: 16318595.
proposed structured literature review.                                                        [13] B. Kitchenham and S Charters. 2007. Guidelines for performing Systematic
   Furthermore, human subject evaluations in XAI are typically                                     Literature Reviews in Software Engineering. Keele University and University of
embedded in a broader context, which may create dependencies                                       Durham, Technical Report EBSE-2007-01 (2007).
                                                                                              [14] Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon,
and limit applicable evaluation approaches. Dependencies may                                       Wojciech Samek, and Klaus-Robert Müller. 2019. Unmasking Clever Hans pre-
arise from the explanation design context, such as the form of                                     dictors and assessing what machines really learn. Nature communications 10, 1
an explanation, its contents, or its underlying generation method.                                 (2019), 1096.
                                                                                              [15] Zachary C. Lipton. 2018. The Mythos of Model Interpretability. Queue 16, 3,
Multiple taxonomies have been developed for guiding the design                                     Article 30 (June 2018), 27 pages. https://doi.org/10.1145/3236386.3241340
of explanations [7, 24]. Nunes and Jannach proposed an elaborate                              [16] Tim Miller. 2019. Explanation in artificial intelligence: Insights from the social
                                                                                                   sciences. Artificial Intelligence 267 (2019), 1 – 38. https://doi.org/10.1016/j.artint.
explanation design taxonomy [24]. However, their taxonomy omits                                    2018.07.007
aspects of evaluation. For now, we have abstained from relating our                           [17] Tim Miller, Piers Howe, and Liz Sonenberg. 2017. Explainable AI: Beware of
preliminary human subject evaluation taxonomy with this prior                                      Inmates Running the Asylum. In IJCAI 2017 Workshop on Explainable Artificial
                                                                                                   Intelligence (XAI). http://people.eng.unimelb.edu.au/tmiller/pubs/explanation-
work, but plan to integrate them in later iterations.                                              inmates.pdf
                                                                                              [18] Sina Mohseni and Eric D. Ragan. 2018. A Human-Grounded Evaluation Bench-
                                                                                                   mark for Local Explanations of Machine Learning. CoRR abs/1801.05075 (2018).
6     CONCLUSION                                                                                   arXiv:1801.05075 http://arxiv.org/abs/1801.05075
In this work, we gave a brief overview of recent efforts on explana-                          [19] Sina Mohseni, Niloofar Zarei, and Eric D. Ragan. 2018. A Survey of Evaluation
                                                                                                   Methods and Measures for Interpretable Machine Learning. CoRR abs/1811.11839
tion evaluation with human subjects in the growing field of XAI.                                   (2018). arXiv:1811.11839 http://arxiv.org/abs/1811.11839
We proposed a methodology for developing a comprehensive tax-                                 [20] Shane T. Mueller, Robert R. Hoffman, William J. Clancey, Abigail Emrey, and
onomy for human subject evaluation that integrates the knowledge                                   Gary Klein. 2019. Explanation in Human-AI Systems: A Literature Meta-Review,
                                                                                                   Synopsis of Key Ideas and Publications, and Bibliography for Explainable AI.
from multiple disciplines involved in XAI. Based on ideas from                                     CoRR abs/1902.01876 (2019). arXiv:1902.01876 http://arxiv.org/abs/1902.01876
prior work, we presented a preliminary taxonomy following the                                 [21] Menaka Narayanan, Emily Chen, Jeffrey He, Been Kim, Sam Gershman, and Finale
                                                                                                   Doshi-Velez. 2018. How do Humans Understand Explanations from Machine
conceptual-to-empirical approach. Despite its limitations, we be-                                  Learning Systems? An Evaluation of the Human-Interpretability of Explanation.
lieve our work is a starting point for rigorously evaluating the                                   CoRR abs/1802.00682 (2018). arXiv:1802.00682 http://arxiv.org/abs/1802.00682
utility of explanations for human understanding of XAI systems.                               [22] Ian Neath and Aimee Surprenant. 2002. Human Memory (2 edition ed.). Thom-
                                                                                                   son/Wadsworth, Australia ; Belmont, CA.
Researchers and practitioners developing XAI explanation facilities                           [23] Robert C Nickerson, Upkar Varshney, and Jan Muntermann. 2013. A method for
and systems have been asked to "respect the time and effort involved                               taxonomy development and its application in information systems. European
to do such evaluations" [4]. We aim to spark a discussion at the                                   Journal of Information Systems 22, 3 (2013), 336–359. https://doi.org/10.1057/ejis.
                                                                                                   2012.26 arXiv:https://doi.org/10.1057/ejis.2012.26
workshop on how to support them along the way.                                                [24] Ingrid Nunes and Dietmar Jannach. 2017. A Systematic Review and Taxonomy
                                                                                                   of Explanations in Decision Support and Recommender Systems. User Modeling
                                                                                                   and User-Adapted Interaction 27, 3-5 (Dec. 2017), 393–444. https://doi.org/10.
REFERENCES                                                                                         1007/s11257-017-9195-0
 [1] Ashraf Abdul, Jo Vermeulen, Danding Wang, Brian Y. Lim, and Mohan Kankan-                [25] Rolf Porst and Christa von Briel. 1995. Wären Sie vielleicht bereit, sich gegebenen-
     halli. 2018. Trends and Trajectories for Explainable, Accountable and Intelligible            falls noch einmal befragen zu lassen? Oder: Gründe für die Teilnahme an Panelbe-
     Systems: An HCI Research Agenda. In Proceedings of the 2018 CHI Conference on                 fragungen. Vol. 1995/04.
     Human Factors in Computing Systems (Montreal QC, Canada) (CHI ’18). ACM, New             [26] Forough Poursabzi-Sangdeh, Daniel G. Goldstein, Jake M. Hofman, Jennifer Wort-
     York, NY, USA, Article 582, 18 pages. https://doi.org/10.1145/3173574.3174156                 man Vaughan, and Hanna M. Wallach. 2018. Manipulating and Measuring
 [2] A. Adadi and M. Berrada. 2018. Peeking Inside the Black-Box: A Survey on                      Model Interpretability. CoRR abs/1802.07810 (2018). arXiv:1802.07810 http:
     Explainable Artificial Intelligence (XAI). IEEE Access 6 (2018), 52138–52160.                 //arxiv.org/abs/1802.07810
     https://doi.org/10.1109/ACCESS.2018.2870052                                              [27] Cynthia Rudin. 2019. Stop explaining black box machine learning models for
 [3] Or Biran and Courtenay Cotton. 2017. Explanation and justification in machine                 high stakes decisions and use interpretable models instead. Nature Machine
     learning: A survey. In IJCAI-17 workshop on explainable AI (XAI), Vol. 8. 1.                  Intelligence 1, 5 (2019), 206–215.
 [4] Finale Doshi-Velez and Been Kim. 2017. Towards A Rigorous Science of Inter-              [28] Deborah Hinderer Sova and Jacob Nielsen. 2003. How to Recruit Participants
     pretability. CoRR abs/1702.08608 (2017). arXiv:1702.08608 http://arxiv.org/abs/               for Usability Studies.         https://www.nngroup.com/reports/how-to-recruit-
     1702.08608                                                                                    participants-usability-studies/, accessed December 20th, 2019.
 [5] F. K. Dosilovic, M. Brcic, and N. Hlupic. 2018. Explainable artificial intelligence: A   [29] Sven Stadtmüller and Rolf Porst. 2005. Zum Einsatz von Incentives bei postalischen
     survey. In 2018 41st International Convention on Information and Communication                Befragungen. Vol. 14.
A Taxonomy for Human Subject Evaluation of Black-Box Explanations in XAI                                                         ExSS-ATEC’20, March 2020, Cagliari, Italy


[30] Nava Tintarev and Judith Masthoff. 2007. A Survey of Explanations in Recom-        [32] Danding Wang, Qian Yang, Ashraf Abdul, and Brian Y Lim. 2019. Designing
     mender Systems. In Proceedings of the 2007 IEEE 23rd International Conference on        Theory-Driven User-Centric Explainable AI. In Proceedings of the 2019 CHI Con-
     Data Engineering Workshop (ICDEW ’07). IEEE Computer Society, Washington,               ference on Human Factors in Computing Systems. ACM, 601.
     DC, USA, 801–810. https://doi.org/10.1109/ICDEW.2007.4401070                       [33] Fan Yang, Mengnan Du, and Xia Hu. 2019. Evaluating Explanation Without
[31] Nadya Vasilyeva, Daniel A Wilkenfeld, and Tania Lombrozo. 2015. Goals Affect            Ground Truth in Interpretable Machine Learning. CoRR abs/1907.06831 (2019).
     the Perceived Quality of Explanations.. In CogSci.                                      arXiv:1907.06831 http://arxiv.org/abs/1907.06831