=Paper=
{{Paper
|id=Vol-3277/paper3
|storemode=property
|title=Evaluating the Practicality of Counterfactual Explanations
|pdfUrl=https://ceur-ws.org/Vol-3277/paper3.pdf
|volume=Vol-3277
|authors=Nina Spreitzer,Hinda Haned,Ilse van der Linden
|dblpUrl=https://dblp.org/rec/conf/aiia/SpreitzerHL22
}}
==Evaluating the Practicality of Counterfactual Explanations==
<pdf width="1500px">https://ceur-ws.org/Vol-3277/paper3.pdf</pdf>
<pre>
Evaluating the Practicality of Counterfactual
Explanations
Nina Spreitzer1,∗ , Hinda Haned1,2 and Ilse van der Linden1,2,∗
1
    University of Amsterdam, Amsterdam, The Netherlands
2
    Civic AI Lab, Amsterdam, The Netherlands


                                         Abstract
                                         Machine learning models are increasingly used for decisions that directly affect people’s lives. These
                                         models are often opaque, meaning that the people affected cannot understand how or why the decision
                                         was made. However, according to the General Data Protection Regulation, decision subjects have the
                                         right to an explanation. Counterfactual explanations are a way to make machine learning models more
                                         transparent by showing how attributes need to be changed to get a different outcome. This type of
                                         explanation is considered easy to understand and human-friendly. To be used in real life, explanations
                                         must be practical, which means they must go beyond a purely theoretical framework. Research has
                                         focused on defining several objective functions to compute practical counterfactuals. However, it has
                                         not yet been tested whether people perceive the explanations as such in practice. To address this, we
                                         contribute by identifying properties that explanations must satisfy to be practical for human subjects.
                                         The properties are then used to evaluate the practicality of two counterfactual explanation methods
                                         (CARE and WachterCF) by conducting a user study. The results show that human subjects consider the
                                         explanations by CARE (a multi-objective approach) to be more practical than the WachterCF (baseline)
                                         explanations. We also show that the perception of explanations differs depending on the classification
                                         task by exploring multiple datasets.

                                         Keywords
                                         explainable AI, counterfactual explanations, practicality, human-friendly explanations, user study


1. Introduction
Machine learning (ML) models are increasingly used for automated decision-making impacting
people’s lives [1]. Some typical applications for ML model decisions are approving a requested
loan [2], hiring an applicant [3], or setting the price rates for insurance contracts [4]. ML
models are often opaque, meaning users cannot trace back how the decision is made [5]. In light
of this automated decision-making, the European Union put forward a General Data Protection
Regulation (GDPR) [6]. The GDPR includes a “right to explanation” [7], meaning affected
people are entitled to request an explanation for a decision that has been made about them.
To serve this right the research field of explainability for ML models is continuously growing
[8]. Explainability aims to make the functioning of a model clear and easy to understand for

XAI.it 2022 - Italian Workshop on Explainable Artificial Intelligence
∗
    Corresponding authors.
Envelope-Open nina.c.spreitzer@gmail.com (N. Spreitzer); h.haned@uva.nl (H. Haned); i.w.c.vanderlinden@uva.nl
(I. v. d. Linden)
GLOBE https://github.com/ninaspreitzer (N. Spreitzer)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)
a given audience [9]. However, an ongoing debate in legal and ML communities discusses
what this right should entail and what specific requirements must be met [10]. One of the
challenges is that the audience to whom the explanation is directed may vary. Arrieta et al. [9]
define several groups of people requiring explainability including system developers, users
with domain expertise, and users affected by the model decisions. Since the audience does
not necessarily have technical skills or domain knowledge, explainability methods must also
be suitable for non-technical users without expertise. Researchers in this area [11][9][12]
commonly distinguish between two families of explainable approaches, namely ante-hoc and
post-hoc approaches. Ante-hoc approaches focus on making models inherently interpretable
[13]. Whereas, post-hoc methods use comprehensible representations to produce useful
approximations of the model’s decision-making process [14]. This is done while treating the
model as a black box without trying to reveal any knowledge about the model’s functioning [15].

We focus on a particular method of post-doc explanations, called counterfactual expla-
nations [16]. A counterfactual explanation proposes minimal changes to the input data that
lead to a different model outcome. It answers the natural question: “Would changing a certain
factor have changed the outcome?” [17]. They can be seen as recommendations for what to
change to achieve a desired model outcome [18]. We focus on two methods for computing
counterfactual explanations, the original approach proposed by Wachter et al. [16], which we
refer to as WachterCF, and a framework proposed by Rasouli et al. [19], called CARE. Section
4 elaborates on how the methods compute a counterfactual instance and on the differences
between them. Counterfactual explanations are viewed to be easy to understand [16] and
human-friendly [8]. Wachter et al. [16] state that counterfactual explanations are “practically
useful for understanding the reasons for a decision”. The Oxford Dictionary1 defines practical
as “concerned with the actual doing or use of something rather than with theory and ideas”.
Consequently, explanations must go beyond a purely theoretical concept to serve as practical
explanations. However, there is limited work that attempts to evaluate the perception of
counterfactual explanations in practice, and previous work has offered criticism on their
applicability in real-life settings (see Section 2). Our contribution lies in first defining a set of
properties that counterfactual explanations should satisfy in order to be considered practical.
These properties are then used to define questions for a user study. The user study tests how
users perceive counterfactual explanations computed by CARE and WachterCF in two different
contexts. The research questions we explore in this work are as follows.

RQ: How practical are counterfactual explanations for human subjects?

     Q1 What properties must counterfactual explanations serve to be practical for human sub-
        jects?
     Q2 How do human subjects perceive counterfactual instances proposed by CARE compared
        to WachterCF?
     Q3 How does the perception of counterfactual explanations differ depending on the classifi-
        cation task?

1
    https://languages.oup.com/google-dictionary-en/
The remainder of this paper is structured as follows. In Section 2, we present related work,
including an overview of proposed methods to compute counterfactual explanations and con-
ducted user studies. Section 3 outlines the methodology we follow to explore our research
questions. In Section 4 we specify desired properties for practical explanations for human
subjects. Section 5 details the experimental setup of our user study, and Section 6 highlights the
most important findings. Finally, we conclude the paper with a discussion by reflecting on the
results and limitations, as well as outlining future work.


2. Related Work
In this section, we first outline proposed methods for computing counterfactual explanations.
Secondly, we elaborate on limitations of the use of counterfactual explanation methods in
practice. This is followed by a discussion of related work addressing these limitations by
introducing new objectives to achieve practicality. Finally, we discuss related user studies
conducted to evaluate counterfactual explanations.


Figure 1: Overview of counterfactual explanation methods, categorized into optimization-based and
model-based approaches.


2.1. Methods for generating counterfactual explanations
In recent years, many methods for generating counterfactual explanations have been proposed
[20]. A complete literature review of the proposed methods exceeds the scope of this paper;
Figure 1 provides an overview. We distinguish between approaches that are optimization-based
[21][22][19][1][16] and ones that are model-based [23][24][25][20][26]. Since we compare
two optimization-based methods, we further elaborate on the different approaches of those
methods. The first counterfactual explanation method, WachterCF [16], defines a counterfactual
instance as the closest point to the input data that results in a different prediction. The closest
point is found by minimizing a distance function between the original and the counterfactual
feature vector using stochastic gradient descent. The method is restricted in the way that the
black-box model needs to be differentiable, and the proposed distance function only works with
continuous features. Ustun et al. [1] solve the optimization problem using a linear program,
focusing on instances that suggest altering features only in a way that is actionable for the end-
user. Similar to WachterCF, Diverse set of Counterfactual Explanations (DiCE) [22] is based on
gradient optimization but provides the user with multiple diverse counterfactual instances. DiCE
also introduces additional terms to the objective function to include further constraints. The
Multi-Objective Counterfactual explanations (MOC) [21] is the first framework that formalizes
counterfactual explanations as a multi-objective optimization problem. Additionally, Rasouli
et al. [19] propose a framework to generate Coherent Actionable Recourse based on sound
counterfactual Explanations (CARE), solving a multi-objective problem based on a hierarchical
objective set. The overview in Figure 1 is based on a categorization of methods following
Redelmeier et al. [20].

2.2. Limitations on counterfactual explanations in practice
In their work, Barocas et al. [27] show that the computation of counterfactual explanations often
relies on widely overlooked assumptions that are necessary for counterfactual explanations to
be accepted in real life. It is assumed that the highlighted feature changes always correspond to
actual actions. The main problem with this is that the features are not independent, as the actions
are likely to affect other features simultaneously. The authors also state that most computed
counterfactual explanations only rely on the distribution of training data and overlook that not
everything that can be considered equivalent in the training data is automatically equivalent
for individual data subjects. In other words, specific suggestions can be practical for some
people but may not be practical for others. Laugel et al. [15] claim that assumptions make
counterfactual explanations unreliable in many contextual uses. They argue that post-hoc
explanations may not be faithful to original data because they are prone to robustness and
complexity problems, such as overfitting or immoderate generalization. This would lead to
unsatisfying interpretability. The authors outline that a counterfactual instance needs to provide
a set of plausible changes a human can act upon in practice and also needs to result from given
data points.

2.3. Taxonomy of desiderata for counterfactual explanations
Researchers have responded to this criticism by developing counterfactual frameworks with
different objective functions to satisfy specific properties [1][19][22][21][28][29]. Redelmeier
et al. [20] outline some of these attributes that have been proposed in previous work. These
include counterfactuals that aim to be close (decrease distance to original input data), sparse
(reduce number of feature changes), feasible (increase density of area around counterfactual
instance), actionable (restrict the features that can change with constraints), and diverse (increase
distance between suggested counterfactual instances for same input). WachterCF focuses on
being close by minimizing the distance between a counterfactual instance and the original input
data. Thus, it does not explicitly aim to satisfy any other objectives. In contrast, CARE solves a
multi-objective problem including four desirable properties, such as proximity (being a neighbor
of the ground-truth data [15]), connectedness (relationship between counterfactual instance and
training data), coherency (keeping the consistency of (un)changed features) and actionability
(include preference, e.g. restrictions or immutability of features). CARE’s objectives are set
up as a hierarchy with four modules. Validity is on the bottom as a foundation, followed
by Soundness, Coherency, and at the top Actionability. The modules are independent of each
other, and all relate to an individual objective function. The low-level modules, validity and
soundness, deal with fundamental and statistical properties of counterfactual explanations.
A valid counterfactual is the foundation, meaning the counterfactual instance needs to alter
the outcome of the ML model. Soundness includes proximity and connectedness. Hence, a
sound counterfactual instance originates from similar observed data and connects to existing
knowledge. The high-level modules manage coherency between features and user preferences.
The coherency module constructs a correlation model and uses this model to guarantee that all
connected features alter in accordance with a particular feature change. Furthermore, CARE
ensures actionability by enabling users to set preferences. Those preferences can include
defining (im)mutable features and also setting value ranges of specific attributes.

2.4. User studies evaluating counterfactual explanations
Warren et al. [30] have examined how well users are able to predict model outcomes after looking
at counterfactual explanations, and self-reported satisfaction and trust in the explanations. Our
user study adds to their findings by assessing user perceptions of practicality properties grounded
in social sciences [31]. Förster et al. [28] examined the coherency of counterfactual explanations
by testing if users consider an instance as real (taken from the training data) or fake (computed
counterfactual instance). Additionally, the users were asked if they thought the instance was
suitable to explain the outcome. The classification task explored in their study was housing
price prediction. The user study design that we present examines classification tasks that affect
human decision subjects. Our work can contribute to better understand user perception of
other desiderata and to what extent it depends on the classification task.


3. Methodology
To answer the research questions, we first formalize our evaluation of practicality by defining
a set of properties. In the user study, we evaluate the practicality properties by assessing
how human subjects perceive counterfactual explanations generated by CARE compared to
explanations generated by WachterCF. We explore this for two classification tasks using different
datasets.

3.1. Formalizing Practicality
To determine what properties counterfactual explanations must satisfy to be practical, we review
existing literature and examine the requirements of ML explanations. Based on that, we define
a set of properties that make counterfactual explanations practical for humans. We focus on
properties that affect how human subjects perceive counterfactual explanations. Thus, we
examine the relation between explanation and user perception rather than the relation between
explanation and model behavior (i.e. robustness). The defined properties serve as a benchmark
for the practicality of generated counterfactuals.
Figure 2: Example screen of user study: Counterfactual explanations computed with CARE presented
in a visually appealing manner. The counterfactual instance computed with WachterCF followed the
same design.


3.2. User Study
The remaining sub-questions are answered by performing a user study with human subjects.
The study aims to compare the perception of explanations provided by CARE with explanations
provided by WachterCF. We use the Wilcoxon Sign Ranked test [32] to determine if the results
are significantly different. The test is non-parametric, meaning the answers do not need to
follow a normal distribution and tests the difference between paired samples.

3.2.1. Define Practicality Evaluation
To measure the practicality of counterfactual instances, we formulate questions mapped to prac-
ticality properties. We compare the methods indirectly, meaning we first evaluate explanations
computed by CARE and then ask the same questions for explanations following WachterCF. The
answers are given without direct comparison. Only one question at the end asks directly which
method is preferred. This question helps to see if the personal opinion matches the findings.

3.2.2. Define Target Group & Sample Size
Before conducting a user study, the target group to be studied must be defined. Since any
individual could be affected by automated decision-making, we use convenience sampling [33]
by collecting information from a conveniently available pool of respondents. This type of
sampling is prone to bias, which will be discussed in Section 7. One distinction we make in the
analysis is whether participants are familiar with machine learning methods because we believe
there may be a potential difference in the perception of explanations depending on the prior
knowledge of participants. Our objective was to gather enough responses to test for statistical
significance. For this reason, we aimed for 100 respondents, preferably evenly distributed in
terms of machine learning literacy.
3.2.3. Classification Tasks and Data
To examine different classification tasks, we use two separate datasets. In the user study, the two
scenarios were evenly distributed among the participants. Both datasets used can be accessed
through the UCI machine learning repository [34]. The Adult Income dataset [35] is used to
classify whether an individual is likely to have an income of more than 50k per year. The
Student Performance dataset [36] is used for predicting what final grade an individual is likely
to have for a school subject. For the user study, we simplify the classification task to predict if a
student will pass or fail the subject. Both decisions are based on personal information about the
individuals.

3.2.4. Preprocessing & Classification Model
The preparation of the Adult dataset follows Zhu [37] by dropping columns that are not needed
and grouping categorical values to end up with a more concise set of possible values. For the
student dataset, we dropped columns to make the number of possible feature changes similar to
the adult dataset to ensure a fair comparison. An overview of the final datasets can be found in
the Appendix. For the remaining steps until the computation of the counterfactual explanation,
we follow Rasouli et al. [19]. The numerical features of both datasets are standardized and
categorical features are converted to ordinal encoding and are further one-hot encoded. The
data sets are split into 80% training and 20% test set, and only the training set is used for creating
the classification model. We follow Rasouli et al. [19] in the choice of model: a multi-layer
neural network with two hidden layers, each including 50 neurons. The resulting model for the
Adult Income classification task has an F1-score of 0.81. The model for the Student Performance
dataset has an F1-score of 0.71.

3.2.5. Generating Counterfactual Explanation
With this setup, we create counterfactual instances using WachterCF and CARE. CARE provides
the user with the possibility of defining constraints for actionability. We pre-define these
constraints following Rasouli et al. [19] by setting gender and race as fix (immutable value) and
age as ge (can only be greater or equal to the current value). Additionally, the CARE framework
also allows the end-user to set the number of how many counterfactual explanations should
be given for a single instance. We chose to set this number for five as we believe that five
different suggestions is both comprehensible and noticeable different from providing only one
counterfactual suggestion.


4. Practicality
Considering that the recipient of the explanation is a human subject, it is crucial to make the
explanations human-friendly. In addition to other objective functions, the explanations must be
presented in a way such that people understand them [38], and the goodness of an explanation
depends on its audience [9]. Thus, human perception is an essential factor for explanation
methods [13]. We recognize that there is a possible trade-off between making explanations
human-friendly while retaining high accuracy. Nevertheless, we highlight that this trade-off is
not relevant for our current work since the aim is not to improve the human-friendliness of
explanations. This work assesses how humans perceive the explanation methods that claim to
be practical for human subjects.

Miller [31] summarizes human-friendly characteristics. Depending on the context, some
features may be more important than others [38]. Concerning counterfactual explanations, we
have defined the following set of properties that lead to practical explanations. This set is then
used as a benchmark to evaluate how humans perceive counterfactual explanations in practice.

    • Contrastiveness: Humans are not interested in why an event happened but rather in
      why that event happened instead of another. In the context of counterfactual explanations,
      we measure contrastiveness by how well the user understands what needs to change to
      get the opposite outcome.
    • Selectivity: Generally, humans do not expect a complete cause of an event. Humans
      are used to selecting a smaller set of causes and treating it as a complete explanation.
      Therefore, counterfactual explanations can provide selectivity by changing only a subset
      of features as well as providing different suggestions.
    • Social: The explaining method is part of an interaction between the end-user and the
      system explaining. As a result, the social environment, the target audience, and the use
      case need to be considered. For counterfactual explanations, this means that the proposed
      changes should be made realistically in the given use case of the affected person.
    • Truthful: A human-friendly explanation needs to make sense. In other words, the user
      must perceive the counterfactual suggestions to reach the other result as plausible.
    • Consistent with prior beliefs: As described by the confirmation bias [39], people tend
      to ignore information that is inconsistent with their prior beliefs. Applied to counterfactual
      explanations, this means that end-users are more likely to consider explanations that
      suggest changes that are expected in advance.


5. Experimental Setup
The following section describes how the defined practicality properties are measured and how
the user study is designed.

5.1. Practicality Measurements
Table 1 provides an overview of our user study questions and the possible responses. We specify
questions to measure how well the counterfactual instances satisfy the practicality properties.
We map the questions shown to the properties as we believe they represent expectations of coun-
terfactual explanations with respect to those properties. Question 1 is asked at the beginning,
followed by Questions 2-8 are asked once for each counterfactual method. Finally, Question 9
which explanation is preferred. To make the scenario and questions more understandable, we
call the person represented by the input data Charlie.
Table 1
User Study Questions: The following questions are formulated based on the property set to evaluate
practicality.
      Question                                                              Measurement         Property
      1 What attribute(s) would you expect to change for Charlie to
                                                                            Multiple Choice:    Consistency
      instead get the outcome of “earning above 50k” / “passing the
                                                                            List of features    prior beliefs
      course”?
                                                                            Likert Scale:
      2 How surprised are you with the suggested changes in attributes                          Consistency
                                                                            1. Not at all
      to get the outcome of “earning above 50k” / “passing the course”?                         prior beliefs
                                                                            7. Very surprised
                                                                            Likert Scale:
      3 How well does the method explain to you what Charlie needs                              Contrastive-
                                                                            1. Not at all
      to change to get “earning above 50k” / “passing the course”?                              ness
                                                                            7. Very well
      4 Based on the explanation, what attribute(s) would you consider      Multiple Choice:    Consistency
      as most important to change the model outcome?                        List of features    prior beliefs
                                                                            Single Choice:
      5 In your opinion, the amount of five different suggestions / one     Too little/
                                                                                                Selectivity
      suggestion is ___ to explain the model outcome.                       Enough/
                                                                            Too many
                                                                            Single Choice:
      6 In your opinion, the variation of attributes in the suggestion(s)   Too little/
                                                                                                Selectivity
      is ___ to explain the model outcome.                                  Enough/
                                                                            Too much
      7 Do you think Charlie could realistically act upon the suggestions   Likert Scale:
      to change the model outcome to “earning above 50k” / “passing         1. Not at all       Social
      the course”?                                                          7. Fully
                                                                            Likert Scale:
      8 Do you think the suggestions make sense in order to retrieve
                                                                            1. Not at all       Truthful
      the model outcome to “earning above 50k” / “passing the course”?
                                                                            7. A lot
                                                                            Single Choice:
      9 Which method would you prefer as an explanation for the
                                                                            Method A
      outcome of the ML model?
                                                                            Method B


5.2. User Study Setup
The questionnaire is designed using the survey tool Qualtrics2 and send out digitally. The study
starts by introducing ML models, how they are used for automated decision-making, and how
counterfactual instances help to provide explanations. Next, the scenario and the datasets are
explained, including an illustrative counterfactual instance, similar to Figure ??. Then, the
participants are asked what properties they expect to change to retrieve the other outcome.
This is followed by showing the first counterfactual explanation, which is computed by CARE.
The participants are asked to answer questions 2-8 concerning the CARE explanation. After
answering the questions, they are shown the second explanation computed by WachterCF and
are asked the same questions. Finally, the last question asks which method is preferred and gives
an opportunity to leave a comment. Figure 2 illustrates how the first explanation calculated by
CARE is shown to the participants. We considered several options for the order of showing

2
    https://www.qualtrics.com
explanations from the two different methods. The final user study shows each participant
the CARE explanations first followed by the WachterCF explanation. Another option would
be to randomize the order of explanation methods shown between participants. The second
option has the advantage of reducing bias introduced by the order. We chose the fixed order
(CARE followed by WachterCF) with the reasoning that our focus is on evaluating CARE (a
method intended for practicality) against the baseline WachterCF method. By showing CARE
first we ensure that CARE is evaluated a priori without the influence of another counterfactual
explanation. If CARE is not perceived to be more practical than the WachterCF explanation
shown second should obtain similar responses.

5.3. Randomized Instances
The instance shown to participants refers to individuals who initially received a negative
prediction (less than 50k per year or a negative final grade) and are given a counterfactual
instance as an explanation. We randomly select twenty instances from the Adult Income test data
and ten instances from the Student Performance test data. The number differs because the Adult
Income dataset contains more instances than the Student Performance dataset. The instances
are evenly distributed among users and randomly assigned. Furthermore, counterfactual
explanations are presented in a visually appealing manner, since we want participants to
focus on the content of the explanation.


6. Results
We have collected 135 responses. 69 participants received the Adult Income dataset, and 66
received the Student Performance dataset. Additionally, 70 responses indicate that they are
familiar with ML models, while the remaining 65 imply not being that familiar with ML models.
We analyze the results the following way. First, we compare the responses of explanations
computed by CARE compared to the ones computed by WachterCF. These results are used
primarily to answer the second sub-question. Secondly, we examine the responses for both
datasets separately to answer our last sub-question if the perception differs between different
classification tasks. Finally, we investigate whether there is a difference in responses among the
respondents that are or are not familiar with machine learning models.

6.1. Overall Comparison: CARE vs. WachterCF
We start by analyzing all questions that include Likert scales (Questions 2, 3, 7 & 8) and single-
choice answers (Questions 5 & 6). Then we examine questions that ask for selecting features
(Questions 1 & 4), followed by looking into the participants’ preference of the two methods
(Question 9).

6.1.1. Results to Likert Scale & Single Choice Questions
Figure 3 shows a stacked bar chart for Questions 2,3,5,6, and 7, comparing the responses
of WachterCF and CARE. Red colors represent negative responses, yellow colors represent
Figure 3: Responses to questions with quantitative responses (Likert scale and Single Choice). Red-
colored answers (bottom) indicate negative responses, yellow (middle) neutral and green positive (top).
For each question the responses for WachterCF is compared with CARE. Based on the P values of a
Wilcoxon Sign Ranked test the responses differ significantly between the methods. The P value for each
question shown is accordingly: 0.001, 0.002, 9.3*e−–09, 1.6*e−–06, 1.6*e–07, 3.1*e–06


neutral responses, and green colors represent positive responses. Looking at the graphs, it
is noticeable that the counterfactual explanations calculated with WachterCF received more
negative responses than those calculated with CARE. We use a Wilcoxon Sign Ranked test
to evaluate the significance of this result. This test is chosen as we assess the difference in
answers of matched samples. The null hypothesis states that the mean is the same. We aim
to reject this hypothesis by showing that the means of the responses differ significantly. The
resulting P values are all smaller than 0.01 and thus confirm that the responses for WachterCF
are significantly different from the responses for CARE. From this, we can conclude that the
counterfactual instances computed by CARE are considered more practical. Taking those results
into account, we map back to the defined set of practicality properties.
    • Consistent with prior beliefs: Question 2 assesses if the explanations are consistent
      with the participant prior beliefs by asking if the explanation is surprising to them. The
      results show that CARE provides less surprising explanations than WachterCF.
    • Contrastiveness: Question 3 shows that suggestions by CARE provide a better explana-
      tion to the user of what needs to be changed to get the different model outcome, which
      serves as an estimation for being contrasting.
    • Selectivity: Questions 5 & 6 aim to evaluate the selectivity by asking participants to
      assess the number of suggestions and variation of features. CARE is rated to select a
      better subset of features than WachterCF.
    • Social: Question 7 evaluates a social perspective, showing that CARE serves more realistic
      suggestions considering the specific use case.
    • Truthful: Question 8 shows that suggestions by CARE are perceived to make more sense
      than suggestions by WachterCF, which maps to being truthful.
Figure 4: Responses to Questions with quantitative responses (Likert scale and Single Choice) exclu-
sively for the classification task of the Adult Income dataset. Red-colored answers (bottom) indicate
negative responses, yellow (middle) neutral and green positive (top). For each question the responses for
WachterCF are compared with CARE. Based on the P values of a Wilcoxon Sign Ranked only Questions
5 & 6 are significantly different. The P value for each question shown is accordingly: 0.95, 0.2, 0.001,
0.01, 0.5, 0.8


6.1.2. Results of Questions with Feature Selection
Question 1 indicates the prior beliefs of the participants by asking what features they expect to
change to retrieve the desired outcome before seeing any explanation. Question 4 elaborates on
what features are considered important after seeing counterfactual explanations. We analyze
the responses by computing a percentage of agreement, which shows how many of the features
considered as most important after seeing the explanation were also selected in Question 1. For
example, if Age and Working Hours are considered as the most important features, but only
Age is chosen to be expected in Question 1, the percentage of agreement is 50%. By comparing
the overall percentage of agreement of WachterCF and CARE, we can see that explanations
provided by CARE seem to show a higher percentage of agreements than ones by WachterCF.
However, according to the Wilcoxon Sign Ranked test we do not have enough evidence to show
a significant difference for this comparison.

6.1.3. Subjective Comparison of the two methods according to participants
Question 9 directly asks the participants what method is preferred as an explanation. Out of
the 135 answers, 113 selected CARE over WachterCF, and only 22 choose WachterCF as the
preferred method. Therefore, we can conclude that humans subjectively prefer CARE over
WachterCF.
6.2. Difference in Classification Task
The responses show that the perception of WachterCF and CARE differs depending on the
classification task. To further assess whether this difference influences the obtained results, we
perform the same analysis as in section 6.1 but split the responses according to the classification
task assessed by the users. To test for significance between the responses to the different
methods we use again a Wilcoxon Sign Ranked test. The Student Performance data shows
a similar pattern as the overall results shown in Figure 3. CARE is seen as more practical
compared to WachterCF and all differences are statistically significant. On the contrary, the
responses to the Adult Income dataset do not show the same results. Figure 4 shows the
different responses to the two counterfactual explanations for instances from the Adult Income
dataset. Only Questions 5 & 6 show a significant difference in answers, which ask to indicate
the perception regarding the practicality property Selectivity. All other questions do not differ
significantly. Question 7 even shows a slight tendency in favor to WachterCF.

We propose three reasons for this. First, when looking at the actual explanations,
WachterCF often only changes the attribute Age in both datasets. If so, Age decreases for the
Student Performance dataset, but increases for the Adult Income dataset. This shows that a
higher age leads to the outcome of earning more than 50k per year. Decreasing the age is not an
actionable explanation, but getting older is happening without any active changes. Therefore,
seeing this as an explanation gives satisfaction to the user. On the other hand, the explanations
for the Student Performance dataset show that WachterCF also includes decreasing age as
it does not include any preference settings, which leads to impracticality. Another reason
might be that the student performance dataset contains more attributes to be modified than the
adult income dataset. More attributes may lead to more complexity in computing practical
counterfactual explanations. Another reason is that the decision of the Student Performance
dataset is about whether a student is likely to pass or fail a class, which is a high-impact
decision. In contrast, the purpose of the Adult Income is to decide whether a person is likely to
earn more or less than 50k per year, which is not a critical decision. Therefore, participants
may be more demanding regarding the practicality of the explanations when students’ lives are
directly affected.

6.3. Diversity in users’ technical literacy
Lastly, we examine whether there is a difference in responses when comparing participants
who have indicated to be familiar with machine learning models to those who are not. To test
this, we divide the responses for each question into two groups indicating whether the user is
familiar with machine learning models or not. To test if the differences are significant we use
the Mann-Whitney U because the answers come from independent samples. Of the 15 questions
(seven quantitative questions for each method and one comparison question), only Question
7 shown for the WachterCF explanation differs significantly (𝑝 = 0.002), which asks whether
the suggestion can be realistically implemented. Participants with a technical background
responded more positively than those with a non-technical background. One possible reason is
that Question 7 offers a relatively large leeway for interpretation compared to other questions.
Participants with a technical background might better understand what we are trying to address
with this question. In contrast, people with a non-technical background might interpret the
question differently.


7. Discussion and Future Work
In this section, we discuss the limitations and weaknesses of the methodological design and
reflect on the results of the user study transitioning to possible future work.

7.1. Limitations and Weaknesses
One limitation of the user study is that the user study participants were selected through
convenience sampling [33]. This type of sampling is prone to bias: One possible bias in the
target group is that participants are most likely to be highly educated. This could lead to a
judgment that may not represent other social groups. Furthermore, the user study design does
not represent a realistic scenario. Participants judge cases of people they do not know and to
whom they have no emotional attachment. If the participants or people close to them would be
affected by model outcomes, the expectation of explanations may be higher. Another aspect of
this is that the datasets have been preprocessed before conducting the user study. This may not
be the case in a real-life scenario; therefore, the quality of counterfactual explanations could
suffer.

7.2. Impact of the scenario of an explanation
This paper shows that the responses to the majority of the questions differ depending on the
scenario the counterfactual instance is used for. This could indicate that the perception of
explanations differs according to the context of the ML decision. An explanation might be
considered practical for a specific use case but impractical for a different one by the same user.
This may indicate that the practicality of counterfactual explanation depends not only on the
counterfactual method, but also on the type of ML task (e.g. the decision), the data used to
decide, and the complexity of this decision (e.g. how many features are used to decide). Further
research is necessary to determine what conclusions can be drawn about this.

7.3. Future Work
We gain insights that counterfactual explanations with the more recent method are overall
perceived as more practical than with the original method, but also that responses differ
depending on the use case of the decision. Future research should examine how the perception
differs depending on different types of decisions and the number of features used. In addition,
it is important to see how the results would have changed when users themselves are affected
by the decision rather than evaluating an explanation for a stranger. Another area to explore
further is how user perceptions of different explanation methods compare, such as feature
attribution methods (like LIME [40]) or causal explanations [30].
8. Conclusions
People have a right to explanation when affected by a ML decision. This helps them to understand
better why and how the model made this particular decision. Counterfactual explanations
are a way to provide transparency by showing which attributes must change in what way
to achieve a different outcome. Research has focused on developing various frameworks for
computing counterfactuals. However, counterfactual explanations must be practical to be
used by human subjects in practice. To answer our research question about the practicality of
two counterfactual methods, we first define the following properties to measure practicality:
Contrastiveness, Selectivity, Social, Truthful, and Consistent with prior beliefs of the user. To
test how people perceive the explanations, we conduct a user study to compare CARE (a more
recent multi-objective approach) against the original method WachterCF (as a baseline) for two
different scenarios. The overall responses show that people perceive explanations computed
with CARE as significantly more practical than those computed with WachterCF. Furthermore,
we find our results differed between the Adult Income dataset and the Student Performance
dataset, which indicates that the perception might differ depending on the use case of the
explanation.


Acknowledgments
We are grateful to the participants of the user study. And we thank Emma Beauxis-Aussalet
(Vrije Universiteit Amsterdam) for the thoughtful feedback which improved the findings of this
paper.


Reproducability
We provide a repository containing the code and data that was used for this research. It
also includes further details on the user study design. The repository is available at https:
//github.com/ninaspreitzer/practicality-counterfactual-explanations.
References
 [1] B. Ustun, A. Spangher, Y. Liu, Actionable recourse in linear classification, in: Proceedings
     of the conference on fairness, accountability, and transparency, 2019, pp. 10–19.
 [2] A. E. Khandani, A. J. Kim, A. W. Lo, Consumer credit-risk models via machine-learning
     algorithms, Journal of Banking & Finance 34 (2010) 2767–2787.
 [3] I. Ajunwa, S. Friedler, C. E. Scheidegger, S. Venkatasubramanian, Hiring by algorithm:
     predicting and preventing disparate impact, Available at SSRN (2016).
 [4] L. Scism, New york insurers can evaluate your social media use-if they can prove why it’s
     needed, The Wall Street Journal (2019).
 [5] S. Herse, J. Vitale, M. Tonkin, D. Ebrahimian, S. Ojha, B. Johnston, W. Judge, M.-A. Williams,
     Do you trust me, blindly? factors influencing trust towards a robot recommender system, in:
     2018 27th IEEE international symposium on robot and human interactive communication
     (RO-MAN), IEEE, 2018, pp. 7–14.
 [6] 2018 reform of eu data protection rules, ???? URL: https://ec.europa.eu/commission/sites/
     beta-political/files/data-protection-factsheet-changes_en.pdf.
 [7] B. Goodman, S. Flaxman, European union regulations on algorithmic decision-making
     and a “right to explanation”, AI magazine 38 (2017) 50–57.
 [8] C. Molnar, Interpretable machine learning, Lulu. com, 2020.
 [9] A. B. Arrieta, N. Díaz-Rodríguez, J. Del Ser, A. Bennetot, S. Tabik, A. Barbado, S. García,
     S. Gil-López, D. Molina, R. Benjamins, et al., Explainable artificial intelligence (xai):
     Concepts, taxonomies, opportunities and challenges toward responsible ai, Information
     fusion 58 (2020) 82–115.
[10] S. Wachter, B. Mittelstadt, L. Floridi, Why a right to explanation of automated decision-
     making does not exist in the general data protection regulation, International Data Privacy
     Law 7 (2017) 76–99.
[11] A. Adadi, M. Berrada, Peeking inside the black-box: A survey on explainable artificial
     intelligence (xai). ieee access, 6: 52138–52160, 2018, 2018.
[12] K. Sokol, P. Flach, Explainability fact sheets: a framework for systematic assessment of
     explainable approaches, in: Proceedings of the 2020 Conference on Fairness, Accountability,
     and Transparency, 2020, pp. 56–67.
[13] M. Langer, D. Oster, T. Speith, H. Hermanns, L. Kästner, E. Schmidt, A. Sesing, K. Baum,
     What do we want from explainable artificial intelligence (xai)?–a stakeholder perspective
     on xai and a conceptual model guiding interdisciplinary xai research, Artificial Intelligence
     296 (2021) 103473.
[14] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, D. Pedreschi, A survey of
     methods for explaining black box models, ACM computing surveys (CSUR) 51 (2018) 1–42.
[15] T. Laugel, M.-J. Lesot, C. Marsala, M. Detyniecki, Issues with post-hoc counterfactual
     explanations: a discussion, arXiv preprint arXiv:1906.04774 (2019).
[16] S. Wachter, B. Mittelstadt, C. Russell, Counterfactual explanations without opening the
     black box: Automated decisions and the gdpr, Harv. JL & Tech. 31 (2017) 841.
[17] F. Doshi-Velez, M. Kortz, R. Budish, C. Bavitz, S. Gershman, D. O’Brien, K. Scott, S. Schieber,
     J. Waldo, D. Weinberger, et al., Accountability of ai under the law: The role of explanation,
     arXiv preprint arXiv:1711.01134 (2017).
[18] A. Artelt, B. Hammer, On the computation of counterfactual explanations–a survey, arXiv
     preprint arXiv:1911.07749 (2019).
[19] P. Rasouli, I. C. Yu, Care: Coherent actionable recourse based on sound counterfactual
     explanations, arXiv preprint arXiv:2108.08197 (2021).
[20] A. Redelmeier, M. Jullum, K. Aas, A. Løland, Mcce: Monte carlo sampling of realistic
     counterfactual explanations, arXiv preprint arXiv:2111.09790 (2021).
[21] S. Dandl, C. Molnar, M. Binder, B. Bischl, Multi-objective counterfactual explanations,
     in: International Conference on Parallel Problem Solving from Nature, Springer, 2020, pp.
     448–469.
[22] R. K. Mothilal, A. Sharma, C. Tan, Explaining machine learning classifiers through di-
     verse counterfactual explanations, in: Proceedings of the 2020 Conference on Fairness,
     Accountability, and Transparency, 2020, pp. 607–617.
[23] M. Downs, J. L. Chu, Y. Yacoby, F. Doshi-Velez, W. Pan, Cruds: Counterfactual recourse
     using disentangled subspaces, ICML WHI 2020 (2020) 1–23.
[24] S. Joshi, O. Koyejo, W. Vijitbenjaronk, B. Kim, J. Ghosh, Towards realistic individual
     recourse and actionable explanations in black-box decision making systems, arXiv preprint
     arXiv:1907.09615 (2019).
[25] M. Pawelczyk, K. Broelemann, G. Kasneci, Learning model-agnostic counterfactual expla-
     nations for tabular data, in: Proceedings of The Web Conference 2020, 2020, pp. 3126–3132.
[26] F. Yang, S. S. Alva, J. Chen, X. Hu, Model-based counterfactual synthesizer for interpreta-
     tion, in: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery &
     Data Mining, 2021, pp. 1964–1974.
[27] S. Barocas, A. D. Selbst, M. Raghavan, The hidden assumptions behind counterfactual
     explanations and principal reasons, in: Proceedings of the 2020 Conference on Fairness,
     Accountability, and Transparency, 2020, pp. 80–89.
[28] M. Förster, P. Hühn, M. Klier, K. Kluge, Capturing users’ reality: A novel approach
     to generate coherent counterfactual explanations, in: Proceedings of the 54th Hawaii
     International Conference on System Sciences, 2021, p. 1274.
[29] K. Kanamori, T. Takagi, K. Kobayashi, H. Arimura, Dace: Distribution-aware counterfactual
     explanation by mixed-integer linear optimization., in: IJCAI, 2020, pp. 2855–2862.
[30] G. Warren, M. T. Keane, R. M. Byrne, Features of explainability: How users understand
     counterfactual and causal explanations for categorical and continuous features in xai,
     arXiv preprint arXiv:2204.10152 (2022).
[31] T. Miller, Explanation in artificial intelligence: Insights from the social sciences, Artificial
     intelligence 267 (2019) 1–38.
[32] R. F. Woolson, Wilcoxon signed-rank test, Wiley encyclopedia of clinical trials (2007) 1–3.
[33] I. Etikan, S. A. Musa, R. S. Alkassim, et al., Comparison of convenience sampling and
     purposive sampling, American journal of theoretical and applied statistics 5 (2016) 1–4.
[34] D. Dua, C. Graff, UCI machine learning repository, 2017. URL: http://archive.ics.uci.edu/ml.
[35] A. Frank, A. Asuncion, Uci machine learning repository [http://archive. ics. uci. edu/ml].
     irvine, ca: University of california, School of information and computer science 213 (2010).
[36] P. Cortez, A. M. G. Silva, Using data mining to predict secondary school student perfor-
     mance (2008).
[37] H. Zhu, Predicting earning potential using the adult dataset, Retrieved December 5 (2016)
     2016.
[38] D. V. Carvalho, E. M. Pereira, J. S. Cardoso, Machine learning interpretability: A survey
     on methods and metrics, Electronics 8 (2019) 832.
[39] R. S. Nickerson, Confirmation bias: A ubiquitous phenomenon in many guises, Review of
     general psychology 2 (1998) 175–220.
[40] M. T. Ribeiro, S. Singh, C. Guestrin, ” why should i trust you?” explaining the predictions
     of any classifier, in: Proceedings of the 22nd ACM SIGKDD international conference on
     knowledge discovery and data mining, 2016, pp. 1135–1144.
A. Overview of preprocessed Adult Income dataset

             Column                     Type                       Values
             Age                        Continuous                 17 - 90
             Working Hours              Continuous                 2 - 99
                                                                   Female
             Gender                     Discrete
                                                                   Male
                                                                   White
             Race                       Discrete
                                                                   Other3
                                                                   Less than High School
                                                                   High School Graduate
                                                                   Some College
                                                                   Associate’s Degree
             Education Level            Discrete                   Bachelor’s Degree
                                                                   Master’s Degree
                                                                   Doctoral Degree
                                                                   Professional Degree

                                                                   Single
                                                                   Married
                                                                   Separated
             Marital Status             Discrete
                                                                   Divorced
                                                                   Widowed

                                                                   Blue-Collar
                                                                   White-Collar
                                                                   Professional
             Occupation                 Discrete                   Sales
                                                                   Service
                                                                   Other/Unknown

                                                                   Government
                                                                   Private
             Industry Type              Discrete                   Self-Employed
                                                                   Other/Unknown


 3. Please note that this racial distinction is taken from previous research and was chosen for simplicity.
 We are aware that it does not properly reflect all ethnicities.
B. Overview of preprocessed Student Performance dataset

       Column               Type         Values
       Age                  Continuous   15 - 19
       Absences             Continuous   0 - 30
                                         Female
       Gender               Discrete
                                         Male
       Extra educational                 Yes
                            Discrete
       support                           No
       Family educational                Yes
                            Discrete
       support                           No
                                         Yes
       Paid tutor classes   Discrete
                                         No
                                         Very low
                                         Low
                                         Medium
       Study Time           Discrete
                                         High
                                         Very high

                                         Very low
                                         Low
                                         Medium
       Freetime             Discrete
                                         High
                                         Very high

                                         Very low
                                         Low
                                         Medium
       Going out            Discrete
                                         High
                                         Very high

</pre>