<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. E. Johnson, T. J. Pollard, L. Shen, L. wei H. Lehman, M. Feng, M. Ghassemi, B. Moody,
P. Szolovits, L. A. Celi, R. G. Mark, MIMIC-III, a freely accessible critical care database,
Scientific Data</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.1145/3375627.3375850</article-id>
      <title-group>
        <article-title>Using Longitudinal Data for Plausible Counterfactual Explanations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Asemota</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giles Hooker</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of California Berkeley</institution>
          ,
          <addr-line>Berkeley, California</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Pennsylvania</institution>
          ,
          <addr-line>Philadelphia, Pennsylvania</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>3</volume>
      <issue>2016</issue>
      <fpage>623</fpage>
      <lpage>631</lpage>
      <abstract>
        <p>Counterfactual explanations are a common approach to providing recourse to data subjects. However, current methodology can produce counterfactuals that cannot be achieved by the subject, making the use of counterfactuals for recourse dificult to justify in practice. Though there is agreement that plausibility is an important quality when using counterfactuals for algorithmic recourse, ground truth plausibility continues to be dificult to quantify. In this paper, we propose using longitudinal data to assess and improve plausibility in counterfactuals. In particular, we develop a metric that compares longitudinal diferences to counterfactual diferences, allowing us to evaluate how similar a counterfactual is to prior observed changes. Furthermore, we use this metric to generate plausible counterfactuals. Finally, we discuss some of the inherent dificulties of using counterfactuals for recourse.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LaTeX class</kwd>
        <kwd>paper template</kwd>
        <kwd>paper formatting</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Over the past two decades, machine learning and artificial intelligence have become intertwined
with broad swaths of society, from education to criminal justice to consumer finance.
Throughout this transition away from human decision-makers and towards algorithmic decision-makers,
researchers, practitioners, and advocates have emphasized the need for explainability and
transparency. Approaches to explainability have varied widely, from creating novel ’glass-box’ model
architectures to developing post-hoc local explainability techniques [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] [2]. Of particular interest
in the past five years are counterfactual explanations, which explain an individual prediction by
ifnding an, in some sense, small change to achieve a desired prediction[ 3]. In contrast to most
explainability techniques, counterfactual explanations seek to explain algorithmic decisions to
data subjects.
      </p>
      <p>Although there has been substantial work in the domain of machine learning explainability,
significant gaps exist regarding the utility of explanations to data subjects. Unlike concepts
such as accuracy or sparsity, subject utility has neither a simple nor agreed upon mathematical
definition. Consequently, counterfactual explanation methods optimize subject utility using
disparate approaches [4] [5]. Terms such as plausibility, validity, and actionability are used to
describe diferent aspects of the utility of counterfactuals. Plausibility, the main focus of this
paper, requires that a counterfactual is a possible state of being [6]. Nonetheless, generating
plausible counterfactuals is not a simple task. Substantial efort has been devoted to developing
methods for plausible counterfactuals, but there are no agreed upon approaches or even metrics
for plausibility. Additional efort has gone into using counterfactuals to provide recourse to
data subjects [7]. Recourse is a stricter goal, requiring that a counterfactual be useful to a data
subject in pursuing a desired decision. Therefore, plausibility is necessary for counterfactuals
to be used for recourse.</p>
      <p>This paper proposes a novel approach to evaluating plausibility using longitudinal data. We
begin by briefly reviewing approaches to improving plausibility in counterfactuals, discussing in
particular persistent pitfalls. We then introduce a longitudinal distance metric for counterfactual
explanations. In introducing our metric, we bring forth the benefits of using longitudinal data
as a proxy for plausibility and mention some limitations. Next, we perform experiments with
our metric to evaluate the use of longitudinal data for plausibility. We also explore some of the
consequences of requiring plausibility. Finally, we discuss the implications of our results in the
broader context of providing recourse through counterfactual explanations.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Ofering Recourse Through Counterfactuals</title>
      <p>A common motivation for counterfactual explanations is to provide data subjects with a path to
recourse. Counterfactuals are unique in their ability to not only explain algorithmic decisions
to a lay audience, but also explain how someone could receive a desired decision. That is, a
counterfactual informs a subject not only why they received a decision, but what to change
and how much to change. Therefore, counterfactuals have the potential to greatly increase
transparency and accountability in algorithmic decision-making.</p>
      <p>However, persistent gaps exist between the ideal scenario and counterfactuals in practice.
Centrally, current methodology fails to consistently produce plausible or achievable explanations.
Here, we use the terms plausible and achievable to refer to objective and subjective perspectives
of the dificulty of pursuing a given counterfactual. A counterfactual is plausible if it respects
constraints on reality, for example, not changing ethnicity or decreasing age. On the other hand,
a counterfactual is achievable if the relevant subject can achieve it. It is generally plausible for
someone to increase their level of education, but it may not be achievable for a given individual.
These definitions themselves elucidate the dificulty of ofering recourse through counterfactual
explanations; how do we know if a data subject can act on a particular recommendation?</p>
      <p>Existing counterfactual explanation methods use proxies for plausibility and achievability in
an attempt to avoid implausible recommendations. Two proxies are most common: relying on
user constraints and leveraging structure in data[5] [8] [4] [9] [10] [7]. Users (i.e. the person
using counterfactuals to explain an algorithm) often have domain expertise on how data subjects
can change. However, relying solely on users risks introducing social bias to explanations.
Data ofer some opportunity to craft constraints objectively, but existing methods to enforce
plausibility using data are insuficient. Current data-based plausibility constraints either assume
individuals are interchangeable or require causal modeling. The former approach does not
reflect the complexities of recommending changes to people, and the latter requires significant
(and often unavailable) knowledge on the part of the user.</p>
      <p>Ultimately, proxies are needed to produce plausible or achievable counterfactuals at scale. We
may not know what an individual can or cannot achieve, or we may have incomplete information
of relationships between the features in our model. However, existing methodologies often use
proxies that insuficiently penalize implausible explanations.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Longitudinal Data as a Proxy for Plausibility</title>
      <p>As discussed in 2, counterfactual explanations often are motivated by the goal of ofering
recourse to data subjects, though there are persistent issues that prevent most methods from
providing recourse. If we view counterfactual explanations as potential paths to recourse, then
we can conceptualize them as recommendations for algorithmic subjects. Specifically, we can
view counterfactuals as recommendations for changes that a data subject can make to receive
a desired decision at some point in the future. Conceptualizing counterfactuals as potential
states of being forward in time naturally leads to considering longitudinal likelihoods. That is,
when making recommendations for the future, we should consider prior observed changes over
time. This perspective leads us to the primary goal of this paper: leveraging longitudinal data
to assess and improve plausibility in counterfactual explanations.</p>
      <p>We introduce a distance metric that compares prior observed changes to proposed changes
in the form of counterfactual explanations. Let ,  ∈ R×  be  observations of  features
across two diferent points in time. Subsequently, let  =  − , that is the change in the
observed features over time. Now we define our distance metric
(, ; , ) = min 1 ∑︁ ‖( − ) − ‖
|ℐ|=  ∈ℐ
(1)
where ℐ is an index set for prior observed changes,  is the desired size of the index set,  is the
example to be explained, and  is the counterfactual explanation. In summary, we compare a
proposed diference to the  closest diferences and average them. By increasing , we require
that the proposed diference is similar to a larger number of observed diferences.</p>
      <p>We can justify and augment our approach in the following ways:
• Since there are likely a large variety in observed trajectories, we average the  most
similar. This allows room for heterogeneity across trajectories without allowing a single
rare trajectory to dominate our metric.
• In the likely chance that our data contains heterogeneous features, we can normalize
our distance metric across features. Here, we consider dividing features by a metric for
the dispersion of their observed diferences. Our experiments use the median absolute
deviation (MAD) or average absolute deviation (AAD), but other approaches can be
implemented.
• For categorical features, several options can be exercised. For binary features, the average
absolute deviation can be appropriate. For multi-class features, we normalize by the rate
at which changes are observed in the longitudinal data.
• Normalization can empower discovery of implausible counterfactual explanations. If our
feature has a high normalization value (e.g. 1 over the MAD), then changes to that feature
are rarely seen.</p>
      <p>Our proposed metric is flexible in its use; we can use it both during and after generating
counterfactuals. Post-generation, the longitudinal distance metric can be used to evaluate
and rank the plausibility of explanations. During generation, the distance metric can be used
to further constrain the search space. Notice that in comparing proposed changes to those
observed in longitudinal data, we not only re-weight distances on a per-feature basis, but we also
incorporate dependencies between feature changes; ruling out a requirement to, for example,
both change profession and increase the length of tenure in your current job.</p>
      <p>A first approach to incorporating our longitudinal metric is to re-score a proposed collection
of counterfactuals. Stochastic search algorithms used in [5] and [9] return a set of possible
counterfactual explanations, usually incorporating a geometric distance metric; these can then
be examined or prioritized by longitudinal distance. We generate counterfactuals using the
methods in [5], and then score them by plausibility. This two-step approach allows us to use the
more regular geometric distance for optimization, providing a more eficient search of feature
space. In this paper, it also allows us to examine the baseline plausibility of counterfactuals
generated with a geometric distance. In the appendix, we ofer one simple way to generate
counterfactuals using our longitudinal metric.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Ranking Explanations from DiCE and MIMIC-III</title>
      <p>To ground our approach, we consider an example from healthcare. Suppose a predictive model
is being used to assess patient risk for a disease. Counterfactuals may be useful to doctors
by explaining individual predictions and showing potential paths to decreased risk. In this
experiment, we use MIMIC-III, an electronic health records (EHR) dataset, to predict acute
respiratory failure (ARF) within four hours of admission [11] [12]. Specifically, we use a version
of MIMIC-III that has been preprocessed by FIDDLE, an EHR data processing pipeline [13]
[14]. Our longitudinal data consists of measurements when the patient is admitted and repeat
measurements four hours after admission. We train a random forests model to predict ARF
using only the first time step of data, and use DiCE ([ 5]) to generate ten counterfactuals for
individuals in the test set who are predicted to have ARF. Our dataset contains 1350 features to
train our model, twenty of which are derivations from vital signs.</p>
      <p>To rank counterfactuals, we use the longitudinal diferences between the first and fourth
hour of data. We normalize our metric using the AAD of the longitudinal diferences, and we
add a small tolerance (10− 5) to prevent division by zero. Finally, we conduct this experiment in
two diferent settings: allowing all features to be changed ( ALL) and allowing only vital signs to
be changed (VITAL). These two settings should show us how our longitudinal distance metric
can help assess plausibility, both when we know what features can be changed and when we
may be unsure how to constrain the counterfactual search space. With this experiment, we seek
to assess the discriminatory ability of our distance metric and subsequently evaluate plausibility
for counterfactual explanations.</p>
      <sec id="sec-4-1">
        <title>4.1. Results</title>
        <p>We begin by looking at the relationship between the geometric distance and longitudinal
distance. Figure 1 B and C plot the L1 distance compared to the longitudinal distance for
explanations using only vital signs and all features respectively. Overall, our metric is loosely
correlated with the L1 distance, but there are significant deviations based on which features are
changed. In the case of explanations that can only change vital signs, there is a noticeable linear
relationship. Vital signs in our dataset generally change at a similar rate, so the cost of changing
one or the other is similar. Therefore, the L1 distance is closely related to the longitudinal
distance.</p>
        <p>Looking at explanations that consider all features, the relationship is much more tenuous.
Some features rarely change, leading to significant jumps in our distance metric even when only
one feature is changed. Additionally, about half of the explanations change some immutable
feature, such as Hospital Ward or Religion. Some explanations also change features that are
mutable in theory, but not observed to have changed. Since the AAD is zero for some features,
changing those features results in a large distance value. The resulting distance is above 105,
1
that is + = 105. Moreover, changing features without observed changes is not a
rare occurrence. When we allowed any feature to be changed, 74 percent of the counterfactuals
generated had a longitudinal distance value above 105. This is in stark contrast VITAL, where
there are no explanations with a longitudinal distance value above 20.</p>
        <p>Next, we consider plausibility at the individual level. Each individual receives ten
counterfactuals, but our metric will help us see how many plausible counterfactuals an individual receives
on average. Figure 1 A shows the proportion of explanations below a threshold, and the vertical
line represents the distance value of a change that has occurred once in our train set. For the
purposes of this experiment, we consider counterfactuals below that threshold to be plausible.
We can see that not only are explanations less plausible on average in ALL, the proportion of
individuals with plausible explanations is significantly lower. Consequently, the vast majority
of individuals in our test set do not receive a plausible counterfactual if we consider changing
all features. Constraining to vital signs, however, leads to only plausible counterfactuals.</p>
        <p>Though we have seen that constraints can improve plausibility, it is important to consider
the efect constraints have on validity. We consider a counterfactual ’valid’ if it has the desired
prediction (i.e. not at risk of ARF). Notably, constraints may prevent the generation of valid
counterfactuals by not allowing changes to predictive features. While we were able to generate
counterfactuals for all 229 individuals in the test set with ALL, we only generated counterfactuals
for 120 individuals with VITAL. This disparity raises concerns around the tension between
plausibility and validity; we can improve plausibility by constraining our search space, but we
may constrain counterfactuals in a way that degrades validity.</p>
        <p>In summary, our experiment elucidates the following:
• Longitudinal data allows us to detect and penalize unobserved changes in counterfactuals
• Constraints can improve plausibility
• Plausibility and validity are in tension with each other</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Supplementary Results</title>
          <p>In the supplement to this paper, we provide further results, specifically regarding the use of our
metric to generate counterfactuals.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Future Work</title>
      <p>In this paper, we only consider a longitudinal distance metric. Future work should explore
modeling longitudinal data to further improve plausibility constraints. Future work should also
consider implementing intermediate steps across time, and modeling plausiblity in terms of
the subject’s current features. Additionally, our approach is computationally complex due to
row-wise comparison of matrices. Further work can investigate decreasing this complexity,
potentially using prototypes or other clustering methods.
series, current population survey: Version 8.0, 2020. URL: https://www.ipums.org/projects/
[16] F. Ding, M. Hardt, J. Miller, L. Schmidt, Retiring adult: New datasets for fair machine
learning, Advances in Neural Information Processing Systems 34 (2021).</p>
    </sec>
    <sec id="sec-6">
      <title>A. Genetic Longitudinal Counterfactuals</title>
      <p>In addition to using our metric after generating counterfactuals, we present a method that
leverages longitudinal data for generating counterfactuals using genetic algorithm. In the genetic
algorithm, largely borrowed from [5] and [8], we begin by generating a random population of
the desired class. Then, we assess the fitness of the population relative to our input and rank the
population by fitness. The top half of the population is then mated (i.e. features are randomly
chosen between two individuals). The next generation in the algorithm is made up of the top
half of the current generation and their ofspring. We repeat this cycle until the best fitness
does not change substantially.</p>
      <p>Algorithm 1 provides pseudocode for the above description. This algorithm is flexible enough
to allow for a variety of fitness metrics, and in this paper we use our longitudinal metric to
generate counterfactuals constrained by longitudinal distances. 2 and 3 show two metrics that
are used in [5] to generate counterfactuals.</p>
      <p>Algorithm 1 Genetic Counterfactuals</p>
      <p>Input: (subject input , desired outcome , model  )</p>
      <p>IntialPopulation(, )
POP ←
repeat
currentBest ← ∞
prevBest ←
mostFit ←
currnetBest ←
POP ←</p>
      <p>Mate(mostFit)
return POP
until currentBest ≈ prevBest</p>
      <p>currentBest
SelectFittest(, ,  )</p>
      <p>BestFitness(mostFit)
(, ) =
︃(</p>
      <p>∑︁
∈</p>
      <p>1
 ()
| − |
)︃</p>
      <p>⎛
+ ⎝</p>
      <p>∑︁
∈</p>
      <p>⎞
( ̸= )⎠
(, ) = ∑︁( ̸= )

(2)
(3)</p>
      <sec id="sec-6-1">
        <title>A.1. Experiment with Adult-Income Dataset</title>
        <p>We performed an experiment to compare counterfactuals generated with and without our
longitudinal metric. In the ’Default’ algorithm, we optimize sparsity and proximity, and in the
’Longitudinal’ algorithm, we optimize proximity and longitudinal distance.</p>
        <p>We use Adult-Income, a common dataset used in fairness and explainability research [15]
[16]. The task is to use an individual’s demographic and economic information to predict if
their income is above 50k. To augment our experiment, we also consider a threshold of 30k.
Having two diferent thresholds should help us understand how plausibility interacts with
the rarity of a desired decision. In Adult-Income, 24 percent of individuals have an income
above 50k, compared to 44 percent who have an income above 30k. We expect that the lower
threshold will lead to more plausible counterfactuals and higher validity for both the ’Default’
and ’Longitudinal’ methods.</p>
        <p>Though the dataset does not contain any longitudinal data, it is simple enough to reason about
what data subjects might look like over time. Therefore, we conduct a simple simulation to
generate longitudinal data: we randomly allow some individuals to swap careers with someone
else in their education class. We also allow some individuals to increase their level of education
before moving to a new career. When swapping careers, all non-demographic variables are
swapped (hours-per-week, occupation, and capital loss/gain). Finally, the simulation increases
age randomly between one and ten years. This simulation shows some of the ways people can
change their economic conditions without allowing any changes on immutable features, such
as race or nationality. However, some features we do not include, such as marital status, can
change in practice. We focus on allowing changes to features that could potentially be included
in a recommendation.</p>
        <p>Metrics for validity are presented in Table 1. Example counterfactuals are presented in Table
2. Overall, we find that the ’Longitudinal’ algorithms produces fewer valid counterfactuals, but
also fewer counterfactuals with impossible changes.</p>
        <sec id="sec-6-1-1">
          <title>Metric</title>
          <p>Mean Validity
% validity=0
% validity=1
% immutable
cu E
c</p>
          <p>an an b
-C S - lt - - S - w - - r
e U o U ia o
T
e
t t
i i
h - - - h - - - - g
n
a
’
t
b i - - b - - - - l
u
l m
- - - - a - - - - ’
M
d
n
a
H
t t v</p>
          <p>r
r r
r
o
f
ea ex -
d
e
e S E
t
o
b e n
u D o
S L
.
g
E m
o
d n
n o
i i
t
c
e a</p>
          <p>f
itr ev ir i
a e r rr - - rr - - - - th
a a a o
b
o
c
u
d e - - - - S - - c</p>
          <p>H o
c s
o r
-v lo
e
h
s c
g
k - - - - - -e
r d
o e</p>
          <p>F
t f
ta le</p>
          <p>S S
p r
g 33 38 39 35 37 37 - - 35 46 teo teu
e
n
A
h o
T c</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Caruana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          , G. Hooker,
          <article-title>Accurate intelligible models with pairwise interactions</article-title>
          ,
          <source>in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          , KDD '13,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>