1. Introduction

A. E. Johnson, T. J. Pollard, L. Shen, L. wei H. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, R. G. Mark, MIMIC-III, a freely accessible critical care database, Scientific Data

10.1145/3375627.3375850

Using Longitudinal Data for Plausible Counterfactual Explanations

Alexander Asemota

Giles Hooker

1 0 University of California Berkeley , Berkeley, California , United States 1 University of Pennsylvania , Philadelphia, Pennsylvania , United States

2020

3 2016 623 631

Counterfactual explanations are a common approach to providing recourse to data subjects. However, current methodology can produce counterfactuals that cannot be achieved by the subject, making the use of counterfactuals for recourse dificult to justify in practice. Though there is agreement that plausibility is an important quality when using counterfactuals for algorithmic recourse, ground truth plausibility continues to be dificult to quantify. In this paper, we propose using longitudinal data to assess and improve plausibility in counterfactuals. In particular, we develop a metric that compares longitudinal diferences to counterfactual diferences, allowing us to evaluate how similar a counterfactual is to prior observed changes. Furthermore, we use this metric to generate plausible counterfactuals. Finally, we discuss some of the inherent dificulties of using counterfactuals for recourse.

eol>LaTeX class paper template paper formatting CEUR-WS

1. Introduction

Over the past two decades, machine learning and artificial intelligence have become intertwined with broad swaths of society, from education to criminal justice to consumer finance. Throughout this transition away from human decision-makers and towards algorithmic decision-makers, researchers, practitioners, and advocates have emphasized the need for explainability and transparency. Approaches to explainability have varied widely, from creating novel ’glass-box’ model architectures to developing post-hoc local explainability techniques [ 1 ] [2]. Of particular interest in the past five years are counterfactual explanations, which explain an individual prediction by ifnding an, in some sense, small change to achieve a desired prediction[ 3]. In contrast to most explainability techniques, counterfactual explanations seek to explain algorithmic decisions to data subjects.

Although there has been substantial work in the domain of machine learning explainability, significant gaps exist regarding the utility of explanations to data subjects. Unlike concepts such as accuracy or sparsity, subject utility has neither a simple nor agreed upon mathematical definition. Consequently, counterfactual explanation methods optimize subject utility using disparate approaches [4] [5]. Terms such as plausibility, validity, and actionability are used to describe diferent aspects of the utility of counterfactuals. Plausibility, the main focus of this paper, requires that a counterfactual is a possible state of being [6]. Nonetheless, generating plausible counterfactuals is not a simple task. Substantial efort has been devoted to developing methods for plausible counterfactuals, but there are no agreed upon approaches or even metrics for plausibility. Additional efort has gone into using counterfactuals to provide recourse to data subjects [7]. Recourse is a stricter goal, requiring that a counterfactual be useful to a data subject in pursuing a desired decision. Therefore, plausibility is necessary for counterfactuals to be used for recourse.

This paper proposes a novel approach to evaluating plausibility using longitudinal data. We begin by briefly reviewing approaches to improving plausibility in counterfactuals, discussing in particular persistent pitfalls. We then introduce a longitudinal distance metric for counterfactual explanations. In introducing our metric, we bring forth the benefits of using longitudinal data as a proxy for plausibility and mention some limitations. Next, we perform experiments with our metric to evaluate the use of longitudinal data for plausibility. We also explore some of the consequences of requiring plausibility. Finally, we discuss the implications of our results in the broader context of providing recourse through counterfactual explanations.

2. Ofering Recourse Through Counterfactuals

A common motivation for counterfactual explanations is to provide data subjects with a path to recourse. Counterfactuals are unique in their ability to not only explain algorithmic decisions to a lay audience, but also explain how someone could receive a desired decision. That is, a counterfactual informs a subject not only why they received a decision, but what to change and how much to change. Therefore, counterfactuals have the potential to greatly increase transparency and accountability in algorithmic decision-making.

However, persistent gaps exist between the ideal scenario and counterfactuals in practice. Centrally, current methodology fails to consistently produce plausible or achievable explanations. Here, we use the terms plausible and achievable to refer to objective and subjective perspectives of the dificulty of pursuing a given counterfactual. A counterfactual is plausible if it respects constraints on reality, for example, not changing ethnicity or decreasing age. On the other hand, a counterfactual is achievable if the relevant subject can achieve it. It is generally plausible for someone to increase their level of education, but it may not be achievable for a given individual. These definitions themselves elucidate the dificulty of ofering recourse through counterfactual explanations; how do we know if a data subject can act on a particular recommendation?

Existing counterfactual explanation methods use proxies for plausibility and achievability in an attempt to avoid implausible recommendations. Two proxies are most common: relying on user constraints and leveraging structure in data[5] [8] [4] [9] [10] [7]. Users (i.e. the person using counterfactuals to explain an algorithm) often have domain expertise on how data subjects can change. However, relying solely on users risks introducing social bias to explanations. Data ofer some opportunity to craft constraints objectively, but existing methods to enforce plausibility using data are insuficient. Current data-based plausibility constraints either assume individuals are interchangeable or require causal modeling. The former approach does not reflect the complexities of recommending changes to people, and the latter requires significant (and often unavailable) knowledge on the part of the user.

Ultimately, proxies are needed to produce plausible or achievable counterfactuals at scale. We may not know what an individual can or cannot achieve, or we may have incomplete information of relationships between the features in our model. However, existing methodologies often use proxies that insuficiently penalize implausible explanations.

3. Longitudinal Data as a Proxy for Plausibility

As discussed in 2, counterfactual explanations often are motivated by the goal of ofering recourse to data subjects, though there are persistent issues that prevent most methods from providing recourse. If we view counterfactual explanations as potential paths to recourse, then we can conceptualize them as recommendations for algorithmic subjects. Specifically, we can view counterfactuals as recommendations for changes that a data subject can make to receive a desired decision at some point in the future. Conceptualizing counterfactuals as potential states of being forward in time naturally leads to considering longitudinal likelihoods. That is, when making recommendations for the future, we should consider prior observed changes over time. This perspective leads us to the primary goal of this paper: leveraging longitudinal data to assess and improve plausibility in counterfactual explanations.

We introduce a distance metric that compares prior observed changes to proposed changes in the form of counterfactual explanations. Let , ∈ R× be observations of features across two diferent points in time. Subsequently, let = − , that is the change in the observed features over time. Now we define our distance metric (, ; , ) = min 1 ∑︁ ‖( − ) − ‖ |ℐ|= ∈ℐ (1) where ℐ is an index set for prior observed changes, is the desired size of the index set, is the example to be explained, and is the counterfactual explanation. In summary, we compare a proposed diference to the closest diferences and average them. By increasing , we require that the proposed diference is similar to a larger number of observed diferences.

We can justify and augment our approach in the following ways: • Since there are likely a large variety in observed trajectories, we average the most similar. This allows room for heterogeneity across trajectories without allowing a single rare trajectory to dominate our metric. • In the likely chance that our data contains heterogeneous features, we can normalize our distance metric across features. Here, we consider dividing features by a metric for the dispersion of their observed diferences. Our experiments use the median absolute deviation (MAD) or average absolute deviation (AAD), but other approaches can be implemented. • For categorical features, several options can be exercised. For binary features, the average absolute deviation can be appropriate. For multi-class features, we normalize by the rate at which changes are observed in the longitudinal data. • Normalization can empower discovery of implausible counterfactual explanations. If our feature has a high normalization value (e.g. 1 over the MAD), then changes to that feature are rarely seen.

Our proposed metric is flexible in its use; we can use it both during and after generating counterfactuals. Post-generation, the longitudinal distance metric can be used to evaluate and rank the plausibility of explanations. During generation, the distance metric can be used to further constrain the search space. Notice that in comparing proposed changes to those observed in longitudinal data, we not only re-weight distances on a per-feature basis, but we also incorporate dependencies between feature changes; ruling out a requirement to, for example, both change profession and increase the length of tenure in your current job.

A first approach to incorporating our longitudinal metric is to re-score a proposed collection of counterfactuals. Stochastic search algorithms used in [5] and [9] return a set of possible counterfactual explanations, usually incorporating a geometric distance metric; these can then be examined or prioritized by longitudinal distance. We generate counterfactuals using the methods in [5], and then score them by plausibility. This two-step approach allows us to use the more regular geometric distance for optimization, providing a more eficient search of feature space. In this paper, it also allows us to examine the baseline plausibility of counterfactuals generated with a geometric distance. In the appendix, we ofer one simple way to generate counterfactuals using our longitudinal metric.

4. Ranking Explanations from DiCE and MIMIC-III

To ground our approach, we consider an example from healthcare. Suppose a predictive model is being used to assess patient risk for a disease. Counterfactuals may be useful to doctors by explaining individual predictions and showing potential paths to decreased risk. In this experiment, we use MIMIC-III, an electronic health records (EHR) dataset, to predict acute respiratory failure (ARF) within four hours of admission [11] [12]. Specifically, we use a version of MIMIC-III that has been preprocessed by FIDDLE, an EHR data processing pipeline [13] [14]. Our longitudinal data consists of measurements when the patient is admitted and repeat measurements four hours after admission. We train a random forests model to predict ARF using only the first time step of data, and use DiCE ([ 5]) to generate ten counterfactuals for individuals in the test set who are predicted to have ARF. Our dataset contains 1350 features to train our model, twenty of which are derivations from vital signs.

To rank counterfactuals, we use the longitudinal diferences between the first and fourth hour of data. We normalize our metric using the AAD of the longitudinal diferences, and we add a small tolerance (10− 5) to prevent division by zero. Finally, we conduct this experiment in two diferent settings: allowing all features to be changed ( ALL) and allowing only vital signs to be changed (VITAL). These two settings should show us how our longitudinal distance metric can help assess plausibility, both when we know what features can be changed and when we may be unsure how to constrain the counterfactual search space. With this experiment, we seek to assess the discriminatory ability of our distance metric and subsequently evaluate plausibility for counterfactual explanations.

4.1. Results

We begin by looking at the relationship between the geometric distance and longitudinal distance. Figure 1 B and C plot the L1 distance compared to the longitudinal distance for explanations using only vital signs and all features respectively. Overall, our metric is loosely correlated with the L1 distance, but there are significant deviations based on which features are changed. In the case of explanations that can only change vital signs, there is a noticeable linear relationship. Vital signs in our dataset generally change at a similar rate, so the cost of changing one or the other is similar. Therefore, the L1 distance is closely related to the longitudinal distance.

Looking at explanations that consider all features, the relationship is much more tenuous. Some features rarely change, leading to significant jumps in our distance metric even when only one feature is changed. Additionally, about half of the explanations change some immutable feature, such as Hospital Ward or Religion. Some explanations also change features that are mutable in theory, but not observed to have changed. Since the AAD is zero for some features, changing those features results in a large distance value. The resulting distance is above 105, 1 that is + = 105. Moreover, changing features without observed changes is not a rare occurrence. When we allowed any feature to be changed, 74 percent of the counterfactuals generated had a longitudinal distance value above 105. This is in stark contrast VITAL, where there are no explanations with a longitudinal distance value above 20.

Next, we consider plausibility at the individual level. Each individual receives ten counterfactuals, but our metric will help us see how many plausible counterfactuals an individual receives on average. Figure 1 A shows the proportion of explanations below a threshold, and the vertical line represents the distance value of a change that has occurred once in our train set. For the purposes of this experiment, we consider counterfactuals below that threshold to be plausible. We can see that not only are explanations less plausible on average in ALL, the proportion of individuals with plausible explanations is significantly lower. Consequently, the vast majority of individuals in our test set do not receive a plausible counterfactual if we consider changing all features. Constraining to vital signs, however, leads to only plausible counterfactuals.

Though we have seen that constraints can improve plausibility, it is important to consider the efect constraints have on validity. We consider a counterfactual ’valid’ if it has the desired prediction (i.e. not at risk of ARF). Notably, constraints may prevent the generation of valid counterfactuals by not allowing changes to predictive features. While we were able to generate counterfactuals for all 229 individuals in the test set with ALL, we only generated counterfactuals for 120 individuals with VITAL. This disparity raises concerns around the tension between plausibility and validity; we can improve plausibility by constraining our search space, but we may constrain counterfactuals in a way that degrades validity.

In summary, our experiment elucidates the following: • Longitudinal data allows us to detect and penalize unobserved changes in counterfactuals • Constraints can improve plausibility • Plausibility and validity are in tension with each other

4.1.1. Supplementary Results

In the supplement to this paper, we provide further results, specifically regarding the use of our metric to generate counterfactuals.

5. Future Work

In this paper, we only consider a longitudinal distance metric. Future work should explore modeling longitudinal data to further improve plausibility constraints. Future work should also consider implementing intermediate steps across time, and modeling plausiblity in terms of the subject’s current features. Additionally, our approach is computationally complex due to row-wise comparison of matrices. Further work can investigate decreasing this complexity, potentially using prototypes or other clustering methods. series, current population survey: Version 8.0, 2020. URL: https://www.ipums.org/projects/ [16] F. Ding, M. Hardt, J. Miller, L. Schmidt, Retiring adult: New datasets for fair machine learning, Advances in Neural Information Processing Systems 34 (2021).

A. Genetic Longitudinal Counterfactuals

In addition to using our metric after generating counterfactuals, we present a method that leverages longitudinal data for generating counterfactuals using genetic algorithm. In the genetic algorithm, largely borrowed from [5] and [8], we begin by generating a random population of the desired class. Then, we assess the fitness of the population relative to our input and rank the population by fitness. The top half of the population is then mated (i.e. features are randomly chosen between two individuals). The next generation in the algorithm is made up of the top half of the current generation and their ofspring. We repeat this cycle until the best fitness does not change substantially.

Algorithm 1 provides pseudocode for the above description. This algorithm is flexible enough to allow for a variety of fitness metrics, and in this paper we use our longitudinal metric to generate counterfactuals constrained by longitudinal distances. 2 and 3 show two metrics that are used in [5] to generate counterfactuals.

Algorithm 1 Genetic Counterfactuals

Input: (subject input , desired outcome , model )

IntialPopulation(, ) POP ← repeat currentBest ← ∞ prevBest ← mostFit ← currnetBest ← POP ←

Mate(mostFit) return POP until currentBest ≈ prevBest

currentBest SelectFittest(, , )

BestFitness(mostFit) (, ) = ︃(

∑︁ ∈

1 () | − | )︃

⎛ + ⎝

∑︁ ∈

⎞ ( ̸= )⎠ (, ) = ∑︁( ̸= ) (2) (3)

A.1. Experiment with Adult-Income Dataset

We performed an experiment to compare counterfactuals generated with and without our longitudinal metric. In the ’Default’ algorithm, we optimize sparsity and proximity, and in the ’Longitudinal’ algorithm, we optimize proximity and longitudinal distance.

We use Adult-Income, a common dataset used in fairness and explainability research [15] [16]. The task is to use an individual’s demographic and economic information to predict if their income is above 50k. To augment our experiment, we also consider a threshold of 30k. Having two diferent thresholds should help us understand how plausibility interacts with the rarity of a desired decision. In Adult-Income, 24 percent of individuals have an income above 50k, compared to 44 percent who have an income above 30k. We expect that the lower threshold will lead to more plausible counterfactuals and higher validity for both the ’Default’ and ’Longitudinal’ methods.

Though the dataset does not contain any longitudinal data, it is simple enough to reason about what data subjects might look like over time. Therefore, we conduct a simple simulation to generate longitudinal data: we randomly allow some individuals to swap careers with someone else in their education class. We also allow some individuals to increase their level of education before moving to a new career. When swapping careers, all non-demographic variables are swapped (hours-per-week, occupation, and capital loss/gain). Finally, the simulation increases age randomly between one and ten years. This simulation shows some of the ways people can change their economic conditions without allowing any changes on immutable features, such as race or nationality. However, some features we do not include, such as marital status, can change in practice. We focus on allowing changes to features that could potentially be included in a recommendation.

Metrics for validity are presented in Table 1. Example counterfactuals are presented in Table 2. Overall, we find that the ’Longitudinal’ algorithms produces fewer valid counterfactuals, but also fewer counterfactuals with impossible changes.

Metric

Mean Validity % validity=0 % validity=1 % immutable cu E c

an an b -C S - lt - - S - w - - r e U o U ia o T e t t i i h - - - h - - - - g n a ’ t b i - - b - - - - l u l m - - - - a - - - - ’ M d n a H t t v

r r r r o f ea ex - d e e S E t o b e n u D o S L . g E m o d n n o i i t c e a

f itr ev ir i a e r rr - - rr - - - - th a a a o b o c u d e - - - - S - - c

H o c s o r -v lo e h s c g k - - - - - -e r d o e

F t f ta le

S S p r g 33 38 39 35 37 37 - - 35 46 teo teu e n A h o T c

[1]

Lou ,

Caruana ,

Gehrke , G. Hooker, Accurate intelligible models with pairwise interactions , in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , KDD '13, Association for Computing Machinery,