<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Explaining Learned Reward Functions with Counterfactual Trajectories</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jan Wehner</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frans Oliehoek</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luciano Calvante Siebert</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CISPA Helmholtz Center for Information Security</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Delft University of Technology</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Learning rewards from human behavior or feedback is a promising approach to aligning AI systems with human values but fails to consistently extract correct reward functions. Interpretability tools could enable users to understand and evaluate possible flaws in learned reward functions. We propose Counterfactual Trajectory Explanations (CTEs) to interpret reward functions in Reinforcement Learning by contrasting an original and a counterfactual trajectory and the rewards they each receive. We derive six quality criteria for CTEs and propose a novel Monte-Carlo-based algorithm for generating CTEs that optimizes these quality criteria. To evaluate how informative the generated explanations are to a proxy-human model, we train it to predict rewards from CTEs. CTEs are demonstrably informative for the proxy-human model, increasing the similarity between its predictions and the reward function on unseen trajectories. Further, it learns to accurately judge diferences in rewards between trajectories and generalizes to out-of-distribution examples. Although CTEs do not lead to a perfect prediction of the reward, our method, and more generally the adaptation of XAI methods, are presented as a fruitful approach for interpreting learned reward functions and thus enabling users to evaluate them.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Value Alignment</kwd>
        <kwd>Reward Learning</kwd>
        <kwd>Explainable AI</kwd>
        <kwd>Counterfactual Explanations</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As Reinforcement Learning (RL) models grow in their capabilities and adoption in real-world applications
[
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ], we must ensure that they are safe and aligned with human values. A core dificulty of achieving
trustworthy and controllable AI [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] is to accurately capture human intentions and preferences in
the reward function on which the RL agent is trained since the reward function will shape the agent’s
objectives and behaviour. For many tasks, it is hard to manually specify a reward function that accurately
represents the intentions, preferences, or values of designers, users or society at large [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. Reward
Learning is a set of techniques that circumvents this problem by instead learning the reward function
from data. For example, Preference-based RL [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] derives a reward function from preference judgments
queried from a human and has recently been applied to control the behaviour of Large Language
Models [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Similarly, Inverse RL [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which is commonly used in autonomous driving and robotics,
aims to retrieve the reward function of an expert from the demonstrations they generate. Reward
learning is a promising approach for aligning the reward functions of AI systems with the intentions
of humans [
        <xref ref-type="bibr" rid="ref11 ref5">5, 11</xref>
        ]. It has significant advantages over behavioral cloning, which learns a policy by
using supervised learning on observation-action pairs since reward functions are considered the most
succinct, robust, and transferable definition of a task [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. However, these techniques sufer from a
multitude of theoretical [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ] and practical problems [15] that make them unable to reliably learn
human values which are diverse [16], dynamic [17] and context-dependent [18].
      </p>
      <p>We aim to develop interpretability tools that help humans to understand learned reward functions
so that they can detect misalignments with their own values. This is in line with the “Transparent
Value Alignment" framework in which Sanneman and Shah [19] suggest leveraging techniques from
counterfactual
eXplainable AI (XAI) to provide explanations about the reward function. The process of explaining
reward functions can be useful for both the understanding and explaining phases of the XAI pipeline [20],
by enabling both developers and users to inspect reward functions. This is a relevant task for the XAI
community, as it contributes to the goal of enabling human users to understand, appropriately trust,
and produce more explainable models [20, 19]. However, there have been few attempts to interpret
reward functions and only Michaud et al. [21] attempt this for deep, learned reward functions. Our
work makes a novel connection between XAI and reward learning by providing, to the best of our
knowledge, the first principled application of counterfactual explanations to reward functions.</p>
      <p>Counterfactual explanations are a popular XAI tool that has not yet, to the best of our knowledge,
been applied to explain reward functions. It helps humans to understand the predictions of ML models
by posing hypothetical “what-if" scenarios. Humans commonly use counterfactuals for decision-making,
learning from past experiences, and emotional regulation[22, 23, 24]. Thus users can intuitively reason
about and learn from counterfactual explanations, which makes this an efective and user-friendly mode
of explanation [25, 26, 27].</p>
      <p>We propose Counterfactual Trajectory Explanations (CTEs) that serve as informative
explanations about deep reward functions. CTEs can be employed in a sequential decision-making
setting by contrasting an original with a counterfactual partial trajectory along with the rewards
assigned to them. This enables the user to draw inferences about what behaviours cause the reward
function to assign high or low rewards. For instance, consider the domain of autonomous driving
illustrated in Figure 1. While a given driving trajectory by itself might not provide much insight, adding
a counterfactual trajectory along with its reward allows a user to hypothesise that the reward function
negatively rewards the driving agent for swerving and getting close to the other lane.</p>
      <p>In order to generate CTEs we identify and adapt six quality criteria for counterfactual explanations
from XAI and psychology and introduce two algorithms for generating CTEs that optimise for these
quality criteria. To evaluate how efective the generated CTEs are we introduce a novel measure of
informativeness in which a proxy-human model learns from the provided explanations. Implementation
details, ablations and further experiments can be found in the technical appendix. 1</p>
    </sec>
    <sec id="sec-2">
      <title>2. Counterfactual Trajectory Explanations (CTEs)</title>
      <p>This study focuses on adapting counterfactual explanations to interpret a learned reward function.
Counterfactual explanations alter the inputs to a given system, which causes a change in the outputs
[26]. When explaining reward functions the inputs could either be single states or (partial) trajectories.
Correspondingly, the outputs to be targeted can either be seen as rewards assigned to single states or
1The full code for the project is available at:
https://github.com/janweh/Counterfactual-Trajectory-Explanations-for-LearnedReward-Functions
as the average reward assigned to the states in a (partial) trajectory. If we would only alter individual
states, multi-step plans could be overlooked and infeasible counterfactuals that cannot occur through
any sequence of actions might be created. By generating trajectories and showing their average rewards
we can provide the user with insights about which multi-step behaviours are incentivized by the reward
function, while also guaranteeing that counterfactuals are feasible. While it would be possible to
generate multiple counterfactuals per original, we only show the user one counterfactual to be able to
cover more original trajectories.</p>
      <p>We operate in Markov Decision Processes consisting of states , actions , transition probabilities 
and a reward function . Further, we denote a learned reward function as  :  ×  ⇒ R, a policy
trained for  as   , full trajectories generated by a full play-through of the environment as  and
partial trajectories as  ⊆  . Counterfactual Trajectory Explanations (CTEs) can now be defined as:
Definition 1. CTEs {(, ), ( ,  )} consist of an original and counterfactual partial trajectory
and their average rewards assigned by a reward function  . Both start in the state  but then follow a
diferent sequence of actions resulting in diferent average rewards.</p>
      <p>The diference in rewards can be causally explained by the diference in actions. If the agent had
chosen actions ( , ...,  ) instead of ( , ...,  ) resulting in  instead of  the reward
function  would have assigned an average reward  instead of . 2</p>
      <p>We propose a method to address the following problem: Given a learned reward function  , a policy
  trained on  and a full original trajectory   generated by   , the task is to select a part of that
trajectory  ⊆   and generate a counterfactual  to it that starts in the same state  so that the
resulting CTE is informative for an explainee to understand  .</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>This Section presents the method used to generate CTEs. First, quality criteria that measure the quality
of an explanation are derived from the literature and combined into a scalar quality value. Then two
algorithms are introduced which generate CTEs by optimising for the quality value.</p>
      <sec id="sec-3-1">
        <title>3.1. Determining the quality of CTEs</title>
        <p>Counterfactual explanations are usually generated by optimising them for a loss function that determines
how good a counterfactual is [29]. This loss function combines multiple aspects, which we call “quality
criteria".</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Quality Criteria</title>
          <p>By reviewing XAI literature we were able to identify 9 quality criteria that are used for counterfactual
explanations. These criteria are designed to make counterfactuals more informative to a human. Out of
these Causality, Resource and Actionability [30, 31, 32] are automatically achieved by our methods. We
are left with six quality criteria to optimise for which we adapt to judge the quality of CTEs.</p>
          <p>1. Validity: Counterfactuals should lead to the desired diference in the output of the model [ 31, 32].
This diference in outputs makes it possible to causally reason about the changes in the inputs. We
maximise Validity as | () −  ( )|.</p>
          <p>2. Proximity: The counterfactual should be similar to the original [30, 33, 32]. Thus we minimize
a measure based on the Modified Hausdorf distance [ 34] that finds the closest match between the
state-actions pairs in the two trajectories. The distance of state-action pairs is calculated as a weighted
sum of the Manhattan distance of the player positions, whether the same action was taken and the edit
distance between non-player objects in the environment.
2Examples of CTEs in the Emergency Environment [28] can be found in: https://drive.google.com/drive/folders
/1JMjwQM24BbDwL8vRnG3pST5hlvpzRfZM?usp=sharing</p>
          <p>3. Diversity: Explanations should cover the space of possible variables as well as possible [35, 36].
Consequently, each new CTE should establish novel information rather than repeating previously
shown CTEs. Thus we maximize Diversity of a new CTE compared to previous CTEs. This is calculated
as the sum of the average diference between the new length of the trajectory and previous lengths, the
average diference in the new starting time in the environment and previous starting times, and the
fraction of previous trajectories that are of the same counterfactual direction. Counterfactual direction
can be upward or downward comparisons [37] when the reward of the counterfactual is higher or lower
than the original’s reward.</p>
          <p>4. State importance: Counterfactual explanations should focus on important states that have a
significant impact on the trajectory outcome [ 36]. We aim to start counterfactual trajectories in critical
states, where the policy strongly favors some actions over others. We maximize the importance of a
starting state which is calculated as the policies negative entropy − ∑︀∈  (|0) log  (|0) [36, 38].</p>
          <p>5. Realisticness: The constellation of variables in a counterfactual should be likely to happen
[30, 32, 31]. In our setting, we want counterfactual trajectories that are likely to be generated by a
policy trained on the given reward function. Such a trajectory would likely score high on the reward
function. Thus we maximize:  ( ) −  ().</p>
          <p>6. Sparsity: Counterfactuals should only change a few features compared to the original to make
it cognitively easier for a human to process the diferences [ 30, 31, 32, 33]. Instead of meticulously
restricting the number of features that difer between states we lighten the cognitive load by incentivizing
CTEs to be short by minimizing: () + ( ).</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Combining quality criteria into a scalar quality value</title>
          <p>
            After measuring the six quality criteria, we scalarise them into one quality value  to be assigned to
a CTE. This is done by normalising the criteria and combining them into a weighted sum. Criteria
are normalised to [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ] by iteratively generating new CTEs with random weights and adapting the
minimum and maximum value the criteria take on.
          </p>
          <p>The weights ω assigned to the quality criteria correspond to their relative importance. However,
this opens the question of how one should weigh the diferent quality criteria to generate the most
informative explanations for a certain user. To find the optimal set of weights we suggest a calibration
phase in which  diferent sets of weights ω = {  , ...,  }=1 are uniformly sampled
 ∼  (0, 1) and used to create CTEs. The CTE’s informativeness is tested and the set of weights that
produces the most informative CTEs to a specific user are chosen for further use.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Generation algorithms for CTEs</title>
        <p>In order to generate CTEs we propose two algorithms that optimise for the aforementioned quality
value (see Section 3.1) along with a random baseline algorithm.</p>
        <p>Algorithm 1 - Monte Carlo-based Trajectory Optimization (MCTO):
MCTO adapts Monte Carlo Tree Search (MCTS) to the task of generating CTEs. MCTS is a heuristic
search algorithm that has been applied to RL by modelling the problem as a game tree, where states and
actions are nodes and branches [39, 40]. It uses random sampling and simulations to balance exploration
and exploitation in estimating the Q-values of states and actions.</p>
        <p>In contrast to MCTS, MCTO operates on partial trajectories instead of states, optimises for quality
values instead of rewards from the environment, adds a termination action which ends the trajectory
and applies domain-specific heuristics. Pseudocode 1 showcases the algorithm.</p>
        <p>In MCTO nodes represent partial trajectories , branches are actions  and child nodes result from
parents by following the action in the connecting branch. Leaf nodes are terminated trajectories which
can occur from entering a terminal state in the environments or by selecting an additional terminal
action that is always available. MCTO optimises for the quality value  of a CTE, which is being
measured at the leaf nodes. A CTE is derived by taking the partial trajectory in the leaf node as the
Algorithm 1 Monte Carlo Trajectory Optimization</p>
        <p>Input: full trajectory  , environment , actions 
 = []
for  in   do
 = []
 = []
repeat
for  to  do</p>
        <p>SELECTION( )</p>
        <p>EXPANSION( )
SIMULATION( )</p>
        <p>BACK-PROPAGATION(,  )

 ←

 ←
 ←
 ←
end for
* = ∈(( , ))
 ← .step(, * )</p>
        <p>APPEND( , (, * ))
until  is 
 = SUBSET( , , | |)</p>
        <p>APPEND(, (,  ))
end for
Return: ∈ ()
% store candidate CTEs
% Q-values of trajectories</p>
        <p>% Subtrajectory from
%  with same lengths as 
counterfactual  and the subtrajectory of   from starting state  with the same length as  as
the original .</p>
        <p>Each state  ∈   in the original trajectory is used as a potential starting point of the CTE by
setting it as the root of the tree and running MCTO. Out of these, the CTE with the highest quality value
is chosen. For a given state we choose the next action by repeating these four steps for a set number of
times () before choosing the action * with the highest Q-value:
1. SELECTION: A node in the tree, which still has unexplored branches is chosen. The choice
is made according to the Upper Confidence Bounds for Trees algorithm based on the estimated
Q-value of the branches and the number of times the nodes and branches have already been
visited.
2. EXPANSION: After selecting a node, we choose a branch and create the resulting child node.
3. SIMULATION: One full playout is completed by sampling actions uniformly until the environment
terminates the trajectory or the terminating action is chosen. At each step, the terminal action
is chosen with a probability of  (). The resulting CTE’s quality value  is evaluated
according to the quality criteria.
4. BACK-PROPAGATION:  is back-propagated up the tree to adjust the Q-values of previous nodes
1
: () = () ( − ()).</p>
        <p>As an eficiency-increasing heuristic, we prune of branches of actions that have a likelihood   (|) ≤
ℎℎ to be chosen by the policy. Furthermore, we choose not to employ a discount factor ( = 1)
when back-propagating  , since this would incentivize shorter CTEs while this is already done by the
Sparsity criterion. Ablations showed that other heuristics such as choosing actions in the simulation
based on the policy   or basing the decisions for expansion on an early estimate of the  did not
improve performance.</p>
        <p>Algorithm 2 - Deviate and Continue (DaC):
The Deviate and Continue (DaC) algorithm creates a counterfactual trajectory  by deviating from the
(1) Reward
Preferences Learning</p>
        <p>Learned Reward TProaliicny
function R?
Proxy-human</p>
        <p>model M?
(4) Supervised</p>
        <p>Learning
Features
&amp; Labels
F(torg), rorg;
F(tcf), rcf
(5) Measure
Similarity</p>
        <p>Evaluation</p>
        <p>CTE
(3) Extract torg,rorg;
Features tcf,rcf</p>
        <p>Generation trajectory</p>
        <p>?org
(2) Generate</p>
        <p>CTEs</p>
        <p>Explanation
method</p>
        <p>Potential
Rating of CTEs
CTEs</p>
        <p>Quality
criteria
original trajectory   before continuing by choosing actions according to policy   . Starting in a state
 ∈  , the deviation is performed by sampling an action from the policy   that leads to a diferent
state than in the original trajectory. After  such deviations  is continued by following   .
During the continuation, there is a  () chance per step of ending both  and  . This process
is repeated for every state  ∈   and the resulting CTE with the highest quality value is chosen.</p>
        <p>Baseline Algorithm - Random As a weak baseline, we compare our algorithms to randomly
generated CTEs. A start state  of the counterfactual is uniformly chosen from the original trajectory
 . From there actions are uniformly sampled, while the trajectories have a () chance of
being ended in each timestep.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>This Section details the experimental approach we take to evaluate the informativeness of CTEs.
We want to automatically measure how well an explainee can understand a reward function from
explanations, while similar works perform user studies or do not ofer quantitative evaluations. Since
previous methods for interpreting reward functions are not applicable to our evaluation setup we can
only compare our proposed methods with a baseline and criteria with each other. Our evaluation
approach includes learning a reward function, generating CTEs about it and measuring how informative
the CTEs are for a proxy-human model (see Figure 2).</p>
      <sec id="sec-4-1">
        <title>4.1. Generating reward functions and CTEs</title>
        <p>To learn a reward function (1) we first generate expert demonstrations. A policy  * is trained on a
ground-truth reward * via Proximal Policy Optimization (PPO) [41]. This policy is used to generate
1000 expert trajectories τ  = {  }10=010. Secondly, we use Adversarial IRL [42] which derives a
robust reward function  and policy   from the demonstrations by posing the IRL problem as a
two-player adversarial game between a reward function and a policy optimizer.</p>
        <p>We use the Emergency environment [28], a Gridworld environment that represents a burning building
where a player needs to rescue humans and reduce the fire. The environment 7 humans that need
to be rescued, a fire extinguisher which can lessen the fire and obstacles which block the agent from
walking through. In each timestep, the player can walk or interact in one of the four directions. This
environment is computationally cheap and simple to investigate. However, it is still interesting to study
since the random initialisations require the reward function to generalise while taking into account
multiple sources of reward.</p>
        <p>To make CTEs about  (2) we first generate a set of full trajectories τ  = {  }10=010 using the
policy   . Lastly, we use the algorithms described in Section 3.2 to optimise for the quality criteria
1000. We conducted a
in Section 3.1 to produce one CTE per full trajectory   = { ,  }
grid search of hyperparameters for each of the generation algorithms. Based on that we choose
 () = 0.35, ℎℎ = 0.003 and  = 10 for MCTO,  () = 0.55 and
 = 3 for DaC and () = 0.15 for Random.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evaluating the informativeness of CTEs</title>
        <p>We argue that informative explanations allow the explainee to better understand the learned reward
function, which we formalize as the explainee’s ability to assign similar average rewards to unseen
trajectories as the reward function.</p>
        <p>To evaluate the informativeness of CTEs, we employ a Neural Network (NN) as a proxy-human
model to learn from the explanations and to predict the average reward assigned by  for a trajectory.
While humans learn diferently from data than an NN, this evaluation setup still gives us important
insights into the functioning and efectiveness of CTEs.</p>
        <p>Notably, this measure only serves to evaluate the generation method and would not be used when
showing CTEs to humans. It allows us to test whether extracting generalisable knowledge about the
reward function from the provided CTE is possible by measuring how well the proxy-human model can
predict unseen CTEs. Furthermore, it allows us to compare diferent algorithms and quality criteria by
measuring and contrasting the informativeness of CTEs they generate.</p>
        <p>The evaluation procedure consists of three steps, as presented in Figure 2: (3) features and labels are
extracted from the CTEs to form a dataset to train on, (4) a proxy-human model is trained to predict
the rewards of trajectories from these features, and, lastly, (5) the similarity between the predictions of
the proxy-human model and the rewards assigned by  is measured to indicate how informative the
CTEs were to the model.</p>
        <p>Extracting features and labels (3)
We extract 46 handcrafted features  () = {0, ..., 45} about the partial trajectories. These features
represent concepts that the reward function might consider in its decision-making, for example of
the form “time spent using item X” or “average distance from object Y”. We opted against methods
for automatic feature [43] extraction to avoid introducing more moving parts in the evaluation. The
average reward for the states in a partial trajectory serves as the label for the proxy-human model
 = 1 Σ∈ (). By averaging the reward we avoid biasing the learning to the length of partial
| |
trajectories.</p>
        <p>Learning a proxy-human model (4)
A proxy-human regression model  is trained to predict the average reward  given to the partial
trajectory  by  from the extracted features  (). Humans learn from counterfactual explanations in
a contrastive manner by looking at the diference in outputs to causally reason about the efect of the
inputs [33] but also learn from the individual data points. Since we aim to make  learn in a similar
way to a human we train  on two tasks. In the single task, it is trained to separately predict the
average reward for the original and the counterfactual. Giving rewards to unseen trajectories shows
how similar the judgements of  and  are for trajectories. The loss on one CTE for this task is the
sum: (,  ) = (() −  ())2 + (( ) −  ( ))2.</p>
        <p>In the contrastive task,  is trained to predict the diference between the average original and
counterfactual reward. By doing this we train  to reason about how the diference in inputs
causes the outputs instead of only learning from data points independently: (,  ) =
[(() − (( )) − ( () −  ( )]2.</p>
        <p>is defined as a 4-layer NN that receives the features extracted from both the original and the
counterfactual as a concatenated input and is trained in a multi-task fashion on single and contrastive
tasks. The body of the NN is shared between both tasks and feeds into two separate last layers that
perform the two tasks separately. The losses of both tasks are used separately to update their respective
last layer and are added into a weighted sum to update the shared body of the network.</p>
        <p>We train the NN on 800 samples with the Adam optimiser and weight decay and results are averaged
0.7
0.6
sen0.5
e
v
i
ta0.4
m
r
o
Ifn0.3
over 30 random initialisations. We perform hyperparameter tuning using 5-fold cross-validation for the
learning rate, regularisation values, number of training epochs and dimensionality of hidden layers.</p>
        <p>Measuring similarity to the reward function (5)
To measure how similar the proxy-human model’s predictions are to the reward function’s outputs we
measure the Pearson Correlation between them on unseen CTEs. Reward functions are invariant under
multiplication of positive numbers and addition [44]. This is well captured by the Pearson Correlation
because it is insensitive to constant additions or multiplications. To ensure a fair comparison between
diferent settings we test how well a model trained on CTEs from one setting generalises to a combined
test set that contains CTEs from all settings.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>This Section describes the results of three experiments that test the overall informativeness of CTEs,
compare the generation algorithms and evaluate the quality criteria.
5.1. Experiment 1: Informativeness of Explanations for proxy-human model
Experimental Setup: We want to determine the success of our methods in generating informative
explanations for a proxy-human model , while also comparing the generation algorithms on the
downstream task. As described in Section 4.2 each generation algorithm produced 800 CTEs on which
we trained 10 s each, before testing the Pearson Correlation between their predictions and the
average rewards on a combined test set of 600 CTEs. We use the weights from Table 2 for the quality
criteria.</p>
      <p>Results: Figure 3a shows that s trained on CTEs from MCTO achieved on average higher
correlation values. s trained on DaC’s CTEs were significantly (  &lt; 0.001) worse, while the models
trained on randomly generated CTEs achieved a much lower correlation on both tasks.</p>
      <sec id="sec-5-1">
        <title>5.2. Experiment 2: Quality of Generation Algorithms</title>
        <p>Experimental Setup: This experiment tests how good the generation algorithms are at optimising
for the quality value. Each generation algorithm produced 1000 CTEs and their quality value  was
measured. To make this test independent of the weights for quality criteria, each CTE is optimised for a
diferent uniformly sampled set of weights: ω = {  , ...,  }1=0010, where  ∼  (0, 1).</p>
        <p>Avg quality value  ↑
Std quality value 
Eficiency (s/CTE) ↓ 3
Length (# steps)
Starting Points (# first step)</p>
        <p>Furthermore, the eficiency of algorithms (seconds/generated CTE) and the length and starting time of
CTEs were recorded.</p>
        <p>Results: From Table 1 we see that MCTO achieved a higher average quality value than DaC, which
again outperformed the random baseline (diferences are significant with  &lt; 1− 7). However, the
higher performance came at a computational cost, since MCTO was slower, while Random was very
eficient. On average the trajectories of Random were the longest and those of MCTO the shortest.
Lastly, both MCTO and DaC tended to choose starting times earlier in the environment (20.96 and 20.45
out of 75 timesteps).</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.3. Experiment 3: Informativeness of quality criteria</title>
        <p>Experimental Setup: Finally, we wanted to determine the influence of a quality criterion on
informativeness. For this, we analyzed the Spearman correlation between the weight assigned to the criterion
during the generation of a set of CTEs and the informativeness of this set of CTEs. Simultaneously we
carried out the calibration phase to determine the set of weights which leads to the most informative
CTEs for an explainee and generation algorithm.</p>
        <p>Thirty sets of weights ω were each used to generate one set of 1000 CTEs with MCTO. 800 CTEs were
used to train 10 s as described in Section 4.2. The performances of the resulting 30 sets of s were
evaluated on a test set that combines the remaining 200 samples from each of the 30 sets of CTEs. This
indicates the informativeness of the CTEs they were trained on. By measuring the Spearman correlation
between the weights assigned to a criterion and the informativeness of the resulting CTEs for , we
can infer the importance of that criterion for making CTEs informative. Furthermore, we record the set
of weights which leads to the most informative CTEs for each generation algorithm except Random
which is independent of weights.</p>
        <p>Results: Figure 3b shows that for both contrastive and single learning, the weights of Validity
( ) correlated the strongest with the informativeness for . This is followed by ,
 ,  and  which all show a moderate correlation with the
informativeness, while  was barely or even negatively correlated with informativeness. While there
are diferences between the importance of criteria for the two tasks, they end up with similar results.</p>
        <p>Furthermore, we find that the same set of weights leads to the most informative CTEs for both MCTO
and DaC. It assigns very high weights to Validity and Proximity, while Realisticness is weighted low.
Contrary to Figure 3b Sparsity is highly weighted.
3Eficiency difers depending on the hardware used.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.4. Discussion</title>
        <p>CTEs are informative for the proxy-human model. Experiment 1 shows that an NN-based model
trained on CTEs is much better than random guessing at predicting rewards or judging the diference in
rewards between unseen CTEs. It also shows a capability to generalise to out-of-distribution examples
when predicting CTEs generated by other algorithms. This indicates that CTEs enable an explainee
to learn some aspects of the reward function which hold generally across diferent distributions of
trajectories.</p>
        <p>However, the fact that the correlations of ’s predictions with the true labels are ≤ 0.60 clearly
shows that there are aspects of the reward function, which  did not pick up on. This could be
explained by a lack of training samples, a loss of information during the feature extraction or insuficient
coverage of diferent situations in the environment. Furthermore, the studied reward function is noisy,
often outputting diferent rewards for apparently similar situations and is thus hard to understand.</p>
        <p>MCTO generated the most informative CTEs, while the CTEs from Random were less informative.</p>
        <p>Similarly, we find that MCTO is the most efective generation algorithm for optimising the
quality value, while DaC outperforms Random. The fact that the algorithms which achieved higher
quality values in Experiment 2 also produced more informative CTEs in Experiment 1 indicates that
optimising well for the quality value is generally useful for making more informative CTEs. Table 1
shows a trade-of, between the performance and eficiency of the generation algorithms, which likely
appears because a more exhaustive search finds higher-scoring CTEs. Furthermore, MCTO and DaC
selected CTEs with earlier starting times. This is because the environment had higher fluctuations in
rewards early on, which benefits Validity and State importance. This shows that they are able to select
CTEs in more interesting parts of the environment. They also tend to choose shorter trajectories, which
score higher on Sparsity.</p>
        <p>Among the criteria Validity is the most important criterion for generating informative CTEs
as shown in Experiment 3. High weights for Validity lead to higher diferences in rewards and lead to a
larger range of labels for contrastive predictions. Possibly, an NN can learn more information from these
larger diferences and is thus better informed by CTEs that are high in Validity. Proximity, Realisticness,
Diversity and State importance are also beneficial for having the proxy-human model learn from CTEs,
but we are less certain about why they are beneficial. Although prioritising Sparsity does not correlate
with informativeness, the most informative set of weights does give it a high weight. However, this
high weight might be a fluke since we only tried 30 sets of weights. In any case, we should not conclude
that humans would not benefit from sparse explanations. While NNs can easily compute gradients over
many diferent features simultaneously, humans can only draw inferences about a few features at once
[45]. This clarifies that the prioritisation of quality criteria will likely difer for a human.</p>
        <p>The fact that the two tasks largely agreed on the importance of quality criteria indicates that they
complement each other. This might be because the two tasks are similar and thus benefit from developing
similar representations in the shared body of the network. Furthermore, because the same set of weights
out of 30 options led to the most informative CTEs when using MCTO and DaC we can speculate that
the relative importance of quality criteria for an explained is similar, independent of the generation
algorithm used.</p>
        <p>Limitations: Since we do not measure the informativeness of CTEs for a human user, our experiments
do not prove that CTEs are informative for humans or show how important the criteria would be to
a user. Furthermore, we only conduct experiments on a single learned reward function in a single
environment, making it unclear how our findings will generalise to other settings. The method might
especially struggle with large and complex environments where it is dificult to achieve high coverage of
the environment with CTEs. Further, depends on the ability to reset the environment to previous states,
which is not given in some environments. Lastly, our evaluation measure depends on hand-crafted
features which limits its applicability.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Related Work</title>
      <p>This Section covers previous work on the interpretability of reward functions and counterfactual
explanations for AI.</p>
      <sec id="sec-6-1">
        <title>6.1. Interpretability of Learned Reward Functions</title>
        <p>Reward functions can be made intrinsically more interpretable by learning them as decision trees
[46, 47, 48] or in logical domains [49, 50]. Attempts have been made to make deep reward functions
more interpretable by simplifying them through equivalence transformation [51] or by imitating a
Neural Network with a decision tree [52]. However, such interpretable representations can negatively
impact the performance of the method.</p>
        <p>To avoid this drawback, we interpret learned reward functions via post-hoc explanations. Post-hoc
methods are applied after the model has been trained to explain the model’s decision-making process.
Lindsey and Shah [53, 54] test the efectiveness and required cognitive workload of simple explanation
techniques about linear reward functions. While their work requires linear reward functions our method
is applicable to any representation of a reward function.</p>
        <p>The closest work to ours comes from Michaud et al. [21] who apply gradient salience and occlusion
maps to identify flaws in a learned reward function and employ handcrafted counterfactual inputs to
validate their findings. Our work focuses on counterfactuals and automatically generates them to be of
high quality.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Counterfactual Explanations</title>
        <p>Despite a large body of work on generating counterfactual explanations about ML models in supervised
learning problems [55, 29, 56, 57] and their relation to human psychology [30, 58], this approach has
only recently been adapted to explain RL policies. Counterfactuals consist of a change in certain input
variables which cause a change in outputs [26]. In the RL setting, counterfactual explanations can
be changes in Features, Goals, Objectives, Events, or Expectations that cause the agent to change its
pursued Actions, Plans, or Policies [32]. This can improve users’ understanding of out-of-distribution
behaviour [36], provide them with more informative demonstrations [59] or showcase how an agent’s
environmental beliefs influence its planning [ 60]. Instead of explaining a policy  this paper presents
the first principled attempt to use them to use counterfactuals to explain a reward function .</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>While reward learning presents a promising approach for aligning AI systems with human values, there
is a lack of methods to interpret the resulting reward functions. To address this we formulate the notion
of Counterfactual Trajectory Explanations (CTEs) and propose algorithms to generate them. Our results
show that CTEs are informative for an explainee, but do not lead to a perfect understanding of the
reward function. Further, they validate our MCTO algorithm to be efective at generating CTEs and
imply that the diference in outcomes between an original and counterfactual trajectory is especially
important to achieve informative explanations. This research demonstrates that it is fruitful to apply
techniques from XAI to interpret learned reward functions.</p>
      <p>Future work should carry out a user study to test the informativeness of CTEs for humans.
Furthermore, the method should be evaluated in more complex environments and on a range of reward
functions produced by diferent reward learning algorithms. Ultimately, we hope that CTEs will be
used in practice to allow users to understand the misalignments between their values and a reward
function, thus enabling them to improve the reward function with new demonstrations or feedback.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>The project on which this report is based was funded by the Federal Ministry of Education and Research
under the funding code 16KIS2012. The responsibility for the content of this publication lies with the
author. Further, This research was partially supported by TAILOR, a project funded by EU Horizon
2020 research and innovation programme under GA No 952215.
AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative
Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances
in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, AAAI Press, 2023,
pp. 15136–15143. URL: https://doi.org/10.1609/aaai.v37i12.26766. doi:10.1609/AAAI.V37I12.
26766.
[15] S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, et al., Open problems and fundamental
limitations of reinforcement learning from human feedback, arXiv preprint arXiv:2307.15217
(2023). arXiv:2307.15217.
[16] R. Lera-Leri, F. Bistafa, M. Serramia, M. López-Sánchez, J. A. Rodríguez-Aguilar, Towards pluralistic
value alignment: Aggregating value systems through lp-regression, in: 21st International
Conference on Autonomous Agents and Multiagent Systems, AAMAS 2022, Auckland, New Zealand,
May 9-13, 2022, International Foundation for Autonomous Agents and Multiagent Systems
(IFAAMAS), 2022, pp. 780–788. URL: https://www.ifaamas.org/Proceedings/aamas2022/pdfs/p780.pdf.
doi:10.5555/3535850.3535938.
[17] I. van de Poel, Understanding value change, Prometheus 38 (2022) 7–24.
[18] E. Liscio, M. van der Meer, L. C. Siebert, C. M. Jonker, P. K. Murukannaiah, What values should
an agent align with?, Auton. Agents Multi Agent Syst. 36 (2022) 23. URL: https://doi.org/10.1007/
s10458-022-09550-0. doi:10.1007/S10458-022-09550-0.
[19] L. Sanneman, J. Shah, Transparent value alignment, in: Companion of the 2023 ACM/IEEE
International Conference on Human-Robot Interaction, HRI 2023, Stockholm, Sweden, March
13-16, 2023, ACM, 2023, pp. 557–560. URL: https://doi.org/10.1145/3568294.3580147. doi:10.1145/
3568294.3580147.
[20] R. Dwivedi, D. Dave, H. Naik, S. Singhal, O. F. Rana, P. Patel, B. Qian, Z. Wen, T. Shah, G. Morgan,
R. Ranjan, Explainable AI (XAI): core ideas, techniques, and solutions, ACM Comput. Surv. 55
(2023) 194:1–194:33. URL: https://doi.org/10.1145/3561048. doi:10.1145/3561048.
[21] E. J. Michaud, A. Gleave, S. Russell, Understanding learned reward functions, CoRR abs/2012.05862
(2020). URL: https://arxiv.org/abs/2012.05862. arXiv:2012.05862.
[22] R. M. Byrne, Counterfactual thought, Annual review of psychology 67 (2016) 135–157.
[23] D. Kahneman, D. T. Miller, Norm theory: Comparing reality to its alternatives., Psychological
review 93 (1986) 136.
[24] N. J. Roese, J. M. Olson, What might have been: The social psychology of counterfactual thinking,</p>
      <p>Psychology Press, 2014.
[25] B. D. Mittelstadt, C. Russell, S. Wachter, Explaining explanations in AI, in: Proceedings of the
Conference on Fairness, Accountability, and Transparency, FAT* 2019, Atlanta, GA, USA, January
29-31, 2019, ACM, 2019, pp. 279–288. URL: https://doi.org/10.1145/3287560.3287574. doi:10.1145/
3287560.3287574.
[26] S. Wachter, B. D. Mittelstadt, C. Russell, Counterfactual explanations without opening the black
box: Automated decisions and the GDPR, CoRR abs/1711.00399 (2017). URL: http://arxiv.org/abs/
1711.00399. arXiv:1711.00399.
[27] D. R. Mandel, Of causal and counterfactual explanation, in: Understanding counterfactuals,
understanding causation: Issues in philosophy and psychology, Oxford University Press, 2011, p.
147.
[28] M. Peschl, A. Zgonnikov, F. A. Oliehoek, L. C. Siebert, MORAL: aligning AI with human
norms through multi-objective reinforced active learning, in: 21st International Conference
on Autonomous Agents and Multiagent Systems, AAMAS 2022, Auckland, New Zealand, May
9-13, 2022, International Foundation for Autonomous Agents and Multiagent Systems
(IFAAMAS), 2022, pp. 1038–1046. URL: https://www.ifaamas.org/Proceedings/aamas2022/pdfs/p1038.pdf.
doi:10.5555/3535850.3535966.
[29] A. Artelt, B. Hammer, On the computation of counterfactual explanations - A survey, CoRR
abs/1911.07749 (2019). URL: http://arxiv.org/abs/1911.07749. arXiv:1911.07749.
[30] M. T. Keane, E. M. Kenny, E. Delaney, B. Smyth, If only we had better counterfactual explanations:
Five key deficits to rectify in the evaluation of counterfactual XAI techniques, in: Proceedings of
the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event /
Montreal, Canada, 19-27 August 2021, ijcai.org, 2021, pp. 4466–4474. URL: https://doi.org/10.24963/
ijcai.2021/609. doi:10.24963/IJCAI.2021/609.
[31] A. Verma, V. Murali, R. Singh, P. Kohli, S. Chaudhuri, Programmatically interpretable reinforcement
learning, in: Proceedings of the 35th International Conference on Machine Learning, ICML 2018,
Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine
Learning Research, PMLR, 2018, pp. 5052–5061. URL: http://proceedings.mlr.press/v80/verma18a.
html.
[32] J. Gajcin, I. Dusparic, Redefining counterfactual explanations for reinforcement learning: Overview,
challenges and opportunities, ACM Comput. Surv. 56 (2024) 219:1–219:33. URL: https://doi.org/10.
1145/3648472. doi:10.1145/3648472.
[33] T. Miller, Explanation in artificial intelligence: Insights from the social sciences, Artif. Intell. 267
(2019) 1–38. URL: https://doi.org/10.1016/j.artint.2018.07.007. doi:10.1016/J.ARTINT.2018.07.
007.
[34] M. Dubuisson, A. K. Jain, A modified hausdorf distance for object matching, in: 12th IAPR
International Conference on Pattern Recognition, Conference A: Computer Vision &amp; Image
Processing, ICPR 1994, Jerusalem, Israel, 9-13 October, 1994, Volume 1, IEEE, 1994, pp. 566–568. URL:
https://doi.org/10.1109/ICPR.1994.576361. doi:10.1109/ICPR.1994.576361.
[35] S. H. Huang, D. Held, P. Abbeel, A. D. Dragan, Enabling robots to communicate their objectives,
Auton. Robots 43 (2019) 309–326. URL: https://doi.org/10.1007/s10514-018-9771-0. doi:10.1007/
S10514-018-9771-0.
[36] J. Frost, O. Watkins, E. Weiner, P. Abbeel, T. Darrell, B. A. Plummer, K. Saenko, Explaining
reinforcement learning policies through counterfactual trajectories, CoRR abs/2201.12462 (2022).</p>
      <p>URL: https://arxiv.org/abs/2201.12462. arXiv:2201.12462.
[37] N. J. Roese, The functional basis of counterfactual thinking., Journal of personality and Social</p>
      <p>Psychology 66 (1994) 805.
[38] S. H. Huang, K. Bhatia, P. Abbeel, A. D. Dragan, Establishing appropriate trust via critical
states, in: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2018,
Madrid, Spain, October 1-5, 2018, IEEE, 2018, pp. 3929–3936. URL: https://doi.org/10.1109/IROS.
2018.8593649. doi:10.1109/IROS.2018.8593649.
[39] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,
I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner,
I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis, Mastering the
game of go with deep neural networks and tree search, Nat. 529 (2016) 484–489. URL: https:
//doi.org/10.1038/nature16961. doi:10.1038/NATURE16961.
[40] T. Vodopivec, S. Samothrakis, B. Ster, On monte carlo tree search and reinforcement learning, J.</p>
      <p>Artif. Intell. Res. 60 (2017) 881–936. URL: https://doi.org/10.1613/jair.5507. doi:10.1613/JAIR.
5507.
[41] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms,</p>
      <p>CoRR abs/1707.06347 (2017). URL: http://arxiv.org/abs/1707.06347. arXiv:1707.06347.
[42] J. Fu, K. Luo, S. Levine, Learning robust rewards with adversarial inverse reinforcement learning,</p>
      <p>CoRR abs/1710.11248 (2017). URL: http://arxiv.org/abs/1710.11248. arXiv:1710.11248.
[43] A. O. Salau, S. Jain, Feature extraction: A survey of the types, techniques, applications, in: 2019
International Conference on Signal Processing and Communication (ICSC), 2019, pp. 158–164.
doi:10.1109/ICSC45622.2019.8938371.
[44] A. Y. Ng, D. Harada, S. Russell, Policy invariance under reward transformations: Theory and
application to reward shaping, in: Proceedings of the Sixteenth International Conference on
Machine Learning (ICML 1999), Bled, Slovenia, June 27 - 30, 1999, Morgan Kaufmann, 1999, pp.
278–287.
[45] G. A. Miller, The magical number seven, plus or minus two: Some limits on our capacity for
processing information., Psychological review 63 (1956) 81.
[46] T. Bewley, F. Lécué, Interpretable preference-based reinforcement learning with tree-structured
reward functions, in: 21st International Conference on Autonomous Agents and Multiagent
Systems, AAMAS 2022, Auckland, New Zealand, May 9-13, 2022, International Foundation for
Autonomous Agents and Multiagent Systems (IFAAMAS), 2022, pp. 118–126. URL: https://www.
ifaamas.org/Proceedings/aamas2022/pdfs/p118.pdf. doi:10.5555/3535850.3535865.
[47] A. Kalra, D. S. Brown, Interpretable reward learning via diferentiable decision trees, in: NeurIPS</p>
      <p>ML Safety Workshop, 2022.
[48] S. Srinivasan, F. Doshi-Velez, Interpretable batch irl to extract clinician goals in icu hypotension
management, AMIA Summits on Translational Science Proceedings 2020 (2020) 636.
[49] D. Kasenberg, M. Scheutz, Interpretable apprenticeship learning with temporal logic specifications,
in: 56th IEEE Annual Conference on Decision and Control, CDC 2017, Melbourne, Australia,
December 12-15, 2017, IEEE, 2017, pp. 4914–4921. URL: https://doi.org/10.1109/CDC.2017.8264386.
doi:10.1109/CDC.2017.8264386.
[50] T. Munzer, B. Piot, M. Geist, O. Pietquin, M. Lopes, Inverse reinforcement learning in relational
domains, in: Proceedings of the Twenty-Fourth International Joint Conference on Artificial
Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, AAAI Press, 2015, pp. 3735–
3741. URL: http://ijcai.org/Abstract/15/525.
[51] E. Jenner, A. Gleave, Preprocessing reward functions for interpretability, CoRR abs/2203.13553
(2022). URL: https://doi.org/10.48550/arXiv.2203.13553. doi:10.48550/ARXIV.2203.13553.
arXiv:2203.13553.
[52] J. Russell, E. Santos, Explaining reward functions in markov decision processes, in: Proceedings
of the Thirty-Second International Florida Artificial Intelligence Research Society Conference,
Sarasota, Florida, USA, May 19-22 2019, AAAI Press, 2019, pp. 56–61. URL: https://aaai.org/ocs/
index.php/FLAIRS/FLAIRS19/paper/view/18275.
[53] L. Sanneman, J. Shah, Explaining reward functions to humans for better human-robot collaboration,</p>
      <p>CoRR abs/2110.04192 (2021). URL: https://arxiv.org/abs/2110.04192. arXiv:2110.04192.
[54] L. Sanneman, J. A. Shah, An empirical study of reward explanations with human-robot interaction
applications, IEEE Robotics Autom. Lett. 7 (2022) 8956–8963. URL: https://doi.org/10.1109/LRA.
2022.3189441. doi:10.1109/LRA.2022.3189441.
[55] S. Verma, V. Boonsanong, M. Hoang, K. E. Hines, J. P. Dickerson, C. Shah, Counterfactual
explanations and algorithmic recourses for machine learning: A review, arXiv preprint arXiv:2010.10596
(2020).
[56] R. Guidotti, Counterfactual explanations and how to find them: literature review and benchmarking,
Data Min. Knowl. Discov. 38 (2024) 2770–2824. URL: https://doi.org/10.1007/s10618-022-00831-6.
doi:10.1007/S10618-022-00831-6.
[57] I. Stepin, J. M. Alonso, A. Catalá, M. Pereira-Fariña, A survey of contrastive and counterfactual
explanation generation methods for explainable artificial intelligence, IEEE Access 9 (2021)
11974–12001. URL: https://doi.org/10.1109/ACCESS.2021.3051315. doi:10.1109/ACCESS.2021.
3051315.
[58] R. M. J. Byrne, Counterfactuals in explainable artificial intelligence (XAI): evidence from human
reasoning, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial
Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, ijcai.org, 2019, pp. 6276–6282. URL:
https://doi.org/10.24963/ijcai.2019/876. doi:10.24963/IJCAI.2019/876.
[59] M. S. Lee, H. Admoni, R. G. Simmons, Reasoning about counterfactuals to improve human
inverse reinforcement learning, in: IEEE/RSJ International Conference on Intelligent Robots
and Systems, IROS 2022, Kyoto, Japan, October 23-27, 2022, IEEE, 2022, pp. 9140–9147. URL:
https://doi.org/10.1109/IROS47612.2022.9982062. doi:10.1109/IROS47612.2022.9982062.
[60] G. J. Stein, Generating high-quality explanations for navigation in partially-revealed
environments, in: Advances in Neural Information Processing Systems 34: Annual
Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14,
2021, virtual, 2021, pp. 17493–17506. URL: https://proceedings.neurips.cc/paper/2021/hash/
926ec030f29f83ce5318754fdb631a33-Abstract.html.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nemati</surname>
          </string-name>
          , G. Yin,
          <article-title>Reinforcement learning in healthcare: A survey</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>55</volume>
          (
          <year>2023</year>
          ) 5:
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          :
          <fpage>36</fpage>
          . URL: https://doi.org/10.1145/3477600. doi:
          <volume>10</volume>
          .1145/3477600.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. R.</given-names>
            <surname>Kiran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Sobh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Talpaert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mannion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A. A.</given-names>
            <surname>Sallab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K.</given-names>
            <surname>Yogamani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pérez</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning for autonomous driving: A survey</article-title>
          ,
          <source>IEEE Trans. Intell. Transp. Syst</source>
          .
          <volume>23</volume>
          (
          <year>2022</year>
          )
          <fpage>4909</fpage>
          -
          <lpage>4926</lpage>
          . URL: https://doi.org/10.1109/TITS.
          <year>2021</year>
          .
          <volume>3054625</volume>
          . doi:
          <volume>10</volume>
          .1109/TITS.
          <year>2021</year>
          .
          <volume>3054625</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>M. M. Afsar</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Crump</surname>
            ,
            <given-names>B. H.</given-names>
          </string-name>
          <string-name>
            <surname>Far</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning based recommender systems: A survey</article-title>
          ,
          <source>ACM Comput. Surv</source>
          .
          <volume>55</volume>
          (
          <year>2023</year>
          )
          <volume>145</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>145</lpage>
          :
          <fpage>38</fpage>
          . URL: https://doi.org/10.1145/3543846. doi:
          <volume>10</volume>
          .1145/ 3543846.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Siebert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Lupetti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Aizenberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Beckers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zgonnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Veluwenkamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Abbink</surname>
          </string-name>
          , E. Giaccardi, G. Houben,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Jonker</surname>
          </string-name>
          , J. van den Hoven, D. Forster,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Lagendijk</surname>
          </string-name>
          ,
          <article-title>Meaningful human control: actionable properties for AI system development</article-title>
          ,
          <source>AI</source>
          Ethics
          <volume>3</volume>
          (
          <year>2023</year>
          )
          <fpage>241</fpage>
          -
          <lpage>255</lpage>
          . URL: https://doi.org/10.1007/s43681-022-00167-3. doi:
          <volume>10</volume>
          .1007/S43681-022-00167-3.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <article-title>Human compatible: Artificial intelligence and the problem of control</article-title>
          ,
          <source>Penguin</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bhatia</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Steinhardt,</surname>
          </string-name>
          <article-title>The efects of reward misspecification: Mapping and mitigating misaligned models</article-title>
          ,
          <source>in: The Tenth International Conference on Learning Representations, ICLR</source>
          <year>2022</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          ,
          <source>April 25-29</source>
          ,
          <year>2022</year>
          , OpenReview.net,
          <year>2022</year>
          . URL: https://openreview.net/forum? id=
          <fpage>JYtwGwIL7ye</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mané</surname>
          </string-name>
          ,
          <article-title>Concrete problems in AI safety</article-title>
          ,
          <source>CoRR abs/1606</source>
          .06565 (
          <year>2016</year>
          ). URL: http://arxiv.org/abs/1606.06565. arXiv:
          <volume>1606</volume>
          .
          <fpage>06565</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>P. F.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Brown</surname>
            , M. Martic,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Legg</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Deep reinforcement learning from human preferences</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9</source>
          ,
          <year>2017</year>
          , Long Beach, CA, USA,
          <year>2017</year>
          , pp.
          <fpage>4299</fpage>
          -
          <lpage>4307</lpage>
          . URL: https://proceedings.neurips.cc/paper/2017/hash/ d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ndousse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          , et al.,
          <article-title>Training a helpful and harmless assistant with reinforcement learning from human feedback</article-title>
          ,
          <source>arXiv preprint arXiv:2204.05862</source>
          (
          <year>2022</year>
          ). arXiv:
          <volume>2204</volume>
          .
          <fpage>05862</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Russell</surname>
          </string-name>
          ,
          <article-title>Algorithms for inverse reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the Seventeenth International Conference on Machine Learning</source>
          ,
          <year>2000</year>
          , pp.
          <fpage>663</fpage>
          -
          <lpage>670</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Everitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Maini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Legg</surname>
          </string-name>
          ,
          <article-title>Scalable agent alignment via reward modeling: a research direction</article-title>
          , CoRR abs/
          <year>1811</year>
          .07871 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1811</year>
          .07871. arXiv:
          <year>1811</year>
          .07871.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>Apprenticeship learning via inverse reinforcement learning</article-title>
          ,
          <source>in: Machine Learning, Proceedings of the Twenty-first International Conference (ICML</source>
          <year>2004</year>
          ), Banf, Alberta, Canada,
          <source>July 4-8</source>
          ,
          <year>2004</year>
          , volume
          <volume>69</volume>
          of ACM International Conference Proceeding Series, ACM,
          <year>2004</year>
          . URL: https://doi.org/10.1145/1015330.1015430. doi:
          <volume>10</volume>
          .1145/1015330.1015430.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Armstrong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mindermann</surname>
          </string-name>
          ,
          <article-title>Occam's razor is insuficient to infer the preferences of irrational agents</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems</source>
          <year>2018</year>
          ,
          <article-title>NeurIPS 2018</article-title>
          , December 3-
          <issue>8</issue>
          ,
          <year>2018</year>
          , Montréal, Canada,
          <year>2018</year>
          , pp.
          <fpage>5603</fpage>
          -
          <lpage>5614</lpage>
          . URL: https://proceedings.neurips.cc/paper/2018/hash/ d89a66c7c80a29b1bdbab0f2a1a94af8-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>J.</given-names>
            <surname>Skalse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abate</surname>
          </string-name>
          ,
          <article-title>Misspecification in inverse reinforcement learning</article-title>
          , in: Thirty-Seventh
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>