<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Verification and Learning Using Explanations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Saaduddin Mahmud</string-name>
          <email>smahmud@umass.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandhya Saisubramanian</string-name>
          <email>sandhya.sai@oregonstate.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shlomo Zilberstein</string-name>
          <email>shlomo@umass.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Reward Alignment, Explanation Generation, Inverse Reinforcement Learning</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Oregon State University</institution>
          ,
          <addr-line>Oregon</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Massachusetts Amherst</institution>
          ,
          <addr-line>Massachusetts</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Workshop Proce dings</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>When a human expert demonstrates the desired behavior, there often exist multiple reward functions consistent with the observed demonstrations. As a result, agents often learn a proxy reward function to encode their observations. Operating based on proxy rewards may be unsafe. Furthermore, black-box representations make it dificult for the demonstrator to verify the learned reward function and prevent harmful behavior. We investigate the eficiency of using explanations to update and verify a learned reward function, to ensure that it aligns with the demonstrator's intent. The problem is formulated as an inverse reinforcement learning from ranked expert demonstrations, with verification tests to validate the alignment of the learned reward. The agent explains its reward function and the human signals whether the explanation passes the verification test. When the explanation is rejected, the agent presents additional alternative explanations to acquire feedback, such as a preference ordering over explanations, which helps it learn the intended reward. We analyze the eficiency of our approach in learning reward functions from diferent types of explanations and present empirical results on five domains. Our results demonstrate the efectiveness of our approach in learning and generalizing human-aligned rewards.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>With dramatic recent advances in artificial intelligence,
autonomous agents are being increasingly deployed in
A predominant way to train such agents in the absence of
a reward function is learning from demonstration (LfD) [2].</p>
      <sec id="sec-1-1">
        <title>Inverse reinforcement learning (IRL) is a form of LfD de</title>
        <p>signed to retrieve a reward function that captures the
demonstrator’s behavior [3], allowing agents to learn and
generalize the observed behavior to unseen situations.</p>
        <p>Despite the success of IRL in many research settings,
two key limitations may lead to unsafe behavior of the
deployed system: (1) the demonstrations may cover only
a subset of states, providing no direct information about
acceptable behavior in other states; and (2) a large space
of candidate reward functions may be consistent with the
demonstrations, each producing slightly diferent
behavConsequently, an agent may learn a proxy reward that
leads to unpredictable, unsafe behavior when
encountering novel situations.
CEUR
htp:/ceur-ws.org
ISN1613-073
© 2023 Copyright for this paper by its authors. Use permitted under Creative</p>
        <p>CEUR
a human is crossing the street with dogs. Since dogs
are often accompanied by humans, the rare case of
encountering dogs alone might be missing from the dataset.</p>
        <p>Consider four diferent reward functions consistent with
ative reward for not stopping in this case.  1 does not
account for dogs,  2 rewards stopping for pedestrians
with dogs,  3 rewards stopping for pedestrians or dogs,
and  4 rewards stopping for all objects, including leaves
or a plastic bag on the road. In the absence of additional
information, the AV may randomly learn one of these
reward functions (say  2), however,  3 represents the true
intent of the demonstrator. When operating based on  2,
the AV may not stop for dogs unaccompanied by humans.</p>
        <p>This example illustrates the inherent reward ambiguity
in IRL and the consequences of learning a proxy reward.</p>
        <p>Existing IRL methods aim to resolve reward ambiguity
by either introducing heuristics such as Max Margin [4]
mation such as a trajectory ranking [6] or a preference
diferentiator [ 7]. However, these approaches are not
guaranteed to avoid reward ambiguity and they do not
verify the learned reward. Recently, Brown et al. [8]
icy, but it does not amend the reward if it is misaligned.</p>
        <p>Further, these methods are not interpretable and may
require additional knowledge, such as the value function.</p>
        <p>
          To address this issue, we introduce a general
framework that utilizes explanations to learn a reward
function that is aligned with the demonstrator’s intent. Our
framework for reward verification
and learning using
ior in states that were not visited in the demonstrations. or Max Entropy [5], or by combining additional
inforIn the demonstration data, the driver always stops when
vehicle (AV) approaching a pedestrian walking two dogs. introduced an approach to verify the agent’s value or
polexplanations (REVEALE) consists of a reward learning framework for verifying and learning human-aligned
phase and a verification phase . In the reward learning reward from demonstrations and using explanations;
phase, the agent learns a reward function based on the (2) presenting an algorithm to generate explanations as
demonstrations. In the verification phase, the demon- feature attributions for reward functions and verify the
strator verifies the reward alignment through verifica- learned reward through human feedback; (3) analyzing
tion tests in the form of queries to the agent. The agent the reduction in reward ambiguity for linear rewards;
responds by explaining its reward, and the demonstra- and (4) demonstrating empirically the efectiveness of
tor signals whether the model passes the verification our approach on five domains.
test. When it fails, the agent queries the demonstrator
by presenting additional explanations from alternative
candidate reward models. The demonstrator provides 2. Related Work
feedback by selecting the explanation that matches most
closely their intended reward. This is followed by the re- Reward learning Most IRL algorithms learn a reward
ward learning phase, in which the agent updates its prior function using expert trajectories [
          <xref ref-type="bibr" rid="ref1">4, 9, 10</xref>
          ]. Recent
algoover candidate reward functions, based on the feedback. rithms utilize additional information to improve reward
learning, such as preferences over trajectories [11, 12],
Trehwuasr,dREfuVnEcAtioLnE bcaynaildteernntaiftyinagnbdefixt wineceonnstihsetelnecairensinignatnhde prior over reward functions [
          <xref ref-type="bibr" rid="ref20">13</xref>
          ], or feature queries [7].
verification until the verification test is passed. A key obstacle to the safe deployment of an autonomous
        </p>
        <p>
          We use verification tests in the form of a query: “ex- agent is the long tail of novel situations [
          <xref ref-type="bibr" rid="ref31">14</xref>
          ] that cannot
plain the reward at state  ”, and explanations in the form be predicted by the demonstrations a priori. In fact, our
of feature attribution. We use feature attribution-based experiments show that adding additional demonstrations
explanations, due to their simplicity, but the framework or preferences over trajectories does not guarantee
imis general and can work with any form of explanation provement in the learned reward. Further, unlike Basu
that can help the demonstrator interpret a reward func- et al. [7] that uses human feedback to identify a
feation. An example of a verification test for the scenario ture that afects trajectory preferences, we use feedback
described in Figure 1 is “explain the reward when the AV to identify which automatically generated explanation
encounters a dog accompanied by humans,” to which the aligns best with the intended reward, thereby reducing
agent would respond with its reward value and feature reward ambiguity. While the former approach is limited
attributions indicating a low weight for the ‘dogs’ fea- to linear rewards, our approach generalizes to nonlinear
ture. This reveals a potential weakness of the model in cases. Finally, none of the existing IRL approaches
perthe counterfactual scenario in which the dog is not ac- form reward verification. Our approach is
complemencompanied by a human (missing from the dataset). When tary to many of the existing reward learning methods,
this fails the verification test, the agent explains another as the reward verification and explanation phases can be
used in tandem with any type of reward learning.
candidate reward function (for example,  2 and  3). The
demonstrator then selects an explanation that is closer to
their intended reward ( 3 in this case), indicating that  3
is preferred over  2 and the desired behavior is to stop
for pedestrians or dogs.
        </p>
        <p>Our key contributions are: (1) introducing a general</p>
        <p>Value alignment Value alignment focuses on ensuring
that an agent behavior is aligned with its user’s intentions.</p>
        <p>Unlike the inverse reward design approach [15] that aims
to retrieve the intended reward by treating the specified
reward function as a proxy, we learn the true reward
function using human feedback on automatically generated</p>
        <sec id="sec-1-1-1">
          <title>Bayesian IRL</title>
          <p>A Bayesian framework for IRL defines
explanations of potential reward models. While some re- a probability distribution of reward functions given a
cent work focused on value alignment verification (VAV)
demonstration dataset 
using Bayes rule,  (| ) ∝
with a minimum number of queries [8], our work dif- ( |) ()
. Various algorithms define
 ( |)
diferfers in that: (1) we use human feedback in the form of
ently. We use the definition from B-REX [ 11], as it is
preferences over reward explanations to verify the re- scalable. Let   and   denote two diferent trajectories.
ward, while VAV uses reward weights, value weights or
trajectory preferences for verification; and (2) VAV can
detect but cannot amend misaligned rewards, while our
approach verifies and rectifies incorrect rewards. Further,
VAV can check the consistency of the value function only
in situations that occur during training, and cannot verify
the performance in novel situations that the agents may
encounter after deployment.</p>
        </sec>
        <sec id="sec-1-1-2">
          <title>Explainable AI</title>
          <p>
            For autonomous systems to be widely
adopted, user trust in the systems’ capability must be
built [16], and it is widely accepted that explanations can
induce trust [17]. Much of the existing work on
explainable AI uses feature attributions as explanations to help
understand the relationship between input features and
the output of a learned model. Some of the widely used
techniques are LIME [18], meanRESP[19], SHAP [20],
gradient as explanation (GaE) [
            <xref ref-type="bibr" rid="ref25">21</xref>
            ] and saliency maps [22].
Besides feature attribution, there are other broad classes
of automated explanation generation methods such as
model reconciliation [23] and policy summarization [24].
While these existing approaches typically use
explanations to improve interpretability, we use them to verify
and improve the reward model. Relevant to our work
is [25] which uses a policy summarization technique to
explain reward function to humans in order to induce
trust. Another related line of work uses a model
reconciliation method to improve humans’ understanding
of reward function for better collaboration [26]. Unlike
these approaches where the focus is to induce trust or
improve collaboration, our framework uses explanations to
simultaneously learn and verify human-aligned reward.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Background</title>
      <sec id="sec-2-1">
        <title>Markov decision process</title>
        <p>A Markov Decision Process
(MDP)  is represented by the tuple  = (, ,  , , 
where  is a finite set of states,  is a finite set of
ac0,  )
tions,  ∶  ×  ×  → [0, 1]
 ∶  ×  → ℝ</p>
        <p>is the reward function,  0 is the initial
state distribution, and  ∈ [0, 1) is the discount factor. A
policy  ∶  ×  → [0, 1]</p>
        <p>is a mapping from states to
a distribution over actions. The state and state-action
and   (, ) = [
values of a policy  are   () = [</p>
        <p>∞
∑=0   (  ) ∣  0 = ,  0 = ,  ], ∀ ∈</p>
        <p>∞
∑=0   (  ) ∣  0,  ]
max   () and  ∗(, ) =</p>
        <p>max   (, ) .
and  ∈  . The optimal values are denoted by  ∗() =
is the transition function,
dataset. B-REX defines  ( |)
as:
Expert demonstrations are given in the form of
preferences,   ≻   , indicating that   is preferred over   . The
demonstration data is denoted by  = {(
 22), ..., ( 1 ≻   2), }, where the trajectory   1 is preferred

over trajectory   2. Hence, such data are called preferential
11 ≻  12), ( 21 ≻
 ( |) =</p>
        <p>∏
(  ≻  )∈</p>
        <p>(  )
 (  ) +  ( 
)</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. The REVEALE Framework</title>
      <p>Consider an agent operating in an environment
modeled as a Markov decision process (MDP), 
=
(, ,  ,</p>
      <p>R,  0,  ) where the reward function R is initially
unknown to the agent. The agent aims to learn R using
expert demonstration data  . We consider a factored
state representation.</p>
      <p>Definition 1.
strator
We assume that the agent has access to a limited num- Verification and feedback
We use verification tests
ber of expert demonstrations to learn R. While the
general framework can leverage diferent types of
demonstrations, explanation generation, and verification
techniques, we target settings where the demonstrations are
selected by the demonstrator.
of the form “Explain the reward at state   ” with   ∈</p>
      <p>Hence,  (  ,   )
=
 (  ), ∀  ,   . However,   can also be generated
automatically using techniques such as policy
summarizain the form of trajectory preferences, explanations are
tion [30]. The agent responds by automatically
generat0,    ( 0) =   (  ) provides features that are critical to learning the
intended reward, this type of feedback can be harder to
generated as reward value and feature attribution, and
verification tests describe the reward function at certain
critical states identified by the human. In Section 5, we
briefly discuss how to handle other forms of explanations,
demonstrations, and IRL algorithms.</p>
      <sec id="sec-3-1">
        <title>Demonstration data</title>
        <p>Similar to [11], we use expert
demonstrations in the form of pairwise preference over
trajectories,  = {(</p>
        <p>11 ≻  12), … , (  1 ≻   2)}, called a
preferential dataset, due to its computational eficiency.</p>
        <p>Based on [9, 27], the space of reward functions
consistent with a policy and preferential demonstrations are
defined as follows.</p>
        <p>Definition 2.</p>
        <p>Given an MDP  , a policy-consistent
reward set, denoted Δ( ) , is a set of reward functions
 ∗( 0)} where  0 is the initial state.
under which  is optimal, Δ( ) = { ∈ ℛ ∣  , 
Definition 3.</p>
        <p>Given an MDP  and preferential dataset
 , a demonstration-consistent reward set, denoted
Δ( ) , is a set of reward functions under which the preferred
trajectories have a higher reward than the less preferred
trajectories, Δ( ) = { ∈ ℛ ∣ (
 }.
 1) &gt; (  2), ∀( 1 ≻   2) ∈

may be mismatches in the training and testing
environments [15, 28]. Such situations lead to reward ambiguity
and the agent may end up learning a proxy reward.
REVEALE overcomes these drawbacks by generating
explanations of the reward functions that are consistent with
the expert’s policy/demonstrations, which are verified
by the demonstrator.</p>
        <p>Explanation generation function ( )
The agent’s
explanations involve two components: (1) the reward
at the verification test state  , denoted by (  ),
fol
lowing the most likely reward function; and (2) feature
attribution indicating the influence of state features on
the reward function. Attribution-based explanations are
commonly used in interpretable machine learning [29].</p>
        <p>A feature attribution is a scoring function denoting
the contribution of each state feature to the output,   ∶
 × ℛ → ℝ || . This is a form of local explanation as it only
explains the reward function at each state in isolation. We
use established local explanation techniques mentioned
earlier: gradient as explanation (GaE), LIME, and saliency
maps.</p>
        <p>In practice, 
may not cover all states and there
intended reward.
of the verification test.
ing explanations, consisting of the reward value at   and
the feature attribution describing the reward function.
Approval: A binary signal indicating whether the
demonstrator approves (ℱ ( (  )) = 1) or disapproves
(ℱ ( (  )) = 0) the explanation, denoting the outcome
Explanation feedback: When the verification test fails,
the demonstrator provides accurate feedback on
explanations generated by the agent in one of the following
two forms, used by the agent to update  :
(1) Oracle explanations, typically provided by the human,
in the form of exact feature attribution corresponding
to their intended reward, ℱ ( (  )) =   (  ), ∀  ∈   ,
where   (  ) denotes the exact feature attribution
generated by the Oracle. Though this is an ideal setting as
collect in practice, except for simpler domains.
(2) Pairwise preference over feature attributions generated
by the agent, ℱ ( 1(  ),  2(  )) =   (  ), ∀  ∈   where
  (  ) ∈ { 1(  ),  2(  )} and  1(  ) and  2(  ) denote
explanations of two diferent reward models that are
consistent with  . This is a more realistic form of feedback
that identifies the explanation that better captures the
Definition 4.
tion states   , an explanation-consistent reward set,
denoted Δ( (</p>
        <p>)), is a set of reward functions whose
corresponding explanations are approved by the demonstrator,
Δ( (  )) = { ∈ ℛ ∣ ℱ  (  (  )) = 1, ∀  ∈   }.
when Δ is continuous, as in a simplex in ℝ .</p>
        <p>Definition 5.</p>
        <sec id="sec-3-1-1">
          <title>Reward Ambiguity is a measure propor</title>
          <p>tional to the size of the consistent reward set Δ, such as
|Δ| when Δ is finite and discrete, and volume of Δ,  (Δ) ,</p>
          <p>REVEALE aims to eliminate reward ambiguity, by
reducing the size of consistent reward set Δ( ) ∩ Δ( (
using verification tests and feedback on explanations.
 )),
Definition 6.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>A solution of a REVEALE instance is a re</title>
          <p>ward function,  ∈ Δ( ) ∩ Δ( (
with the demonstrator’s intent.
 )), that is better aligned</p>
          <p>An optimal solution eliminates reward ambiguity and
identifies a reward function that is aligned with the
demonstrator’s intended reward, when feasible. The
following section presents an algorithm that produces a
solution.
Algorithm 1: ILV
Output: Final Reward   and test score
Input: Demonstration data  , verification states  
1:  = ∅
3:   ∼ ℛ
2:   = ∅; ∀  ∈</p>
          <p />
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Solution Approach</title>
      <sec id="sec-4-1">
        <title>Our algorithm, iterative learning and verification of re</title>
        <p>ward (ILV), is outlined in Algorithm 1. The input is a set
of demonstrations,  , and verification test states,  . We

assume that the final reward is the MAP of the
distribuThe algorithm first initializes an empty set of feedback


 , an empty list of candidate explanations   for each

∈   , best reward   randomly drawn from ℛ, and
best test score   . The test scores consist of two numbers:
the first number indicates</p>
        <p>how likely the demonstrations
are under the reward model, and the second number
indicates how good the model is in explaining the reward.</p>
        <p>In each iteration, the current best reward model,
denoted by   , is calculated as the MAP of  (| ,  )
7). The corresponding test score of that model, denoted
by   is initialized to a 2-D array that consists of the
(Line
likelihood of data</p>
        <p>under   , and zero to indicate that
the correctness of the model has not been evaluated (Line
8). For each verification test state   , the reward value
  (  ) and the corresponding explanation   
shown to the demonstrator for approval (Line 10).
(  ) are
If approved, then</p>
        <p>(  ) is added to  (Lines 11-12).</p>
        <p>If disapproved, then the explanation is added to the
candidate explanation set   and additional feedback from

tions from    that have not been queried so far (Lines
14-15). The agent can also add additional explanations
1–10
to</p>
        <p>ditional models from  (| ,  )
either by modifying existing ones or sampling
adand generating their
corresponding explanations for query selection. The
demonstrator can provide no feedback (
unchanged),
generate exact feature attribution, or provide pairwise
preferences over explanations from    .</p>
        <p>A test score for</p>
        <p>(  ), ∀  ∈   is added to   using
the score function (Line 16). The score reflects the
similarity of</p>
        <p>(  ) to human-generated explanations or their
preferred explanations. Finally,   and   are updated
based on the score and the algorithm ends by returning
the best reward and best score, when all verification tests
have been approved,  = 0 (Lines 17-20).</p>
        <p>MAP Estimation The maximum a posteriori
probability (MAP) is estimated as  (| ,  ) =  ( |)
where   () is a prior over  defined by the feedback
on explanations. The feedback on explanations is
represented as a distribution because semantically it is a
set of constraints over the model parameters. This also
allows generalization to other IRL algorithms that use
the Bayesian framework, as most algorithms difer only
 ()
in their likelihood function.
5.1. Learning Linear Reward Using</p>
        <p>Explanations
This section analyzes our proposed method using linear
reward models. A linear reward is described by a linear
weighted combination over the  -size vector of features
describing the state, () =</p>
        <p>w () , w ∈ ℝ . The
corresponding explanations generated using GaE and LIME
present results using GaE and saliency maps (SM).</p>
        <p>For a linear reward, GaE(()) = ∇ ()   () =
∇()
w () =
w, and SM is |w|. Using Definition
5, we
now show that feedback on GaE-based and SM-based
explanations reduces reward ambiguity. We assume that
the agent and the demonstrator share the same similarity
measure, and discuss results with cosine similarity.</p>
      </sec>
      <sec id="sec-4-2">
        <title>Proposition 1. The complexity of removing reward ambi</title>
        <p>guity with Oracle-generated GaE explanations as feedback,
ℱ ( (  )) =   (  ), ∀  ∈   , is (1) .</p>
        <p>Proof Sketch. ∀ ∈  , the explanation given by the oracle</p>
        <p>w∗()) = w∗. Thus one oracle-generated GaE
explanation is suficient to reduce |</p>
        <p>(  )| to one.</p>
        <p>Though   is often dificult to obtain, it shows the
best-case scenario for REVEALE to eliminate reward
ambiguity. The significance of Proposition</p>
        <p>1 is that it
esmethods and reward learning.
tion (mean reward can be obtained by modifying Line 7). (LM) will produce the same output. Therefore, we only
the demonstrator is requested, by selecting two explana- tablishes a direct bridge between the feature attribution
the size of reward hypotheses by % . Let |ℛ| =  . Ac- oracle-generated explanations,   (  ),
similar representation (e.g. cosine distance). Though our
discussion of the framework uses the definition of the
likelihood function presented in B-REX [11], REVEALE
can be used with other Bayesian IRL methods as well by
replacing the definition of the likelihood function.
5.2. Deep REVEALE
Using MCMC for estimating MAP of  (| ,  )
  () as in Equation 1 is ineficient in problems with
large dimensions and feedback constraints because it can
take a large number of steps to get a good estimation
of the MAP. In addition, explanation feedback cannot
be represented as linear constraints when reward
functions are represented using neural networks. Hence, we
present a method for calculating the priors as soft
versions of the constraints discussed earlier. Note that when
calculating the priors, the agent uses the previously
collected feedback ( ) and the explanations generated. For
with
  (  ))| ≤ 2 ,</p>
        <p>ℱ ( 1(  ),  2(  )) =   (  ),
  (  ))| ≤ 2 . denotes the human’s preference, and  ∈ [0, ∞) .</p>
      </sec>
      <sec id="sec-4-3">
        <title>GaE explanation pairs.</title>
      </sec>
      <sec id="sec-4-4">
        <title>Proposition 2. To reduce reward function ambiguity by</title>
        <p>% in expectation, it sufices to have preference feedback
over a set of  = log2(1/(1 − /100)) randomly generated
Proof Sketch. Consider a set of pairwise preference GaE
feedback denoted by  

2 )} where, ( 1
By Definition
 ,  2) are candidate explanations for  .

4, the explanation-consistent reward set,
Δ( 
  (  )), is described by the half-space constraints:
  (  ) = {( 11 ≻  12), … , ( 
  (  ) enforces that the cosine similarity of w
1 should be larger than the cosine similarity of
w with  2. We want to find the bound  on the size

of  
  (  ), |</p>
        <p>(  )|, that is suficient for reducing
cording to [27] |Δ( 
| 
  (  )| =  . Therefore,  = (1 −
2
explanation pairs.
of ℛ is removed in expectation using feedback over a
set of  = log2(1/(1 − /100)) randomly generated GaE
1
2
 ) ∗ 100% volume
  (  ))| =   in expectation where</p>
      </sec>
      <sec id="sec-4-5">
        <title>Proposition 3. It is suficient to have O(1) Oracle</title>
        <p>= dim(w).
generated SM Feedback to reduce |Δ( 
Proof Sketch. ∀ ∈  , the explanation given by the oracle
w′ such that |w| = |w∗| by taking +, − sign combination
w∗()) = |w∗|. Then we can construct at most 2 ,
of each element of |w∗|. Therefore, |Δ(</p>
        <p>Propositions 1 and 3 show that GaE can be more
efective than SM in reducing reward ambiguity. Our empiri- tion,  (| ,  )
cal results show a similar trend for non-linear rewards.
tion  and feedback ℱ is computed as:
Prior Definition</p>
        <p>The prior   () for explanation
func  () ∝ ℐ (  ℱ(  )),
(1)
by  
 ( |)
where ℐ (.)is 1 if  satisfies all the constraints imposed
ℱ(  ) and 0 otherwise. Now, MAP of  (| ,  ) =</p>
        <p>() can be estimated using an of-the-shelf
Markov Chain Monte Carlo (MCMC) solver.</p>
        <p>Generalization Notice that Equation 1 does not depend
on the explanation generation mechanism, i.e. feature
attribution. REVEALE can utilize any explanation
generation method as long as   ℱ(.)can be represented for
that method. Defining such constraints requires a
measurement of similarity  (.) between two explanations of
(2)
(3)
  () ∝ (</p>
        <p>∏
  (  )∈ 
 
  (</p>
        <p>(  ),  (  ))) ,
where   (  ) is the agent’s explanation,   (  ) is the
oracle-generated explanation as feedback,  (.) is a
measurement of similarity, and  ∈ [0, ∞) .</p>
        <p>Similarly,
for
pairwise
preferences,
  () ∝ (</p>
        <p>∏
(x1≻x2)∈</p>
        <p>( x ,x1)
    ( x ,x1) +   ( x ,x2) )

where x
 =   (  ) is the agent’s explanation, x1 ≻ x
2</p>
        <p>For the above priors, gradient-based MCMC
optimization methods [31] work well in high dimensions and can
be used to optimize Bayesian neural networks. In
addican be approximated using a
gradientbased method with − log  (| ,  )
as loss functions.</p>
        <p>This loss function will decompose into two parts, one
for the likelihood function and the other for the prior.
The parameter  can be adjusted to optimize these two
parts simultaneously. Also notice that when  is set to
zero this becomes standard B-REX [11] and T-REX [6].
Optimizing this loss function requires calculating the
gradient of the explanation function 
with respect to state
features. This can be automatically calculated through
auto-dif libraries such as JAX [ 32].</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Experimental Setup</title>
      <p>We evaluate the efectiveness of learning aligned linear
and non-linear rewards with REVEALE using three
explanation generation techniques: gradient as explanations
(GaE), LIME, and saliency map (SM). Reward alignment
is measured by (1) the accuracy of predicting the user’s
trajectory preferences in test environments, and (2) the
average reward collected by executing the policy in the
test environments trained using the learned reward. The
results of our approach are compared with those of the
policy with the true reward function (Optimal) and two
recent IRL algorithms, B-REX [11] (for linear rewards)
and T-REX [6] (for non-linear rewards). Hence, we use
REX to denote B-REX for domains with linear reward
and T-REX for non-linear reward.</p>
      <p>We report results on five proof-of-concept domains,
including three domains from the AI safety literature.</p>
      <p>Many of the domains we consider sufering from
spurious feature correlation, where two or more state features Figure 2: Preference prediction accuracy
always co-occur in the demonstrations and this
correlation afects the learning process. Since such correlations
are spurious during test time the agent encounters novel
states. We use   () as a test score for REVEALE.
Veriifcation states   are selected from  . Explanations are
randomly selected from    for querying feedback.
Notice that neither the  nor the   contains novel states
that the agent encounter during evaluations. As a
result, evaluation performance is a good indicator of the
generalizability of the methods. All algorithms were
implemented by us and tested on a machine with 32 GB
RAM and 12GB GPU. Values are averaged over 60
diferent random seeds. Experiments with non-linear rewards
use a four-layer neural network with Relu activation.</p>
      <p>LavaLand This domain introduced by Hadfield-Menell
et al. [15] consists of a ‘lava’ feature that never appears Figure 3: Dataset size vs accuracy for REX
in the demonstrations. As a result, the agent may not
learn to avoid it when it navigates to a goal location,
potentially resulting in unsafe behavior when deployed. network quality, and accident history in the segment⟩.
DogWalk This is the AV domain illustrated in Figure 1, The non-linear reward incentivizes the AV to navigate
where the AV must learn to stop for both pedestrians on safe (low pothole and low accident history) and
comand dogs. Each state is represented by ⟨location, human, fortable (good mobile network) routes while reducing
dog, bag⟩. This environment is an example of spurious the time to the destination. However, the demonstration
feature correlation as humans and dogs occur together data contains spurious feature correlation as most roads
in the demonstrations. that have good mobile networks also have bad accident
WaterWorld This domain, based on [33], tests how the history, leading to reward ambiguity.
agent responds to a distribution shift. There are two CoinRush This domain is similar to the CoinRun
entypes of surfaces in the problem: ‘water’ and ‘ground’. vironment described in [31]. The cells in the grid have
We consider a linear reward, with a negative reward for coins, gold, or an enemy. The target is to gather as much
stepping into the water. The demonstrations and the gold and coins while avoiding the enemy. However, in
training environment have fixed water locations, but the the demonstration data, the enemy and coin/gold always
test environments have scattered water locations. Re- have fixed colors (green and yellow respectively),
resultward ambiguity arises as the agent may not be able to ing in spurious feature correlation. But in the test
envidistinguish if the negative reward is associated with the ronment, they can have any color.
surface type or grid location. In WaterWorld and CoindRush, the ambiguity is about
Navigation (AVNav) This domain, designed by us, de- which feature should get attribution. In other domains,
scribes a safe route planning problem, where the demon- the ambiguity is about which feature to attribute and
strations are preferences over diferent routes. Each state whether their attribution should be positive or negative.
represents a road segment and is denoted by the tuple AVNav and CoinRush have a non-linear reward structure,
⟨road segment length, average speed, #potholes, mobile while the other problems use linear rewards.
Models</p>
      <p>REX
GaE (  )
GaE (  )
SM (  )
SM (  )
LIME (  )
LIME (  )
Optimal</p>
      <p>LavaLand</p>
      <p>DogWalk</p>
      <p>WaterWorld</p>
      <p>AVNav
6.1. Results and Discussion to provide valuable feedback, which the agent uses to
reduce reward ambiguity in novel situations.</p>
      <p>Prediction accuracy Figure 2 shows the average pre- Efect of number of demonstrations We also test the
diction accuracy tested on 2000 pairs of trajectories. For efect of #demonstrations on the prediction accuracy of
training, we use 256 demonstrations and 64 preference REX (Figure 3), with the size of  ranging between 2 to
feedback over pair of explanations for domains with lin- 2048. We observe no improvement in the accuracy of
ear reward and 1024 demonstrations and 256 preference B-REX beyond 128 demonstrations and 1024 for T-REX,
feedback over pair of explanations for non-linear reward. which indicates that the approach is unable to eliminate
The red star over each bar represents the accuracy of the reward ambiguity even if the number of demonstrations
corresponding explanation method when exact Oracle increases. This is because the additional trajectories do
explanations were given instead of preference feedback. not encode any information about novel situations the</p>
      <p>In every domain, except CoinRush, REVEALE with agent may encounter when deployed. Therefore, the
perGaE explanations achieves the highest accuracy and formance does not improve in the test cases.
matches the accuracy of prediction based on human- Average and worse case reward Table 1 shows the
generated explanations. In LavaLand and DogWalk, SM average and worst reward obtained with diferent
apidentified that ‘lava’ and ’dogs’ are important features, proaches in test environments. We report the worst-case
respectively, but could not identify whether they should reward since it provides insights into the degree of
unbe positively attributed because it uses the absolute value safe behavior that may arise when the reward is not
of the gradient. B-REX also sufers from this drawback. well-aligned. We evaluate the efectiveness of each
exIn domains where the ambiguity is about which feature planation method using both types of feedback:
Oracleshould be attributed, such as location or surface type generated   and pairwise preferences   . We also
in WaterWorld, SM performs comparably to other ap- report the average reward obtained with the true reward
proaches. However, B-REX often associated the reward function in each setting, denoted by Optimal. Our
rewith location, instead of surface type. Overall, our results sults show that REVEALE with GaE using   feedback
indicate that REVEALE with any explanation method per- performs better on most domains. SM outperforms the
forms better than REX. other approaches only when the ambiguity was about</p>
      <p>In all five environments, all the approaches, including whether the reward should be associated with a feature
REX, achieve near-optimal prediction accuracy on the or location. That is, SM often identifies the magnitude of
demonstration dataset used to learn the reward. The correlation but struggles to refine whether it is positive
performance degrades in the test scenarios because the or negative, often associating incorrectly. LIME performs
agents encounter novel states that did not occur in the similarly to GaE, when feedback is   but performs
relademonstration data. As evident from Figure 2, REVEALE tively poorly when feedback is   . This is because LIME
improves the prediction performance significantly in works with a large set of states in the neighborhood of the
such cases. In the absence of prior knowledge about input states, unlike GaE and SM, which only work with
novel situations, it might not be possible to predict how a single state. Therefore when exact inputs are given,
the agent will perform just by assessing the agent’s re- LIME works very well. With   feedback, the error can
ward/policy/value function in states that appear in the propagate to many states causing worse performance
training environment. However, examining the consis- than GaE. Overall, our results show that REVEALE can
tency of the agent’s explanations in states that appear in learn and generalize reward that is better aligned than
training data allows the demonstrator to infer its behavior the existing approaches.
in many novel situations. This allows the demonstrator</p>
    </sec>
    <sec id="sec-6">
      <title>7. Summary and Future Work</title>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Ziebart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Maas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Bagnell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Dey</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <article-title>Maximum entropy inverse reinforcement learning, This paper presents a general interpretable reward learn</article-title>
          - in
          <source>: Proceedings of the 23rd AAAI Conference on ing and verification framework to ensure that the learned Artificial Intelligence</source>
          ,
          <year>2008</year>
          , pp.
          <fpage>1433</fpage>
          -
          <lpage>1438</lpage>
          .
          <article-title>reward is aligned with that of the demonstrator's intent</article-title>
          . [11]
          <string-name>
            <surname>D. S. Brown</surname>
          </string-name>
          , R. Coleman,
          <string-name>
            <given-names>R.</given-names>
            <surname>Srinivasan</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Niekum,</surname>
          </string-name>
          <article-title>The results demonstrate the benefits of our approach in Safe imitation learning via fast Bayesian reward learning the intended reward, thereby supporting the inference from preferences, in: Proceedings of the safe deployment of RL agents in the real world</article-title>
          .
          <source>In the 37th International Conference on Machine Learnfuture</source>
          , we aim to develop techniques to automatically ing,
          <year>2020</year>
          , pp.
          <fpage>1165</fpage>
          -
          <lpage>1177</lpage>
          .
          <article-title>identify critical states for verification, and</article-title>
          integrate ac- [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Palan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. C.</given-names>
            <surname>Landolfi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Shevchuk</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <article-title>Sadigh, tive learning methods [34] to optimize queries. Learning reward functions by integrating human</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <article-title>demonstrations and preferences</article-title>
          ,
          <source>in: Proceedings Acknowledgments of Robotics: Science and Systems XV</source>
          ,
          <year>2019</year>
          . [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramachandran</surname>
          </string-name>
          , E. Amir,
          <article-title>Bayesian inverse reThis work was supported in part by the National Science inforcement learning</article-title>
          ,
          <source>in: Proceedings of the 20th Foundation Grants IIS-1954782, IIS-2205153, NSF and International Joint Conference on Artificial intelliUSDA-NIFA award number 2021-67021-35344</source>
          . gence,
          <year>2007</year>
          , pp.
          <fpage>2586</fpage>
          -
          <lpage>2591</lpage>
          . [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bogdoll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Guneshka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Zollner</surname>
          </string-name>
          , One on-
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>tology to rule them all: Corner case scenarios for References autonomous driving</article-title>
          ,
          <source>ArXiv abs/2209</source>
          .00342 (
          <year>2022</year>
          ). [15]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hadfield-Menell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Milli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Russell</surname>
          </string-name>
          , [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zilberstein</surname>
          </string-name>
          ,
          <article-title>Building strong semi-autonomous A. Dragan, Inverse reward design</article-title>
          , in: Advances in
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>systems, in: Proceedings of the 29th AAAI Confer- Neural Information Processing Systems</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>ence on Artificial Intelligence</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>4088</fpage>
          -
          <lpage>4092</lpage>
          . [16]
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Linegang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Stoner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Patterson</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. D.</surname>
          </string-name>
          [2]
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Argall</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chernova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Veloso</surname>
          </string-name>
          , B. Brown- Seppelt,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Hofman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. B.</given-names>
            <surname>Crittendon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>tion</surname>
          </string-name>
          ,
          <source>Robotics and Autonomous Systems</source>
          <volume>57</volume>
          (
          <year>2009</year>
          )
          <article-title>sion planning: A challenge requiring an ecologi-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          469-
          <fpage>483</fpage>
          . cal approach,
          <source>Proceedings of the Human Factors</source>
          [3]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <source>Reinforcement Learning: and Ergonomics Society Annual Meeting</source>
          <volume>50</volume>
          (
          <year>2006</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>An</surname>
            <given-names>Introduction</given-names>
          </string-name>
          , MIT press Cambridge,
          <year>1998</year>
          .
          <fpage>2482</fpage>
          -
          <lpage>2486</lpage>
          . [4]
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Russell</surname>
          </string-name>
          , Algorithms for inverse rein- [17]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Shah</surname>
          </string-name>
          , Improving robot controller
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <article-title>forcement learning</article-title>
          .,
          <source>in: Proceedings of the 17th In- transparency through autonomous policy expla-</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>ternational Conference on Machine Learning</source>
          ,
          <year>2000</year>
          , nation,
          <source>in: Proceedings of the ACM/IEEE</source>
          Inter-
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          pp.
          <fpage>663</fpage>
          -
          <lpage>670</lpage>
          . national Conference on
          <string-name>
            <surname>Human-Robot</surname>
            <given-names>Interaction</given-names>
          </string-name>
          , [5]
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Ziebart</surname>
          </string-name>
          ,
          <source>Modeling Purposeful Adaptive Behav- 2017</source>
          , pp.
          <fpage>303</fpage>
          -
          <lpage>312</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <article-title>ior with the Principle of Maximum Causal Entropy</article-title>
          , [18]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          , “Why should I
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          Carnegie Mellon University,
          <year>2010</year>
          .
          <article-title>trust you?”: Explaining the predictions of any clas</article-title>
          [6]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Brown</surname>
          </string-name>
          , W. Goo,
          <string-name>
            <given-names>N.</given-names>
            <surname>Prabhat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Niekum</surname>
          </string-name>
          , Ex- sifier, in
          <source>: Proceedings of the 22nd ACM SIGKDD</source>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <article-title>inverse reinforcement learning from observations</article-title>
          ,
          <source>and Data Mining</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>1135</fpage>
          -
          <lpage>1144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>in: Proceedings of the 36th International</source>
          Confer- [19]
          <string-name>
            <given-names>S. B.</given-names>
            <surname>Nashed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mahmud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Goldman</surname>
          </string-name>
          , S. Zil-
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>ence on Machine Learning</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>783</fpage>
          -
          <lpage>792</lpage>
          . berstein,
          <article-title>A unifying framework for causal ex</article-title>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Basu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Dragan</surname>
          </string-name>
          ,
          <article-title>Learning from planation of sequential decision making, ArXiv</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <article-title>richer human guidance: Augmenting comparison</article-title>
          - abs/2205.15462 (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <article-title>based learning with feature queries</article-title>
          , in: Proceedings [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lundberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.-I.</given-names>
            <surname>Lee</surname>
          </string-name>
          , A unified approach
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>of the 13th International Conference on Human- to interpreting model predictions</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Robot</given-names>
            <surname>Interaction</surname>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>132</fpage>
          -
          <lpage>140</lpage>
          . arXiv:
          <volume>1705</volume>
          .
          <fpage>07874</fpage>
          . [8]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Niekum</surname>
          </string-name>
          , Value align- [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Tayyub</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sarmad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Schönborn</surname>
          </string-name>
          , Explain-
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <article-title>ment verification, in: Proceedings of the 38th In- ing deep neural networks for point clouds us-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <source>ternational Conference on Machine Learning</source>
          ,
          <year>2021</year>
          ,
          <article-title>ing gradient-based visualisations</article-title>
          , arXiv preprint
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          pp.
          <fpage>1105</fpage>
          -
          <lpage>1115</lpage>
          . arXiv:
          <volume>2207</volume>
          .12984 (
          <year>2022</year>
          ). [9]
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          , Apprenticeship learning via [22]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vedaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          , Deep in-
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <source>of the 21st International Conference on Machine classification models and saliency maps</source>
          , CoRR
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <surname>learning</surname>
          </string-name>
          ,
          <year>2004</year>
          . abs/1312.6034 (
          <year>2014</year>
          ). [23]
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sreedharan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Kamb-
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <source>preprint arXiv:1701.08317</source>
          (
          <year>2017</year>
          ). [24]
          <string-name>
            <given-names>O.</given-names>
            <surname>Amir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Doshi-Velez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sarne</surname>
          </string-name>
          , Summarizing
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <source>Agent Systems</source>
          <volume>33</volume>
          (
          <year>2019</year>
          )
          <fpage>628</fpage>
          -
          <lpage>644</lpage>
          . [25]
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Bhatia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Dragan</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <source>ence on Intelligent Robots and Systems</source>
          ,
          <year>2018</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          3929-
          <fpage>3936</fpage>
          . [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Tabrez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hayes</surname>
          </string-name>
          , Explanation-based
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <string-name>
            <surname>the 14th</surname>
            <given-names>ACM</given-names>
          </string-name>
          /IEEE International Conference on
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          <string-name>
            <surname>Human-Robot Interaction</surname>
          </string-name>
          ,
          <year>2020</year>
          , p.
          <fpage>249</fpage>
          -
          <lpage>257</lpage>
          . [27]
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Brown</surname>
          </string-name>
          , W. Goo,
          <string-name>
            <given-names>S.</given-names>
            <surname>Niekum</surname>
          </string-name>
          , Better-than-
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <article-title>ranked demonstrations</article-title>
          ,
          <source>in: Proceedings of the 3rd</source>
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <source>Annual Conference on Robot Learning</source>
          ,
          <year>2019</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          330-
          <fpage>359</fpage>
          . [28]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kamar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dey</surname>
          </string-name>
          , J. Shah,
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          <article-title>ment learning</article-title>
          ,
          <source>in: Proceedings of the 17th Inter-</source>
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <string-name>
            <given-names>Multiagent</given-names>
            <surname>Systems</surname>
          </string-name>
          ,
          <year>2018</year>
          , pp.
          <fpage>1017</fpage>
          -
          <lpage>1025</lpage>
          . [29]
          <string-name>
            <given-names>C.</given-names>
            <surname>Molnar</surname>
          </string-name>
          , Interpretable Machine Learning: A
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          Lulu.com,
          <year>2022</year>
          . [30]
          <string-name>
            <given-names>I.</given-names>
            <surname>Lage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lifschitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Doshi-Velez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Amir</surname>
          </string-name>
          , Ex-
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          summarization,
          <source>in: Proceedings of the 28th Inter-</source>
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <year>2019</year>
          , pp.
          <fpage>1401</fpage>
          -
          <lpage>1407</lpage>
          . [31]
          <string-name>
            <given-names>L.</given-names>
            <surname>Langosco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Koch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sharkey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pfau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          learning,
          <source>in: Proceedings of the 39th Interna-</source>
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <source>tional Conference on Machine Learning</source>
          ,
          <year>2022</year>
          , pp.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          12004-
          <fpage>12019</fpage>
          . [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Frostig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Hawkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          grams, http://github.com/google/jax,
          <year>2018</year>
          . [33]
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Martic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Krakovna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. A.</given-names>
            <surname>Ortega</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          gridworlds,
          <source>ArXiv abs/1711</source>
          .09883 (
          <year>2017</year>
          ). [34]
          <string-name>
            <given-names>B.</given-names>
            <surname>Settles</surname>
          </string-name>
          ,
          <article-title>Active learning</article-title>
          , Synthesis Lectures
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <source>on Artificial Intelligence and Machine Learning 6</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>