1. Introduction

Verification and Learning Using Explanations

Saaduddin Mahmud

smahmud@umass.edu 1 2

Sandhya Saisubramanian

sandhya.sai@oregonstate.edu 0 2

Shlomo Zilberstein

shlomo@umass.edu 1 2

Reward Alignment, Explanation Generation, Inverse Reinforcement Learning

0 Oregon State University , Oregon , USA 1 University of Massachusetts Amherst , Massachusetts , USA 2 Workshop Proce dings

2023

When a human expert demonstrates the desired behavior, there often exist multiple reward functions consistent with the observed demonstrations. As a result, agents often learn a proxy reward function to encode their observations. Operating based on proxy rewards may be unsafe. Furthermore, black-box representations make it dificult for the demonstrator to verify the learned reward function and prevent harmful behavior. We investigate the eficiency of using explanations to update and verify a learned reward function, to ensure that it aligns with the demonstrator's intent. The problem is formulated as an inverse reinforcement learning from ranked expert demonstrations, with verification tests to validate the alignment of the learned reward. The agent explains its reward function and the human signals whether the explanation passes the verification test. When the explanation is rejected, the agent presents additional alternative explanations to acquire feedback, such as a preference ordering over explanations, which helps it learn the intended reward. We analyze the eficiency of our approach in learning reward functions from diferent types of explanations and present empirical results on five domains. Our results demonstrate the efectiveness of our approach in learning and generalizing human-aligned rewards.

1. Introduction

With dramatic recent advances in artificial intelligence, autonomous agents are being increasingly deployed in A predominant way to train such agents in the absence of a reward function is learning from demonstration (LfD) [2].

Inverse reinforcement learning (IRL) is a form of LfD de

signed to retrieve a reward function that captures the demonstrator’s behavior [3], allowing agents to learn and generalize the observed behavior to unseen situations.

Despite the success of IRL in many research settings, two key limitations may lead to unsafe behavior of the deployed system: (1) the demonstrations may cover only a subset of states, providing no direct information about acceptable behavior in other states; and (2) a large space of candidate reward functions may be consistent with the demonstrations, each producing slightly diferent behavConsequently, an agent may learn a proxy reward that leads to unpredictable, unsafe behavior when encountering novel situations. CEUR htp:/ceur-ws.org ISN1613-073 © 2023 Copyright for this paper by its authors. Use permitted under Creative

CEUR a human is crossing the street with dogs. Since dogs are often accompanied by humans, the rare case of encountering dogs alone might be missing from the dataset.

Consider four diferent reward functions consistent with ative reward for not stopping in this case. 1 does not account for dogs, 2 rewards stopping for pedestrians with dogs, 3 rewards stopping for pedestrians or dogs, and 4 rewards stopping for all objects, including leaves or a plastic bag on the road. In the absence of additional information, the AV may randomly learn one of these reward functions (say 2), however, 3 represents the true intent of the demonstrator. When operating based on 2, the AV may not stop for dogs unaccompanied by humans.

This example illustrates the inherent reward ambiguity in IRL and the consequences of learning a proxy reward.

Existing IRL methods aim to resolve reward ambiguity by either introducing heuristics such as Max Margin [4] mation such as a trajectory ranking [6] or a preference diferentiator [ 7]. However, these approaches are not guaranteed to avoid reward ambiguity and they do not verify the learned reward. Recently, Brown et al. [8] icy, but it does not amend the reward if it is misaligned.

Further, these methods are not interpretable and may require additional knowledge, such as the value function.

To address this issue, we introduce a general framework that utilizes explanations to learn a reward function that is aligned with the demonstrator’s intent. Our framework for reward verification and learning using ior in states that were not visited in the demonstrations. or Max Entropy [5], or by combining additional inforIn the demonstration data, the driver always stops when vehicle (AV) approaching a pedestrian walking two dogs. introduced an approach to verify the agent’s value or polexplanations (REVEALE) consists of a reward learning framework for verifying and learning human-aligned phase and a verification phase . In the reward learning reward from demonstrations and using explanations; phase, the agent learns a reward function based on the (2) presenting an algorithm to generate explanations as demonstrations. In the verification phase, the demon- feature attributions for reward functions and verify the strator verifies the reward alignment through verifica- learned reward through human feedback; (3) analyzing tion tests in the form of queries to the agent. The agent the reduction in reward ambiguity for linear rewards; responds by explaining its reward, and the demonstra- and (4) demonstrating empirically the efectiveness of tor signals whether the model passes the verification our approach on five domains. test. When it fails, the agent queries the demonstrator by presenting additional explanations from alternative candidate reward models. The demonstrator provides 2. Related Work feedback by selecting the explanation that matches most closely their intended reward. This is followed by the re- Reward learning Most IRL algorithms learn a reward ward learning phase, in which the agent updates its prior function using expert trajectories [ 4, 9, 10 ]. Recent algoover candidate reward functions, based on the feedback. rithms utilize additional information to improve reward learning, such as preferences over trajectories [11, 12], Trehwuasr,dREfuVnEcAtioLnE bcaynaildteernntaiftyinagnbdefixt wineceonnstihsetelnecairensinignatnhde prior over reward functions [ 13 ], or feature queries [7]. verification until the verification test is passed. A key obstacle to the safe deployment of an autonomous

We use verification tests in the form of a query: “ex- agent is the long tail of novel situations [ 14 ] that cannot plain the reward at state ”, and explanations in the form be predicted by the demonstrations a priori. In fact, our of feature attribution. We use feature attribution-based experiments show that adding additional demonstrations explanations, due to their simplicity, but the framework or preferences over trajectories does not guarantee imis general and can work with any form of explanation provement in the learned reward. Further, unlike Basu that can help the demonstrator interpret a reward func- et al. [7] that uses human feedback to identify a feation. An example of a verification test for the scenario ture that afects trajectory preferences, we use feedback described in Figure 1 is “explain the reward when the AV to identify which automatically generated explanation encounters a dog accompanied by humans,” to which the aligns best with the intended reward, thereby reducing agent would respond with its reward value and feature reward ambiguity. While the former approach is limited attributions indicating a low weight for the ‘dogs’ fea- to linear rewards, our approach generalizes to nonlinear ture. This reveals a potential weakness of the model in cases. Finally, none of the existing IRL approaches perthe counterfactual scenario in which the dog is not ac- form reward verification. Our approach is complemencompanied by a human (missing from the dataset). When tary to many of the existing reward learning methods, this fails the verification test, the agent explains another as the reward verification and explanation phases can be used in tandem with any type of reward learning. candidate reward function (for example, 2 and 3). The demonstrator then selects an explanation that is closer to their intended reward ( 3 in this case), indicating that 3 is preferred over 2 and the desired behavior is to stop for pedestrians or dogs.

Our key contributions are: (1) introducing a general

Value alignment Value alignment focuses on ensuring that an agent behavior is aligned with its user’s intentions.

Unlike the inverse reward design approach [15] that aims to retrieve the intended reward by treating the specified reward function as a proxy, we learn the true reward function using human feedback on automatically generated

Bayesian IRL

A Bayesian framework for IRL defines explanations of potential reward models. While some re- a probability distribution of reward functions given a cent work focused on value alignment verification (VAV) demonstration dataset using Bayes rule, (| ) ∝ with a minimum number of queries [8], our work dif- ( |) () . Various algorithms define ( |) diferfers in that: (1) we use human feedback in the form of ently. We use the definition from B-REX [ 11], as it is preferences over reward explanations to verify the re- scalable. Let and denote two diferent trajectories. ward, while VAV uses reward weights, value weights or trajectory preferences for verification; and (2) VAV can detect but cannot amend misaligned rewards, while our approach verifies and rectifies incorrect rewards. Further, VAV can check the consistency of the value function only in situations that occur during training, and cannot verify the performance in novel situations that the agents may encounter after deployment.

Explainable AI

For autonomous systems to be widely adopted, user trust in the systems’ capability must be built [16], and it is widely accepted that explanations can induce trust [17]. Much of the existing work on explainable AI uses feature attributions as explanations to help understand the relationship between input features and the output of a learned model. Some of the widely used techniques are LIME [18], meanRESP[19], SHAP [20], gradient as explanation (GaE) [ 21 ] and saliency maps [22]. Besides feature attribution, there are other broad classes of automated explanation generation methods such as model reconciliation [23] and policy summarization [24]. While these existing approaches typically use explanations to improve interpretability, we use them to verify and improve the reward model. Relevant to our work is [25] which uses a policy summarization technique to explain reward function to humans in order to induce trust. Another related line of work uses a model reconciliation method to improve humans’ understanding of reward function for better collaboration [26]. Unlike these approaches where the focus is to induce trust or improve collaboration, our framework uses explanations to simultaneously learn and verify human-aligned reward.

3. Background Markov decision process

A Markov Decision Process (MDP) is represented by the tuple = (, , , , where is a finite set of states, is a finite set of ac0, ) tions, ∶ × × → [0, 1] ∶ × → ℝ

is the reward function, 0 is the initial state distribution, and ∈ [0, 1) is the discount factor. A policy ∶ × → [0, 1]

is a mapping from states to a distribution over actions. The state and state-action and (, ) = [ values of a policy are () = [

∞ ∑=0 ( ) ∣ 0 = , 0 = , ], ∀ ∈

∞ ∑=0 ( ) ∣ 0, ] max () and ∗(, ) =

max (, ) . and ∈ . The optimal values are denoted by ∗() = is the transition function, dataset. B-REX defines ( |) as: Expert demonstrations are given in the form of preferences, ≻ , indicating that is preferred over . The demonstration data is denoted by = {( 22), ..., ( 1 ≻ 2), }, where the trajectory 1 is preferred over trajectory 2. Hence, such data are called preferential 11 ≻ 12), ( 21 ≻ ( |) =

∏ ( ≻ )∈

( ) ( ) + ( )

4. The REVEALE Framework

Consider an agent operating in an environment modeled as a Markov decision process (MDP), = (, , ,

R, 0, ) where the reward function R is initially unknown to the agent. The agent aims to learn R using expert demonstration data . We consider a factored state representation.

Definition 1. strator We assume that the agent has access to a limited num- Verification and feedback We use verification tests ber of expert demonstrations to learn R. While the general framework can leverage diferent types of demonstrations, explanation generation, and verification techniques, we target settings where the demonstrations are selected by the demonstrator. of the form “Explain the reward at state ” with ∈

Hence, ( , ) = ( ), ∀ , . However, can also be generated automatically using techniques such as policy summarizain the form of trajectory preferences, explanations are tion [30]. The agent responds by automatically generat0, ( 0) = ( ) provides features that are critical to learning the intended reward, this type of feedback can be harder to generated as reward value and feature attribution, and verification tests describe the reward function at certain critical states identified by the human. In Section 5, we briefly discuss how to handle other forms of explanations, demonstrations, and IRL algorithms.

Demonstration data

Similar to [11], we use expert demonstrations in the form of pairwise preference over trajectories, = {(

11 ≻ 12), … , ( 1 ≻ 2)}, called a preferential dataset, due to its computational eficiency.

Based on [9, 27], the space of reward functions consistent with a policy and preferential demonstrations are defined as follows.

Definition 2.

Given an MDP , a policy-consistent reward set, denoted Δ( ) , is a set of reward functions ∗( 0)} where 0 is the initial state. under which is optimal, Δ( ) = { ∈ ℛ ∣ , Definition 3.

Given an MDP and preferential dataset , a demonstration-consistent reward set, denoted Δ( ) , is a set of reward functions under which the preferred trajectories have a higher reward than the less preferred trajectories, Δ( ) = { ∈ ℛ ∣ ( }. 1) > ( 2), ∀( 1 ≻ 2) ∈ may be mismatches in the training and testing environments [15, 28]. Such situations lead to reward ambiguity and the agent may end up learning a proxy reward. REVEALE overcomes these drawbacks by generating explanations of the reward functions that are consistent with the expert’s policy/demonstrations, which are verified by the demonstrator.

Explanation generation function ( ) The agent’s explanations involve two components: (1) the reward at the verification test state , denoted by ( ), fol lowing the most likely reward function; and (2) feature attribution indicating the influence of state features on the reward function. Attribution-based explanations are commonly used in interpretable machine learning [29].

A feature attribution is a scoring function denoting the contribution of each state feature to the output, ∶ × ℛ → ℝ || . This is a form of local explanation as it only explains the reward function at each state in isolation. We use established local explanation techniques mentioned earlier: gradient as explanation (GaE), LIME, and saliency maps.

In practice, may not cover all states and there intended reward. of the verification test. ing explanations, consisting of the reward value at and the feature attribution describing the reward function. Approval: A binary signal indicating whether the demonstrator approves (ℱ ( ( )) = 1) or disapproves (ℱ ( ( )) = 0) the explanation, denoting the outcome Explanation feedback: When the verification test fails, the demonstrator provides accurate feedback on explanations generated by the agent in one of the following two forms, used by the agent to update : (1) Oracle explanations, typically provided by the human, in the form of exact feature attribution corresponding to their intended reward, ℱ ( ( )) = ( ), ∀ ∈ , where ( ) denotes the exact feature attribution generated by the Oracle. Though this is an ideal setting as collect in practice, except for simpler domains. (2) Pairwise preference over feature attributions generated by the agent, ℱ ( 1( ), 2( )) = ( ), ∀ ∈ where ( ) ∈ { 1( ), 2( )} and 1( ) and 2( ) denote explanations of two diferent reward models that are consistent with . This is a more realistic form of feedback that identifies the explanation that better captures the Definition 4. tion states , an explanation-consistent reward set, denoted Δ( (

)), is a set of reward functions whose corresponding explanations are approved by the demonstrator, Δ( ( )) = { ∈ ℛ ∣ ℱ ( ( )) = 1, ∀ ∈ }. when Δ is continuous, as in a simplex in ℝ .

Definition 5.

Reward Ambiguity is a measure propor

tional to the size of the consistent reward set Δ, such as |Δ| when Δ is finite and discrete, and volume of Δ, (Δ) ,

REVEALE aims to eliminate reward ambiguity, by reducing the size of consistent reward set Δ( ) ∩ Δ( ( using verification tests and feedback on explanations. )), Definition 6.

A solution of a REVEALE instance is a re

ward function, ∈ Δ( ) ∩ Δ( ( with the demonstrator’s intent. )), that is better aligned

An optimal solution eliminates reward ambiguity and identifies a reward function that is aligned with the demonstrator’s intended reward, when feasible. The following section presents an algorithm that produces a solution. Algorithm 1: ILV Output: Final Reward and test score Input: Demonstration data , verification states 1: = ∅ 3: ∼ ℛ 2: = ∅; ∀ ∈

5. Solution Approach Our algorithm, iterative learning and verification of re

ward (ILV), is outlined in Algorithm 1. The input is a set of demonstrations, , and verification test states, . We assume that the final reward is the MAP of the distribuThe algorithm first initializes an empty set of feedback , an empty list of candidate explanations for each ∈ , best reward randomly drawn from ℛ, and best test score . The test scores consist of two numbers: the first number indicates

how likely the demonstrations are under the reward model, and the second number indicates how good the model is in explaining the reward.

In each iteration, the current best reward model, denoted by , is calculated as the MAP of (| , ) 7). The corresponding test score of that model, denoted by is initialized to a 2-D array that consists of the (Line likelihood of data

under , and zero to indicate that the correctness of the model has not been evaluated (Line 8). For each verification test state , the reward value ( ) and the corresponding explanation shown to the demonstrator for approval (Line 10). ( ) are If approved, then

( ) is added to (Lines 11-12).

If disapproved, then the explanation is added to the candidate explanation set and additional feedback from tions from that have not been queried so far (Lines 14-15). The agent can also add additional explanations 1–10 to

ditional models from (| , ) either by modifying existing ones or sampling adand generating their corresponding explanations for query selection. The demonstrator can provide no feedback ( unchanged), generate exact feature attribution, or provide pairwise preferences over explanations from .

A test score for

( ), ∀ ∈ is added to using the score function (Line 16). The score reflects the similarity of

( ) to human-generated explanations or their preferred explanations. Finally, and are updated based on the score and the algorithm ends by returning the best reward and best score, when all verification tests have been approved, = 0 (Lines 17-20).

MAP Estimation The maximum a posteriori probability (MAP) is estimated as (| , ) = ( |) where () is a prior over defined by the feedback on explanations. The feedback on explanations is represented as a distribution because semantically it is a set of constraints over the model parameters. This also allows generalization to other IRL algorithms that use the Bayesian framework, as most algorithms difer only () in their likelihood function. 5.1. Learning Linear Reward Using

Explanations This section analyzes our proposed method using linear reward models. A linear reward is described by a linear weighted combination over the -size vector of features describing the state, () =

w () , w ∈ ℝ . The corresponding explanations generated using GaE and LIME present results using GaE and saliency maps (SM).

For a linear reward, GaE(()) = ∇ () () = ∇() w () = w, and SM is |w|. Using Definition 5, we now show that feedback on GaE-based and SM-based explanations reduces reward ambiguity. We assume that the agent and the demonstrator share the same similarity measure, and discuss results with cosine similarity.

Proposition 1. The complexity of removing reward ambi

guity with Oracle-generated GaE explanations as feedback, ℱ ( ( )) = ( ), ∀ ∈ , is (1) .

Proof Sketch. ∀ ∈ , the explanation given by the oracle

w∗()) = w∗. Thus one oracle-generated GaE explanation is suficient to reduce |

( )| to one.

Though is often dificult to obtain, it shows the best-case scenario for REVEALE to eliminate reward ambiguity. The significance of Proposition

1 is that it esmethods and reward learning. tion (mean reward can be obtained by modifying Line 7). (LM) will produce the same output. Therefore, we only the demonstrator is requested, by selecting two explana- tablishes a direct bridge between the feature attribution the size of reward hypotheses by % . Let |ℛ| = . Ac- oracle-generated explanations, ( ), similar representation (e.g. cosine distance). Though our discussion of the framework uses the definition of the likelihood function presented in B-REX [11], REVEALE can be used with other Bayesian IRL methods as well by replacing the definition of the likelihood function. 5.2. Deep REVEALE Using MCMC for estimating MAP of (| , ) () as in Equation 1 is ineficient in problems with large dimensions and feedback constraints because it can take a large number of steps to get a good estimation of the MAP. In addition, explanation feedback cannot be represented as linear constraints when reward functions are represented using neural networks. Hence, we present a method for calculating the priors as soft versions of the constraints discussed earlier. Note that when calculating the priors, the agent uses the previously collected feedback ( ) and the explanations generated. For with ( ))| ≤ 2 ,

ℱ ( 1( ), 2( )) = ( ), ( ))| ≤ 2 . denotes the human’s preference, and ∈ [0, ∞) .

GaE explanation pairs. Proposition 2. To reduce reward function ambiguity by

% in expectation, it sufices to have preference feedback over a set of = log2(1/(1 − /100)) randomly generated Proof Sketch. Consider a set of pairwise preference GaE feedback denoted by 2 )} where, ( 1 By Definition , 2) are candidate explanations for . 4, the explanation-consistent reward set, Δ( ( )), is described by the half-space constraints: ( ) = {( 11 ≻ 12), … , ( ( ) enforces that the cosine similarity of w 1 should be larger than the cosine similarity of w with 2. We want to find the bound on the size of ( ), |

( )|, that is suficient for reducing cording to [27] |Δ( | ( )| = . Therefore, = (1 − 2 explanation pairs. of ℛ is removed in expectation using feedback over a set of = log2(1/(1 − /100)) randomly generated GaE 1 2 ) ∗ 100% volume ( ))| = in expectation where

Proposition 3. It is suficient to have O(1) Oracle

= dim(w). generated SM Feedback to reduce |Δ( Proof Sketch. ∀ ∈ , the explanation given by the oracle w′ such that |w| = |w∗| by taking +, − sign combination w∗()) = |w∗|. Then we can construct at most 2 , of each element of |w∗|. Therefore, |Δ(

Propositions 1 and 3 show that GaE can be more efective than SM in reducing reward ambiguity. Our empiri- tion, (| , ) cal results show a similar trend for non-linear rewards. tion and feedback ℱ is computed as: Prior Definition

The prior () for explanation func () ∝ ℐ ( ℱ( )), (1) by ( |) where ℐ (.)is 1 if satisfies all the constraints imposed ℱ( ) and 0 otherwise. Now, MAP of (| , ) =

() can be estimated using an of-the-shelf Markov Chain Monte Carlo (MCMC) solver.

Generalization Notice that Equation 1 does not depend on the explanation generation mechanism, i.e. feature attribution. REVEALE can utilize any explanation generation method as long as ℱ(.)can be represented for that method. Defining such constraints requires a measurement of similarity (.) between two explanations of (2) (3) () ∝ (

∏ ( )∈ (

( ), ( ))) , where ( ) is the agent’s explanation, ( ) is the oracle-generated explanation as feedback, (.) is a measurement of similarity, and ∈ [0, ∞) .

Similarly, for pairwise preferences, () ∝ (

∏ (x1≻x2)∈

( x ,x1) ( x ,x1) + ( x ,x2) ) where x = ( ) is the agent’s explanation, x1 ≻ x 2

For the above priors, gradient-based MCMC optimization methods [31] work well in high dimensions and can be used to optimize Bayesian neural networks. In addican be approximated using a gradientbased method with − log (| , ) as loss functions.

This loss function will decompose into two parts, one for the likelihood function and the other for the prior. The parameter can be adjusted to optimize these two parts simultaneously. Also notice that when is set to zero this becomes standard B-REX [11] and T-REX [6]. Optimizing this loss function requires calculating the gradient of the explanation function with respect to state features. This can be automatically calculated through auto-dif libraries such as JAX [ 32].

6. Experimental Setup

We evaluate the efectiveness of learning aligned linear and non-linear rewards with REVEALE using three explanation generation techniques: gradient as explanations (GaE), LIME, and saliency map (SM). Reward alignment is measured by (1) the accuracy of predicting the user’s trajectory preferences in test environments, and (2) the average reward collected by executing the policy in the test environments trained using the learned reward. The results of our approach are compared with those of the policy with the true reward function (Optimal) and two recent IRL algorithms, B-REX [11] (for linear rewards) and T-REX [6] (for non-linear rewards). Hence, we use REX to denote B-REX for domains with linear reward and T-REX for non-linear reward.

We report results on five proof-of-concept domains, including three domains from the AI safety literature.

Many of the domains we consider sufering from spurious feature correlation, where two or more state features Figure 2: Preference prediction accuracy always co-occur in the demonstrations and this correlation afects the learning process. Since such correlations are spurious during test time the agent encounters novel states. We use () as a test score for REVEALE. Veriifcation states are selected from . Explanations are randomly selected from for querying feedback. Notice that neither the nor the contains novel states that the agent encounter during evaluations. As a result, evaluation performance is a good indicator of the generalizability of the methods. All algorithms were implemented by us and tested on a machine with 32 GB RAM and 12GB GPU. Values are averaged over 60 diferent random seeds. Experiments with non-linear rewards use a four-layer neural network with Relu activation.

LavaLand This domain introduced by Hadfield-Menell et al. [15] consists of a ‘lava’ feature that never appears Figure 3: Dataset size vs accuracy for REX in the demonstrations. As a result, the agent may not learn to avoid it when it navigates to a goal location, potentially resulting in unsafe behavior when deployed. network quality, and accident history in the segment⟩. DogWalk This is the AV domain illustrated in Figure 1, The non-linear reward incentivizes the AV to navigate where the AV must learn to stop for both pedestrians on safe (low pothole and low accident history) and comand dogs. Each state is represented by ⟨location, human, fortable (good mobile network) routes while reducing dog, bag⟩. This environment is an example of spurious the time to the destination. However, the demonstration feature correlation as humans and dogs occur together data contains spurious feature correlation as most roads in the demonstrations. that have good mobile networks also have bad accident WaterWorld This domain, based on [33], tests how the history, leading to reward ambiguity. agent responds to a distribution shift. There are two CoinRush This domain is similar to the CoinRun entypes of surfaces in the problem: ‘water’ and ‘ground’. vironment described in [31]. The cells in the grid have We consider a linear reward, with a negative reward for coins, gold, or an enemy. The target is to gather as much stepping into the water. The demonstrations and the gold and coins while avoiding the enemy. However, in training environment have fixed water locations, but the the demonstration data, the enemy and coin/gold always test environments have scattered water locations. Re- have fixed colors (green and yellow respectively), resultward ambiguity arises as the agent may not be able to ing in spurious feature correlation. But in the test envidistinguish if the negative reward is associated with the ronment, they can have any color. surface type or grid location. In WaterWorld and CoindRush, the ambiguity is about Navigation (AVNav) This domain, designed by us, de- which feature should get attribution. In other domains, scribes a safe route planning problem, where the demon- the ambiguity is about which feature to attribute and strations are preferences over diferent routes. Each state whether their attribution should be positive or negative. represents a road segment and is denoted by the tuple AVNav and CoinRush have a non-linear reward structure, ⟨road segment length, average speed, #potholes, mobile while the other problems use linear rewards. Models

REX GaE ( ) GaE ( ) SM ( ) SM ( ) LIME ( ) LIME ( ) Optimal

LavaLand

DogWalk

WaterWorld

AVNav 6.1. Results and Discussion to provide valuable feedback, which the agent uses to reduce reward ambiguity in novel situations.

Prediction accuracy Figure 2 shows the average pre- Efect of number of demonstrations We also test the diction accuracy tested on 2000 pairs of trajectories. For efect of #demonstrations on the prediction accuracy of training, we use 256 demonstrations and 64 preference REX (Figure 3), with the size of ranging between 2 to feedback over pair of explanations for domains with lin- 2048. We observe no improvement in the accuracy of ear reward and 1024 demonstrations and 256 preference B-REX beyond 128 demonstrations and 1024 for T-REX, feedback over pair of explanations for non-linear reward. which indicates that the approach is unable to eliminate The red star over each bar represents the accuracy of the reward ambiguity even if the number of demonstrations corresponding explanation method when exact Oracle increases. This is because the additional trajectories do explanations were given instead of preference feedback. not encode any information about novel situations the

In every domain, except CoinRush, REVEALE with agent may encounter when deployed. Therefore, the perGaE explanations achieves the highest accuracy and formance does not improve in the test cases. matches the accuracy of prediction based on human- Average and worse case reward Table 1 shows the generated explanations. In LavaLand and DogWalk, SM average and worst reward obtained with diferent apidentified that ‘lava’ and ’dogs’ are important features, proaches in test environments. We report the worst-case respectively, but could not identify whether they should reward since it provides insights into the degree of unbe positively attributed because it uses the absolute value safe behavior that may arise when the reward is not of the gradient. B-REX also sufers from this drawback. well-aligned. We evaluate the efectiveness of each exIn domains where the ambiguity is about which feature planation method using both types of feedback: Oracleshould be attributed, such as location or surface type generated and pairwise preferences . We also in WaterWorld, SM performs comparably to other ap- report the average reward obtained with the true reward proaches. However, B-REX often associated the reward function in each setting, denoted by Optimal. Our rewith location, instead of surface type. Overall, our results sults show that REVEALE with GaE using feedback indicate that REVEALE with any explanation method per- performs better on most domains. SM outperforms the forms better than REX. other approaches only when the ambiguity was about

In all five environments, all the approaches, including whether the reward should be associated with a feature REX, achieve near-optimal prediction accuracy on the or location. That is, SM often identifies the magnitude of demonstration dataset used to learn the reward. The correlation but struggles to refine whether it is positive performance degrades in the test scenarios because the or negative, often associating incorrectly. LIME performs agents encounter novel states that did not occur in the similarly to GaE, when feedback is but performs relademonstration data. As evident from Figure 2, REVEALE tively poorly when feedback is . This is because LIME improves the prediction performance significantly in works with a large set of states in the neighborhood of the such cases. In the absence of prior knowledge about input states, unlike GaE and SM, which only work with novel situations, it might not be possible to predict how a single state. Therefore when exact inputs are given, the agent will perform just by assessing the agent’s re- LIME works very well. With feedback, the error can ward/policy/value function in states that appear in the propagate to many states causing worse performance training environment. However, examining the consis- than GaE. Overall, our results show that REVEALE can tency of the agent’s explanations in states that appear in learn and generalize reward that is better aligned than training data allows the demonstrator to infer its behavior the existing approaches. in many novel situations. This allows the demonstrator

7. Summary and Future Work

[10]

B. D.

Ziebart ,

A. L.

Maas ,

J. A.

Bagnell ,

A. K.

Dey ,

Maximum entropy inverse reinforcement learning, This paper presents a general interpretable reward learn - in : Proceedings of the 23rd AAAI Conference on ing and verification framework to ensure that the learned Artificial Intelligence , 2008 , pp. 1433 - 1438 . reward is aligned with that of the demonstrator's intent . [11] D. S. Brown , R. Coleman,

Srinivasan , S. Niekum, The results demonstrate the benefits of our approach in Safe imitation learning via fast Bayesian reward learning the intended reward, thereby supporting the inference from preferences, in: Proceedings of the safe deployment of RL agents in the real world . In the 37th International Conference on Machine Learnfuture , we aim to develop techniques to automatically ing, 2020 , pp. 1165 - 1177 . identify critical states for verification, and integrate ac- [12]

Palan ,

N. C.

Landolfi ,

Shevchuk , D.

Sadigh, tive learning methods [34] to optimize queries. Learning reward functions by integrating human

demonstrations and preferences , in: Proceedings Acknowledgments of Robotics: Science and Systems XV , 2019 . [13]

Ramachandran , E. Amir, Bayesian inverse reThis work was supported in part by the National Science inforcement learning , in: Proceedings of the 20th Foundation Grants IIS-1954782, IIS-2205153, NSF and International Joint Conference on Artificial intelliUSDA-NIFA award number 2021-67021-35344 . gence, 2007 , pp. 2586 - 2591 . [14]

Bogdoll ,

Guneshka ,

J. M.

Zollner , One on-

tology to rule them all: Corner case scenarios for References autonomous driving , ArXiv abs/2209 .00342 ( 2022 ). [15]

Hadfield-Menell ,

Milli ,

Abbeel ,

S. J.

Russell , [1]

Zilberstein , Building strong semi-autonomous A. Dragan, Inverse reward design , in: Advances in

systems, in: Proceedings of the 29th AAAI Confer- Neural Information Processing Systems , 2017 .

ence on Artificial Intelligence , 2015 , pp. 4088 - 4092 . [16]

M. P.

Linegang ,

H. A.

Stoner ,

M. J.

Patterson , B. D. [2]

B. D.

Argall ,

Chernova ,

Veloso , B. Brown- Seppelt,

J. D.

Hofman ,

Z. B.

Crittendon ,

J. D.

Lee ,

tion , Robotics and Autonomous Systems 57 ( 2009 ) sion planning: A challenge requiring an ecologi-

469- 483 . cal approach, Proceedings of the Human Factors [3]

R. S.

Sutton ,

A. G.

Barto , Reinforcement Learning: and Ergonomics Society Annual Meeting 50 ( 2006 )

Introduction

, MIT press Cambridge, 1998 . 2482 - 2486 . [4]

A. Y.

Ng ,

S. J.

Russell , Algorithms for inverse rein- [17]

Hayes ,

J. A.

Shah , Improving robot controller

forcement learning ., in: Proceedings of the 17th In- transparency through autonomous policy expla-

ternational Conference on Machine Learning , 2000 , nation, in: Proceedings of the ACM/IEEE Inter-

pp. 663 - 670 . national Conference on Human-Robot

Interaction

, [5]

B. D.

Ziebart , Modeling Purposeful Adaptive Behav- 2017 , pp. 303 - 312 .

ior with the Principle of Maximum Causal Entropy , [18]

M. T.

Ribeiro ,

Singh ,

Guestrin , “Why should I

Carnegie Mellon University, 2010 . trust you?”: Explaining the predictions of any clas [6]

D. S.

Brown , W. Goo,

Prabhat ,

Niekum , Ex- sifier, in : Proceedings of the 22nd ACM SIGKDD

inverse reinforcement learning from observations , and Data Mining , 2016 , pp. 1135 - 1144 .

in: Proceedings of the 36th International Confer- [19]

S. B.

Nashed ,

Mahmud ,

C. V.

Goldman , S. Zil-

ence on Machine Learning , 2019 , pp. 783 - 792 . berstein, A unifying framework for causal ex [7]

Basu ,

Singhal ,

A. D.

Dragan , Learning from planation of sequential decision making, ArXiv

richer human guidance: Augmenting comparison - abs/2205.15462 ( 2022 ).

based learning with feature queries , in: Proceedings [20]

Lundberg ,

S.-I.

Lee , A unified approach

of the 13th International Conference on Human- to interpreting model predictions , 2017 .

Robot

Interaction , 2018 , pp. 132 - 140 . arXiv: 1705 . 07874 . [8]

D. S.

Brown ,

J. J.

Schneider ,

Niekum , Value align- [21]

Tayyub ,

Sarmad ,

Schönborn , Explain-

ment verification, in: Proceedings of the 38th In- ing deep neural networks for point clouds us-

ternational Conference on Machine Learning , 2021 , ing gradient-based visualisations , arXiv preprint

pp. 1105 - 1115 . arXiv: 2207 .12984 ( 2022 ). [9]

Abbeel ,

A. Y.

Ng , Apprenticeship learning via [22]

Simonyan ,

Vedaldi ,

Zisserman , Deep in-

of the 21st International Conference on Machine classification models and saliency maps , CoRR

learning , 2004 . abs/1312.6034 ( 2014 ). [23]

Chakraborti ,

Sreedharan ,

Zhang , S. Kamb-

preprint arXiv:1701.08317 ( 2017 ). [24]

Amir ,

Doshi-Velez ,

Sarne , Summarizing

Agent Systems 33 ( 2019 ) 628 - 644 . [25]

S. H.

Huang ,

Bhatia ,

Abbeel ,

A. D.

Dragan ,

ence on Intelligent Robots and Systems , 2018 , pp.

3929- 3936 . [26]

Tabrez ,

Agrawal ,

Hayes , Explanation-based

the 14th

ACM

/IEEE International Conference on

Human-Robot Interaction , 2020 , p. 249 - 257 . [27]

D. S.

Brown , W. Goo,

Niekum , Better-than-

ranked demonstrations , in: Proceedings of the 3rd

Annual Conference on Robot Learning , 2019 , pp.

330- 359 . [28]

Ramakrishnan ,

Kamar ,

Dey , J. Shah,

ment learning , in: Proceedings of the 17th Inter-

Multiagent

Systems , 2018 , pp. 1017 - 1025 . [29]

Molnar , Interpretable Machine Learning: A

Lulu.com, 2022 . [30]

Lage ,

Lifschitz ,

Doshi-Velez ,

Amir , Ex-

summarization, in: Proceedings of the 28th Inter-

2019 , pp. 1401 - 1407 . [31]

Langosco ,

Koch ,

Sharkey ,

Pfau ,

Krueger ,

learning, in: Proceedings of the 39th Interna-

tional Conference on Machine Learning , 2022 , pp.

12004- 12019 . [32]

Bradbury ,

Frostig ,

Hawkins ,

M. J.

Johnson ,

grams, http://github.com/google/jax, 2018 . [33]

Leike ,

Martic ,

Krakovna ,

P. A.

Ortega ,

gridworlds, ArXiv abs/1711 .09883 ( 2017 ). [34]

Settles , Active learning , Synthesis Lectures

on Artificial Intelligence and Machine Learning 6