=Paper=
{{Paper
|id=Vol-2419/paper28
|storemode=property
|title=Detecting Spiky Corruption in Markov Decision Processes
|pdfUrl=https://ceur-ws.org/Vol-2419/paper_28.pdf
|volume=Vol-2419
|authors=Jason Mancuso,Tomasz Kisielewski,David Lindner,Alok Singh
|dblpUrl=https://dblp.org/rec/conf/ijcai/MancusoKLS19
}}
==Detecting Spiky Corruption in Markov Decision Processes==
Detecting Spiky Corruption in Markov Decision Processes
Jason Mancuso1∗ , Tomasz Kisielewski2∗ , David Lindner3∗ and Alok Singh4∗
1
Dropout Labs
2
Independent Researcher
3
ETH Zürich
4
Terrafuse
jason@manc.us, tymorl@gmail.com, lindnerd@ethz.ch, alok.singh@berkeley.edu
Abstract 1.1 Problem motivation
CRMDPs naturally capture different notions of reward mis-
Current reinforcement learning methods fail if the specification such as wireheading, side effects, avoiding su-
reward function is imperfect, i.e. if the agent ob- pervision, and sensory malfunction [Everitt et al., 2017;
serves reward different from what it actually re- Amodei et al., 2016]. This makes them a useful framework
ceives. We study this problem within the formal- for developing RL agents robust to reward corruption.
ism of Corrupt Reward Markov Decision Processes Additionally, different approaches to learning reward func-
(CRMDPs). We show that if the reward corruption tions can be interpreted in the CRMDP framework, such as
in a CRMDP is sufficiently “spiky”, the environ- semi-supervised RL [Finn et al., 2016], cooperative inverse
ment is solvable. We fully characterize the regret RL [Hadfield-Menell et al., 2016], learning from human pref-
bound of a Spiky CRMDP, and introduce an algo- erences [Christiano et al., 2017], and learning values from
rithm that is able to detect its corrupt states. We stories [Riedl and Harrison, 2016]. Approaches to learn a
show that this algorithm can be used to learn the op- reward function from expert input such as inverse reinforce-
timal policy with any common reinforcement learn- ment learning (IRL) [Ng et al., 2000] can yield corrupt reward
ing algorithm. Finally, we investigate our algorithm functions when learning from an expert of bounded rational-
in a pair of simple gridworld environments, finding ity or from sub-optimal demonstrations. CRMDPs may be
that our algorithm can detect the corrupt states and able to provide new theoretical guarantees and insights for
learn the optimal policy despite the corruption. IRL methods, which are often limited by the assumption that
the expert acts nearly optimal.
1 Introduction 1.2 Solution motivation
Our approach is inspired by connections to work on robust-
The reward function distinguishes reinforcement learning ness to noise and fairness in supervised learning. The connec-
(RL) from other forms of learning. If the reward function is tion between supervised learning and RL has been discussed
misspecified, the agent can learn undesired behavior [Everitt before (e.g. see [Barto and Dietterich, 2004]); here, we only
et al., 2017]. It is an open problem in RL to detect or avoid use it in an informal way to motivate our approach.
misspecified rewards [Amodei et al., 2016]. In particular, one way to view supervised learning is as a
[Everitt et al., 2017] formalize this problem by introducing special case of RL. In this interpretation, the RL policy cor-
Corrupt Reward Markov Decision Processes (CRMDPs). In- responds to the supervised learning model, the actions are to
formally, a CRMDP is a MDP in which the agent receives a pick a specific label for an input, and the reward is an indi-
corrupted reward signal. A No Free Lunch Theorem for CR- cator function that is 1 if the true label matches our pick, and
MDPs states that they are unlearnable in general; however, 0 otherwise. The reward is provided by an oracle that can
additional assumptions on the problem can lead to learnable provide the true labels for a fixed training set of samples.
subclasses. In particular, [Everitt et al., 2017] introduce a set In noisy supervised learning, this oracle can be fallible,
of strong assumptions that allow for a quantilizing agent to meaning the true label of an instance may not match the la-
achieve sublinear regret. Other assumptions, however, may bel the oracle provides. In this setting, the goal is to learn
lead to distinct learnable subclasses that are also useful in the true reward function despite only having access to a cor-
practice. rupt reward function. It has been observed that deep neural
In this work, we propose a set of assumptions to create networks can learn in the presence of certain kinds of noise
such a subclass. Intuitively, our assumptions capture the no- [Rolnick et al., 2017; Drory et al., 2018], which suggests
tion of the corruption being “spiky” with respect to a distance that some classes of CRMDPs beyond those investigated in
measure on the state space. [Everitt et al., 2017] can be solved.
For further inspiration, we turn to the field of fairness in
∗
equal contribution supervised classification. [Dwork et al., 2012] provide a nat-
ural definition of individual fairness using distance metrics 2. ∀x, y ∈ S, |R(x) − R(y)| ≤ d(x, y),
on the input and output spaces and a corresponding Lipschitz 3. ∀x ∈ Sc , LVSn (x) > supy∈Sn LVS (y).
conditions. Intuitively, a classifier is considered fair if it pro-
vides similar labels for similar input samples. Our approach We call d the distance between the states, and LV the Lip-
to solving CRMDPs is similar. However, we apply Lipschitz schitz violation measure.
conditions to the reward function rather than the classifier. A Intuitively, the distance d should capture some notion of
simple derivation shows that these interpretations are equiva- smoothness of the true reward function in each state. The goal
lent when the likelihood of the classifier is used to define the is to construct this distance such that the reward in corrupt
reward function. states is much less smooth (hence, “spiky”). The assumptions
2 and 3 formalize this intuition.
1.3 Related Work Note the strong relationship between the distance and re-
We aim to detect reward corruption by assuming the true ward functions. For a given Spiky CRMDP one cannot be
reward to be “smooth” and the corruption to be “spiky”. modified independently of the other without breaking the as-
Smoothness of the reward function with respect to a distance sumptions. In particular, any linear transformation applied to
in state space is a classic notion in machine learning [Santa- the reward, such as e.g. scaling, has to be also performed on
marı́a et al., 1997]. Assumptions about the smoothness of the the distance function.
reward function have also been used to ensure safety proper- The LV function is meant to be a measure of Lipschitz vio-
ties for RL algorithms, for example by [Turchetta et al., 2016] lations of a state – “how much” does a given state violate (2)
or [Garcı́a and Fernández, 2012]. with respect to some set of states Sn when one substitutes C
Both define smoothness with respect to a distance metric for R. We propose two ways of measuring this, which further
in the state space which is similar to our approach. How- refine the class of Spiky CRMDPs. Unless otherwise noted,
ever, they tackle the problem of safely exploring an MDP, i.e. all our examples satisfy the conditions in definition 2 with
without visiting dangerous states, and do not consider reward either of these functions used as LV.
corruption. Another key difference are the assumptions on Definition 3 (NLV). The Number of Lipschitz Violations of
the distance functions, which are a subset of metrics that are x ∈ S with respect to A ⊆ S is
connected to the (known) transition function in [Turchetta et
al., 2016] or are just the Euclidean distance in [Garcı́a and NLVA (x) := µ({y ∈ A | |C(x) − C(y)| > d(x, y)}),
Fernández, 2012]. In contrast, we allow any metric. where µ is a measure on S.
There also exist approaches for automatically learning Definition 4 (TLV). The Total Lipschitz Violation of x ∈ S
distance functions on the state space [Taylor et al., 2011; with respect to A ⊆ S is
Globerson and T. Roweis, 2005; Davis et al., 2007; Jain et Z
al., 2009]. Such methods might be used in future work that TLVA (x) = min {0, |C(x) − C(y)| − d(x, y)} dµ(y),
remove the need for explicitly providing a distance function.
y∈A
2 Problem statement where µ is a measure on S.
Let us recall the definition of CRMDPs from [Everitt et al., Note that both variants require a measure on the state
2017]. space. In general there might not be a natural choice, but for
finite state spaces, which we will consider, the counting mea-
Definition 1 (CRMDP). A Corrupt Reward MDP (CRMDP) sure is often reasonable. In this case, NLV counts the number
is a finite-state MDP hS, A, T, Ri with an additional corrupt of states which violate the Lipschitz condition with the given
reward function C : S → R. We call hS, A, T, Ri the under- state, while TLV sums up the magnitudes of the violations.
lying MDP, Sn = {x ∈ S | R(x) = C(x)} the set of non-
corrupt states, and its complement, Sc = S \ Sn , the set of
corrupt states.
3 Theoretical Results
Note that C represents the reward observed by an agent and 3.1 Corruption identification
may be equal to the real reward R in some, or even all, of the With this setup in place, we can now introduce an algorithm
states. Our aim is to identify the corrupt states in Sc and learn to detect corrupt states in finite Spiky CRMDPs, which is
an optimal policy with respect to R while only observing C. shown in algorithm 1. The core idea is simple: We maintain
In general, this is impossible according to the CRMDP No a set of corrupt states, initially empty, and sort all states de-
Free Lunch Theorem. However special classes of CRMDPs scending by their Lipschitz violation with respect to all states.
may not have this limitation [Everitt et al., 2017]. Therefore, Then for each state we check its Lipschitz violation with re-
we consider CRMDPs with a specific form of R and C. spect to all states that we have not identified as corrupt yet.
If it is positive, we mark the state as corrupt. As soon as we
Definition 2 (Spiky CRMDP). Let M be a CRMDP with two
encounter a state with zero Lipschitz violation we are done.
additional functions, d and LV. d : S × S → R+ is a metric
This algorithm makes use of assumptions 2 and 3 in defi-
on the state space, and LV : Powerset(S) × S → R+ is
nition 2. Assumption 3 makes sure that by sorting the states
non-decreasing with respect to set inclusion. We call M a
in descending order we consider all corrupt states first. As-
Spiky CRMDP if the following assumptions are satisfied:
sumption 2 then provides us with a simple stopping condition,
1. Sn is nonempty namely no further violations of the Lipschitz condition.
Algorithm 1 An algorithm for identifying corrupt states in Algorithm 2 An example of using algorithm 1 online when
Spiky CRMDPs. 3’ is satisfied.
function I DENTIFY C ORRUPT S TATES(S, LV) → Ŝc function L EARN O NLINE(π, LV)
Ŝc ← ∅ Ŝc ← ∅
Sort x ∈ S by LVS (x) decreasing for τ ∼ π do
for x ∈ S do X ← Identif yCorruptStates(τ, LV)
if LVS\Ŝc (x) = 0 then Ŝc ← Ŝc ∪ X
Sˆn ← Sˆn ∪ (τ \ Ŝc )
return Ŝc Train RL agent using reward signal lbŜn
Add x to Ŝc
This bound is not particularly useful as a theoretical result,
Proposition 1. Let M be a spiky CRMDP. Then algorithm 1 because it is fairly easy to contrive CRMDPs where it be-
returns Sc when given S and LV as input. comes as large as the difference between the least and greatest
This proposition simply states that the detection algorithm possible cumulative rewards.
is able to correctly detect all corrupt states. The proof of this However, it might be useful to increase sample efficiency
and all following statements is included in the appendix. in an active reward learning setting. Say we have a supervisor
that we can ask to provide us with the real reward of a given
3.2 A posteriori bounds on regret state, but such a question is expensive. We therefore want to
We now turn to the problem of learning an optimal policy maximize the information we get from a single question. The
despite the corruption in a CRMDP. Our first approach is to way we compute the bound (1) allows us to pick a question
learn from an optimistic estimate of the true reward based on such that the upper bound on regret improves the most, which
the Lipschitz condition on the rewards. To this end we first is a good criterion for question quality. We do not investigate
define such upper and lower bounds on the rewards. this further, but suggest it as useful future work.
Definition 5. Let M = hS, A, T, R, C, d, LVi be a spiky CR- 3.3 Optimality with corruption avoidance
MDP. Then for a state x we define the reward lower Lipschitz To be able to guarantee an optimal policy despite corruption,
bound to be we have to make an additional assumption about the environ-
lb(x) = max R(y) − d(x, y). ment. In particular, we will assume that the underlying MDP
y∈Sn has at least one optimal policy which avoids all corrupt states.
This essentially means that identifying and then avoiding the
We call any state that introduces this bound and is closest to x
corrupt states is enough to solve the environment.
the lower Lipschitz bounding state. Symmetrically we define
the upper bounds and upper bounding states. Proposition 3. Let M = hS, A, T, R, C, d, LVi be a spiky
reward CRMDP and Ṁ = hS, A, T, Ri its underlying MDP.
ub(x) = min R(y) + d(x, y). Then if there exists a Ṁ -optimal policy π ∗ generating a tra-
y∈Sn
jectory τ ∼ π ∗ that does not contain any corrupt states, then
any policy optimal with respect to M using lb as a reward
For a non-corrupt state, both bounds just equal the true re- function will also be Ṁ -optimal.
ward, so it’s its own bounding state. We also get lb(x) ≤
R(x) and ub(x) ≥ R(x), because the distance function is The assumption of an optimal policy that always avoids
positive definite. corrupt states is very strong, especially in stochastic environ-
Note that the reward lower (or upper) Lipschitz bound and ments. However, this is to be expected since our result allows
bounding state can be computed by the agent after identifying for solutions without any regret.
the corrupt states, because it only requires access to the dis- It is also worth noting that the assumption might be slightly
tance function and real reward function for non-corrupt states, weakened in practice, as we discuss in our experimental re-
which is equal to the corrupt reward there. sults.
After computing these bounds, the agent can compute up-
per bounds on the regret it is experiencing. Finding a pol- 4 Practical considerations
icy with respect to this optimistic estimate of the true reward Algorithm 1 sorts over an entire state space, which requires
function of corrupt states gives us a way to bound the ex- (1) complete knowledge of the state space and (2) computa-
pected regret using the Lipschitz bounds. tional resources to perform a sorting operation.
Proposition 2. The expected regret with respect to R of a pol-
icy π 0 optimal with respect to ub (the reward upper Lipschitz 4.1 Modification for Online Learning
bound) is bounded from above by We would like our algorithm to work online, i.e. at most con-
X sidering a small batch of trajectories at once. Our current
sup E ub(x) − lb(x). (1) assumptions are not enough for such an algorithm to be cor-
π ub-optimal τ ∼π x∈τ rect. However, it is possible by strengthening assumption 3
in definition 2:
3’ ∀π, τ ∼ π ∀x ∈ τc , LVτn (x) > supy∈τn LVτ (y), 10 9 8 7 11 (6) 10 9 8 7 11 (6)
where τ ∼ π is a trajectory sampled from policy π. We treat
the trajectory τ as a sequence of states, τc are the corrupt 9 9 8 7 6 9 9 11 (6) 7 6
states in this sequence, and τn = τ \ τc .
8 8 8 7 6 8 11 (6) 8 7 6
This is the same condition as before, except that we re-
quire it to hold over all possible trajectories through the MDP.
7 7 7 7 6 7 7 7 7 6
Functionally speaking, this allows us to be sure that we’ll be
able to iteratively perform corruption identification without
11 (6) 6 6 6 A 11 (6) 6 6 6 A
misidentifying any states. This assumption is much stronger
than the previous version and restricts the class of environ-
ments satisfying it considerably. In practice, we believe that
many interesting environments will still satisfy it, and many Figure 1: Toy Spiky CRMDP envorinments Corners on the left, On-
of those that do not may still be learnable in a similar manner. TheWay on the right. The blue cell in the lower right hand corner
With assumption 3’ satisfied, we can use algorithm 1 for is the starting position of the agent and the green cell in the upper
online reinforcement learning, as shown in algorithm 2. To left hand corner the goal. The true reward collected in each state is
do this, we sample trajectories from the current policy of determined by the max-distance to the goal and shown here by the
numbers in each cell. The reward in the red cells is corrupted and
the agent, apply the algorithm 1 on the individual trajectories
the underlying true reward is shown in parentheses.
and then update the policy using the reward lower Lipschitz
bound.
Note that lbŜn (x) = maxy∈Ŝn R(y) − d(x, y). A straight- in the rollouts and replaces their rewards by the Lipschitz
forward application of proposition 1 shows that Ŝc → Sc bounds. The full code used, logs and commands to repro-
from below and Ŝn → Sn from above when using algorithm duce the experiments can be found at: https://github.com/
2. jvmancuso/safe-grid-agents.
In order to actually use the identified states, we also need To evaluate the quality of our algorithm for each experi-
the ability to compute the lb and ub functions. This once ment we calculate the average corrupt reward, average hidden
again would normally require access to the whole state space reward, the sample complexity needed to achieve this result,
with corrupt states identified. However, we can approximate and the ratio of this complexity to the complexity needed in
lb and ub by slowly building up the known state space and the non-corrupt baseline. The results are summarized in Table
corrupt state space. This is what is reflected in algorithm 2, 1 and we proceed by discussing them in detail.
specifically when using lbŜn ≈ lb as a reward signal. This
5.1 Toy Spiky-CRMDP
approximation is pessimistic for corrupt states but converges
as Ŝn → Sn . Similar to [Everitt et al., 2017], we construct a toy example
under which our learnability guarantee from proposition 3 is
4.2 Memory complexity satisfied. We also slightly tweak this toy example to break the
requirements for proposition 3, with the hope of demonstrat-
Memory complexity is an additional challenge for our algo-
ing that the theorem’s requirements can be relaxed to some
rithm. Keeping the whole state space in memory is usually
extent without harming learnability. The figure shows the toy
not feasible, and even keeping only the encountered states
example and its modification.
can quickly result in performance problems. We implemented
The gridworlds are shown in 1. The agent starts on
some optimizations to reduce the memory consumption of
the blue field and, in typical gridworld fashion, can move
our approach. While they had no effect in the small toy en-
up/down/left/right as its action. The red cells are corrupted
vironments we consider in this paper, for completeness and
states and give the agent an unusually high reward, thereby
future reference we include a description of these optimiza-
satisfying our conditions about the “spikyness” of the corrup-
tions in appendix B.
tion. We use the Manhattan metric as our distance measure.
Note that in the environment on the right, the optimal pol-
5 Experiments icy must encounter corrupt states, violating the assumption of
We ran experiments on gridworld environments, detailed be- proposition 3. However, because the optimal policy remains
low. On each environment, we trained three different agents optimal after substituting lb for the reward we should still ex-
using an implementation of PPO [Schulman et al., 2017]. pect good performance from our algorithm.
The first agent just uses PPO with access to the corrupt
reward, without any consideration for corruption. 5.2 Results
The second agent uses PPO with access to the hidden re- The baseline results for both environments are as expected.
ward, with the corruption removed. These are two baselines PPO with access to the corrupt reward very quickly learns the
– how well an algorithm performs on the corrupt state and corrupt-optimal policy, going straight to the bottom left or
how well it could perform on the environment if it was not top right corner. PPO with access to the hidden reward needs
corrupted. significantly more data to learn the optimal policy, because
The third agent uses algorithm 2 during the rollout phase the problem is more complicated and cannot be solved with a
of the PPO algorithm. In particular, it identifies corrupt states constant policy.
Environment Reward Agent Avg. Corrupt Reward Avg. True Reward Sample Complexity SC ratio
Corrupt Baseline 73 48 5421 0.17
Uncorrupt Baseline 64 64 31410 1.00
Corners
Uncorrupt CRMDP 64 64 32250 1.03
Corrupt CRMDP 64 64 57000 1.81
Corrupt Baseline 73 48 4194 0.09
Uncorrupt Baseline 64 64 45310 1.00
OnTheWay
Uncorrupt CRMDP 64 64 51750 1.14
Corrupt CRMDP 64 64 94380 2.08
Table 1: Results of the gridworld experiments described in section 5. In addition to the final corrupted and hidden true reward we report the
sample complexity of each, which is defined as the number of episodes required for a moving average (momentum=0.9) of observed return to
reach its optimal value. The SC ratio is the ratio of the sample complexity compared to the baseline model that has access to the hidden true
reward function. Note that sample complexity measures are generally susceptible to noise, and should be interpreted with caution.
As a sanity check we ran an additional baseline test for of practical settings in which Bayesian RL has traditionally
these toy environments – our algorithm with access to the fallen short.
hidden reward. As it does not encounter any states it could
perceive as corrupt it performs comparably to the baseline. 7 Acknowledgements
Finally we ran our algorithm without access to the hidden Thanks to Victoria Krakovna and Tom Everitt for generous
reward. In both cases it learned the optimal policy, requir- feedback and insightful discussions about this work and to
ing about two times as much data as the agent with access Rohin Shah, Jan Leike, and Geoffrey Irving for providing
to the hidden reward. It is worth pointing out that because valuable feedback on early proposals of our approach.
of the way the environments are constructed, this additional Furthermore, thanks to the organizers of the 2nd AI Safety
data is most likely not used for learning the bounds on the Camp during which a significant portion of this work was
true reward. These bounds will almost always be as good as completed.
they can as soon as the agent identifies the corruption. Rather,
the increased difficulty probably stems from the lower differ-
ences between optimal and sub-optimal policy payoffs.
References
[Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob
Steinhardt, Paul F. Christiano, John Schulman, and
6 Conclusion Dan Mané. Concrete problems in AI safety. CoRR,
The class of Spiky CRMDPs resolves several limitations in abs/1606.06565, 2016.
the class of previously known, solvable CRMDPs. In partic- [Barto and Dietterich, 2004] A. G. Barto and T.G. Diet-
ular, this class of MDPs need not have finite diameter (state terich. Reinforcement learning and its relationship to su-
spaces symmetric in time), we demonstrate that they can be pervised learning. John Wiley and Sons, Inc, 2004.
solved in the usual MDP formalism without recourse to the
decoupled RL of [Everitt et al., 2017]. [Christiano et al., 2017] Paul F. Christiano, Jan Leike, Tom
Despite the the experimental support for our algorithm’s Brown, Miljan Martic, Shane Legg, and Dario Amodei.
success in toy gridworld environments, there are several lim- Deep reinforcement learning from human preferences.
itations of our solution. Even though we can minimize the re- In Advances in Neural Information Processing Systems,
gret, it required quite a few assumptions to do so. Our results 2017.
for the OnTheWay environment suggest that these assump- [Davis et al., 2007] Jason V. Davis, Brian Kulis, Prateek
tions can be weakened further, and this could be useful future Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-
work. However, we believe the regret bound is most intrigu- theoretic metric learning. In Proceedings of the 24th In-
ing, as it can be used for accelerating exploration in decou- ternational Conference on Machine Learning, ICML ’07,
pled RL schemes. This is most apparent for semi-supervised pages 209–216, New York, NY, USA, 2007. ACM.
RL, but also applies to other settings in which reward infor- [Drory et al., 2018] Amnon Drory, Shai Avidan, and Raja
mation can be inferred from external channels or actors. For Giryes. On the resistance of neural nets to label noise.
this bound to become practically useful, future work should CoRR, abs/1803.11410, 2018.
prioritize learning the Spiky CRMDP distance metric d from
trajectory data in an online or active reward learning setting. [Dwork et al., 2012] Cynthia Dwork, Moritz Hardt, Toniann
More generally, Lipschitz reward functions and spiky cor- Pitassi, Omer Reingold, and Richard Zemel. Fairness
ruption can be seen as a particularly strong prior in the through awareness. In Innovations in Theoretical Com-
Bayesian RL setting. Our theorems and experiments demon- puter Science Conference, 2012.
strate that this can be used to encode useful inductive biases [Everitt et al., 2017] Tom Everitt, Victoria Krakovna, Lau-
in relevant environments. While we demonstrate a single in- rent Orseau, and Shane Legg. Reinforcement learning with
stance of its usefulness in learning within misspecified envi- a corrupted reward channel. In International Joint Confer-
ronments, these priors can be very useful in a wide variety ence on Artificial Intelligence, 2017.
[Finn et al., 2016] Chelsea Finn, Tianhe Yu, Justin Fu, Pieter because in general LV is non-decreasing with respect to in-
Abbeel, and Sergey Levine. Generalizing skills with clusion, that is LVA ≥ LVB if A ⊆ B. This immediately
semi-supervised reinforcement learning. arXiv preprint tells us that all corrupt states x ∈ Sc are processed by Algo-
arXiv:1612.00429, 2016. rithm 1 before all non-corrupt states y ∈ Sn .
[Garcı́a and Fernández, 2012] Javier Garcı́a and Fernando We will show that all states added to the Ŝc set are actually
Fernández. Safe exploration of state and action spaces in corrupt. For a state x ∈ S to be added to that set it has to
reinforcement learning. J. Artif. Int. Res., 45(1):515–564, satisfy
September 2012. LVS\Ŝc (x) > 0.
[Globerson and T. Roweis, 2005] Amir Globerson and Sam This condition can only be satisfied only if Sc \ Ŝc 6= ∅,
T. Roweis. Metric learning by collapsing classes. In because otherwise S \ Ŝc would contain no corrupt states
Advances in Neural Information Processing Systems, vol- and no non-corrupt states can violate the Lipschitz condition.
ume 18, 01 2005. But Algorithm 1 processes all corrupt states before any non-
[Hadfield-Menell et al., 2016] Dylan Hadfield-Menell, corrupt states, so x has to be among the corrupt ones.
Anca D. Dragan, Pieter Abbeel, and Stuart J. Russell. The algorithm returns as soon as it finds a state x satisfying
Cooperative inverse reinforcement learning. CoRR, LVS\Ŝc (x) = 0. Because of the processing order it is suffi-
abs/1606.03137, 2016. cient to prove that such a state is a non-corrupt state. Note
[Jain et al., 2009] Prateek Jain, Brian Kulis, Inderjit S. that
Dhillon, and Kristen Grauman. Online metric learning LVS\Ŝc (x) ≥ LVSn (x),
and fast similarity search. In D. Koller, D. Schuurmans, because, as we have already shown, Ŝc ⊆ Sc . Since the left
Y. Bengio, and L. Bottou, editors, Advances in Neural In- hand side of this expression is zero and the right one is non-
formation Processing Systems 21, pages 761–768. Curran negative by definition, it also has to be zero. But zero cannot
Associates, Inc., 2009. be strictly greater than a nonempty supremum of nonnegative
[Ng et al., 2000] Andrew Y Ng, Stuart J Russell, et al. Al- values, so x cannot be corrupt.
gorithms for inverse reinforcement learning. In Icml, vol-
ume 1, page 2, 2000. Proof of proposition 2:
[Riedl and Harrison, 2016] Mark O. Riedl and Brent Harri- Proof. First note that any policy π has average cumulative
son. Using stories to teach human values to artificial reward with respect to ub greater or equal than as with respect
agents. In AI, Ethics, and Society, Papers from the 2016 to R X X
AAAI Workshop, Phoenix, Arizona, USA, February 13, E R(x) ≤ E ub(x), (2)
τ ∼π τ ∼π
2016., 2016. x∈τ x∈τ
[Rolnick et al., 2017] David Rolnick, Andreas Veit, Serge J. because in general ub(x) ≥ R(x) as the only difference be-
Belongie, and Nir Shavit. Deep learning is robust to mas- tween these functions is a substitution of upper bounds for
sive label noise. CoRR, abs/1705.10694, 2017. some, possibly lower, values of R. In particular, since this
is also true for R-optimal policies, this means that any ub-
[Santamarı́a et al., 1997] Juan Carlos Santamarı́a, Richard S.
optimal policy π 0 will have average cumulative reward with
Sutton, and Ashwin Ram. Experiments with reinforce- respect to ub greater or equal than the R-optimal policy π ∗
ment learning in problems with continuous state and action has with respect to R:
spaces. Adaptive Behaviour, 6:163–217, 1997. X X X
[Schulman et al., 2017] John Schulman, Filip Wolski, Pra- E∗ R(x) ≤ E ∗ ub(x) ≤ E 0 ub(x). (3)
τ ∼π τ ∼π τ ∼π
x∈τ x∈τ x∈τ
fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox-
imal policy optimization algorithms. arXiv preprint Now, by (2), we get another bound on average cumulative
arXiv:1707.06347, 2017. reward of any policy π
X X
[Taylor et al., 2011] Matthew E. Taylor, Brian Kulis, and Fei E ub(x) + (lb(x) − ub(x)) = E lb(x)
τ ∼π τ ∼π
Sha. Metric learning for reinforcement learning agents. In x∈τ x∈τ
AAMAS, 2011. X
≤ E R(x).
[Turchetta et al., 2016] Matteo Turchetta, Felix τ ∼π
x∈τ
Berkenkamp, and Andreas Krause. Safe exploration By moving the part in parentheses to the right side of the in-
in finite markov decision processes with gaussian pro- equality we get
cesses. In Neural Information Processing Systems (NIPS), X X
pages 4305–4313, December 2016. E ub(x) ≤ E R(x) + (ub(x) − lb(x)). (4)
τ ∼π τ ∼π
x∈τ x∈τ
A Proofs of theoretical results Using (3) and (4) for ub-optimal policies π 0 we get
X X
Proof of proposition 1: E∗ R(x) ≤ E 0 ub(x)
τ ∼π τ ∼π
Proof. First note that x∈τ x∈τ
X
∀x∈Sc LVS (x) ≥ LVSn (x) > sup LVS (y), ≤ E0 R(x) + (ub(x) − lb(x)).
τ ∼π
y∈Sn x∈τ
We get the final bound by moving the reward to the left hand
side and taking a supremum over ub-optimal policies of both
sides.
X X
sup E∗ R(x) − E 0 R(x) ≤
π 0 ub-optimal τ ∼π x∈τ τ ∼π
x∈τ
X
sup E0 (ub(x) − lb(x)).
π 0 ub-optimal τ ∼π x∈τ
Proof of proposition 3:
Proof. Recall the inequality in (2), that is
X X
E lb(x) ≤ E R(x).
τ ∼π τ ∼π
x∈τ x∈τ
∗
This means that π is lb-optimal, because its average cumula-
tive reward does not change, so it is still the greatest possible.
Any other lb-optimal policy has to get at least as much av-
erage cumulative reward as π ∗ with respect to lb. Since it
would get at least as much reward with respect to R, it is also
R-optimal.
B Reducing memory consumption
We cannot avoid to store all the corrupt states, as we need to
substitute our approximations for the rewards received when
encountering them. However, keeping all the non-corrupt
states in memory is not strictly necessary, because they are
only used to improve our approximation of lb. To reduce
memory consumption, we can instead keep only a small set
of them and use it to update the cached lb of all known corrupt
states. We add newly encountered non-corrupt states to this
set, but if it gets too large, we remove some states at random.
Since we always update our approximation of lb using
newly encountered non-corrupt states, it converges to the cor-
rect value as long as a state giving the best bound is not
encountered earlier than the corrupt state. Because of the
stronger version of assumption 3, we can even expect the state
giving the best bound to be in the very trajectory identifying
the corrupt state. Because of this we do not expect any prac-
tical problems with this optimization. All of our experiments
use this modification of algorithm 2; however, in the toy envi-
ronments presented the size of the cached set was bigger than
the state space, thus there was no practical effect.
Future work could further reduce memory consumption
by keeping the information about the state space in different
ways, for example using neural networks to approximate the
required functions.