=Paper= {{Paper |id=Vol-2419/paper28 |storemode=property |title=Detecting Spiky Corruption in Markov Decision Processes |pdfUrl=https://ceur-ws.org/Vol-2419/paper_28.pdf |volume=Vol-2419 |authors=Jason Mancuso,Tomasz Kisielewski,David Lindner,Alok Singh |dblpUrl=https://dblp.org/rec/conf/ijcai/MancusoKLS19 }} ==Detecting Spiky Corruption in Markov Decision Processes== https://ceur-ws.org/Vol-2419/paper_28.pdf
                        Detecting Spiky Corruption in Markov Decision Processes
                 Jason Mancuso1∗ , Tomasz Kisielewski2∗ , David Lindner3∗ and Alok Singh4∗
                                                1
                                                  Dropout Labs
                                          2
                                            Independent Researcher
                                                 3
                                                   ETH Zürich
                                                   4
                                                     Terrafuse
                  jason@manc.us, tymorl@gmail.com, lindnerd@ethz.ch, alok.singh@berkeley.edu

                             Abstract                               1.1   Problem motivation
                                                                    CRMDPs naturally capture different notions of reward mis-
         Current reinforcement learning methods fail if the         specification such as wireheading, side effects, avoiding su-
         reward function is imperfect, i.e. if the agent ob-        pervision, and sensory malfunction [Everitt et al., 2017;
         serves reward different from what it actually re-          Amodei et al., 2016]. This makes them a useful framework
         ceives. We study this problem within the formal-           for developing RL agents robust to reward corruption.
         ism of Corrupt Reward Markov Decision Processes               Additionally, different approaches to learning reward func-
         (CRMDPs). We show that if the reward corruption            tions can be interpreted in the CRMDP framework, such as
         in a CRMDP is sufficiently “spiky”, the environ-           semi-supervised RL [Finn et al., 2016], cooperative inverse
         ment is solvable. We fully characterize the regret         RL [Hadfield-Menell et al., 2016], learning from human pref-
         bound of a Spiky CRMDP, and introduce an algo-             erences [Christiano et al., 2017], and learning values from
         rithm that is able to detect its corrupt states. We        stories [Riedl and Harrison, 2016]. Approaches to learn a
         show that this algorithm can be used to learn the op-      reward function from expert input such as inverse reinforce-
         timal policy with any common reinforcement learn-          ment learning (IRL) [Ng et al., 2000] can yield corrupt reward
         ing algorithm. Finally, we investigate our algorithm       functions when learning from an expert of bounded rational-
         in a pair of simple gridworld environments, finding        ity or from sub-optimal demonstrations. CRMDPs may be
         that our algorithm can detect the corrupt states and       able to provide new theoretical guarantees and insights for
         learn the optimal policy despite the corruption.           IRL methods, which are often limited by the assumption that
                                                                    the expert acts nearly optimal.

1       Introduction                                                1.2   Solution motivation
                                                                    Our approach is inspired by connections to work on robust-
The reward function distinguishes reinforcement learning            ness to noise and fairness in supervised learning. The connec-
(RL) from other forms of learning. If the reward function is        tion between supervised learning and RL has been discussed
misspecified, the agent can learn undesired behavior [Everitt       before (e.g. see [Barto and Dietterich, 2004]); here, we only
et al., 2017]. It is an open problem in RL to detect or avoid       use it in an informal way to motivate our approach.
misspecified rewards [Amodei et al., 2016].                            In particular, one way to view supervised learning is as a
   [Everitt et al., 2017] formalize this problem by introducing     special case of RL. In this interpretation, the RL policy cor-
Corrupt Reward Markov Decision Processes (CRMDPs). In-              responds to the supervised learning model, the actions are to
formally, a CRMDP is a MDP in which the agent receives a            pick a specific label for an input, and the reward is an indi-
corrupted reward signal. A No Free Lunch Theorem for CR-            cator function that is 1 if the true label matches our pick, and
MDPs states that they are unlearnable in general; however,          0 otherwise. The reward is provided by an oracle that can
additional assumptions on the problem can lead to learnable         provide the true labels for a fixed training set of samples.
subclasses. In particular, [Everitt et al., 2017] introduce a set      In noisy supervised learning, this oracle can be fallible,
of strong assumptions that allow for a quantilizing agent to        meaning the true label of an instance may not match the la-
achieve sublinear regret. Other assumptions, however, may           bel the oracle provides. In this setting, the goal is to learn
lead to distinct learnable subclasses that are also useful in       the true reward function despite only having access to a cor-
practice.                                                           rupt reward function. It has been observed that deep neural
   In this work, we propose a set of assumptions to create          networks can learn in the presence of certain kinds of noise
such a subclass. Intuitively, our assumptions capture the no-       [Rolnick et al., 2017; Drory et al., 2018], which suggests
tion of the corruption being “spiky” with respect to a distance     that some classes of CRMDPs beyond those investigated in
measure on the state space.                                         [Everitt et al., 2017] can be solved.
                                                                       For further inspiration, we turn to the field of fairness in
    ∗
        equal contribution                                          supervised classification. [Dwork et al., 2012] provide a nat-
ural definition of individual fairness using distance metrics           2. ∀x, y ∈ S, |R(x) − R(y)| ≤ d(x, y),
on the input and output spaces and a corresponding Lipschitz            3. ∀x ∈ Sc , LVSn (x) > supy∈Sn LVS (y).
conditions. Intuitively, a classifier is considered fair if it pro-
vides similar labels for similar input samples. Our approach             We call d the distance between the states, and LV the Lip-
to solving CRMDPs is similar. However, we apply Lipschitz             schitz violation measure.
conditions to the reward function rather than the classifier. A          Intuitively, the distance d should capture some notion of
simple derivation shows that these interpretations are equiva-        smoothness of the true reward function in each state. The goal
lent when the likelihood of the classifier is used to define the      is to construct this distance such that the reward in corrupt
reward function.                                                      states is much less smooth (hence, “spiky”). The assumptions
                                                                      2 and 3 formalize this intuition.
1.3    Related Work                                                      Note the strong relationship between the distance and re-
We aim to detect reward corruption by assuming the true               ward functions. For a given Spiky CRMDP one cannot be
reward to be “smooth” and the corruption to be “spiky”.               modified independently of the other without breaking the as-
Smoothness of the reward function with respect to a distance          sumptions. In particular, any linear transformation applied to
in state space is a classic notion in machine learning [Santa-        the reward, such as e.g. scaling, has to be also performed on
marı́a et al., 1997]. Assumptions about the smoothness of the         the distance function.
reward function have also been used to ensure safety proper-             The LV function is meant to be a measure of Lipschitz vio-
ties for RL algorithms, for example by [Turchetta et al., 2016]       lations of a state – “how much” does a given state violate (2)
or [Garcı́a and Fernández, 2012].                                    with respect to some set of states Sn when one substitutes C
   Both define smoothness with respect to a distance metric           for R. We propose two ways of measuring this, which further
in the state space which is similar to our approach. How-             refine the class of Spiky CRMDPs. Unless otherwise noted,
ever, they tackle the problem of safely exploring an MDP, i.e.        all our examples satisfy the conditions in definition 2 with
without visiting dangerous states, and do not consider reward         either of these functions used as LV.
corruption. Another key difference are the assumptions on             Definition 3 (NLV). The Number of Lipschitz Violations of
the distance functions, which are a subset of metrics that are        x ∈ S with respect to A ⊆ S is
connected to the (known) transition function in [Turchetta et
al., 2016] or are just the Euclidean distance in [Garcı́a and             NLVA (x) := µ({y ∈ A | |C(x) − C(y)| > d(x, y)}),
Fernández, 2012]. In contrast, we allow any metric.                  where µ is a measure on S.
   There also exist approaches for automatically learning             Definition 4 (TLV). The Total Lipschitz Violation of x ∈ S
distance functions on the state space [Taylor et al., 2011;           with respect to A ⊆ S is
Globerson and T. Roweis, 2005; Davis et al., 2007; Jain et                           Z
al., 2009]. Such methods might be used in future work that            TLVA (x) =        min {0, |C(x) − C(y)| − d(x, y)} dµ(y),
remove the need for explicitly providing a distance function.
                                                                                    y∈A

2     Problem statement                                               where µ is a measure on S.
Let us recall the definition of CRMDPs from [Everitt et al.,             Note that both variants require a measure on the state
2017].                                                                space. In general there might not be a natural choice, but for
                                                                      finite state spaces, which we will consider, the counting mea-
Definition 1 (CRMDP). A Corrupt Reward MDP (CRMDP)                    sure is often reasonable. In this case, NLV counts the number
is a finite-state MDP hS, A, T, Ri with an additional corrupt         of states which violate the Lipschitz condition with the given
reward function C : S → R. We call hS, A, T, Ri the under-            state, while TLV sums up the magnitudes of the violations.
lying MDP, Sn = {x ∈ S | R(x) = C(x)} the set of non-
corrupt states, and its complement, Sc = S \ Sn , the set of
corrupt states.
                                                                      3     Theoretical Results
   Note that C represents the reward observed by an agent and         3.1    Corruption identification
may be equal to the real reward R in some, or even all, of the        With this setup in place, we can now introduce an algorithm
states. Our aim is to identify the corrupt states in Sc and learn     to detect corrupt states in finite Spiky CRMDPs, which is
an optimal policy with respect to R while only observing C.           shown in algorithm 1. The core idea is simple: We maintain
In general, this is impossible according to the CRMDP No              a set of corrupt states, initially empty, and sort all states de-
Free Lunch Theorem. However special classes of CRMDPs                 scending by their Lipschitz violation with respect to all states.
may not have this limitation [Everitt et al., 2017]. Therefore,       Then for each state we check its Lipschitz violation with re-
we consider CRMDPs with a specific form of R and C.                   spect to all states that we have not identified as corrupt yet.
                                                                      If it is positive, we mark the state as corrupt. As soon as we
Definition 2 (Spiky CRMDP). Let M be a CRMDP with two
                                                                      encounter a state with zero Lipschitz violation we are done.
additional functions, d and LV. d : S × S → R+ is a metric
                                                                         This algorithm makes use of assumptions 2 and 3 in defi-
on the state space, and LV : Powerset(S) × S → R+ is
                                                                      nition 2. Assumption 3 makes sure that by sorting the states
non-decreasing with respect to set inclusion. We call M a
                                                                      in descending order we consider all corrupt states first. As-
Spiky CRMDP if the following assumptions are satisfied:
                                                                      sumption 2 then provides us with a simple stopping condition,
    1. Sn is nonempty                                                 namely no further violations of the Lipschitz condition.
Algorithm 1 An algorithm for identifying corrupt states in          Algorithm 2 An example of using algorithm 1 online when
Spiky CRMDPs.                                                       3’ is satisfied.
  function I DENTIFY C ORRUPT S TATES(S, LV) → Ŝc                     function L EARN O NLINE(π, LV)
     Ŝc ← ∅                                                               Ŝc ← ∅
     Sort x ∈ S by LVS (x) decreasing                                      for τ ∼ π do
     for x ∈ S do                                                               X ← Identif yCorruptStates(τ, LV)
         if LVS\Ŝc (x) = 0 then                                                Ŝc ← Ŝc ∪ X
                                                                                Sˆn ← Sˆn ∪ (τ \ Ŝc )
             return Ŝc                                                         Train RL agent using reward signal lbŜn
          Add x to Ŝc

                                                                       This bound is not particularly useful as a theoretical result,
Proposition 1. Let M be a spiky CRMDP. Then algorithm 1             because it is fairly easy to contrive CRMDPs where it be-
returns Sc when given S and LV as input.                            comes as large as the difference between the least and greatest
   This proposition simply states that the detection algorithm      possible cumulative rewards.
is able to correctly detect all corrupt states. The proof of this      However, it might be useful to increase sample efficiency
and all following statements is included in the appendix.           in an active reward learning setting. Say we have a supervisor
                                                                    that we can ask to provide us with the real reward of a given
3.2   A posteriori bounds on regret                                 state, but such a question is expensive. We therefore want to
We now turn to the problem of learning an optimal policy            maximize the information we get from a single question. The
despite the corruption in a CRMDP. Our first approach is to         way we compute the bound (1) allows us to pick a question
learn from an optimistic estimate of the true reward based on       such that the upper bound on regret improves the most, which
the Lipschitz condition on the rewards. To this end we first        is a good criterion for question quality. We do not investigate
define such upper and lower bounds on the rewards.                  this further, but suggest it as useful future work.
Definition 5. Let M = hS, A, T, R, C, d, LVi be a spiky CR-         3.3    Optimality with corruption avoidance
MDP. Then for a state x we define the reward lower Lipschitz        To be able to guarantee an optimal policy despite corruption,
bound to be                                                         we have to make an additional assumption about the environ-
                lb(x) = max R(y) − d(x, y).                         ment. In particular, we will assume that the underlying MDP
                          y∈Sn                                      has at least one optimal policy which avoids all corrupt states.
                                                                    This essentially means that identifying and then avoiding the
We call any state that introduces this bound and is closest to x
                                                                    corrupt states is enough to solve the environment.
the lower Lipschitz bounding state. Symmetrically we define
the upper bounds and upper bounding states.                         Proposition 3. Let M = hS, A, T, R, C, d, LVi be a spiky
                                                                    reward CRMDP and Ṁ = hS, A, T, Ri its underlying MDP.
                ub(x) = min R(y) + d(x, y).                         Then if there exists a Ṁ -optimal policy π ∗ generating a tra-
                          y∈Sn
                                                                    jectory τ ∼ π ∗ that does not contain any corrupt states, then
                                                                    any policy optimal with respect to M using lb as a reward
   For a non-corrupt state, both bounds just equal the true re-     function will also be Ṁ -optimal.
ward, so it’s its own bounding state. We also get lb(x) ≤
R(x) and ub(x) ≥ R(x), because the distance function is                The assumption of an optimal policy that always avoids
positive definite.                                                  corrupt states is very strong, especially in stochastic environ-
   Note that the reward lower (or upper) Lipschitz bound and        ments. However, this is to be expected since our result allows
bounding state can be computed by the agent after identifying       for solutions without any regret.
the corrupt states, because it only requires access to the dis-        It is also worth noting that the assumption might be slightly
tance function and real reward function for non-corrupt states,     weakened in practice, as we discuss in our experimental re-
which is equal to the corrupt reward there.                         sults.
   After computing these bounds, the agent can compute up-
per bounds on the regret it is experiencing. Finding a pol-         4     Practical considerations
icy with respect to this optimistic estimate of the true reward     Algorithm 1 sorts over an entire state space, which requires
function of corrupt states gives us a way to bound the ex-          (1) complete knowledge of the state space and (2) computa-
pected regret using the Lipschitz bounds.                           tional resources to perform a sorting operation.
Proposition 2. The expected regret with respect to R of a pol-
icy π 0 optimal with respect to ub (the reward upper Lipschitz      4.1    Modification for Online Learning
bound) is bounded from above by                                     We would like our algorithm to work online, i.e. at most con-
                              X                                     sidering a small batch of trajectories at once. Our current
                 sup      E       ub(x) − lb(x).           (1)      assumptions are not enough for such an algorithm to be cor-
              π ub-optimal τ ∼π x∈τ                                 rect. However, it is possible by strengthening assumption 3
                                                                    in definition 2:
    3’ ∀π, τ ∼ π ∀x ∈ τc , LVτn (x) > supy∈τn LVτ (y),                10       9   8     7   11 (6)         10        9        8      7   11 (6)
where τ ∼ π is a trajectory sampled from policy π. We treat
the trajectory τ as a sequence of states, τc are the corrupt           9       9   8     7     6             9        9      11 (6)   7     6

states in this sequence, and τn = τ \ τc .
                                                                       8       8   8     7     6             8      11 (6)     8      7     6
   This is the same condition as before, except that we re-
quire it to hold over all possible trajectories through the MDP.
                                                                       7       7   7     7     6             7        7        7      7     6
Functionally speaking, this allows us to be sure that we’ll be
able to iteratively perform corruption identification without
                                                                     11 (6)    6   6     6     A           11 (6)     6        6      6     A
misidentifying any states. This assumption is much stronger
than the previous version and restricts the class of environ-
ments satisfying it considerably. In practice, we believe that
many interesting environments will still satisfy it, and many       Figure 1: Toy Spiky CRMDP envorinments Corners on the left, On-
of those that do not may still be learnable in a similar manner.    TheWay on the right. The blue cell in the lower right hand corner
   With assumption 3’ satisfied, we can use algorithm 1 for         is the starting position of the agent and the green cell in the upper
online reinforcement learning, as shown in algorithm 2. To          left hand corner the goal. The true reward collected in each state is
do this, we sample trajectories from the current policy of          determined by the max-distance to the goal and shown here by the
                                                                    numbers in each cell. The reward in the red cells is corrupted and
the agent, apply the algorithm 1 on the individual trajectories
                                                                    the underlying true reward is shown in parentheses.
and then update the policy using the reward lower Lipschitz
bound.
   Note that lbŜn (x) = maxy∈Ŝn R(y) − d(x, y). A straight-       in the rollouts and replaces their rewards by the Lipschitz
forward application of proposition 1 shows that Ŝc → Sc            bounds. The full code used, logs and commands to repro-
from below and Ŝn → Sn from above when using algorithm             duce the experiments can be found at: https://github.com/
2.                                                                  jvmancuso/safe-grid-agents.
   In order to actually use the identified states, we also need        To evaluate the quality of our algorithm for each experi-
the ability to compute the lb and ub functions. This once           ment we calculate the average corrupt reward, average hidden
again would normally require access to the whole state space        reward, the sample complexity needed to achieve this result,
with corrupt states identified. However, we can approximate         and the ratio of this complexity to the complexity needed in
lb and ub by slowly building up the known state space and           the non-corrupt baseline. The results are summarized in Table
corrupt state space. This is what is reflected in algorithm 2,      1 and we proceed by discussing them in detail.
specifically when using lbŜn ≈ lb as a reward signal. This
                                                                    5.1       Toy Spiky-CRMDP
approximation is pessimistic for corrupt states but converges
as Ŝn → Sn .                                                       Similar to [Everitt et al., 2017], we construct a toy example
                                                                    under which our learnability guarantee from proposition 3 is
4.2     Memory complexity                                           satisfied. We also slightly tweak this toy example to break the
                                                                    requirements for proposition 3, with the hope of demonstrat-
Memory complexity is an additional challenge for our algo-
                                                                    ing that the theorem’s requirements can be relaxed to some
rithm. Keeping the whole state space in memory is usually
                                                                    extent without harming learnability. The figure shows the toy
not feasible, and even keeping only the encountered states
                                                                    example and its modification.
can quickly result in performance problems. We implemented
                                                                       The gridworlds are shown in 1. The agent starts on
some optimizations to reduce the memory consumption of
                                                                    the blue field and, in typical gridworld fashion, can move
our approach. While they had no effect in the small toy en-
                                                                    up/down/left/right as its action. The red cells are corrupted
vironments we consider in this paper, for completeness and
                                                                    states and give the agent an unusually high reward, thereby
future reference we include a description of these optimiza-
                                                                    satisfying our conditions about the “spikyness” of the corrup-
tions in appendix B.
                                                                    tion. We use the Manhattan metric as our distance measure.
                                                                    Note that in the environment on the right, the optimal pol-
5     Experiments                                                   icy must encounter corrupt states, violating the assumption of
We ran experiments on gridworld environments, detailed be-          proposition 3. However, because the optimal policy remains
low. On each environment, we trained three different agents         optimal after substituting lb for the reward we should still ex-
using an implementation of PPO [Schulman et al., 2017].             pect good performance from our algorithm.
   The first agent just uses PPO with access to the corrupt
reward, without any consideration for corruption.                   5.2       Results
   The second agent uses PPO with access to the hidden re-          The baseline results for both environments are as expected.
ward, with the corruption removed. These are two baselines          PPO with access to the corrupt reward very quickly learns the
– how well an algorithm performs on the corrupt state and           corrupt-optimal policy, going straight to the bottom left or
how well it could perform on the environment if it was not          top right corner. PPO with access to the hidden reward needs
corrupted.                                                          significantly more data to learn the optimal policy, because
   The third agent uses algorithm 2 during the rollout phase        the problem is more complicated and cannot be solved with a
of the PPO algorithm. In particular, it identifies corrupt states   constant policy.
    Environment       Reward         Agent        Avg. Corrupt Reward        Avg. True Reward        Sample Complexity         SC ratio
                      Corrupt        Baseline             73                        48                     5421                 0.17
                      Uncorrupt      Baseline             64                        64                    31410                 1.00
    Corners
                      Uncorrupt      CRMDP                64                        64                    32250                 1.03
                      Corrupt        CRMDP                64                        64                    57000                 1.81
                      Corrupt        Baseline             73                        48                     4194                 0.09
                      Uncorrupt      Baseline             64                        64                    45310                 1.00
    OnTheWay
                      Uncorrupt      CRMDP                64                        64                    51750                 1.14
                      Corrupt        CRMDP                64                        64                    94380                 2.08

Table 1: Results of the gridworld experiments described in section 5. In addition to the final corrupted and hidden true reward we report the
sample complexity of each, which is defined as the number of episodes required for a moving average (momentum=0.9) of observed return to
reach its optimal value. The SC ratio is the ratio of the sample complexity compared to the baseline model that has access to the hidden true
reward function. Note that sample complexity measures are generally susceptible to noise, and should be interpreted with caution.


   As a sanity check we ran an additional baseline test for              of practical settings in which Bayesian RL has traditionally
these toy environments – our algorithm with access to the                fallen short.
hidden reward. As it does not encounter any states it could
perceive as corrupt it performs comparably to the baseline.              7    Acknowledgements
   Finally we ran our algorithm without access to the hidden             Thanks to Victoria Krakovna and Tom Everitt for generous
reward. In both cases it learned the optimal policy, requir-             feedback and insightful discussions about this work and to
ing about two times as much data as the agent with access                Rohin Shah, Jan Leike, and Geoffrey Irving for providing
to the hidden reward. It is worth pointing out that because              valuable feedback on early proposals of our approach.
of the way the environments are constructed, this additional                Furthermore, thanks to the organizers of the 2nd AI Safety
data is most likely not used for learning the bounds on the              Camp during which a significant portion of this work was
true reward. These bounds will almost always be as good as               completed.
they can as soon as the agent identifies the corruption. Rather,
the increased difficulty probably stems from the lower differ-
ences between optimal and sub-optimal policy payoffs.
                                                                         References
                                                                         [Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob
                                                                           Steinhardt, Paul F. Christiano, John Schulman, and
6    Conclusion                                                            Dan Mané. Concrete problems in AI safety. CoRR,
The class of Spiky CRMDPs resolves several limitations in                  abs/1606.06565, 2016.
the class of previously known, solvable CRMDPs. In partic-               [Barto and Dietterich, 2004] A. G. Barto and T.G. Diet-
ular, this class of MDPs need not have finite diameter (state              terich. Reinforcement learning and its relationship to su-
spaces symmetric in time), we demonstrate that they can be                 pervised learning. John Wiley and Sons, Inc, 2004.
solved in the usual MDP formalism without recourse to the
decoupled RL of [Everitt et al., 2017].                                  [Christiano et al., 2017] Paul F. Christiano, Jan Leike, Tom
   Despite the the experimental support for our algorithm’s                Brown, Miljan Martic, Shane Legg, and Dario Amodei.
success in toy gridworld environments, there are several lim-              Deep reinforcement learning from human preferences.
itations of our solution. Even though we can minimize the re-              In Advances in Neural Information Processing Systems,
gret, it required quite a few assumptions to do so. Our results            2017.
for the OnTheWay environment suggest that these assump-                  [Davis et al., 2007] Jason V. Davis, Brian Kulis, Prateek
tions can be weakened further, and this could be useful future             Jain, Suvrit Sra, and Inderjit S. Dhillon. Information-
work. However, we believe the regret bound is most intrigu-                theoretic metric learning. In Proceedings of the 24th In-
ing, as it can be used for accelerating exploration in decou-              ternational Conference on Machine Learning, ICML ’07,
pled RL schemes. This is most apparent for semi-supervised                 pages 209–216, New York, NY, USA, 2007. ACM.
RL, but also applies to other settings in which reward infor-            [Drory et al., 2018] Amnon Drory, Shai Avidan, and Raja
mation can be inferred from external channels or actors. For               Giryes. On the resistance of neural nets to label noise.
this bound to become practically useful, future work should                CoRR, abs/1803.11410, 2018.
prioritize learning the Spiky CRMDP distance metric d from
trajectory data in an online or active reward learning setting.          [Dwork et al., 2012] Cynthia Dwork, Moritz Hardt, Toniann
   More generally, Lipschitz reward functions and spiky cor-               Pitassi, Omer Reingold, and Richard Zemel. Fairness
ruption can be seen as a particularly strong prior in the                  through awareness. In Innovations in Theoretical Com-
Bayesian RL setting. Our theorems and experiments demon-                   puter Science Conference, 2012.
strate that this can be used to encode useful inductive biases           [Everitt et al., 2017] Tom Everitt, Victoria Krakovna, Lau-
in relevant environments. While we demonstrate a single in-                rent Orseau, and Shane Legg. Reinforcement learning with
stance of its usefulness in learning within misspecified envi-             a corrupted reward channel. In International Joint Confer-
ronments, these priors can be very useful in a wide variety                ence on Artificial Intelligence, 2017.
[Finn et al., 2016] Chelsea Finn, Tianhe Yu, Justin Fu, Pieter   because in general LV is non-decreasing with respect to in-
   Abbeel, and Sergey Levine. Generalizing skills with           clusion, that is LVA ≥ LVB if A ⊆ B. This immediately
   semi-supervised reinforcement learning. arXiv preprint        tells us that all corrupt states x ∈ Sc are processed by Algo-
   arXiv:1612.00429, 2016.                                       rithm 1 before all non-corrupt states y ∈ Sn .
[Garcı́a and Fernández, 2012] Javier Garcı́a and Fernando          We will show that all states added to the Ŝc set are actually
   Fernández. Safe exploration of state and action spaces in    corrupt. For a state x ∈ S to be added to that set it has to
   reinforcement learning. J. Artif. Int. Res., 45(1):515–564,   satisfy
   September 2012.                                                                       LVS\Ŝc (x) > 0.
[Globerson and T. Roweis, 2005] Amir Globerson and Sam           This condition can only be satisfied only if Sc \ Ŝc 6= ∅,
   T. Roweis. Metric learning by collapsing classes. In          because otherwise S \ Ŝc would contain no corrupt states
   Advances in Neural Information Processing Systems, vol-       and no non-corrupt states can violate the Lipschitz condition.
   ume 18, 01 2005.                                              But Algorithm 1 processes all corrupt states before any non-
[Hadfield-Menell et al., 2016] Dylan         Hadfield-Menell,    corrupt states, so x has to be among the corrupt ones.
   Anca D. Dragan, Pieter Abbeel, and Stuart J. Russell.            The algorithm returns as soon as it finds a state x satisfying
   Cooperative inverse reinforcement learning.          CoRR,    LVS\Ŝc (x) = 0. Because of the processing order it is suffi-
   abs/1606.03137, 2016.                                         cient to prove that such a state is a non-corrupt state. Note
[Jain et al., 2009] Prateek Jain, Brian Kulis, Inderjit S.       that
   Dhillon, and Kristen Grauman. Online metric learning                              LVS\Ŝc (x) ≥ LVSn (x),
   and fast similarity search. In D. Koller, D. Schuurmans,      because, as we have already shown, Ŝc ⊆ Sc . Since the left
   Y. Bengio, and L. Bottou, editors, Advances in Neural In-     hand side of this expression is zero and the right one is non-
   formation Processing Systems 21, pages 761–768. Curran        negative by definition, it also has to be zero. But zero cannot
   Associates, Inc., 2009.                                       be strictly greater than a nonempty supremum of nonnegative
[Ng et al., 2000] Andrew Y Ng, Stuart J Russell, et al. Al-      values, so x cannot be corrupt.
   gorithms for inverse reinforcement learning. In Icml, vol-
   ume 1, page 2, 2000.                                            Proof of proposition 2:
[Riedl and Harrison, 2016] Mark O. Riedl and Brent Harri-        Proof. First note that any policy π has average cumulative
   son. Using stories to teach human values to artificial        reward with respect to ub greater or equal than as with respect
   agents. In AI, Ethics, and Society, Papers from the 2016      to R               X                 X
   AAAI Workshop, Phoenix, Arizona, USA, February 13,                           E       R(x) ≤ E         ub(x),              (2)
                                                                                   τ ∼π                      τ ∼π
   2016., 2016.                                                                           x∈τ                       x∈τ
[Rolnick et al., 2017] David Rolnick, Andreas Veit, Serge J.     because in general ub(x) ≥ R(x) as the only difference be-
   Belongie, and Nir Shavit. Deep learning is robust to mas-     tween these functions is a substitution of upper bounds for
   sive label noise. CoRR, abs/1705.10694, 2017.                 some, possibly lower, values of R. In particular, since this
                                                                 is also true for R-optimal policies, this means that any ub-
[Santamarı́a et al., 1997] Juan Carlos Santamarı́a, Richard S.
                                                                 optimal policy π 0 will have average cumulative reward with
   Sutton, and Ashwin Ram. Experiments with reinforce-           respect to ub greater or equal than the R-optimal policy π ∗
   ment learning in problems with continuous state and action    has with respect to R:
   spaces. Adaptive Behaviour, 6:163–217, 1997.                           X                X                 X
[Schulman et al., 2017] John Schulman, Filip Wolski, Pra-             E∗      R(x) ≤ E ∗       ub(x) ≤ E 0       ub(x). (3)
                                                                   τ ∼π                    τ ∼π                           τ ∼π
                                                                            x∈τ                        x∈τ                       x∈τ
   fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox-
   imal policy optimization algorithms. arXiv preprint             Now, by (2), we get another bound on average cumulative
   arXiv:1707.06347, 2017.                                       reward of any policy π
                                                                         X                                   X
[Taylor et al., 2011] Matthew E. Taylor, Brian Kulis, and Fei         E      ub(x) + (lb(x) − ub(x)) = E        lb(x)
                                                                     τ ∼π                                                   τ ∼π
   Sha. Metric learning for reinforcement learning agents. In               x∈τ                                                    x∈τ
   AAMAS, 2011.                                                                                                                    X
                                                                                                                          ≤ E            R(x).
[Turchetta et al., 2016] Matteo        Turchetta,        Felix                                                              τ ∼π
                                                                                                                                   x∈τ
   Berkenkamp, and Andreas Krause. Safe exploration              By moving the part in parentheses to the right side of the in-
   in finite markov decision processes with gaussian pro-        equality we get
   cesses. In Neural Information Processing Systems (NIPS),              X               X
   pages 4305–4313, December 2016.                                   E      ub(x) ≤ E        R(x) + (ub(x) − lb(x)). (4)
                                                                    τ ∼π                   τ ∼π
                                                                           x∈τ                     x∈τ
A    Proofs of theoretical results                               Using (3) and (4) for ub-optimal policies π 0 we get
                                                                          X                 X
Proof of proposition 1:                                               E∗      R(x) ≤ E 0       ub(x)
                                                                     τ ∼π                       τ ∼π
Proof. First note that                                                       x∈τ                       x∈τ
                                                                                                       X
        ∀x∈Sc LVS (x) ≥ LVSn (x) > sup LVS (y),                                           ≤ E0               R(x) + (ub(x) − lb(x)).
                                                                                                τ ∼π
                                       y∈Sn                                                            x∈τ
We get the final bound by moving the reward to the left hand
side and taking a supremum over ub-optimal policies of both
sides.
                    X              X
       sup      E∗     R(x) − E 0      R(x) ≤
    π 0 ub-optimal τ ∼π x∈τ           τ ∼π
                                              x∈τ
                                               X
                               sup       E0          (ub(x) − lb(x)).
                           π 0 ub-optimal τ ∼π x∈τ




    Proof of proposition 3:
Proof. Recall the inequality in (2), that is
                   X                  X
                E      lb(x) ≤ E           R(x).
                 τ ∼π                  τ ∼π
                         x∈τ                  x∈τ
                     ∗
This means that π is lb-optimal, because its average cumula-
tive reward does not change, so it is still the greatest possible.
Any other lb-optimal policy has to get at least as much av-
erage cumulative reward as π ∗ with respect to lb. Since it
would get at least as much reward with respect to R, it is also
R-optimal.

B     Reducing memory consumption
We cannot avoid to store all the corrupt states, as we need to
substitute our approximations for the rewards received when
encountering them. However, keeping all the non-corrupt
states in memory is not strictly necessary, because they are
only used to improve our approximation of lb. To reduce
memory consumption, we can instead keep only a small set
of them and use it to update the cached lb of all known corrupt
states. We add newly encountered non-corrupt states to this
set, but if it gets too large, we remove some states at random.
   Since we always update our approximation of lb using
newly encountered non-corrupt states, it converges to the cor-
rect value as long as a state giving the best bound is not
encountered earlier than the corrupt state. Because of the
stronger version of assumption 3, we can even expect the state
giving the best bound to be in the very trajectory identifying
the corrupt state. Because of this we do not expect any prac-
tical problems with this optimization. All of our experiments
use this modification of algorithm 2; however, in the toy envi-
ronments presented the size of the cached set was bigger than
the state space, thus there was no practical effect.
   Future work could further reduce memory consumption
by keeping the information about the state space in different
ways, for example using neural networks to approximate the
required functions.