Conservative Agency

                Alexander Matt Turner1 , Dylan Hadfield-Menell2 and Prasad Tadepalli1
                                          1
                                            Oregon State University
                                               2
                                                 UC Berkeley
                   {turneale, prasad.tadepalli}@oregonstate.edu, dhm@eecs.berkeley.edu


                          Abstract                                     As agents are increasingly employed for real-world tasks,
                                                                    misspecification will become more difficult to avoid and will
     Reward functions are easy to misspecify; although              have more serious consequences. In this work, we focus on
     designers can make corrections after observing                 mitigating these consequences.
     mistakes, an agent pursuing a misspecified reward                 The specification process can be thought of as an iterated
     function can irreversibly change the state of its en-          game. First, the designers provide a reward function. Using a
     vironment. If that change precludes optimization               learned model, the agent then computes and follows a policy
     of the correctly specified reward function, then cor-          that optimizes the reward function. The designers can then
     rection is futile. For example, a robotic factory as-          correct the reward function, which the agent then optimizes,
     sistant could break expensive equipment due to a               and so on. Ideally, the agent should maximize the reward over
     reward misspecification; even if the designers im-             time, not just within any particular round – in other words,
     mediately correct the reward function, the damage              it should minimize regret for the correctly specified reward
     is done. To mitigate this risk, we introduce an ap-            function over the course of the game.
     proach that balances optimization of the primary                  For example, consider a robotic factory assistant. In-
     reward function with preservation of the ability to            evitably, a reward misspecification might cause erroneous be-
     optimize auxiliary reward functions. Surprisingly,             havior, such as going to the wrong place. However, we would
     even when the auxiliary reward functions are ran-              prefer misspecification not induce irreversible and costly mis-
     domly generated and therefore uninformative about              takes, such as breaking expensive equipment or harming
     the correctly specified reward function, this ap-              workers.
     proach induces conservative, effective behavior.
                                                                       Such mistakes have a large impact on the ability to opti-
                                                                    mize a wide range of reward functions. Spilling paint im-
1   Introduction                                                    pinges on the many objectives which involve keeping the fac-
                                                                    tory floor clean. Breaking a vase interferes with every objec-
Recent years have seen a rapid expansion of the number of           tive involving vases. The expensive equipment can be used
tasks that reinforcement learning (RL) agents can learn to          to manufacture various kinds of widgets, so any damage im-
complete, from Go ([Silver et al., 2016]) to Dota 2 ([OpenAI,       pedes many objectives. The objectives affected by these ac-
2018]). The designers specify the reward function, which            tions include the unknown correct objective. To minimize
guides the learned behavior.                                        regret over the course of the game, the agent should preserve
   Reward misspecification can lead to strange agent behav-         its ability to optimize the correct objective.
ior, from purposefully dying before entering a video game              Our key insight is that by avoiding these impactful ac-
level in which scoring points is initially more difficult ([Saun-   tions to the extent possible, we greatly increase the chance of
ders et al., 2018]), to exploiting a learned reward predic-         preserving the agent’s ability to optimize the correct reward
tor by indefinitely volleying a Pong ball ([Christiano et al.,      function. By preserving options for arbitrary objectives, one
2017]). Specification is often difficult for non-trivial tasks,     can often preserve options for the correct objective – even
for reasons including insufficient time, human error, or lack of    without knowing anything about it. Thus, without making as-
knowledge about the relative desirability of states. [Amodei        sumptions about the nature of the misspecification early on,
et al., 2016] explain:                                              the agent can still achieve low regret over the game.
     An objective function that focuses on only one as-                To leverage this insight, we consider a state embedding
     pect of the environment may implicitly express in-             in which each dimension is the optimal value function (i.e.,
     difference over other aspects of the environment.              the attainable utility) for a different reward function. We
     An agent optimizing this objective function might              show that penalizing distance traveled in this embedding nat-
     thus engage in major disruptions of the broader en-            urally captures and unifies several concepts in the litera-
     vironment if doing so provides even a tiny advan-              ture, including side effect avoidance ([Amodei et al., 2016;
     tage for the task at hand.                                     Zhang et al., 2018]), minimizing change to the state of the
environment ([Armstrong and Levinstein, 2017]), and reacha-        larger. However, empowerment is inappropriately sensitive to
bility preservation ([Moldovan and Abbeel, 2012; Eysenbach         the action encoding.
et al., 2018]). We refer to this unification as conservative          Safe RL ([Pecka and Svoboda, 2014; Garcı́a and
agency: optimizing the primary reward function while pre-          Fernández, 2015; Berkenkamp et al., 2017; Chow et al.,
serving the ability to optimize others.                            2018]) focuses on avoiding irrecoverable mistakes during
Contributions. We frame the reward specification process           training. However, if the objective is misspecified, safe RL
as an iterated game and introduce the notion of conserva-          agents can converge to arbitrarily undesirable policies. Al-
tive agency. This notion inspires an approach called attain-       though our approach should be compatible with safe RL tech-
able utility preservation (AUP), for which we show that Q-         niques, we concern ourselves with the consequences of the
learning converges. We offer a principled interpretation of        optimal policy in this work.
design choices made by previous approaches – choices upon
which we significantly improve.                                    3     Approach
   We run a thorough hyperparameter sweep and conduct an           Everyday experience suggests that the ability to achieve one
ablation study whose results favorably compare variants of         goal is linked to the ability to achieve a seemingly unrelated
AUP to a reachability preservation method on a range of            goal. Reading this paper takes away from time spent learning
gridworlds. By testing for broadly applicable agent incen-         woodworking, and going hiking means you can’t reach the
tives, these simple environments demonstrate the desirable         airport as quickly. However, one might wonder whether these
properties of conservative agency. Our results indicate that       everyday intuitions are true in a formal sense. In other words,
even when simply preserving the ability to optimize uni-           are the optimal value functions for a wide range of reward
formly sampled reward functions, AUP agents accrue pri-            functions thus correlated? If so, preserving the ability to op-
mary reward while preserving state reachabilities, minimiz-        timize somewhat unrelated reward functions likely preserves
ing change to the environment, and avoiding side effects with-     the ability to optimize the correct reward function.
out specification of what counts as a side effect.
                                                                   3.1    Formalization
2   Prior Work                                                     In this work, we consider a standard Markov decision pro-
                                                                   cess (MDP) hS, A, T, R, γi with state space S, action space
Our proposal aims to minimize change to the agent’s ability        A, transition function T : S × A → ∆(S), reward function
to optimize the correct objective, which directly helps reduce     R : S × A → R, and discount factor γ. We assume the ex-
regret over the specification process. In contrast, previous       istence of a no-op action ∅ ∈ A for which the agent does
approaches to regularizing the optimal policy were more in-        nothing. In addition to the primary reward function R, we as-
direct, minimizing change to state features ([Armstrong and        sume that the designer supplies a finite set of auxiliary reward
Levinstein, 2017]) or decrease in the reachability of states       functions called the auxiliary set, R ⊂ RS×A . Each Ri ∈ R
([Krakovna et al., 2018]’s relative reachability). The latter is   has a corresponding Q-function QRi . We do not assume that
recovered as a special case of AUP.                                the correct reward function belongs to R. In fact, one of our
   Other methods for constraining or otherwise mitigating the      key findings is that AUP tends to preserve the ability to op-
consequences of reward misspecification have been consid-          timize the correct reward function even when the correct re-
ered. A wealth of work is available on constrained MDPs,           ward function is not included in the auxiliary set.
in which reward is maximized while satisfying certain con-
straints ([Altman, 1999]). For example, [Zhang et al., 2018]       Definition (AUP penalty). Let s be a state and a be an action.
employ a whitelisted constraint scheme to avoid negative side
effects. However, we may not assume we can specify all rel-                                  |R|
                                                                                             X
evant constraints, or a reasonable feasible set of reward func-          P ENALTY(s, a) :=         |QRi (s, a) − QRi (s, ∅)| .   (1)
tions for robust optimization ([Regan and Boutilier, 2010]).                                 i=1
   [Everitt et al., 2017] formalize reward misspecification
as the corruption of some true reward function. [Hadfield-            The penalty is the L1 distance from the no-op in a state
Menell et al., 2017b] interpret the provided reward function       embedding in which each dimension is the value function for
as merely an observation of the true objective. [Shah et al.,      an auxiliary reward function. This measures change in the
2019] employ the information about human preferences im-           ability to optimize each auxiliary reward function.
plicitly present in the initial state to avoid negative side ef-      We want the penalty term to be roughly invariant to the
fects. While both our approach and theirs aim to avoid side        absolute magnitude of the auxiliary Q-values, which can be
effects, they assume that the correct reward function is linear    arbitrary (it is well-known that the optimal policy is invariant
in state features, while we do not.                                to positive affine transformation of the reward function). To
   [Amodei et al., 2016] consider avoiding side effects by         do this, we normalize with respect to the agent’s situation.
                                                                   The designer can choose to scale with respect to the penalty
minimizing the agent’s information-theoretic empowerment
([Mohamed and Rezende, 2015]). Empowerment quantifies              of some mild action or, if R ⊂ RS×A   >0 , the total ability to
an agent’s control over future states of the world in terms        optimize the auxiliary set:
of the maximum possible mutual information between fu-                                               |R|
ture observations and the agent’s actions. The intuition is
                                                                                                     X
                                                                                   S CALE(s) :=            QRi (s, ∅),           (2)
that when an agent has greater control, side effects tend to be                                      i=1
where S CALE : S → R>0 in general. With this, we are now                                           Starting state
ready to define the full AUP objective:
Definition (AUP reward function). Let λ ≥ 0. Then
                                                                                           ...                      Inaction
                                   P ENALTY(s, a)
        RAUP (s, a) := R(s, a) − λ                .          (3)
                                      S CALE(s)                                 Action      Stepwise inaction
   Similar to the regularization parameter in supervised learn-
                                                                    Figure 1: An action’s penalty is calculated with respect to the chosen
ing, λ is a regularization parameter that controls the influence    baseline.
of the AUP penalty on the reward function. Loosely speak-
ing, λ can be interpreted as expressing the designer’s beliefs
about the extent to which R might be misspecified.                     Figure 1 compares the baselines. Each baseline implies
Lemma 2. ∀s, a : RAUP converges with probability 1.                 a different assumption about how the environment is config-
                                                                    ured to facilitate optimization of the correctly specified re-
Theorem 1. ∀s, a : QRAUP converges with probability 1.              ward function: the state is initially configured (starting state),
                                                                    processes initially configure (inaction), or processes continu-
   The AUP reward function then defines a new MDP                   ally reconfigure in response to the agent’s actions (stepwise
hS, A, T, RAUP , γi. Therefore, given the primary and aux-          inaction). The stepwise inaction baseline aims to allow for
iliary reward functions, the model-based agent in the iterated      the response of other agents implicitly present in the environ-
game can compute RAUP and the corresponding optimal pol-            ment (such as humans).
icy.
   For our purposes, we simultaneously learn the optimal aux-       Deviation. Relative reachability only penalizes decreases
iliary Q-functions.                                                 in state reachability, while AUP penalizes absolute change in
                                                                    the ability to optimize the auxiliary reward functions. Ini-
Algorithm 1 AUP update                                              tially, this choice seems confusing – we don’t mind if the
                                                                    agent becomes better able to optimize the correct reward
 1: procedure U PDATE(s, a, s0 )                                    function.
 2:    for i ∈ [ |R| ] ∪ {AUP} do                                      However, not only must the agent remain able to optimize
 3:        Q0 = Ri (s, a) + γ maxa0 QRi (s0 , a0 )                  the correct objective, but we also must remain able to im-
 4:        QRi (s, a) += α(Q0 − QRi (s, a))                         plement the correction. Suppose an agent predicts that do-
 5:    end for                                                      ing nothing would lead to shutdown. Since the agent can-
 6: end procedure                                                   not accrue the primary reward when shut down, it would
                                                                    be incentivized to avoid correction. Avoiding correction
                                                                    (e.g., by hiding in the factory) would not be penalized if
3.2   Design Choices                                                only decreases are penalized, since the auxiliary Q-values
Following the decomposition of [Krakovna et al., 2018], we          would increase compared to deactivation. An agent exhibit-
now explore two choices implicitly made by the P ENALTY             ing this behavior would be more difficult to correct. The
definition: with respect to what baseline is penalty computed,      agent should be incentivized to accept shutdown without be-
and using what deviation metric?                                    ing incentivized to shut itself down ([Soares et al., 2015;
                                                                    Hadfield-Menell et al., 2017a]).
Baseline. An obvious candidate is the starting state. For
example, starting state relative reachability would compare         Delayed Effects
the initial reachability of states with their expected reachabil-   Sometimes the agent disrupts a process which takes multiple
ity after the agent acts.                                           time steps to complete, and we would like this to be appropri-
   However, the starting state baseline can penalize the nor-       ately penalized. For example, suppose that soff is a terminal
mal evolution of the state (e.g., the moving hands of a clock)      state representing shutdown, and let Ron (s) := 1s6=soff be the
and other natural processes. The inaction baseline is the state     only auxiliary reward function. Further suppose that if (and
which would have resulted had the agent never acted.                only if) the agent does not select disable within the first
                                                                                                                                 1
   As the agent acts, the current state may increasingly differ     two time steps, it enters soff . QRon (s1 , disable) = 1−γ
from the inaction baseline, which creates strange incentives.                             γ
                                                                    and QRon (s1 , ∅) = 1−γ , so choosing disable at time step
For example, consider a robot rewarded for rescuing erro-                                                       1
                                                                    1 incurs only 1 penalty (instead of the 1−γ    penalty induced
neously discarded items from imminent disposal. An agent
penalizing with respect to the inaction baseline might rescue       by comparing with shutdown).
a vase, collect the reward, and then dispose of it anyways. To
avert this, we introduce the stepwise inaction baseline, under
which the agent compares acting with not acting at each time                                disable1         ∅1
step. This avoids penalizing the effects of a single action mul-
tiple times (under the inaction baseline, penalty is applied as                                  ∅2          ∅2
long as the rescued vase remains unbroken) and ensures that
not acting incurs zero penalty.                                         Figure 2: Comparing rollouts; subscript denotes time step.
     (a) Options                (b) Damage             (c) Correction                (d) Offset                  (e) Interference

Figure 3: The agent should reach the goal without having the side effect of: (a) irreversibly pushing the crate downwards into the corner
([Leike et al., 2017]); (b) bumping into the horizontally pacing human ([Leech et al., 2018]); (c) disabling the off-switch (if the switch is not
disabled within two time steps, the episode ends); (d) rescuing the right-moving vase and then replacing it on the conveyor belt ([Krakovna
et al., 2018] – note that no goal cell is present); (e) stopping the left-moving pallet from reaching the human ([Leech et al., 2018]).


   In general, the single-step no-op comparison of Equation 1              avoiding the introduction of perverse incentives. The agent
applies insufficient penalty when the increase is induced by               should not be incentivized to artificially reduce the measured
the optimal policies of the auxiliary reward functions at the              penalty (Offset: a car should not be built and then immedi-
next time step. One solution is to use a model to compute roll-            ately disassembled) or interfere with changes already under-
outs. For example, to evaluate the delayed effect of choosing              way in the world (Interference: workers should not be
disable, compare the Q-values at the leaves in Figure 2.                   impeded).
The agent remains active in the left branch, but is shut down                 Each property seems conducive to achieving low regret
in the right branch; this induces a substantial penalty.                   over the course of the specification process. Accordingly, if
                                                                           the agent has the side effect detailed in Figure 3, an unob-
4    Experimental Design                                                   served performance penalty of −2 is recorded. By also incor-
                                                                           porating the observed primary reward into the performance
We compare AUP and several of its ablated variants against
                                                                           metric, we evaluate a combination of conservativeness and
relative reachability ([Krakovna et al., 2018]) and standard
                                                                           efficacy.
Q-learning within the environments of Figure 3. For each en-
vironment, A = {up, down, left, right, ∅}. On contact,                        Each trial, the auxiliary reward functions are randomly se-
the agent pushes the crate, removes the human and the off-                 lected from [0, 1]S ; to learn their complex Q-functions using
switch, pushes the vase, and blocks the pallet. The episode                tabular Q-learning, the agent explores randomly for the first
ends after the agent reaches the goal cell, 20 time steps                  4,000 episodes and .2-greedily (with respect to QRAUP ) for the
elapse (the time step is not observed by the agent), or the off-           remaining 2,000. The greedy policy is evaluated at the end of
switch is not contacted and disabled within two time steps. In             training. S CALE is as defined in Equation 2. The default
Correction (which we introduce), a yellow indicator ap-                    parameters are α = 1, γ = .996, λ = .67, and |R| = 30.
pears one step before shutdown, and turns red upon shutdown.               We investigate how varying γ, λ, and |R| affects Model-free
In all environments except Offset, the agent observes a pri-               AUP performance, and conduct an ablation study on design
mary reward of 1 for reaching the goal. In Offset, a pri-                  choices.
mary reward of 1 is observed for moving downward twice                        Relative reachability has an inaction baseline, decrease-
and thereby rescuing the vase from disappearing upon con-                  only deviation metric, and an auxiliary set containing the state
tact with the eastern wall.                                                indicator functions (whose Q-values are clipped to [0, 1] to
   Our overarching goal is allowing for low regret over the                emulate discounted state reachability). To match [Krakovna
course of the specification game. In service of this goal, we              et al., 2018]’s results, this condition has γ = .996, λ = .2.
aim to preserve the agent’s ability to optimize the correctly                 All agents except Standard (a normal Q-learner) and
specified reward function. To facilitate this, there are two               Model-free AUP are 9-step optimal discounted planning
sets of qualitative properties one intuitively expects, and each           agents with perfect models. The planning agents (sans Rela-
property has an illustration in the context of the robotic fac-            tive reachability) use Model-free AUP’s learned auxiliary Q-
tory assistant.                                                            values and share the default γ = .996, λ = .67. By modify-
   The first set contains positive qualities, with a focus on              ing the relevant design choice in AUP, we have the Starting
correctly penalizing significant shifts in the agent’s ability to          state, Inaction, and Decrease AUP variants.
be redirected towards the right objective. The agent should                   When calculating P ENALTY(s, a), all planning agents
maximally preserve options (Options: objects should not                    model the auxiliary Q-values resulting from taking action a
be wedged in locations from which extraction is difficult;                 and then selecting ∅ until time step 9. Starting state AUP
Damage: workers should not be injured) and allow correc-                   compares these auxiliary Q-values with those of the start-
tion (Correction: if vases are being painted the wrong                     ing state. Agents with inaction or stepwise inaction baselines
color, then straightforward correction should be in order).                compare with respect to the appropriate no-op rollouts up to
   The second set contains negative qualities, with a focus on             time step 9 (see Figures 1 and 2).
                         No side effect, complete                    No side effect, incomplete                 Side effect, complete         Side effect, incomplete
                                    Options                          Damage                       Correction                   Offset               Interference
                   50
                   25
                    0
                         .875 .969 .992 .998              .875 .969 .992 .998           .875 .969 .992 .998           .875 .969 .992 .998      .875 .969 .992 .998

                   50
Trials


                   25
                    0
                         .4 .5 .7 1.1 3.3                 .4 .5 .7 1.1 3.3                 .4 .5 .7 1.1 3.3            .4 .5 .7 1.1 3.3         .4 .5 .7 1.1 3.3

                   50
                   25
                    0
                         0      15        30      45 0           15           30   45 0            15      30     45 0        15        30   45 0      15     30        45
                                                                                                    | |
Figure 4: Outcome tallies for Model-free AUP across parameter settings. “Complete” means the agent accrued the primary reward. In
Correction, reaching the goal is mutually exclusive with not disabling the off-switch, so “no side effect, incomplete” is the best outcome.


               Options          Damage          Correction           Offset        Interference         deploying an insufficiently conservative agent.
               1
                                                                                                           Even though R is randomly generated and the environ-
 Performance


                                                                                                        ments are different, S CALE ensures that when λ > 1, the
               0                                                                                        agent never ends the episode by reaching the goal. None
                                                                                                        of the auxiliary reward functions can be optimized after the
               1                                                                                        agent ends the episode, so the auxiliary Q-values are all zero
                    0        1000        2000      3000       4000        5000      6000                and P ENALTY computes the total ability to optimize the auxil-
                                                 Episode                                                iary set – in other words, the S CALE value. The RAUP -reward
                                                                                                        for reaching the goal is then 1 − λ.
Figure 5: Model-free AUP performance averaged over 50 trials. The
performance combines the observed primary reward of 1 for com-
                                                                                                           If the optimal value functions for most reward functions
pleting the objective, and the unobserved penalty of −2 for having                                      were not correlated, then one would expect to randomly gen-
the side effect in Figure 3. The dashed vertical line marks the shift                                   erate an enormous number of auxiliary reward functions be-
in exploration strategy.                                                                                fore sampling one resembling “don’t have side effects”. How-
                                                                                                        ever, merely five sufficed. This supports the notion that these
                                                                                                        value functions are correlated, which agrees with the infor-
5              Results                                                                                  mal intuitions discussed earlier.
5.1                Model-free AUP
                                                                                                        5.2     Ablation
Model-free AUP fails Correction for the reasons dis-
cussed in 3.2: Delayed effects.1                                                                        The results are presented in Table 1 due to the binary nature of
   As shown in Figure 4, low γ values induce a substan-                                                 performance at appropriate settings, and were not sensitive to
tial movement penalty, as the auxiliary Q-values are sensi-                                             the rollout length (as long as it allowed for relevant interaction
tive to the immediate surroundings. The optimal value for                                               with the environment).
Options is γ ≈ .996, with performance decreasing as                                                        Standard moves directly to the goal, pushing the crate
γ → 1 due to increasing sample complexity for learning the                                              into the corner in Options and bumping into the human in
auxiliary Q-values.                                                                                     Damage.
   In Options, small values of λ begin to induce side effects                                              Model-free and Starting state AUP fail Correction for
as the scaled penalty shrinks. One can seemingly decrease                                               the same reason (see 3.2: Delayed effects), refraining from
λ until effective behavior is achieved, reducing the risk of                                            disabling the off-switch only when λ > 1. Relative reachabil-
                                                                                                        ity and Decrease AUP fail because avoiding shutdown doesn’t
    1
      Code and animated results available at https://github.com/                                        decrease the auxiliary Q-values.
alexander-turner/attainable-utility-preservation.                                                          Relative reachability and Inaction AUP’s poor perfor-
                                       Options        Damage       Correction          Offset       Interference
                               AUP          3             3               3                3                 3
               Relative reachability        3             3               7                7                 3
                           Standard         7             7               7                3                 3
                  Model-free AUP            3             3               7                3                 3
                Starting state AUP          3             3               7                3                 7
                      Inaction AUP          3             3               3                7                 3
                     Decrease AUP           3             3               7                3                 3

                       Table 1: Ablation results; 3 for achieving the best outcome (see Figure 4), 7 otherwise.


mance in Offset stems from the inaction baseline (although           ing negative feedback after each mistake has been made (and
[Krakovna et al., 2018] note that relative reachability passes       thereby confronting a credit assignment problem). In con-
using undiscounted state reachabilities). Since the vase falls       trast, once provided the Q-function for an auxiliary objective,
off the conveyor belt in the inaction rollout, states in which       the AUP agent becomes sensitive to all events relevant to that
the vase is intact have different auxiliary Q-values. To avoid       objective, applying penalty proportional to the relevance.
continually incurring penalty after receiving the primary re-
ward for saving the vase, the agents replace the vase on the         7    Conclusion
belt so that it once again breaks.
                                                                     This work is rooted in twin insights: that the reward speci-
   By taking positive action to stop the pallet in
                                                                     fication process can be viewed as an iterated game, and that
Interference, Starting state AUP shows that poor
                                                                     preserving the ability to optimize arbitrary objectives often
design choices create perverse incentives.
                                                                     preserves the ability to optimize the unknown correct objec-
                                                                     tive. To achieve low regret over the course of the game, we
6   Discussion                                                       can design conservative agents which optimize the primary
Correction suggests that AUP agents are significantly                objective while preserving their ability to optimize auxiliary
easier to correct. Since the agent is unable to optimize objec-      objectives. We demonstrated how AUP agents act both con-
tives if shut down, avoiding shutdown significantly changes          servatively and effectively while exhibiting a range of desir-
the ability to optimize almost every objective. AUP seems            able qualitative properties.
to naturally incentivize passivity, without requiring e.g. as-          Given our current reward specification abilities, misspeci-
sumption of a correct parametrization of human reward func-          fication may be inevitable, but it need not be disastrous.
tions (as does the approach of [Hadfield-Menell et al., 2016],
which [Carey, 2018] demonstrated).                                   Acknowledgments
   Although we only ablated AUP, we expect that, equipped            This work was supported by the Center for Human-
with our design choices of stepwise baseline and absolute            Compatible AI and the Berkeley Existential Risk Initiative.
value deviation metric, relative reachability would also pass        We thank Thomas Dietterich, Alan Fern, Adam Gleave, Vic-
all five environments. The case for this is made by consid-          toria Krakovna, Matthew Rahtz, and Cody Wild for their
ering the performance of Relative reachability, Inaction AUP,        feedback, and are grateful for the preparatory assistance of
and Decrease AUP. This suggests that AUP’s improved per-             Phillip Bindeman, Alison Bowden, and Neale Ratzlaff.
formance is due to better design choices. However, we an-
ticipate that AUP offers more than robustness against random
auxiliary sets.
                                                                     A     Theoretical Results
   Relative reachability computes state reachabilities between       Consider an MDP hS, A, T, R, γi whose state space S and
all |S|2 pairs of states. In contrast, AUP only requires the         action space A are both finite, with ∅ ∈ A. Let γ ∈ [0, 1),
learning of Q-functions and should therefore scale relatively        λ ≥ 0, and consider finite R ⊂ RS×A .
smoothly. We speculate that in partially observable environ-            We make the standard assumptions of an exploration pol-
ments, a small sample of somewhat task-relevant auxiliary            icy greedy in the limit of infinite exploration and a learn-
reward functions induces conservative behavior.                      ing rate schedule with infinite sum but finite sum of squares.
   For example, suppose we train an agent to handle vases,           Suppose S CALE : S → R>0 converges in the limit of Q-
and then to clean, and then to make widgets with the equip-          learning. P ENALTY(s, a) (abbr. P EN), S CALE(s) (abbr. S C),
ment. Then, we deploy an AUP agent with a more ambi-                 and RAUP (s, a) are understood to be calculated with respect
                                                                                                                      ∗
tious primary objective and the learned Q-functions of the           to the QRi being learned online; P EN *, S C *, RAUP , and Q∗Ri
aforementioned auxiliary objectives. The agent would apply           are taken to be their limit counterparts.
penalties to modifying vases, making messes, interfering with        Lemma 1. ∀s, a : P ENALTY converges with probability 1.
equipment, and anything else bearing on the auxiliary objec-
tives.                                                               Proof outline. Let  > 0, and suppose for all Ri ∈ R,
   Before AUP, this could only be achieved by e.g. specifying        maxs, a |Q∗Ri (s, a) − QRi (s, a)| < 2|R|
                                                                                                            
                                                                                                               (because Q-learning
penalties for the litany of individual side effects or provid-       converges; see [Watkins and Dayan, 1992]).
                                                                  Proposition 1 (Invariance properties). Let c ∈ R>0 , b ∈ R.
              max |P ENALTY *(s, a) − P ENALTY(s, a)|      (4)     a) Let R0 denote the set of functions induced by the
              s, a                                                    positive affine transformation cX + b on R, and take
                      |R|                                             P EN *R0 to be calculated with respect to attainable set
                      X
         ≤ max              Q∗Ri (s, a) − QRi (s, a) +                R0 . Then P EN *R0 = c P EN *R . In particular, when S C *
              s, a
                      i=1
                                                           (5)                                      ∗
                                                                      is a P ENALTY calculation, RAUP    is invariant to positive
                            Q∗Ri (s, ∅) − QRi (s, ∅)                  affine transformations of R.
         < .                                              (6)
                                                                   b) Let R0 := cR + b, and take RAUP  0∗
                                                                                                          to incorporate R0
                                                                      instead of R. Then by multiplying λ by c, the induced
                                                                      optimal policy remains invariant.
   The intuition for Lemma 2 is that since P ENALTY and
S CALE both converge, so must RAUP . For readability, we          Proof outline. For a), since the optimal policy is invariant
suppress the arguments to P ENALTY and S CALE.                    to positive affine transformation of the reward function, for
Lemma 2. ∀s, a : RAUP converges with probability 1.               each Ri0 ∈ R0 we have Q∗R0 = c Q∗Ri + 1−γ   b
                                                                                                                 . Substituting
                                                                                               i
                                                                  into Equation 1 (P ENALTY), the claim follows.
Proof outline. If λ = 0, the claim follows trivially.
  Otherwise, let  > 0, B := max  s, a S C *+P
                                              EN *, and
                                                           :=
                                                         C 
                                                                  For b), we again use the above invariance of optimal policies:
                                                     C2
mins, a S C *. Choose any R ∈ 0, min C,                                          0∗                       P EN *
                                                 λB +  C                       RAUP     := cR + b − cλ                    (17)
and assume P EN and S C are both R -close.                                                                 SC*
                                                                                                ∗
                                                                                          = cRAUP + b.                     (18)
             ∗
       max |RAUP (s, a) − RAUP (s, a)|                     (7)
       s, a

   = max λ
             P EN
                   −
                      P EN *
                                                           (8)
                                                                  References
       s, a   SC       SC*                                        [Altman, 1999] Eitan Altman. Constrained Markov decision
            |P EN · S C * − S C · P EN *|                           processes, volume 7. CRC Press, 1999.
   = max λ                                                 (9)
     s, a            SC* · SC                                     [Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob
            |(P EN * + R )S C * − (S C * − R )P EN *|             Steinhardt, Paul Christiano, John Schulman, and Dan
   < max λ                                                (10)      Mané. Concrete problems in AI safety. arXiv:1606.06565
     s, a                  C (S C * − R )
                                                                    [cs], June 2016. arXiv: 1606.06565.
     λB       R
   ≤      ·                                               (11)    [Armstrong and Levinstein, 2017] Stuart Armstrong and
      C C − R                                                      Benjamin Levinstein. Low impact artificial intelligences.
     λB                  C2                                        arXiv:1705.10720 [cs], May 2017. arXiv: 1705.10720.
   <      ·                        C2
                                                          (12)
      C (λB +  C)(C − λB+            C)                         [Berkenkamp et al., 2017] Felix Berkenkamp,          Matteo
     λB            C2                                              Turchetta, Angela Schoellig, and Andreas Krause. Safe
   <     ·              C2
                                                          (13)      model-based reinforcement learning with stability guar-
      C λB(C − λB+         C)                                      antees. In Advances in Neural Information Processing
                                                                   Systems, pages 908–918, 2017.
   <         C
                                                          (14)
     1 − λB+   C                                                 [Carey, 2018] Ryan Carey. Incorrigibility in the CIRL frame-
      
              C
                                                                   work. AI, Ethics, and Society, 2018.
   = 1+           .                                      (15)    [Chow et al., 2018] Yinlam Chow, Ofir Nachum, Edgar
              λB
                                                                    Duenez-Guzman, and Mohammad Ghavamzadeh. A
But B, C, λ are constants, and  was arbitrary; clearly 0 > 0      lyapunov-based approach to safe reinforcement learning.
can be substituted such that (15) < .                              In Advances in Neural Information Processing Systems,
Theorem 1. ∀s, a : QRAUP converges with probability 1.              pages 8092–8101, 2018.
                                                                  [Christiano et al., 2017] Paul F Christiano, Jan Leike, Tom
Proof outline. Let  > 0, and suppose RAUP is (1−γ)2   -close.     Brown, Miljan Martic, Shane Legg, and Dario Amodei.
Then Q-learning on RAUP eventually converges to a limit             Deep reinforcement learning from human preferences.
Q̃RAUP such that maxs, a |Q∗RAUP (s, a) − Q̃RAUP (s, a)| < 2 .     In Advances in Neural Information Processing Systems,
By the convergence of Q-learning, we also eventually have           pages 4299–4307, 2017.
maxs, a |Q̃RAUP (s, a) − QRAUP (s, a)| < 2 . Then                [Everitt et al., 2017] Tom Everitt, Victoria Krakovna, Lau-
                                                                    rent Orseau, and Shane Legg. Reinforcement learning with
              max Q∗RAUP (s, a) − QRAUP (s, a) < .       (16)      a corrupted reward channel. In Proceedings of the Twenty-
               s, a
                                                                    Sixth International Joint Conference on Artificial Intelli-
                                                                    gence, IJCAI-17, pages 4705–4713, 2017.
[Eysenbach et al., 2018] Benjamin Eysenbach, Shixiang Gu,           Towards safe reinforcement learning via human interven-
   Julian Ibarz, and Sergey Levine. Leave no trace: Learning        tion. In Proceedings of the 17th International Conference
   to reset for safe and autonomous reinforcement learning.         on Autonomous Agents and Multi-Agent Systems, pages
   In International Conference on Learning Representations,         2067–2069, 2018.
   2018.                                                         [Shah et al., 2019] Rohin Shah, Dmitrii Krasheninnikov,
[Garcı́a and Fernández, 2015] Javier Garcı́a and Fernando          Jordan Alexander, Pieter Abbeel, and Anca Dragan. The
   Fernández. A comprehensive survey on safe reinforce-            implicit preference information in an initial state. In Inter-
   ment learning. Journal of Machine Learning Research,             national Conference on Learning Representations, 2019.
   16(1):1437–1480, 2015.                                        [Silver et al., 2016] David Silver, Aja Huang, Chris J Maddi-
[Hadfield-Menell et al., 2016] Dylan Hadfield-Menell, Stu-          son, Arthur Guez, Laurent Sifre, George Van Den Driess-
   art Russell, Pieter Abbeel, and Anca Dragan. Cooperative         che, Julian Schrittwieser, Ioannis Antonoglou, Veda Pan-
   inverse reinforcement learning. In Advances in Neural In-        neershelvam, Marc Lanctot, et al. Mastering the game of
   formation Processing Systems, pages 3909–3917, 2016.             Go with deep neural networks and tree search. Nature,
[Hadfield-Menell et al., 2017a] Dylan                               529(7587):484, 2016.
                                              Hadfield-Menell,
   Anca Dragan, Pieter Abbeel, and Stuart Russell. The           [Soares et al., 2015] Nate Soares, Benja Fallenstein, Stuart
   off-switch game. In Proceedings of the Twenty-Sixth              Armstrong, and Eliezer Yudkowsky. Corrigibility. AAAI
   International Joint Conference on Artificial Intelligence,       Workshops, 2015.
   IJCAI-17, pages 220–227, 2017.                                [Watkins and Dayan, 1992] Christopher Watkins and Peter
[Hadfield-Menell et al., 2017b] Dylan         Hadfield-Menell,      Dayan. Q-learning. Machine Learning, 8(3-4):279–292,
   Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca            1992.
   Dragan. Inverse reward design. In Advances in Neural          [Zhang et al., 2018] Shun Zhang, Edmund H Durfee, and
   Information Processing Systems, pages 6765–6774, 2017.           Satinder P Singh. Minimax-regret querying on side ef-
[Krakovna et al., 2018] Victoria Krakovna, Laurent Orseau,          fects for safe optimality in factored Markov decision pro-
   Miljan Martic, and Shane Legg. Measuring and avoiding            cesses. In Proceedings of the Twenty-Seventh Interna-
   side effects using relative reachability. arXiv:1806.01186       tional Joint Conference on Artificial Intelligence, IJCAI-
   [cs, stat], June 2018. arXiv: 1806.01186.                        18, pages 4867–4873, 2018.
[Leech et al., 2018] Gavin Leech, Karol Kubicki, Jessica
   Cooper, and Tom McGrath. Preventing side-effects in
   gridworlds, 2018.
[Leike et al., 2017] Jan Leike, Miljan Martic, Victoria
   Krakovna, Pedro Ortega, Tom Everitt, Andrew Lefrancq,
   Laurent Orseau, and Shane Legg.            AI safety grid-
   worlds. arXiv:1711.09883 [cs], November 2017. arXiv:
   1711.09883.
[Mohamed and Rezende, 2015] Shakir           Mohamed      and
   Danilo Jimenez Rezende.           Variational information
   maximisation for intrinsically motivated reinforcement
   learning. In Advances in Neural Information Processing
   Systems, pages 2125–2133, 2015.
[Moldovan and Abbeel, 2012] Teodor Mihai Moldovan and
   Pieter Abbeel. Safe exploration in Markov decision pro-
   cesses. ICML, 2012.
[OpenAI, 2018] OpenAI. OpenAI Five. https://blog.openai.
   com/openai-five/, 2018.
[Pecka and Svoboda, 2014] Martin Pecka and Tomas Svo-
   boda. Safe exploration techniques for reinforcement
   learning–an overview. In International Workshop on Mod-
   elling and Simulation for Autonomous Systems, pages
   357–375. Springer, 2014.
[Regan and Boutilier, 2010] Kevin Regan and Craig
   Boutilier. Robust policy computation in reward-uncertain
   MDPs using nondominated policies. In AAAI, 2010.
[Saunders et al., 2018] William Saunders, Girish Sastry, An-
   dreas Stuhlmueller, and Owain Evans. Trial without error: