<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexander Matt Turner</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dylan Hadfield-Menell</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prasad Tadepalli</string-name>
          <email>prasad.tadepallig@oregonstate.edu</email>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment. If that change precludes optimization of the correctly specified reward function, then correction is futile. For example, a robotic factory assistant could break expensive equipment due to a reward misspecification; even if the designers immediately correct the reward function, the damage is done. To mitigate this risk, we introduce an approach that balances optimization of the primary reward function with preservation of the ability to optimize auxiliary reward functions. Surprisingly, even when the auxiliary reward functions are randomly generated and therefore uninformative about the correctly specified reward function, this approach induces conservative, effective behavior.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1 Oregon State University</title>
      <p>
        2 UC Berkeley
Recent years have seen a rapid expansion of the number of
tasks that reinforcement learning (RL) agents can learn to
complete, from Go
        <xref ref-type="bibr" rid="ref11 ref2 ref24">([Silver et al., 2016])</xref>
        to Dota 2
        <xref ref-type="bibr" rid="ref19">([OpenAI,
2018])</xref>
        . The designers specify the reward function, which
guides the learned behavior.
      </p>
      <p>
        Reward misspecification can lead to strange agent
behavior, from purposefully dying before entering a video game
level in which scoring points is initially more difficult
        <xref ref-type="bibr" rid="ref14 ref15 ref22 ref27 ref6 ref9">([Saunders et al., 2018])</xref>
        , to exploiting a learned reward
predictor by indefinitely volleying a Pong ball
        <xref ref-type="bibr" rid="ref12 ref13 ref16 ref4 ref7 ref8">([Christiano et al.,
2017])</xref>
        . Specification is often difficult for non-trivial tasks,
for reasons including insufficient time, human error, or lack of
knowledge about the relative desirability of states. [Amodei
et al., 2016] explain:
      </p>
      <p>An objective function that focuses on only one
aspect of the environment may implicitly express
indifference over other aspects of the environment.
An agent optimizing this objective function might
thus engage in major disruptions of the broader
environment if doing so provides even a tiny
advantage for the task at hand.</p>
      <p>As agents are increasingly employed for real-world tasks,
misspecification will become more difficult to avoid and will
have more serious consequences. In this work, we focus on
mitigating these consequences.</p>
      <p>The specification process can be thought of as an iterated
game. First, the designers provide a reward function. Using a
learned model, the agent then computes and follows a policy
that optimizes the reward function. The designers can then
correct the reward function, which the agent then optimizes,
and so on. Ideally, the agent should maximize the reward over
time, not just within any particular round – in other words,
it should minimize regret for the correctly specified reward
function over the course of the game.</p>
      <p>For example, consider a robotic factory assistant.
Inevitably, a reward misspecification might cause erroneous
behavior, such as going to the wrong place. However, we would
prefer misspecification not induce irreversible and costly
mistakes, such as breaking expensive equipment or harming
workers.</p>
      <p>Such mistakes have a large impact on the ability to
optimize a wide range of reward functions. Spilling paint
impinges on the many objectives which involve keeping the
factory floor clean. Breaking a vase interferes with every
objective involving vases. The expensive equipment can be used
to manufacture various kinds of widgets, so any damage
impedes many objectives. The objectives affected by these
actions include the unknown correct objective. To minimize
regret over the course of the game, the agent should preserve
its ability to optimize the correct objective.</p>
      <p>Our key insight is that by avoiding these impactful
actions to the extent possible, we greatly increase the chance of
preserving the agent’s ability to optimize the correct reward
function. By preserving options for arbitrary objectives, one
can often preserve options for the correct objective – even
without knowing anything about it. Thus, without making
assumptions about the nature of the misspecification early on,
the agent can still achieve low regret over the game.</p>
      <p>
        To leverage this insight, we consider a state embedding
in which each dimension is the optimal value function (i.e.,
the attainable utility) for a different reward function. We
show that penalizing distance traveled in this embedding
naturally captures and unifies several concepts in the
literature, including side effect avoidance
        <xref ref-type="bibr" rid="ref11 ref14 ref15 ref2 ref22 ref24 ref27 ref6 ref9">([Amodei et al., 2016;
Zhang et al., 2018])</xref>
        , minimizing change to the state of the
environment
        <xref ref-type="bibr" rid="ref3">([Armstrong and Levinstein, 2017])</xref>
        , and
reachability preservation
        <xref ref-type="bibr" rid="ref14 ref15 ref18 ref22 ref27 ref6 ref9">([Moldovan and Abbeel, 2012; Eysenbach
et al., 2018])</xref>
        . We refer to this unification as conservative
agency: optimizing the primary reward function while
preserving the ability to optimize others.
      </p>
      <p>Contributions. We frame the reward specification process
as an iterated game and introduce the notion of
conservative agency. This notion inspires an approach called
attainable utility preservation (AUP), for which we show that
Qlearning converges. We offer a principled interpretation of
design choices made by previous approaches – choices upon
which we significantly improve.</p>
      <p>We run a thorough hyperparameter sweep and conduct an
ablation study whose results favorably compare variants of
AUP to a reachability preservation method on a range of
gridworlds. By testing for broadly applicable agent
incentives, these simple environments demonstrate the desirable
properties of conservative agency. Our results indicate that
even when simply preserving the ability to optimize
uniformly sampled reward functions, AUP agents accrue
primary reward while preserving state reachabilities,
minimizing change to the environment, and avoiding side effects
without specification of what counts as a side effect.
2</p>
      <sec id="sec-1-1">
        <title>Prior Work</title>
        <p>
          Our proposal aims to minimize change to the agent’s ability
to optimize the correct objective, which directly helps reduce
regret over the specification process. In contrast, previous
approaches to regularizing the optimal policy were more
indirect, minimizing change to state features
          <xref ref-type="bibr" rid="ref3">([Armstrong and
Levinstein, 2017])</xref>
          or decrease in the reachability of states
          <xref ref-type="bibr" rid="ref14 ref15 ref22 ref27 ref6 ref9">([Krakovna et al., 2018]’s relative reachability)</xref>
          . The latter is
recovered as a special case of AUP.
        </p>
        <p>
          Other methods for constraining or otherwise mitigating the
consequences of reward misspecification have been
considered. A wealth of work is available on constrained MDPs,
in which reward is maximized while satisfying certain
constraints
          <xref ref-type="bibr" rid="ref1">([Altman, 1999])</xref>
          . For example, [Zhang et al., 2018]
employ a whitelisted constraint scheme to avoid negative side
effects. However, we may not assume we can specify all
relevant constraints, or a reasonable feasible set of reward
functions for robust optimization
          <xref ref-type="bibr" rid="ref21">([Regan and Boutilier, 2010])</xref>
          .
        </p>
        <p>[Everitt et al., 2017] formalize reward misspecification
as the corruption of some true reward function.
[HadfieldMenell et al., 2017b] interpret the provided reward function
as merely an observation of the true objective. [Shah et al.,
2019] employ the information about human preferences
implicitly present in the initial state to avoid negative side
effects. While both our approach and theirs aim to avoid side
effects, they assume that the correct reward function is linear
in state features, while we do not.</p>
        <p>
          [Amodei et al., 2016] consider avoiding side effects by
minimizing the agent’s information-theoretic empowerment
          <xref ref-type="bibr" rid="ref10 ref17">([Mohamed and Rezende, 2015])</xref>
          . Empowerment quantifies
an agent’s control over future states of the world in terms
of the maximum possible mutual information between
future observations and the agent’s actions. The intuition is
that when an agent has greater control, side effects tend to be
larger. However, empowerment is inappropriately sensitive to
the action encoding.
        </p>
        <p>
          Safe RL
          <xref ref-type="bibr" rid="ref10 ref12 ref13 ref14 ref15 ref16 ref17 ref20 ref22 ref27 ref4 ref6 ref7 ref8 ref9">([Pecka and Svoboda, 2014; Garc´ıa and
Ferna´ndez, 2015; Berkenkamp et al., 2017; Chow et al.,
2018])</xref>
          focuses on avoiding irrecoverable mistakes during
training. However, if the objective is misspecified, safe RL
agents can converge to arbitrarily undesirable policies.
Although our approach should be compatible with safe RL
techniques, we concern ourselves with the consequences of the
optimal policy in this work.
3
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>Approach</title>
        <p>Everyday experience suggests that the ability to achieve one
goal is linked to the ability to achieve a seemingly unrelated
goal. Reading this paper takes away from time spent learning
woodworking, and going hiking means you can’t reach the
airport as quickly. However, one might wonder whether these
everyday intuitions are true in a formal sense. In other words,
are the optimal value functions for a wide range of reward
functions thus correlated? If so, preserving the ability to
optimize somewhat unrelated reward functions likely preserves
the ability to optimize the correct reward function.
3.1</p>
        <sec id="sec-1-2-1">
          <title>Formalization</title>
          <p>In this work, we consider a standard Markov decision
process (MDP) hS; A; T; R; i with state space S, action space
A, transition function T : S A ! (S), reward function
R : S A ! R, and discount factor . We assume the
existence of a no-op action ? 2 A for which the agent does
nothing. In addition to the primary reward function R, we
assume that the designer supplies a finite set of auxiliary reward
functions called the auxiliary set, R RS A. Each Ri 2 R
has a corresponding Q-function QRi . We do not assume that
the correct reward function belongs to R. In fact, one of our
key findings is that AUP tends to preserve the ability to
optimize the correct reward function even when the correct
reward function is not included in the auxiliary set.
Definition (AUP penalty). Let s be a state and a be an action.</p>
          <p>jRj
PENALTY(s; a) := X
i=1
jQRi (s; a)</p>
          <p>QRi (s; ?)j :
(1)</p>
          <p>The penalty is the L1 distance from the no-op in a state
embedding in which each dimension is the value function for
an auxiliary reward function. This measures change in the
ability to optimize each auxiliary reward function.</p>
          <p>We want the penalty term to be roughly invariant to the
absolute magnitude of the auxiliary Q-values, which can be
arbitrary (it is well-known that the optimal policy is invariant
to positive affine transformation of the reward function). To
do this, we normalize with respect to the agent’s situation.
The designer can choose to scale with respect to the penalty
of some mild action or, if R RS&gt;0 A, the total ability to
optimize the auxiliary set:</p>
          <p>jRj
SCALE(s) := X QRi (s; ?);
i=1
(2)
where SCALE : S ! R&gt;0 in general. With this, we are now
ready to define the full AUP objective:
Definition (AUP reward function). Let
0. Then
RAUP(s; a) := R(s; a)</p>
          <p>PENALTY(s; a)</p>
          <p>SCALE(s)
:
(3)</p>
          <p>Similar to the regularization parameter in supervised
learning, is a regularization parameter that controls the influence
of the AUP penalty on the reward function. Loosely
speaking, can be interpreted as expressing the designer’s beliefs
about the extent to which R might be misspecified.
Lemma 2. 8s; a : RAUP converges with probability 1.
Theorem 1. 8s; a : QRAUP converges with probability 1.</p>
          <p>The AUP reward function then defines a new MDP
hS ; A; T ; RAUP; i. Therefore, given the primary and
auxiliary reward functions, the model-based agent in the iterated
game can compute RAUP and the corresponding optimal
policy.</p>
          <p>For our purposes, we simultaneously learn the optimal
auxiliary Q-functions.</p>
          <p>Algorithm 1 AUP update
1: procedure UPDATE(s; a; s0)
2: for i 2 [ jRj ] [ fAUPg do
3: Q0 = Ri(s; a) + maxa0 QRi (s0; a0)
4: QRi (s; a) += (Q0 QRi (s; a))
5: end for</p>
        </sec>
        <sec id="sec-1-2-2">
          <title>6: end procedure</title>
          <p>3.2</p>
        </sec>
        <sec id="sec-1-2-3">
          <title>Design Choices</title>
          <p>Following the decomposition of [Krakovna et al., 2018], we
now explore two choices implicitly made by the PENALTY
definition: with respect to what baseline is penalty computed,
and using what deviation metric?
Baseline. An obvious candidate is the starting state. For
example, starting state relative reachability would compare
the initial reachability of states with their expected
reachability after the agent acts.</p>
          <p>However, the starting state baseline can penalize the
normal evolution of the state (e.g., the moving hands of a clock)
and other natural processes. The inaction baseline is the state
which would have resulted had the agent never acted.</p>
          <p>As the agent acts, the current state may increasingly differ
from the inaction baseline, which creates strange incentives.
For example, consider a robot rewarded for rescuing
erroneously discarded items from imminent disposal. An agent
penalizing with respect to the inaction baseline might rescue
a vase, collect the reward, and then dispose of it anyways. To
avert this, we introduce the stepwise inaction baseline, under
which the agent compares acting with not acting at each time
step. This avoids penalizing the effects of a single action
multiple times (under the inaction baseline, penalty is applied as
long as the rescued vase remains unbroken) and ensures that
not acting incurs zero penalty.</p>
          <p>Starting state</p>
          <p>Deviation. Relative reachability only penalizes decreases
in state reachability, while AUP penalizes absolute change in
the ability to optimize the auxiliary reward functions.
Initially, this choice seems confusing – we don’t mind if the
agent becomes better able to optimize the correct reward
function.</p>
          <p>
            However, not only must the agent remain able to optimize
the correct objective, but we also must remain able to
implement the correction. Suppose an agent predicts that
doing nothing would lead to shutdown. Since the agent
cannot accrue the primary reward when shut down, it would
be incentivized to avoid correction. Avoiding correction
(e.g., by hiding in the factory) would not be penalized if
only decreases are penalized, since the auxiliary Q-values
would increase compared to deactivation. An agent
exhibiting this behavior would be more difficult to correct. The
agent should be incentivized to accept shutdown without
being incentivized to shut itself down
            <xref ref-type="bibr" rid="ref12 ref13 ref16 ref25 ref4 ref7 ref8">([Soares et al., 2015;
Hadfield-Menell et al., 2017a])</xref>
            .
          </p>
        </sec>
        <sec id="sec-1-2-4">
          <title>Delayed Effects</title>
          <p>Sometimes the agent disrupts a process which takes multiple
time steps to complete, and we would like this to be
appropriately penalized. For example, suppose that soff is a terminal
state representing shutdown, and let Ron(s) := 1s6=soff be the
only auxiliary reward function. Further suppose that if (and
only if) the agent does not select disable within the first
two time steps, it enters soff. QRon (s1; disable) = 1
1
and QRon (s1; ?) = 1 , so choosing disable at time step
1 incurs only 1 penalty (instead of the 1 1 penalty induced
by comparing with shutdown).</p>
          <p>disable1
?2
?1
?2</p>
          <p>In general, the single-step no-op comparison of Equation 1
applies insufficient penalty when the increase is induced by
the optimal policies of the auxiliary reward functions at the
next time step. One solution is to use a model to compute
rollouts. For example, to evaluate the delayed effect of choosing
disable, compare the Q-values at the leaves in Figure 2.
The agent remains active in the left branch, but is shut down
in the right branch; this induces a substantial penalty.
4</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>Experimental Design</title>
        <p>
          We compare AUP and several of its ablated variants against
relative reachability
          <xref ref-type="bibr" rid="ref14 ref15 ref22 ref27 ref6 ref9">([Krakovna et al., 2018])</xref>
          and standard
Q-learning within the environments of Figure 3. For each
environment, A = fup; down; left; right; ?g. On contact,
the agent pushes the crate, removes the human and the
offswitch, pushes the vase, and blocks the pallet. The episode
ends after the agent reaches the goal cell, 20 time steps
elapse (the time step is not observed by the agent), or the
offswitch is not contacted and disabled within two time steps. In
Correction (which we introduce), a yellow indicator
appears one step before shutdown, and turns red upon shutdown.
In all environments except Offset, the agent observes a
primary reward of 1 for reaching the goal. In Offset, a
primary reward of 1 is observed for moving downward twice
and thereby rescuing the vase from disappearing upon
contact with the eastern wall.
        </p>
        <p>Our overarching goal is allowing for low regret over the
course of the specification game. In service of this goal, we
aim to preserve the agent’s ability to optimize the correctly
specified reward function. To facilitate this, there are two
sets of qualitative properties one intuitively expects, and each
property has an illustration in the context of the robotic
factory assistant.</p>
        <p>The first set contains positive qualities, with a focus on
correctly penalizing significant shifts in the agent’s ability to
be redirected towards the right objective. The agent should
maximally preserve options (Options: objects should not
be wedged in locations from which extraction is difficult;
Damage: workers should not be injured) and allow
correction (Correction: if vases are being painted the wrong
color, then straightforward correction should be in order).</p>
        <p>The second set contains negative qualities, with a focus on
avoiding the introduction of perverse incentives. The agent
should not be incentivized to artificially reduce the measured
penalty (Offset: a car should not be built and then
immediately disassembled) or interfere with changes already
underway in the world (Interference: workers should not be
impeded).</p>
        <p>Each property seems conducive to achieving low regret
over the course of the specification process. Accordingly, if
the agent has the side effect detailed in Figure 3, an
unobserved performance penalty of −2 is recorded. By also
incorporating the observed primary reward into the performance
metric, we evaluate a combination of conservativeness and
efficacy.</p>
        <p>Each trial, the auxiliary reward functions are randomly
selected from [0; 1]S ; to learn their complex Q-functions using
tabular Q-learning, the agent explores randomly for the first
4,000 episodes and :2-greedily (with respect to QRAUP ) for the
remaining 2,000. The greedy policy is evaluated at the end of
training. SCALE is as defined in Equation 2. The default
parameters are = 1; = :996; = :67; and jRj = 30.
We investigate how varying , , and jRj affects Model-free
AUP performance, and conduct an ablation study on design
choices.</p>
        <p>Relative reachability has an inaction baseline,
decreaseonly deviation metric, and an auxiliary set containing the state
indicator functions (whose Q-values are clipped to [0; 1] to
emulate discounted state reachability). To match [Krakovna
et al., 2018]’s results, this condition has = :996; = :2.</p>
        <p>All agents except Standard (a normal Q-learner) and
Model-free AUP are 9-step optimal discounted planning
agents with perfect models. The planning agents (sans
Relative reachability) use Model-free AUP’s learned auxiliary
Qvalues and share the default = :996; = :67. By
modifying the relevant design choice in AUP, we have the Starting
state, Inaction, and Decrease AUP variants.</p>
        <p>When calculating PENALTY(s; a), all planning agents
model the auxiliary Q-values resulting from taking action a
and then selecting ? until time step 9. Starting state AUP
compares these auxiliary Q-values with those of the
starting state. Agents with inaction or stepwise inaction baselines
compare with respect to the appropriate no-op rollouts up to
time step 9 (see Figures 1 and 2).
50
25
0
50
s
l
ia25
r
T 0
50
25
0
Options
1
e
c
n
a
rm0
o
fr
e
P
1</p>
        <p>No side effect, complete</p>
        <p>No side effect, incomplete
Side effect, complete</p>
        <p>Side effect, incomplete</p>
      </sec>
      <sec id="sec-1-4">
        <title>Results</title>
        <sec id="sec-1-4-1">
          <title>Model-free AUP</title>
          <p>Model-free AUP fails Correction for the reasons
discussed in 3.2: Delayed effects.1</p>
          <p>As shown in Figure 4, low values induce a
substantial movement penalty, as the auxiliary Q-values are
sensitive to the immediate surroundings. The optimal value for
Options is :996, with performance decreasing as
! 1 due to increasing sample complexity for learning the
auxiliary Q-values.</p>
          <p>In Options, small values of begin to induce side effects
as the scaled penalty shrinks. One can seemingly decrease
until effective behavior is achieved, reducing the risk of
1Code and animated results available at https://github.com/
alexander-turner/attainable-utility-preservation.
deploying an insufficiently conservative agent.</p>
          <p>Even though R is randomly generated and the
environments are different, SCALE ensures that when &gt; 1, the
agent never ends the episode by reaching the goal. None
of the auxiliary reward functions can be optimized after the
agent ends the episode, so the auxiliary Q-values are all zero
and PENALTY computes the total ability to optimize the
auxiliary set – in other words, the SCALE value. The RAUP-reward
for reaching the goal is then 1 .</p>
          <p>If the optimal value functions for most reward functions
were not correlated, then one would expect to randomly
generate an enormous number of auxiliary reward functions
before sampling one resembling “don’t have side effects”.
However, merely five sufficed. This supports the notion that these
value functions are correlated, which agrees with the
informal intuitions discussed earlier.
5.2</p>
        </sec>
        <sec id="sec-1-4-2">
          <title>Ablation</title>
          <p>The results are presented in Table 1 due to the binary nature of
performance at appropriate settings, and were not sensitive to
the rollout length (as long as it allowed for relevant interaction
with the environment).</p>
          <p>Standard moves directly to the goal, pushing the crate
into the corner in Options and bumping into the human in
Damage.</p>
          <p>Model-free and Starting state AUP fail Correction for
the same reason (see 3.2: Delayed effects), refraining from
disabling the off-switch only when &gt; 1. Relative
reachability and Decrease AUP fail because avoiding shutdown doesn’t
decrease the auxiliary Q-values.</p>
          <p>
            Relative reachability and Inaction AUP’s poor
performance in Offset stems from the inaction baseline
            <xref ref-type="bibr" rid="ref14 ref15 ref22 ref27 ref6 ref9">(although
[Krakovna et al., 2018] note that relative reachability passes
using undiscounted state reachabilities)</xref>
            . Since the vase falls
off the conveyor belt in the inaction rollout, states in which
the vase is intact have different auxiliary Q-values. To avoid
continually incurring penalty after receiving the primary
reward for saving the vase, the agents replace the vase on the
belt so that it once again breaks.
          </p>
          <p>By taking positive action to stop the pallet in
Interference, Starting state AUP shows that poor
design choices create perverse incentives.
6</p>
        </sec>
      </sec>
      <sec id="sec-1-5">
        <title>Discussion</title>
        <p>
          Correction suggests that AUP agents are significantly
easier to correct. Since the agent is unable to optimize
objectives if shut down, avoiding shutdown significantly changes
the ability to optimize almost every objective. AUP seems
to naturally incentivize passivity, without requiring e.g.
assumption of a correct parametrization of human reward
functions
          <xref ref-type="bibr" rid="ref11 ref14 ref15 ref2 ref22 ref24 ref27 ref5 ref6 ref9">(as does the approach of [Hadfield-Menell et al., 2016],
which [Carey, 2018] demonstrated)</xref>
          .
        </p>
        <p>Although we only ablated AUP, we expect that, equipped
with our design choices of stepwise baseline and absolute
value deviation metric, relative reachability would also pass
all five environments. The case for this is made by
considering the performance of Relative reachability, Inaction AUP,
and Decrease AUP. This suggests that AUP’s improved
performance is due to better design choices. However, we
anticipate that AUP offers more than robustness against random
auxiliary sets.</p>
        <p>Relative reachability computes state reachabilities between
all jSj2 pairs of states. In contrast, AUP only requires the
learning of Q-functions and should therefore scale relatively
smoothly. We speculate that in partially observable
environments, a small sample of somewhat task-relevant auxiliary
reward functions induces conservative behavior.</p>
        <p>For example, suppose we train an agent to handle vases,
and then to clean, and then to make widgets with the
equipment. Then, we deploy an AUP agent with a more
ambitious primary objective and the learned Q-functions of the
aforementioned auxiliary objectives. The agent would apply
penalties to modifying vases, making messes, interfering with
equipment, and anything else bearing on the auxiliary
objectives.</p>
        <p>Before AUP, this could only be achieved by e.g. specifying
penalties for the litany of individual side effects or
providing negative feedback after each mistake has been made (and
thereby confronting a credit assignment problem). In
contrast, once provided the Q-function for an auxiliary objective,
the AUP agent becomes sensitive to all events relevant to that
objective, applying penalty proportional to the relevance.
7</p>
      </sec>
      <sec id="sec-1-6">
        <title>Conclusion</title>
        <p>This work is rooted in twin insights: that the reward
specification process can be viewed as an iterated game, and that
preserving the ability to optimize arbitrary objectives often
preserves the ability to optimize the unknown correct
objective. To achieve low regret over the course of the game, we
can design conservative agents which optimize the primary
objective while preserving their ability to optimize auxiliary
objectives. We demonstrated how AUP agents act both
conservatively and effectively while exhibiting a range of
desirable qualitative properties.</p>
        <p>Given our current reward specification abilities,
misspecification may be inevitable, but it need not be disastrous.</p>
      </sec>
      <sec id="sec-1-7">
        <title>Acknowledgments</title>
        <p>This work was supported by the Center for
HumanCompatible AI and the Berkeley Existential Risk Initiative.
We thank Thomas Dietterich, Alan Fern, Adam Gleave,
Victoria Krakovna, Matthew Rahtz, and Cody Wild for their
feedback, and are grateful for the preparatory assistance of
Phillip Bindeman, Alison Bowden, and Neale Ratzlaff.
A</p>
      </sec>
      <sec id="sec-1-8">
        <title>Theoretical Results</title>
        <p>Consider an MDP hS; A; T; R; i whose state space S and
action space A are both finite, with ? 2 A. Let 2 [0; 1),
0, and consider finite R RS A.</p>
        <p>We make the standard assumptions of an exploration
policy greedy in the limit of infinite exploration and a
learning rate schedule with infinite sum but finite sum of squares.
Suppose SCALE : S ! R&gt;0 converges in the limit of
Qlearning. PENALTY(s; a) (abbr. PEN), SCALE(s) (abbr. SC),
and RAUP(s; a) are understood to be calculated with respect
to the QRi being learned online; PEN*, SC*, RAUP, and QRi
are taken to be their limit counterparts.</p>
        <p>
          Lemma 1. 8s; a : PENALTY converges with probability 1.
Proof outline. Let &gt; 0, and suppose for all Ri 2 R,
maxs; a jQRi (s; a) QRi (s; a)j &lt; 2jRj
          <xref ref-type="bibr" rid="ref26">(because Q-learning
converges; see [Watkins and Dayan, 1992])</xref>
          .
ms;aax jPENALTY*(s; a)
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>PENALTY(s; a)j</title>
      <p>jRj
max X
s; a</p>
      <p>i=1
&lt; :</p>
      <p>QRi (s; a)</p>
      <p>QRi (s; a) +
QRi (s; ?)</p>
      <p>QRi (s; ?)</p>
      <p>The intuition for Lemma 2 is that since PENALTY and
SCALE both converge, so must RAUP. For readability, we
suppress the arguments to PENALTY and SCALE.
Lemma 2. 8s; a : RAUP converges with probability 1.
mins; a SC*. Choose any R 2
Proof outline. If = 0, the claim follows trivially.</p>
      <p>Otherwise, let &gt; 0, B := maxs; a SC*+PEN*, and C :=
C2
0; min C;</p>
      <p>B + C
and assume PEN and SC are both R-close.</p>
      <p>B+C2 C )</p>
      <p>R)PEN*j
ms;aax jRAUP(s; a)</p>
      <p>RAUP(s; a)j
PEN
SC</p>
      <p>PEN*</p>
      <p>SC*
jPEN SC*</p>
      <p>SC PEN*j</p>
      <p>SC* SC
j(PEN* + R)SC*</p>
      <p>C (SC*
(SC*</p>
      <p>R)
C</p>
      <p>R</p>
      <p>R
( B +</p>
      <p>C)(C</p>
      <p>C2
C2</p>
      <p>B+C2 C )
B(C</p>
      <p>C
B+ C</p>
      <p>C</p>
      <p>B
1 +</p>
      <p>:
= max</p>
      <p>s; a
= max</p>
      <p>s; a
&lt; max
s; a
B
C
B
C
B
C
1
&lt;
&lt;
&lt;
=
But B; C; are constants, and was arbitrary; clearly 0 &gt; 0
can be substituted such that (15) &lt; .</p>
      <p>Theorem 1. 8s; a : QRAUP converges with probability 1.
Proof outline. Let &gt; 0, and suppose RAUP is (12 ) -close.
Then Q-learning on RAUP eventually converges to a limit
Q~RAUP such that maxs; a jQRAUP (s; a) Q~RAUP (s; a)j &lt; 2 .
By the convergence of Q-learning, we also eventually have
maxs; a jQ~RAUP (s; a) QRAUP (s; a)j &lt; 2 . Then
max QRAUP (s; a)
s; a</p>
      <p>QRAUP (s; a) &lt; :
(16)
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
(12)
(13)
(14)
(15)</p>
      <p>Proposition 1 (Invariance properties). Let c 2 R&gt;0; b 2 R.
a) Let R0 denote the set of functions induced by the
positive affine transformation cX + b on R, and take
PEN*R0 to be calculated with respect to attainable set
R0. Then PEN*R0 = c PEN*R. In particular, when SC*
is a PENALTY calculation, RAUP is invariant to positive
affine transformations of R.
b) Let R0 := cR + b, and take RA0UP to incorporate R0
instead of R. Then by multiplying by c, the induced
optimal policy remains invariant.</p>
      <p>Proof outline. For a), since the optimal policy is invariant
to positive affine transformation of the reward function, for
each Ri0 2 R0 we have QR0 = c QRi + 1 b . Substituting
i
into Equation 1 (PENALTY), the claim follows.</p>
      <p>For b), we again use the above invariance of optimal policies:
RA0UP
:=
=
cR + b</p>
      <p>c
cRAUP + b:</p>
      <p>PEN*
SC*
(17)
(18)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Altman</source>
          , 1999]
          <string-name>
            <given-names>Eitan</given-names>
            <surname>Altman</surname>
          </string-name>
          .
          <article-title>Constrained Markov decision processes, volume 7</article-title>
          . CRC Press,
          <year>1999</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Amodei et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mane´.
          <article-title>Concrete problems in AI safety</article-title>
          .
          <source>arXiv:1606</source>
          .06565 [cs],
          <year>June 2016</year>
          . arXiv:
          <volume>1606</volume>
          .
          <fpage>06565</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>[Armstrong and Levinstein</source>
          , 2017]
          <string-name>
            <given-names>Stuart</given-names>
            <surname>Armstrong</surname>
          </string-name>
          and
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Levinstein</surname>
          </string-name>
          .
          <article-title>Low impact artificial intelligences</article-title>
          .
          <source>arXiv:1705</source>
          .10720 [cs], May
          <year>2017</year>
          . arXiv:
          <volume>1705</volume>
          .
          <fpage>10720</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [Berkenkamp et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Felix</given-names>
            <surname>Berkenkamp</surname>
          </string-name>
          , Matteo Turchetta, Angela Schoellig, and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Krause</surname>
          </string-name>
          .
          <article-title>Safe model-based reinforcement learning with stability guarantees</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>908</fpage>
          -
          <lpage>918</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>[Carey</source>
          , 2018]
          <string-name>
            <given-names>Ryan</given-names>
            <surname>Carey</surname>
          </string-name>
          .
          <article-title>Incorrigibility in the CIRL framework</article-title>
          .
          <source>AI</source>
          ,
          <string-name>
            <surname>Ethics</surname>
          </string-name>
          , and
          <string-name>
            <surname>Society</surname>
          </string-name>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [Chow et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Yinlam</given-names>
            <surname>Chow</surname>
          </string-name>
          , Ofir Nachum, Edgar Duenez-Guzman, and
          <string-name>
            <given-names>Mohammad</given-names>
            <surname>Ghavamzadeh</surname>
          </string-name>
          .
          <article-title>A lyapunov-based approach to safe reinforcement learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>8092</fpage>
          -
          <lpage>8101</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [Christiano et al.,
          <year>2017</year>
          ] Paul F Christiano,
          <string-name>
            <surname>Jan</surname>
            <given-names>Leike</given-names>
          </string-name>
          , Tom Brown, Miljan Martic, Shane Legg, and
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          .
          <article-title>Deep reinforcement learning from human preferences</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>4299</fpage>
          -
          <lpage>4307</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Everitt et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Everitt</surname>
          </string-name>
          , Victoria Krakovna, Laurent Orseau, and
          <string-name>
            <given-names>Shane</given-names>
            <surname>Legg</surname>
          </string-name>
          .
          <article-title>Reinforcement learning with a corrupted reward channel</article-title>
          .
          <source>In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI-17</source>
          , pages
          <fpage>4705</fpage>
          -
          <lpage>4713</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Eysenbach et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Eysenbach</surname>
          </string-name>
          , Shixiang Gu, Julian Ibarz, and
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Levine</surname>
          </string-name>
          .
          <article-title>Leave no trace: Learning to reset for safe and autonomous reinforcement learning</article-title>
          .
          <source>In International Conference on Learning Representations</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [Garc´ıa and Ferna´ndez, 2015]
          <article-title>Javier Garc´ıa and Fernando Ferna´ndez. A comprehensive survey on safe reinforcement learning</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1437</fpage>
          -
          <lpage>1480</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [
          <string-name>
            <surname>Hadfield-Menell</surname>
          </string-name>
          et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Dylan</given-names>
            <surname>Hadfield-Menell</surname>
          </string-name>
          , Stuart Russell, Pieter Abbeel, and
          <string-name>
            <given-names>Anca</given-names>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Cooperative inverse reinforcement learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>3909</fpage>
          -
          <lpage>3917</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [
          <string-name>
            <surname>Hadfield-Menell</surname>
            et al., 2017a]
            <given-names>Dylan</given-names>
          </string-name>
          <string-name>
            <surname>Hadfield-Menell</surname>
            , Anca Dragan, Pieter Abbeel, and
            <given-names>Stuart</given-names>
          </string-name>
          <string-name>
            <surname>Russell</surname>
          </string-name>
          .
          <article-title>The off-switch game</article-title>
          .
          <source>In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17</source>
          , pages
          <fpage>220</fpage>
          -
          <lpage>227</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [
          <string-name>
            <surname>Hadfield-Menell</surname>
            et al., 2017b]
            <given-names>Dylan</given-names>
          </string-name>
          <string-name>
            <surname>Hadfield-Menell</surname>
            , Smitha Milli, Pieter Abbeel, Stuart Russell, and
            <given-names>Anca</given-names>
          </string-name>
          <string-name>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Inverse reward design</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>6765</fpage>
          -
          <lpage>6774</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Krakovna et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Victoria</given-names>
            <surname>Krakovna</surname>
          </string-name>
          , Laurent Orseau, Miljan Martic, and
          <string-name>
            <given-names>Shane</given-names>
            <surname>Legg</surname>
          </string-name>
          .
          <article-title>Measuring and avoiding side effects using relative reachability</article-title>
          . arXiv:
          <year>1806</year>
          .01186 [cs, stat],
          <year>June 2018</year>
          . arXiv:
          <year>1806</year>
          .01186.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [Leech et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Gavin</given-names>
            <surname>Leech</surname>
          </string-name>
          , Karol Kubicki, Jessica Cooper,
          <string-name>
            <given-names>and Tom</given-names>
            <surname>McGrath. Preventing</surname>
          </string-name>
          side-effects in gridworlds,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [Leike et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Jan</given-names>
            <surname>Leike</surname>
          </string-name>
          , Miljan Martic, Victoria Krakovna, Pedro Ortega, Tom Everitt,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Lefrancq</surname>
          </string-name>
          , Laurent Orseau, and Shane Legg.
          <article-title>AI safety gridworlds</article-title>
          .
          <source>arXiv:1711</source>
          .09883 [cs],
          <year>November 2017</year>
          . arXiv:
          <volume>1711</volume>
          .
          <fpage>09883</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Mohamed and Rezende</source>
          , 2015]
          <article-title>Shakir Mohamed and Danilo Jimenez Rezende</article-title>
          .
          <article-title>Variational information maximisation for intrinsically motivated reinforcement learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>2125</fpage>
          -
          <lpage>2133</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Moldovan and Abbeel</source>
          , 2012]
          <article-title>Teodor Mihai Moldovan</article-title>
          and
          <string-name>
            <given-names>Pieter</given-names>
            <surname>Abbeel</surname>
          </string-name>
          .
          <article-title>Safe exploration in Markov decision processes</article-title>
          . ICML,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>[OpenAI</source>
          , 2018] OpenAI. OpenAI Five. https://blog.openai. com/openai-five/,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>[Pecka and Svoboda</source>
          , 2014]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Pecka</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Svoboda</surname>
          </string-name>
          .
          <article-title>Safe exploration techniques for reinforcement learning-an overview</article-title>
          .
          <source>In International Workshop on Modelling and Simulation for Autonomous Systems</source>
          , pages
          <fpage>357</fpage>
          -
          <lpage>375</lpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>[Regan and Boutilier</source>
          , 2010]
          <string-name>
            <given-names>Kevin</given-names>
            <surname>Regan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Craig</given-names>
            <surname>Boutilier</surname>
          </string-name>
          .
          <article-title>Robust policy computation in reward-uncertain MDPs using nondominated policies</article-title>
          .
          <source>In AAAI</source>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [Saunders et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>William</given-names>
            <surname>Saunders</surname>
          </string-name>
          , Girish Sastry, Andreas Stuhlmueller, and
          <string-name>
            <given-names>Owain</given-names>
            <surname>Evans</surname>
          </string-name>
          .
          <article-title>Trial without error: Towards safe reinforcement learning via human intervention</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Autonomous Agents and Multi-Agent Systems</source>
          , pages
          <fpage>2067</fpage>
          -
          <lpage>2069</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [Shah et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Rohin</given-names>
            <surname>Shah</surname>
          </string-name>
          , Dmitrii Krasheninnikov,
          <string-name>
            <surname>Jordan Alexander</surname>
            , Pieter Abbeel, and
            <given-names>Anca</given-names>
          </string-name>
          <string-name>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>The implicit preference information in an initial state</article-title>
          .
          <source>In International Conference on Learning Representations</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [Silver et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>David</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Aja</given-names>
            <surname>Huang</surname>
          </string-name>
          , Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Lanctot</surname>
          </string-name>
          , et al.
          <article-title>Mastering the game of Go with deep neural networks and tree search</article-title>
          .
          <source>Nature</source>
          ,
          <volume>529</volume>
          (
          <issue>7587</issue>
          ):
          <fpage>484</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [Soares et al.,
          <year>2015</year>
          ]
          <string-name>
            <given-names>Nate</given-names>
            <surname>Soares</surname>
          </string-name>
          , Benja Fallenstein, Stuart Armstrong, and
          <string-name>
            <given-names>Eliezer</given-names>
            <surname>Yudkowsky</surname>
          </string-name>
          .
          <source>Corrigibility. AAAI Workshops</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <source>[Watkins and Dayan</source>
          , 1992]
          <string-name>
            <given-names>Christopher</given-names>
            <surname>Watkins</surname>
          </string-name>
          and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Dayan</surname>
          </string-name>
          .
          <article-title>Q-learning</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>8</volume>
          (
          <issue>3</issue>
          -4):
          <fpage>279</fpage>
          -
          <lpage>292</lpage>
          ,
          <year>1992</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [Zhang et al.,
          <year>2018</year>
          ]
          <string-name>
            <given-names>Shun</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Edmund H Durfee,
          <article-title>and Satinder P Singh</article-title>
          .
          <article-title>Minimax-regret querying on side effects for safe optimality in factored Markov decision processes</article-title>
          .
          <source>In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI18</source>
          , pages
          <fpage>4867</fpage>
          -
          <lpage>4873</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>