<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Penalizing side effects using stepwise relative reachability</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Victoria Krakovna</string-name>
          <email>vkrakovna@google.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laurent Orseau</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Miljan Martic</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shane Legg</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p />
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>How can we design safe reinforcement learning
agents that avoid unnecessary disruptions to their
environment? We show that current approaches for
penalizing side effects are ineffective in the case
where achieving the objective requires irreversible
actions. They can also introduce bad incentives for
the agent: interference, where the agent prevents any
irreversible changes in the environment (including
the actions of other agents), and offsetting, where
the agent undoes its own actions towards the
objective. To isolate the source of these failure modes,
we break down side effects penalties into two
independent components: a baseline state and a measure
of deviation from this baseline state. We argue that
the interference and offsetting incentives arise from
the choice of baseline state, while the choice of
deviation measure determines the effectiveness of the
penalty at avoiding side effects. We introduce a new
deviation measure based on relative reachability of
states that penalizes side effects when the simpler
unreachability measure fails. We also show that the
stepwise inaction baseline (where the agent does
nothing instead of its last action) avoids the bad
incentives where other baselines fail. We
empirically compare different combinations of baseline
and deviation measure choices on a set of gridworld
experiments designed to illustrate possible failure
modes, and show that only the combination of the
relative reachability measure with the stepwise
inaction baseline avoids all the failure modes
simultaneously.
An important component of safe behavior for reinforcement
learning agents is avoiding unnecessary side effects while
performing a task [Amodei et al., 2016; Taylor et al., 2016]. For
example, if an agent’s task is to carry a box across the room,
we want it to do so without breaking vases, while an agent
tasked with eliminating a computer virus should avoid
unnecessarily deleting files. The side effects problem is related to the
frame problem in classical AI [McCarthy and Hayes, 1969].
For machine learning systems, it has mostly been studied in
the context of safe exploration during the agent’s learning
process [Pecka and Svoboda, 2014; García and Fernández,
2015], but can also occur after training if the reward
function is misspecified and fails to penalize disruptions to the
environment [Ortega et al., 2018].</p>
      <p>We would like to incentivize the agent to avoid side
effects without explicitly penalizing every possible disruption,
defining disruptions in terms of predefined state features, or
going through a process of trial and error when designing the
reward function. While such approaches can be sufficient for
agents deployed in a narrow set of environments, they often
require a lot of human input and are unlikely to scale well
to increasingly complex and diverse environments. It is thus
important to develop more general and systematic approaches
for avoiding side effects.</p>
      <p>Most of the general approaches to this problem are
reachability-based methods: methods that preserve
reversibility (the reachability of a starting state) [Moldovan and Abbeel,
2012; Eysenbach et al., 2017], and reachability analysis
methods that require reachability of a safe region [Mitchell et al.,
2005; Gillula and Tomlin, 2012; Fisac et al., 2017]. The
reachability criterion has a notable limitation: it is insensitive to
the magnitude of the irreversible disruption, e.g. it equally
penalizes the agent for breaking one vase or a hundred vases.
Thus, this criterion is ineffective for avoiding side effects if
the objective requires an irreversible action. Comparison to
a starting state also introduces undesirable incentives in
dynamic environments, where irreversible transitions can happen
spontaneously (due to the forces of nature, the actions of other
agents, etc). Since such transitions make the starting state
unreachable, the agent has an incentive to interfere to prevent
them. This is often undesirable, e.g. if the transition involves
a human eating food. Thus, while these methods address the
side effects problem in environments where the agent is the
only source of change and the objective does not require
irreversible actions, a more general criterion is needed when these
assumptions do not hold.</p>
      <p>The contributions of this paper are as follows. In Section
2, we introduce a breakdown of side effects penalties into two
design choices, a baseline state and a measure of deviation of
the current state from the baseline state, as shown in Figure 1.
In Section 2.1, we introduce two types of bad incentives
(interagent policy
(a) Choices of baseline state s0t: starting state s0, inaction st(0), and
stepwise inaction st(t 1). Actions drawn from the agent policy are
shown by solid blue arrows, while actions drawn from the inaction
policy are shown by dashed gray arrows.</p>
      <p>st
R(st; s)</p>
      <p>R(st; s0t)
s
(b) Choices of deviation measure d: given a function R(x; y) that
defines the reachability of y from x, dUR(st; s0t) := 1 R(st; s0t)
is the unreachability measure of the baseline state s0t from the
current state st (dotted line), while the relative reachability measure
dRR(st; s0t) := jS1j Ps2S max(R(s0t; s) R(st; s); 0) is the
average reduction in reachability of states s from current state st (solid
line) compared to the baseline state s0t (dashed line).
ference and offsetting), along with environments that test for
them. We argue that they arise from the choice of baseline and
show that the stepwise inaction baseline (shown in Figure 1a)
avoids them. In Section 2.2, we show that the unreachability
measure is insensitive to the magnitude of the agent’s effects,
and propose a magnitude-sensitive relative reachability
measure, defined by comparing the reachability of states between
the current state and the baseline state, as shown in Figure
1b. In Section 3, we compare all combinations of the baseline
and deviation measure choices on gridworld environments that
test for side effects and bad incentives, and show that relative
reachability with the stepwise inaction baseline is the only
combination that passes on all the environments.</p>
      <p>We do not claim this approach to be a complete solution
to the side effects problem, since there may be other cases
of bad behaviors that we have not considered. However, we
believe that avoiding the bad behaviors we described is a
bare minimum for an agent to be both safe and useful, so our
approach provides some necessary ingredients for a solution
to the problem.</p>
      <p>Preliminaries. We assume that the environment is a
discounted Markov Decision Process (MDP), defined by a tuple
(S ; A; r; p; ). S is the set of states, A is the set of actions,
r : S A ! R is the reward function, p(st+1jst; at) is the
transition function, and 2 (0; 1) is the discount factor. At
time step t, the agent receives the state st, outputs the action at
drawn from its policy (atjst), and receives reward r(st; at).
We define a transition as a tuple (st; at; st+1) consisting of
state st, action at, and next state st+1. We assume that there
is a special noop action anoop that has the same effect as the
agent being turned off during the given time step. We denote
st(k) as the state obtained by starting in state sk and taking the
noop action t k times until time step t.</p>
    </sec>
    <sec id="sec-2">
      <title>Intended effects and side effects. We begin with some mo</title>
      <p>tivating examples for distinguishing intended and unintended
disruptions to the environment:
Example 1 (Vase). The agent’s objective is to get from point
A to point B as quickly as possible, and there is a vase in the
shortest path that would break if the agent walks into it.
Example 2 (Omelette). The agent’s objective is to make an
omelette, which requires breaking eggs.</p>
      <p>In both of these cases, the agent would take an irreversible
action by default (breaking a vase vs breaking eggs). However,
the agent can still get to point B without breaking the vase (at
the cost of a bit of extra time), but it cannot make an omelette
without breaking eggs. We would like to incentivize the agent
to avoid breaking the vase while allowing it to break the eggs.</p>
      <p>Safety criteria are often implemented as constraints [García
and Fernández, 2015; Moldovan and Abbeel, 2012; Eysenbach
et al., 2017]. This approach works well if we know exactly
what the agent must avoid, but is too inflexible for a general
criterion for avoiding side effects. For example, a constraint
that the agent must never make the starting state unreachable
would prevent it from making the omelette in Example 2, no
matter how high the reward for doing so.</p>
      <p>A more flexible way to implement a side effects criterion
is by adding a penalty for impacting the environment to the
reward function, which acts as an intrinsic pseudo-reward. An
impact penalty at time t can be defined as a measure of
deviation of the current state st from a baseline state s0t, denoted
as d(st; s0t). Then at every time step t, the agent receives the
following total reward:
r(st; at)
d(st+1; s0t+1):
Since the task reward r indicates whether the agent has
achieved the objective, we can distinguish intended and
unintended effects by balancing the task reward and the penalty
using the scaling parameter . Here, the penalty would
outweigh the small reward gain from walking into the vase over
going around the vase, but it would not outweigh the large
reward gain from breaking the eggs.
2</p>
      <sec id="sec-2-1">
        <title>Design choices for an impact penalty</title>
        <p>When defining the impact penalty, the baseline s0t and
deviation measure d can be chosen separately. We will discuss
several possible choices for each of these components.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2.1 Baseline states</title>
      <p>Starting state baseline. One natural choice of baseline state
is the starting state s0t = s0 when the agent was deployed (or
a starting state distribution), which we call the starting state
baseline. This is the baseline used in reversibility-preserving
approaches, where the agent learns a reset policy that is
rewarded for reaching states that are likely under the initial state
distribution.</p>
      <p>While penalties with the starting state baseline work well
in environments where the agent is the only source of change,
in dynamic environments they also penalize irreversible
transitions that are not caused by the agent. This incentivizes the
agent to interfere with other agents and environment processes
to prevent these irreversible transitions. To illustrate this
interference behavior, we introduce the Conveyor Belt Sushi
environment, shown in Figure 2.</p>
      <p>This environment is a sushi restaurant, which contains a
conveyor belt that moves to the right by one square after
every agent action. There is a sushi dish on the belt that
is eaten by a human if it reaches the end of the belt. The
interference behavior is to move the sushi off the belt. The
agent is rewarded for reaching the goal square, and it can reach
the goal with or without interfering with the sushi in the same
number of steps. The desired behavior is to reach the goal
without interference, by going left and then down. As shown
in Section 3, impact penalties with the starting state baseline
produce the interference behavior.</p>
      <p>Inaction baseline. Another choice is the inaction baseline
s0t = st(0): a counterfactual state of the environment if the
agent had done nothing for the duration of the episode.
Inaction can be defined in several ways. Armstrong and Levinstein
[2017] define it as the agent never being deployed:
conditioning on the event X where the AI system is never turned on.
It can also be defined as following some baseline policy, e.g.
a policy that always takes the noop action anoop. We use this
noop policy as the inaction baseline in this work. Penalties
with this baseline do not produce the interference behavior in
dynamic environments, since transitions that are not caused
by the agent would also occur in the inaction counterfactual
and thus are not penalized.</p>
      <p>However, the inaction baseline incentivizes another type
of undesirable behavior, called offsetting. We introduce the
Conveyor Belt Vase environment to illustrate this behavior,
shown in Figure 3.</p>
      <p>This environment also contains a conveyor belt, with a
vase that will break if it reaches the end of the belt. The
agent receives a reward for taking the vase off the belt. The
desired behavior is to move the vase off and then stay put. The
offsetting behavior is to move the vase off (thus collecting the
reward) and then put it back on, as shown in Figure 4.
(a) Agent takes vase off belt.</p>
      <p>(b) Agent goes around vase.</p>
      <p>(c) Agent puts vase back on belt.</p>
      <p>Offsetting happens because the vase breaks in the inaction
counterfactual. Once the agent takes the vase off the belt, it
continues to receive penalties for the deviation between the
current state and the baseline. Thus, it has an incentive to
return to the baseline by breaking the vase after collecting the
reward. Experiments in Section 3 show that impact penalties
with the inaction baseline produce the offsetting behavior if
they have a nonzero penalty for taking the vase off the belt.</p>
      <p>Stepwise inaction baseline. The inaction baseline can be
modified to branch off from the previous state st 1 rather than
the starting state s0. This is the stepwise inaction baseline
s0t = st(t 1): a counterfactual state of the environment if the
agent had done nothing instead of its last action [Turner et al.,
2019]. This baseline state is generated by a baseline policy
that follows the agent policy for the first t 1 steps, and takes
an action drawn from the inaction policy on step t. Each
transition is penalized only once, at the same time as it is
rewarded, so there is no offsetting incentive.</p>
      <p>However, there is a problem with directly comparing current
state st with st(t 1): this does not capture delayed effects of
action at 1. For example, if this action is putting a vase on a
conveyor belt, then the current state st contains the intact vase,
and by the time the vase breaks, the broken vase will be part
of the baseline state. Thus, the penalty for action at 1 needs
to be modified to take into account future effects of this action,
e.g. by using inaction rollouts from the current state and the
baseline (Figure 5).</p>
      <p>st
: : :
: : :</p>
      <p>An inaction rollout from state s~t 2 fst; s0tg is a sequence of
states obtained by following the inaction policy starting from
that state: s~t; s~t(+t)1; s~t(+t)2; : : : . Future effects of action at 1
can be modeled by comparing an inaction rollout from st to
an inaction rollout from st(t 1). For example, if action at 1
puts the vase on the belt, and the vase breaks 2 steps later, then
st(+t)2 will contain a broken vase, while s0t+2 will not. Turner
(t)
et al. [2019] compare the inaction rollouts st+k and s0t(+t)k at a
(t)
single time step t+k, which is simple to compute, but does not
account for delayed effects that occur after that time step. We
will introduce a recursive formula for comparing the inaction
rollouts st(+t)k and s0t(+t)k for all k
2.2</p>
    </sec>
    <sec id="sec-4">
      <title>Deviation measures</title>
      <p>0 in Section 2.2.</p>
      <p>We will consider reachability-based deviation measures. We
define reachability of state y from state x as the value function
of the optimal policy given a reward of 1 for reaching y and 0
otherwise:</p>
      <p>R(x; y) := max E rN (x;y)
where N (x; y) is the number of steps it takes to reach y from
x when following policy , and r 2 (0; 1] is the reachability
discount factor. A special case is undiscounted reachability
( r = 1), which computes whether y is reachable in any
number of steps.</p>
      <p>Without a full environment model, reachability can be
computed dynamically as the agent explores the environment,
based on states and transitions that the agent has
encountered (assuming a deterministic environment). Reachability
is initialized as R(x; y) = 1 if x = y and 0 otherwise
(different states are unreachable from each other). When the
agent makes a new transition (st; at; st+1), we update the
reachability function to consider all shortest paths that
involve this transition: for all pairs of states x and y, we set
R(x; y) = max(R(x; y); R(x; st) rR(st+1; y)), which has
time complexity O(S2) where S is the number of states. If
the agent has taken all the edges along the shortest path from
x to y, then the computed value is exact, so the greedy
approximation converges to true reachability function once each
edge has been explored once. The total time complexity is
O(ES2) where E is the number of directed edges, and the
space complexity is O(S2).</p>
      <p>Unreachability measure. One natural choice of
deviation measure is the difficulty of reaching the baseline state
s0t from the current state st. Reachability of the starting state
s0 is commonly used as a constraint in reversibility-preserving
approaches [Moldovan and Abbeel, 2012; Eysenbach et al.,
2017], where the agent does not take an action if it makes the
reachability value function too low. The unreachability (UR)
measure is defined as
dUR(st; s0t) := 1</p>
      <p>R(st; s0t):
A problem with the unreachability measure is that it takes
the maximum value of 1 if the agent takes any irreversible
action (since the reachability of the baseline becomes 0). Thus,
the agent receives the maximum penalty independently of the
magnitude of the irreversible action, e.g. whether the agent
breaks one vase or a hundred vases. This can lead to unsafe
behavior, as demonstrated on the Box environment from the
AI Safety Gridworlds suite [Leike et al., 2017], shown in
Figure 6.</p>
      <p>The environment contains a box that needs to be pushed
out of the way for the agent to reach the goal. The unsafe
behavior is taking the shortest path to the goal, which involves
pushing the box down into a corner (an irrecoverable position).
The desired behavior is to take a slightly longer path in order
to push the box to the right. The action of moving the box
is irreversible in both cases: if the box is moved to the right,
the agent can move it back, but then the agent ends up on the
other side of the box. Thus, the agent receives the maximum
penalty of 1 for moving the box in any direction, so the penalty
does not incentivize the agent to choose the safe path. Section
3 confirms that the unreachability penalty fails on the Box
environment for all choices of baseline.</p>
      <p>Relative reachability measure. To address the
magnitudesensitivity problem, we now introduce a reachability-based
measure that is sensitive to the magnitude of the irreversible
action. We define the relative reachability (RR) measure as
the average reduction in reachability of all states s from the
current state st compared to the baseline s0t (Figure 1b):
dRR(st; s0t) :=
jSj s2S
1 X max(R(s0t; s)</p>
      <p>R(st; s); 0)
The RR measure is nonnegative everywhere, and zero for
states st that reach or exceed baseline reachability of all states.</p>
      <p>In the Box environment, moving the box down makes more
states unreachable than moving the box to the right (all states
where the box is not in a corner become unreachable). Thus,
the agent receives a higher penalty for moving the box down,
and has an incentive to move the box to the right.</p>
    </sec>
    <sec id="sec-5">
      <title>Modifications for the stepwise inaction baseline. In or</title>
      <p>der to capture the delayed effects of actions, we modify the
deviation measures to incorporate the inaction rollouts from
st and s0t = st(t 1) (shown in Figure 5) as follows:
dSUR(st; s0t) :=1</p>
      <p>(1</p>
      <p>RV (s~t; s) :=(1
dSRR(st; s0t) :=</p>
      <p>1
) X
k=0</p>
      <p>kR(s~t(+t)k; s)
jSj s2S
1 X max(RV (s0t; s)</p>
      <p>1
) X
k=0
kR(st(+t)k; s0t(+t)k)</p>
      <p>RV (st; s); 0)
We call RV (s~t; s) the rollout value of s~t 2 fst; s0tg with
respect to s. In a deterministic environment, the UR measure
dSUR(st; s0t) and the rollout value RV (s~t; s) for the RR
measure dSRR(st; s0t) can be computed recursively as follows:
dSUR(x; y) =(1</p>
      <p>RV (x; y) =(1
)(R(x; y) + dSUR(I(x); I(y)))
)(R(x; y) +</p>
      <p>RV (I(x); y))
where I(x) is the inaction function that gives the state reached
by following the inaction policy from state x (this is the
identity function in static environments).
3</p>
      <sec id="sec-5-1">
        <title>Experiments</title>
        <p>We run a tabular Q-learning agent with different penalties on
the gridworld environments introduced in Section 2 . While
these environments are simple, they provide a proof of concept
by clearly illustrating the desirable and undesirable behaviors,
which would be more difficult to isolate in more complex
environments. We compare all combinations of the following
design choices for an impact penalty:</p>
        <p>Baselines: starting state s0, inaction st(0), stepwise
inaction st(t 1)
Deviation measures: unreachability (UR) (dSUR(st; s0t)
for the stepwise inaction baseline, dUR(st; s0t) for other
baselines), relative reachability (RR) (dSRR(st; s0t) for
the stepwise inaction baseline, dRR(st; s0t) for other
baselines), no penalty (None)
Discounting: r = 0:99 (discounted), r = 1:0
(undiscounted)</p>
        <p>In addition to the reward function, each environment has
a performance function, originally introduced in Leike et al.
[2017], which is not observed by the agent. This represents the
agent’s performance according to the designer’s true
preferences: it reflects how well the agent achieves the objective and
whether it does so safely. In all environments, reaching the
goal (if there is one) gives a reward of 50 and terminates the
episode, and episode performance is obtained by subtracting
50 from the episode return if an unsafe outcome occurs. The
Box environment has a movement penalty of -1, while the
conveyor belt environments do not. The results are shown in
Figure 7.</p>
        <p>We anneal the exploration rate linearly from 1 to 0 over
9000 episodes, and keep it at 0 for the next 1000 episodes. For
each penalty on each environment, we use a grid search to tune
the scaling parameter , choosing the value of that gives the
highest average performance on the last 100 episodes. The
grid search is over = 0:1; 0:3; 1; 3; 10; 30; 100; 300; 1000.
We found that the agent performance was not very sensitive
to the value of , with near-optimal performance for a wide
range of values for each penalty.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conveyor Belt Sushi environment (Figure 2). This envi</title>
      <p>ronment has around 400 states. The safe outcome (reaching
the goal without interfering with the sushi) achieves a
performance of 50, while the unsafe outcome (taking the sushi off
the belt) achieves a performance of 0.</p>
      <p>An agent with no penalty achieves performance 0. It
is unclear why it chooses an interference path, since
noninterference paths have the same length and thus the same
movement penalty. An agent with no penalty without
annealing chooses interference path some of the time, resulting in a
performance of 20, so its behavior is not stable.
(a) Conveyor Belt Sushi environment
(b) Conveyor Belt Vase environment</p>
      <p>(c) Box environment</p>
      <p>All penalties with the inaction and stepwise inaction
baselines reach near-optimal performance. The RR penalty with
the starting state baseline produces the interference behavior
(removing the sushi from the belt), resulting in performance
0. However, since the starting state is unreachable no matter
what the agent does, the UR penalty is always at the maximum
value of 1, so it does not produce interference behavior.</p>
    </sec>
    <sec id="sec-7">
      <title>Conveyor Belt Vase environment (Figure 3). This envi</title>
      <p>ronment has around 400 states. The agent receives a reward of
50 for taking the vase off the belt, which does not terminate the
episode (it always lasts for 100 steps). The safe outcome (the
vase is not broken) achieves performance 50, and the unsafe
outcome (the vase is broken) achieves performance 0.</p>
      <p>An agent with no penalty achieves performance 42. With
the inaction baseline, the discounted penalties achieve
performance 0, which corresponds to the offsetting behavior of
moving the vase off the belt and then putting it back on, shown
in Figure 4. The undiscounted versions avoid this behavior,
since the action of taking the vase off the belt is reversible and
thus is not penalized at all, so there is nothing to offset. All
penalties with the stepwise inaction baseline perform well on
this environment, showing that this baseline does not produce
offsetting.</p>
      <p>Box environment (Figure 6). This environment has 60
states. The safe longer path to the goal achieves performance
43, while the unsafe shorter path that puts the box in the corner
achieves performance -5.</p>
      <p>An agent with no penalty takes the unsafe path. For all
baselines, RR achieves optimal performance 43, while UR achieves
performance -5. This happens because the UR measure is not
magnitude-sensitive, and thus does not distinguish between
irreversible actions that result in recoverable and irrecoverable
box positions.</p>
      <p>Overall, the combinations of design choices that perform
best across all environments are RR with the stepwise
inaction baseline and undiscounted RR with the inaction baseline.
Since undiscounted RR only penalizes irreversible transitions,
a penalty that aims to penalize reversible transitions has to be
combined with the stepwise inaction baseline.
4</p>
      <sec id="sec-7-1">
        <title>Additional related work</title>
        <p>Safe exploration. Safe exploration methods prevent the
agent from taking harmful actions by enforcing safety
constraints [Turchetta et al., 2016], penalizing risk [Chow et al.,
2015], using intrinsic motivation [Lipton et al., 2016],
preserving reversibility [Moldovan and Abbeel, 2012; Eysenbach et
al., 2017], etc. Explicitly defined constraints or safe regions
tend to be task-specific and require significant human input,
so they do not provide a general solution to the side effects
problem. Penalizing risk and intrinsic motivation can help
the agent avoid low-reward states (such as getting trapped or
damaged), but do not discourage the agent from damaging the
environment if this is not accounted for in the reward
function. Reversibility-preserving methods have interference and
magnitude insensitivity issues as discussed in Section 2.</p>
        <p>Side effects criteria using state features. Zhang et al.
[2018] assumes a factored MDP where the agent is allowed
to change some of the features and proposes a criterion for
querying the supervisor about changing other features in order
to allow for intended effects. Shah et al. [2019] define an
auxiliary reward for avoiding side effects in terms of state
features by assuming that the starting state of the environment
is already organized according to human preferences. Since
the latter method uses the starting state as a baseline, we
would expect it to produce interference behavior in dynamic
environments. While these approaches are promising, they
are not general in their present form due to reliance on state
features.</p>
        <p>Empowerment. Our RR measure is related to
empowerment [Klyubin et al., 2005; Salge et al., 2014; Gregor et al.,
2017], a measure of the agent’s control over its environment,
defined as the highest possible mutual information between the
agent’s actions and the future state. Empowerment measures
the agent’s ability to reliably reach many states, while RR
penalizes the reduction in reachability of states. Maximizing
empowerment would encourage the agent to avoid irreversible
side effects, but would also incentivize interference behavior,
and it is unclear to us how to define an empowerment-based
measure that would avoid this. One possibility is to penalize
the reduction in empowerment between the current state st and
the baseline s0t. However, empowerment is indifferent between
these two cases: A) the same states are reachable from st and
s0t, and B) a state x reachable from s0t but unreachable from
st, while another state y is reachable from st but unreachable
from s0t. Thus, penalizing reduction in empowerment would
miss some side effects: e.g. if the agent replaced the sushi on
the conveyor belt with a vase, empowerment could remain the
same, so the agent is not penalized for destroying the vase.</p>
        <p>Uncertainty about the objective. Inverse Reward
Design [Hadfield-Menell et al., 2017] incorporates uncertainty
about the objective by considering alternative reward
functions that are consistent with the given reward function in the
training environment. This helps avoid some side effects that
stem from distributional shift However, this method assumes
that the given reward function is correct for the training
environment, and so does not prevent side effects caused by a
reward function that is misspecified in that environment.
Quantilization [Taylor, 2016] incorporates uncertainty by taking
actions from the top quantile of actions. These methods help
to prevent side effects, but do not provide a way to quantify
side effects.</p>
        <p>Human oversight. An alternative to specifying a side
effects penalty is to teach the agent to avoid side effects through
human oversight, such as inverse reinforcement learning [Ng
and Russell, 2000; Hadfield-Menell et al., 2016],
demonstrations [Abbeel and Ng, 2004], or human feedback [Christiano
et al., 2017; Saunders et al., 2017]. It is unclear how well an
agent can learn a general heuristic for avoiding side effects
from human oversight. We expect this to depend on the
diversity of settings in which it receives oversight and its ability to
generalize from those settings, which are difficult to quantify.
We expect that an intrinsic penalty for side effects would be
more robust and more reliably result in avoiding them. Such
a penalty could also be combined with human oversight to
decrease the amount of human input required for an agent to
learn human preferences.
5</p>
      </sec>
      <sec id="sec-7-2">
        <title>Conclusions</title>
        <p>We have outlined possible bad incentives (interference and
offsetting) that can arise from a poor choice of baseline, and
a magnitude insensitivity property that can arise from a poor
choice of deviation measure, and proposed design choices that
avoid these issues in preliminary experiments. There are many
possible directions for follow-up work:</p>
        <p>Scalable implementation. The RR measure in its exact
form is not tractable for environments more complex than
gridworlds. A more practical implementation could be
computed over some set of representative states instead of all
states. For example, the agent could learn a set of auxiliary
policies for reaching distinct states, similarly to the method
for approximating empowerment in Gregor et al. [2017].</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Better choices of baseline. Using noop actions to define</title>
      <p>inaction for the stepwise inaction baseline can be problematic.
For example, when driving a car on a winding road, the default
outcome of a noop is a crash, so the agent would not be
penalized for spilling coffee in the car. This could be avoided
using a better inaction baseline, such as following the road, but
this can be challenging to define in a task-independent way.</p>
    </sec>
    <sec id="sec-9">
      <title>Weights over the state space. In practice, we often value</title>
      <p>the reachability of some states much more than others. This
could be incorporated into the relative reachability measure
by adding a weight ws for each state s in the sum. Such
weights could be learned through human feedback methods,
e.g. Christiano et al. [2017].</p>
      <p>Theory. There is a need for theoretical work on
characterizing and formalizing undesirable incentives that arise from
different design choices in penalizing side effects.</p>
      <p>We hope this work lays the foundations for a practical
methodology on avoiding side effects that would scale well to
more complex environments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Pieter</given-names>
            <surname>Abbeel</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Ng</surname>
          </string-name>
          .
          <article-title>Apprenticeship learning via inverse reinforcement learning</article-title>
          .
          <source>In International Conference on Machine Learning</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and
          <string-name>
            <given-names>Dan</given-names>
            <surname>Mané</surname>
          </string-name>
          .
          <article-title>Concrete problems in AI safety</article-title>
          .
          <source>arXiv preprint arXiv:1606.06565</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Stuart</given-names>
            <surname>Armstrong</surname>
          </string-name>
          and
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Levinstein</surname>
          </string-name>
          .
          <article-title>Low impact artificial intelligences</article-title>
          .
          <source>arXiv preprint arXiv:1705.10720</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Yinlam</given-names>
            <surname>Chow</surname>
          </string-name>
          , Aviv Tamar, Shie Mannor, and
          <string-name>
            <given-names>Marco</given-names>
            <surname>Pavone</surname>
          </string-name>
          .
          <article-title>Risk-sensitive and robust decision-making: a CVaR optimization approach</article-title>
          .
          <source>In Neural Information Processing Systems (NIPS)</source>
          , pages
          <fpage>1522</fpage>
          -
          <lpage>1530</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Paul</given-names>
            <surname>Christiano</surname>
          </string-name>
          , Jan Leike, Tom B Brown, Miljan Martic, Shane Legg, and
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          .
          <article-title>Deep reinforcement learning from human preferences</article-title>
          .
          <source>In Neural Information Processing Systems (NIPS)</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Eysenbach</surname>
          </string-name>
          , Shixiang Gu, Julian Ibarz, and
          <string-name>
            <given-names>Sergey</given-names>
            <surname>Levine</surname>
          </string-name>
          .
          <article-title>Leave no Trace: Learning to reset for safe and autonomous reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1711.06782</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Jaime F. Fisac</surname>
          </string-name>
          ,
          <string-name>
            <surname>Anayo K. Akametalu</surname>
          </string-name>
          , Melanie Nicole Zeilinger, Shahab Kaynama,
          <string-name>
            <surname>Jeremy H. Gillula</surname>
            , and
            <given-names>Claire J.</given-names>
          </string-name>
          <string-name>
            <surname>Tomlin</surname>
          </string-name>
          .
          <article-title>A general safety framework for learning-based control in uncertain robotic systems</article-title>
          .
          <source>arXiv preprint arXiv:1705.01292</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Javier</given-names>
            <surname>García</surname>
          </string-name>
          and
          <string-name>
            <given-names>Fernando</given-names>
            <surname>Fernández</surname>
          </string-name>
          .
          <article-title>A comprehensive survey on safe reinforcement learning</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          ,
          <volume>16</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1437</fpage>
          -
          <lpage>1480</lpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Jeremy H.</given-names>
            <surname>Gillula</surname>
          </string-name>
          and
          <string-name>
            <given-names>Claire J.</given-names>
            <surname>Tomlin</surname>
          </string-name>
          .
          <article-title>Guaranteed safe online learning via reachability: tracking a ground target using a quadrotor</article-title>
          .
          <source>In IEEE International Conference on Robotics and Automation (ICRA)</source>
          , pages
          <fpage>2723</fpage>
          -
          <lpage>2730</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Karol</given-names>
            <surname>Gregor</surname>
          </string-name>
          , Danilo Jimenez Rezende, and
          <string-name>
            <given-names>Daan</given-names>
            <surname>Wierstra</surname>
          </string-name>
          .
          <article-title>Variational intrinsic control</article-title>
          .
          <source>International Conference for Learning Representations (ICLR) Workshop</source>
          , arXiv preprint arXiv:
          <volume>1611</volume>
          .07507,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Dylan</given-names>
            <surname>Hadfield-Menell</surname>
          </string-name>
          , Anca Dragan, Pieter Abbeel, and Stuart Russell.
          <article-title>Cooperative inverse reinforcement learning</article-title>
          .
          <source>In Neural Information Processing Systems (NIPS)</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Dylan</given-names>
            <surname>Hadfield-Menell</surname>
          </string-name>
          , Smitha Milli, Pieter Abbeel, Stuart J.
          <string-name>
            <surname>Russell</surname>
            , and
            <given-names>Anca D.</given-names>
          </string-name>
          <string-name>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Inverse reward design</article-title>
          .
          <source>In Neural Information Processing Systems (NIPS)</source>
          , pages
          <fpage>6768</fpage>
          -
          <lpage>6777</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Alexander S. Klyubin</surname>
          </string-name>
          , Daniel Polani, and
          <string-name>
            <surname>Chrystopher</surname>
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Nehaniv</surname>
          </string-name>
          .
          <article-title>All else being equal be empowered</article-title>
          .
          <source>In European Conference on Artificial Life (ECAL)</source>
          , pages
          <fpage>744</fpage>
          -
          <lpage>753</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Jan</given-names>
            <surname>Leike</surname>
          </string-name>
          , Miljan Martic, Victoria Krakovna, Pedro A. Ortega, Tom Everitt,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Lefrancq</surname>
          </string-name>
          , Laurent Orseau, and Shane Legg.
          <article-title>AI safety gridworlds</article-title>
          .
          <source>arXiv preprint arXiv:1711.09883</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Zachary C.</given-names>
            <surname>Lipton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jianfeng</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Lihong</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Jianshu</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Li</given-names>
            <surname>Deng</surname>
          </string-name>
          .
          <article-title>Combating reinforcement learning's sisyphean curse with intrinsic fear</article-title>
          .
          <source>arXiv preprint arXiv:1611.01211</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>John McCarthy and Patrick J. Hayes</surname>
          </string-name>
          .
          <article-title>Some philosophical problems from the standpoint of artificial intelligence</article-title>
          .
          <source>In Machine Intelligence</source>
          , pages
          <fpage>463</fpage>
          -
          <lpage>502</lpage>
          . Edinburgh University Press,
          <year>1969</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Ian</surname>
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mitchell</surname>
          </string-name>
          ,
          <string-name>
            <surname>Alexandre M. Bayen</surname>
            , and
            <given-names>Claire J.</given-names>
          </string-name>
          <string-name>
            <surname>Tomlin</surname>
          </string-name>
          .
          <article-title>A time-dependent hamilton-jacobi formulation of reachable sets for continuous dynamic games</article-title>
          .
          <source>IEEE Transactions on Automatic Control</source>
          ,
          <volume>50</volume>
          (
          <issue>7</issue>
          ):
          <fpage>947</fpage>
          -
          <lpage>957</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>Teodor</given-names>
            <surname>Mihai</surname>
          </string-name>
          Moldovan and
          <string-name>
            <given-names>Pieter</given-names>
            <surname>Abbeel</surname>
          </string-name>
          .
          <article-title>Safe exploration in Markov decision processes</article-title>
          .
          <source>In International Conference on Machine Learning (ICML)</source>
          , pages
          <fpage>1451</fpage>
          -
          <lpage>1458</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Ng</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stuart</given-names>
            <surname>Russell</surname>
          </string-name>
          .
          <article-title>Algorithms for inverse reinforcement learning</article-title>
          .
          <source>In International Conference on Machine Learning</source>
          , pages
          <fpage>663</fpage>
          -
          <lpage>670</lpage>
          ,
          <year>2000</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <given-names>Pedro</given-names>
            <surname>Ortega</surname>
          </string-name>
          , Vishal Maini, and et al.
          <article-title>Building safe artificial intelligence: specification, robustness, and assurance</article-title>
          .
          <source>DeepMind Safety Research Blog</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <given-names>Martin</given-names>
            <surname>Pecka</surname>
          </string-name>
          and
          <string-name>
            <given-names>Tomas</given-names>
            <surname>Svoboda</surname>
          </string-name>
          .
          <article-title>Safe exploration techniques for reinforcement learning - an overview</article-title>
          .
          <source>In International Workshop on Modelling and Simulation for Autonomous Systems</source>
          , pages
          <fpage>357</fpage>
          -
          <lpage>375</lpage>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>Christoph</given-names>
            <surname>Salge</surname>
          </string-name>
          , Cornelius Glackin, and Daniel Polani.
          <article-title>Empowerment - an introduction</article-title>
          .
          <source>In Guided Self-Organization: Inception</source>
          , pages
          <fpage>67</fpage>
          -
          <lpage>114</lpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>William</given-names>
            <surname>Saunders</surname>
          </string-name>
          , Girish Sastry, Andreas Stuhlmueller, and
          <string-name>
            <given-names>Owain</given-names>
            <surname>Evans</surname>
          </string-name>
          .
          <article-title>Trial without error: Towards safe reinforcement learning via human intervention</article-title>
          .
          <source>arXiv preprint arXiv:1707.05173</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Rohin</given-names>
            <surname>Shah</surname>
          </string-name>
          , Dmitrii Krasheninnikov,
          <string-name>
            <surname>Jordan Alexander</surname>
            , Pieter Abbeel, and
            <given-names>Anca</given-names>
          </string-name>
          <string-name>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Preferences implicit in the state of the world</article-title>
          .
          <source>In International Conference for Learning Representations (ICLR)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <given-names>Jessica</given-names>
            <surname>Taylor</surname>
          </string-name>
          , Eliezer Yudkowsky, Patrick LaVictoire, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Critch</surname>
          </string-name>
          .
          <article-title>Alignment for advanced machine learning systems</article-title>
          .
          <source>Technical report, Machine Intelligence Research Institute</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          <string-name>
            <given-names>Jessica</given-names>
            <surname>Taylor. Quantilizers</surname>
          </string-name>
          :
          <article-title>A safer alternative to maximizers for limited optimization</article-title>
          .
          <source>In AAAI Workshop on AI, Ethics, and Society</source>
          , pages
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <string-name>
            <given-names>Matteo</given-names>
            <surname>Turchetta</surname>
          </string-name>
          , Felix Berkenkamp, and
          <string-name>
            <given-names>Andreas</given-names>
            <surname>Krause</surname>
          </string-name>
          .
          <article-title>Safe exploration in finite Markov decision processes with Gaussian processes</article-title>
          .
          <source>In Neural Information Processing Systems (NIPS)</source>
          , pages
          <fpage>4305</fpage>
          -
          <lpage>4313</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Turner</surname>
          </string-name>
          , Dylan Hadfield-Menell, and
          <string-name>
            <given-names>Prasad</given-names>
            <surname>Tadepalli</surname>
          </string-name>
          .
          <article-title>Conservative agency via attainable utility preservation</article-title>
          .
          <source>arXiv preprint arXiv:1902.09725</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <given-names>Shun</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Edmund H. Durfee</surname>
            , and
            <given-names>Satinder P.</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
          </string-name>
          .
          <article-title>Minimax-regret querying on side effects for safe optimality in factored Markov decision processes</article-title>
          .
          <source>In International Joint Conference on Artificial Intelligence (IJCAI)</source>
          , pages
          <fpage>4867</fpage>
          -
          <lpage>4873</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>