=Paper=
{{Paper
|id=Vol-2808/Paper_20
|storemode=property
|title=Challenges for Using Impact Regularizers to Avoid Negative Side Effects
|pdfUrl=https://ceur-ws.org/Vol-2808/Paper_20.pdf
|volume=Vol-2808
|authors=David Lindner,Kyle Matoba,Alexander Meulemans
|dblpUrl=https://dblp.org/rec/conf/aaai/LindnerMM21
}}
==Challenges for Using Impact Regularizers to Avoid Negative Side Effects==
<pdf width="1500px">https://ceur-ws.org/Vol-2808/Paper_20.pdf</pdf>
<pre>
        Challenges for Using Impact Regularizers to Avoid Negative Side Effects
                            David Lindner,1 * Kyle Matoba, 2 * Alexander Meulemans 3 *
                                           1
                                            Department of Computer Science, ETH Zurich
                                                         2
                                                           Idiap and EPFL
                               3
                                 Institute of Neuroinformatics, University of Zurich and ETH Zurich
                               david.lindner@inf.ethz.ch, kyle.matoba@epfl.ch, ameulema@ethz.ch


                            Abstract                                 process and there is no guarantee that the agent will not ex-
  Designing reward functions for reinforcement learning is dif-      hibit side effects when it encounters new situations. In fact,
  ficult: besides specifying which behavior is rewarded for a        such problems with misspecified reward functions have been
  task, the reward also has to discourage undesired outcomes.        observed in various practical applications of RL (Krakovna
  Misspecified reward functions can lead to unintended nega-         et al. 2020b).
  tive side effects, and overall unsafe behavior. To overcome           In most situations, it is useful to decompose the re-
  this problem, recent work proposed to augment the specified        ward R(s) into a task-related component Rtask (s) and an
  reward function with an impact regularizer that discourages        environment-related component Renv (s), where the latter
  behavior that has a big impact on the environment. Although        specifies how the agent should behave in the environment,
  initial results with impact regularizers seem promising in mit-    regardless of the task.1 As Shah et al. (2019) observe, Renv is
  igating some types of side effects, important challenges re-
                                                                     related to the frame problem in classical AI (McCarthy and
  main. In this paper, we examine the main current challenges
  of impact regularizers and relate them to fundamental design       Hayes 1969): we not only have to make a prediction about
  decisions. We discuss in detail which challenges recent ap-        what is supposed to change, but also what is supposed to re-
  proaches address and which remain unsolved. Finally, we ex-        main unchanged. Renv is more prone to misspecification, be-
  plore promising directions to overcome the unsolved chal-          cause it needs to specify everything that can happen beyond
  lenges in preventing negative side effects with impact regu-       a task, that can result in undesired outcomes. Because the
  larizers.                                                          designer builds an RL agent to solve a specific problem, it
                                                                     is relatively easy to anticipate considerations directly related
                     1    Introduction                               to solving the task in Rtask . Shah et al. (2019) point out that
                                                                     environments are generally already optimized for humans,
Specifying a reward function in reinforcement learning (RL)
                                                                     hence, defining Renv primarily requires to specify which fea-
that completely aligns with the designer’s intent is a difficult
                                                                     tures of the environment the AI systems should not disturb.
task. Besides specifying what is important to solve the task
                                                                     Therefore, penalizing large changes in the current state of
at hand, the designer also needs to specify how the AI sys-
                                                                     the world can be thought of as a coarse approximation for
tem should behave in the environment in general, which is
                                                                     Renv .
hard to fully cover. For example, RL agents playing video
                                                                        Impact regularization (IR) has emerged as a tractable and
games often learn to achieve a high score without solving
                                                                     effective way to approximate Renv (Armstrong and Levin-
the desired task by exploiting the game (e.g. Saunders et al.
                                                                     stein 2017; Krakovna et al. 2019; Turner, Hadfield-Menell,
2018). Side effects occur when the behavior of the AI system
                                                                     and Tadepalli 2020). The main idea behind IR is to approx-
diverges from the designer’s intent because of some consid-
                                                                     imate Renv through a measure of “impact on the environ-
erations that were not anticipated beforehand, such as the
                                                                     ment”, which avoids negative side effects and reduces the
possibility to exploit a game. In this work, we focus on side
                                                                     burden on the reward designer.
effects that are tied to the reward function, which we define
                                                                        In this paper, we discuss IR of the form
as side effects that would still occur if we had access to an or-
acle that finds an optimal policy for a given reward function.               R(st ) = Rspec (st ) − λ · d(st , b(s0 , st−1 , t))    (1)
We explicitly do not consider side effects resulting from the
used RL algorithm, which are often discussed using the term          where st denotes the state at time step t, Rspec denotes the
safe exploration (Garcıa and Fernández 2015).                       reward function specified by the designer,2 and:
   In practice, the designer typically goes through several it-      • the baseline b(s0 , st−1 , t) provides a state obtained by fol-
erations of reward specification to optimize the agent’s per-          lowing a “default” or “safe” policy at timestep t and uses
formance and minimize side effects. This is often a tedious
                                                                        1
   * The authors contributed equally.                                      We write the reward function only as a function of states for
Copyright ©2021 for this paper by its authors. Use permitted under   simplicity, as the state-space can be formally extended to include
Creative Commons License Attribution 4.0 International (CC BY        the last action.
                                                                         2
4.0).                                                                      Rspec contains the specified parts of both Rtask and Renv .
  either the initial state and the current time (s0 , t) to com-       measures respectively, which both share the same structural
  pute it, or else the current state st−1 ,                            form for the deviation measure:
• d measures the deviation of the realized state from the                                           X
                                                                                                    X
                                                                                dVD (st , s0t ) =         wx f Vx (s0t ) − Vx (st ) ,
                                                                                                                                   
  baseline state, and                                                                                                                   (2)
                                                                                                    x=1
• λ ≥ 0 gives a global scale at which to trade off the speci-
  fied reward and the regularization.                                  where x ranges over some sources of value, Vx (st ) is the
Composing these three terms gives a general formulation                value of state st according to x, wx is its weight in the sum
of regularization that encompasses most proposals found in             and f is a function characterizing the deviation between the
the literature, but permits separate analysis (Krakovna et al.         values. AU is a special case of this with wx = 1/X for all x
2019).                                                                 and the absolute value operator as f . This formulation cap-
   We start by giving an overview of the related work on               tures the same intuition as RR, but allows to measure the
IR (Section 2), before we discuss the three main design de-            impact of the agent in terms of different value functions, in-
cisions for IR. First, we discuss how to choose a baseline             stead of just counting states. Concretely, AU aims to mea-
(Section 3), emphasizing considerations of environment dy-             sure the agent’s ability to achieve high utility on a range of
namics and a tendency for agents to offset their actions. Sec-         different goals in the environment, and penalizes any change
ond, we discuss how to quantify deviations from the base-              that reduces this ability. Turner, Hadfield-Menell, and Tade-
line (Section 4), especially the distinction between negative,         palli (2020) also introduced the stepwise inaction baseline to
neutral, and positive side effects. Third, we discuss how to           mitigate offsetting behavior (c.f. Section 3.2). This baseline
choose the scale λ (Section 5). Finally, we propose some               follows an inaction policy starting from the previous state
directions to improve the effectiveness of IR (Section 6) .            st−1 rather than the starting state s0 . Follow-up work scaled
   The main contribution of this work is to, discuss in de-            AU towards more complex environments (Turner, Ratzlaff,
tail the current main challenges of IR, building upon previ-           and Tadepalli 2020).
ous work, and to suggest possible ways forward to overcome                Krakovna et al. (2020a) built upon the VD measure and
these challenges.                                                      introduced an auxiliary loss representing how well the agent
                                                                       could solve future tasks in the same environment, given its
                                                                       current state. This can be seen as a deviation measure in e.q.
                     2    Related Work                                 (1) that rewards similarity with a baseline instead of penal-
Amodei et al. (2016) reviewed negative side effects as one             izing deviation from it. Eysenbach et al. (2018)’s approach
of several problems in AI safety, and discussed using im-              to penalize irreversibility can be seen as a special case of
pact regularization (IR) to avoid negative side effects. Since         Krakovna et al. (2020a).
then, several concrete approaches to IR have been proposed,               Aside from IR, Rahaman et al. (2019) proposed to learn
of which eq. (1) gives the underlying structure. Armstrong             an arrow of time, representing a directed measure of reach-
and Levinstein (2017) proposed to measure the impact of the            ability, using the intuition that irreversible actions tend to
agent compared to the inaction baseline, starting from the             leave the environment in a more disorderly state, making it
initial state s0 . The inaction baseline assumes the agent does        possible to define an arrow of time with methods inspired by
nothing, which can be formalized by assuming a non-action              thermodynamics. As another alternative to IR, Zhang, Dur-
exists.3 Armstrong and Levinstein (2017) emphasized the                fee, and Singh (2018, 2020) proposed to learn which envi-
importance of a semantically meaningful state representa-              ronmental features an AI system is allowed to change by
tion for the environment when measuring distances from the             querying a human overseer. They provided an active query-
inaction baseline. While Armstrong and Levinstein (2017)               ing approach that makes maximally informative queries.
discussed the problem of measuring the impact of an agent              Shah et al. (2019) developed a method for learning which
abstractly, Krakovna et al. (2019) proposed a concrete de-             parts of the environment a human cares about by assum-
viation measure called Relative Reachability (RR). RR mea-             ing that the world is optimized to suit humans. Saisubra-
sures the average reduction in the number of states reachable          manian, Kamar, and Zilberstein (2020) formulated the side
from the current state, compared to a baseline state. This             effects problem as a multi-objective Markov Decision Pro-
captures the intuition that irreversible changes to the envi-          cess, where they learn a separate reward function penaliz-
ronment should be penalized more, but has advantages over              ing negative side effects and optimize this secondary objec-
directly using irreversibility as a measure of impact (as e.g.         tive while staying close to the optimal policy of the task ob-
in Eysenbach et al. (2018)), such as allowing to quantify the          jective. Saisubramanian, Zilberstein, and Kamar (2020) pro-
magnitude of different irreversible changes.                           vide a broad overview of the various existing approaches for
   Turner, Hadfield-Menell, and Tadepalli (2020) and                   mitigating negative side effects, while we zoom in on one
Krakovna et al. (2019) generalized the concept of RR to-               class of approaches, IR, and discuss the corresponding chal-
wards Attainable Utility (AU) and Value Difference (VD)                lenges in detail.
   3
     Armstrong and Levinstein (2017) define this baseline as the                       3     Choosing a Baseline
state the environment would be in when the agent would have never
been deployed. This is slightly different from the definition of the   Recent work mainly uses two types of baselines in impact
inaction baseline we give here and that later work used, as the mere   regularization (IR): (i) the inaction baseline b(s0 , st , t) =
presence of the agent can influence the environment.                   T (st |s0 , πinaction ) and (ii) the stepwise inaction baseline
b(s0 , st , t) = T (st |st−1 , πinaction ), where T is the distribu-   belt, collect the reward, and afterwards put the vase back on
tion over states st when starting at state s0 or st−1 respec-          the conveyor belt to reduce deviation from the baseline. To
tively and following the inaction policy πinaction that always         understand this offsetting behavior recall the decomposition
takes an action anop that does nothing.                                of the true reward into a task-related and an environment-
   Unfortunately, the inaction baseline can lead to undesir-           related component from section 1. A designer usually spec-
                                                                                              task
able offsetting behavior, where the agent tries to undo the            ifies a task reward Rspec   that rewards states signaling task
outcomes of their task after collecting the reward, moving             completion (e.g. taking the vase off the belt). However, each
back closer to the initial baseline (Turner, Hadfield-Menell,          task has consequences to the environment, which often are
and Tadepalli 2020). The stepwise inaction baseline re-                the reason why the task should be completed in the first place
moves the offsetting incentive of the agent by branching off           (e.g. the vase being not broken). In all but simple tasks, as-
from the previous state instead of the starting state (Turner,         signing a reward to every task consequence is impossible,
Hadfield-Menell, and Tadepalli 2020). However, Krakovna                and so by omission, they have a zero reward. When IR pe-
et al. (2020a) argued that offsetting behavior is desirable in         nalizes consequences of completing the task, because they
many cases. In section 3.2 we contribute to this discussion            differ from the baseline, this results in undesirable offset-
by breaking down in detail when offsetting behavior is desir-          ting behavior. The stepwise inaction baseline (Turner, Rat-
able or undesirable, whereas in section 3.3, we argue that the         zlaff, and Tadepalli 2020) successfully removes all offsetting
inaction baseline and step-wise inaction baseline can lead             incentives. However, in other situations offsetting might be
to inaction incentives in nonlinear dynamical environments.            desired.
We start, however, with the fundamental observation that the              Desirable Offsetting. In many cases, offsetting behavior
inaction baseline and stepwise inaction baseline do not al-            is desired, because it can prevent unnecessary side effects.
ways represent safe policies in section 3.1.                           Krakovna et al. (2020a) provide an example of an agent
                                                                       which is asked to go shopping, and needs to open the front
3.1   Inaction Baselines are not Always Safe
                                                                       door of the house to go to the shop. If the agent leaves the
The baseline used in IR should represent a safe policy where           door open, wind from outside can knock over a vase in-
the AI system does not harm its environment or itself. In              side, which the agent can prevent by closing the door after
many cases, taking no actions would be a safe policy for the           leaving the house. When using the stepwise inaction base-
agent, e.g. for a cleaning robot. However, if the AI system is         line (with rollouts, c.f. Section 4.2), the agent gets penal-
responsible for a task requiring continuous control, inaction          ized once when opening the door for knocking over the vase
of the AI system can be disastrous. For example, if the agent          in the future, independent of whether it closes the door af-
is responsible for driving a car on a highway, doing noth-             terwards (and thus prevents the vase from breaking) or not.
ing likely results in a crash. This is particularly problematic        Hence, for this example, the offsetting behavior (closing the
for the stepwise inaction baseline, which follows an inaction          door) is desirable. The reasoning behind this example can be
policy starting from the previous state. The inaction policy           generalized to all cases where the offsetting behavior con-
starting from the initial state can also be unsafe, for example,       cerns states that are instrumental towards achieving the task
if an agent takes over the control of the car from a human,            (e.g. opening the door) and not a consequence of completing
and therefore the initial state s0 already has the car driving.        the task (e.g. the vase being not broken).
   For this reason, designing a safe baseline for a task or en-
vironment that requires continuous control is a hard prob-                A Crucial Need for a New Baseline. The recently pro-
lem. One possible approach is to design a policy that is               posed baselines either remove offsetting incentives alto-
known to be safe based on expert knowledge. However, this              gether or allow for both undesirable and desirable offsetting
can be a time-consuming process, and is not always feasible.           to occur, which are both unsatisfactory solutions. Krakovna
Designing safe baselines for tasks and environments that re-           et al. (2020a) proposed resolving this issue by allowing all
quire continuous control is an open problem that has to be             offsetting (e.g. by using the inaction baseline) and rewarding
solved before IR can be used in these applications.                    all states where the task is completed in the specified reward
                                                                       function. However, we attribute three important downsides
3.2   Offsetting                                                       to this approach. First, states that occur after task comple-
An agent engages in offsetting behavior when it tries to undo          tion can still have negative side effects. If the reward asso-
the outcomes of previous actions, i.e. when it “covers up its          ciated with these states is high enough to prevent offsetting,
tracks”. Offsetting behavior can be desirable or undesirable,          it might also be high enough to encourage the agent to pur-
depending on which outcomes the agent counteracts.                     sue these states and ignore their negative side effects. Sec-
   Undesirable offsetting. Using IRs with an inaction base-            ond, not all tasks have a distinct goal state that indicates the
line starting from the initial state can lead to undesirable off-      completion of a task, but rather accumulate task-related re-
setting behavior where the agent counteracts the outcomes              wards at various time steps during an episode. Third, this
of its task (Krakovna et al. 2019; Turner, Hadfield-Menell,            approach creates a new incentive for the agent to prevent
and Tadepalli 2020). For example, Krakovna et al. (2019)               shut-down, as it continues to get rewards after the task is
consider a vase on a conveyor belt. The agent is rewarded              completed (Hadfield-Menell et al. 2017).
for taking the vase off the belt, hence preventing that it will          We conclude that offsetting is still an unsolved problem,
fall off the belt. The desired behavior is to take the vase and        highlighting the need for a new baseline, to prevent undesir-
stay put. The offsetting behavior is to take the vase off the          able offsetting behavior, but allow for desirable offsetting.
3.3   Environment Dynamics and Inaction                                Deviation Measures. At the core of the inaction prob-
      Incentives                                                    lem is that some negative side effects are worse than others.
                                                                    Usually, it does not matter if the agent changes the weather
In dynamic environments that are highly sensitive to the            conditions by moving around, however, it would matter if
agent’s actions, the agent will be susceptible to inaction in-      the agent causes a serious negative side effect, for example
centives. Either the agent does not act at all (for all but small   a hurricane. While both outcomes can be a result of com-
magnitudes of λ) or it will be insufficiently regularized and       plex and chaotic dynamics of the environment, we care less
possibly result in undesired side effects (for small λ).            about the former and more about the latter. Differentiating
   Sensitivity to Typical Actions. Many real-world environ-         between negative, neutral and positive side effects is a task
ments exhibit chaotic behavior, in which the state of the en-       of the deviation measure used in the IR, which is discussed
vironment is highly sensitive to small perturbations. In such       in the next section.
environments, the environment state where the agent has
performed an action will be fundamentally different from                    4    Choosing a Deviation Measure
the environment state for the inaction baseline (Armstrong
and Levinstein 2017). Furthermore, for the step-wise inac-          A baseline defines a “safe” counterfactual to the agent’s ac-
tion baseline, the same argument holds for the non-action           tions. The deviation measure determines how much a de-
compared to the planned action of the agent. Hence, when            viation from this baseline by the agent should be penal-
using these baselines for IR, all actions of the agent will be      ized or rewarded. Currently, the main approaches to a de-
strongly regularized, creating the inaction incentive. When         viation measure are the relative reachability (RR) measure
λ is lowered to allow the agent to take actions, the agent can      (Krakovna et al. 2019), the attainable utility (AU) measure
cause negative side effects when the IR cannot differentiate        (Turner, Hadfield-Menell, and Tadepalli 2020) and the fu-
between negative side effects and chaotic changes in the en-        ture task (FT) reward (Krakovna et al. 2020a). In the prac-
vironment. Here, it is useful to distinguish between typical        tical implementations of AU and FT, they use reachability
and atypical actions. We say (informally) that an action is         tasks, which can be considered as a sub-sampling of RR. In
typical if it is commonly used for solving a wide variety of        this section, we argue that the current deviation measures
tasks (e.g. moving). When the environment is highly sen-            should be augmented with a notion of value of the impact to
sitive to typical actions, IRs with the current baselines will      avoid unsatisfactory performance of the agent and that new
prevent the agent from engaging in normal operations. How-          rollout policies should be designed that allow for a proper
ever, it is not always a problem if the environment is highly       incorporation of delayed effects into the deviation measure.
sensitive to atypical actions of the agent (e.g. discharging
onboard weaponry), as preventing atypical actions impedes           4.1   Which Side Effects are Negative?
less with the normal operation of the agent.                        The goal of IRs is to approximate Renv for all states in a
   Capability of the Agent. The inaction incentive will be-         tractable manner. It does this by penalizing impact on the en-
come more apparent for agents that are highly capable of            vironment, built upon the assumption that the environment is
predicting the detailed consequences of their actions, for ex-      already optimized for human preferences (Shah et al. 2019).
ample by using a powerful physics engine. As the ability to         The IR aims to penalize impact proportionally to the mag-
predict the consequences of an action is fundamental to min-        nitude of this impact which corresponds with the magnitude
imizing side effects, limiting the prediction capabilities of an    of the side effect (Krakovna et al. 2019; Turner, Hadfield-
agent to prevent the inaction incentive is not desired. Rather,     Menell, and Tadepalli 2020). However, not all impact is neg-
for agents that can very accurately predict the implications        ative, but it can also be neutral or even positive. Renv does not
of their actions, it is necessary to have an accompanying in-       only consider the magnitude the impact on the environment,
telligent impact regularizer.                                       but also to which degree this impact is negative, neutral or
   State Features. Armstrong and Levinstein (2017) point            positive. Neglecting the associated value of impact can lead
out that for IR one should not represent states with overly         to suboptimal agent behavior, as highlighted in the example
fine-grained features, as presenting an agent with too much         below.
information exposes them to basing decisions on irrelevan-             Example: The Chemical Production Plant. Consider
cies. For example, it would be counterproductive for an             an AI system controlling a plant producing a chemical prod-
agent attempting to forecast demand in an online sales sit-         uct for which various unknown reactions exist, each produc-
uation to model each potential customer separately, when            ing a different combination of waste products. The task of
broader aggregates would suffice. However, there remain             the AI system is to optimize the production rate of the plant,
two issues with this approach to mitigate the inaction in-          i.e. it gets a reward proportional to the production rate. To
centive. First, the intrinsic dynamics of the environment re-       minimize the impact of the plant on the environment, the
main unchanged, so it is still highly sensitive to small per-       reward function of the agent is augmented with an impact
turbations, of which the results can be visible in the coarser      regularizer, which penalizes the mass of waste products re-
features (e.g. the specific weather conditions). Second, for        leased in the environment, compared to an inaction baseline
advanced AI systems, it might be beneficial to change their         (where the plant is not operational). Some waste products
feature representation to become more capable of predicting         are harmless (e.g. O2 ), whereas others can be toxic. When
the consequences of their actions. In this case, one would          the deviation measure of the impact regularizer does not
have no control over the granularity of the features.               differentiate between negative, neutral or positive impact,
the AI system is incentivized to use a reaction mechanism          the inaction policy, such as the current policy of the agent.
that maximizes production while minimizing waste. How-
ever, this reaction might output mostly toxic waste product,       5     Choosing the Magnitude of the Regularizer
whereas another reaction outputs only harmless waste prod-         To combine the IR with a specified reward function, the
ucts and hence has no negative side effects. Tuning the regu-      designer has to choose the magnitude of the regularizer
larizer magnitude λ does not provide a satisfactory solution       λ. Turner, Hadfield-Menell, and Tadepalli (2020) say that
in this case, as either the plant is not operational (for high     “loosely speaking, λ can be interpreted as expressing the
lambda), or the plant is at risk of releasing toxic waste prod-    designer’s beliefs about the extent to which R [the specified
ucts in the environment.                                           reward] might be misspecified”.
   Positive Side Effects. The distinction between positive,
                                                                      It is crucial to choose the correct λ. If λ is too small, the
neutral and negative impact is not only needed to allow for
                                                                   regularizer may not reduce the risk of undesirable side ef-
a satisfactory performance of the agent in many environ-
                                                                   fects effectively. If λ is too big, the regularizer will overly
ments, it is also desirable for encouraging unanticipated pos-
                                                                   restrict necessary effects of the agent on the environment,
itive side effects. Expanding upon the example in 4.1: if the
                                                                   and the agent will be less effective at achieving its goal.
agent discovered a way to costlessly sequester carbon diox-
                                                                   Note, that while the regularizers proposed by Krakovna et al.
ide alongside its other tasks it should do so, whilst an IR
                                                                   (2019) and Turner, Hadfield-Menell, and Tadepalli (2020)
would encourage the robot to not interfere. While very posi-
                                                                   already measure utility, in general λ must also handle a unit-
tive unexpected outcomes might be unlikely, this possibility
                                                                   conversion of the regularizer to make it comparable with the
should not be neglected in the analysis of impact regulariz-
                                                                   reward function.
ers.
                                                                      Some intuition for choosing λ comes from a Bayesian per-
   Value Differences. To distinguish between positive, neu-
                                                                   spective, where the regularizer encodes prior knowledge and
tral and negative side effects, we need an approximation of
                                                                   λ controls how far from the prior the posterior should have
Renv that goes beyond measuring impact as a sole source
                                                                   moved. Another distinct view on setting λ comes from the
of information. The value difference framework (Turner,
                                                                   dual optimization problem, where it represents the Lagrange
Hadfield-Menell, and Tadepalli 2020) allows for differenti-
                                                                   multiplier on an implied set of constraints: λ is the magni-
ating between positive and negative impact by defining the
                                                                   tude of the regularizer for which the solution to the penal-
deviation measure as a sum of differences in value between
                                                                   ized optimization problem coincides with a constrained opti-
a baseline and the agent’s state-action pair for various value
                                                                   mization problem. Hence, the designer can use λ to commu-
functions. Hence, it is possible to reflect how much the de-
                                                                   nicate constraints to the AI system, which is a natural way
signer’s values different kinds of side effects in these value
                                                                   to phrase some common safety problems (Ray, Achiam, and
functions. However, the challenge remains to design value
                                                                   Amodei 2019).
functions that approximate Renv to a sufficient degree on the
                                                                      Armstrong and Levinstein (2017) discuss the problem of
complete state space, which is again prone to reward mis-
                                                                   tuning λ and note that contrary to intuition the region of use-
specification. So although the value difference framework
                                                                   ful λ’s can be very small and hard to find safely. In practice λ
allows for specifying values for side effects, how to specify
                                                                   is often tuned until the desired behavior is achieved, e.g., by
this notion of value is still an open problem.
                                                                   starting with a high λ and reducing it until the agent achieves
4.2   Rollout Policies                                             the desired behavior. This approach is in general insufficient
                                                                   to find the correct trade-off. For a fixed step-size in decreas-
Often, the actions of an agent cause delayed effects, i.e. ef-     ing λ, the tuning might always jump from a λ that leads to
fects that are not visible immediately after taking the action.    inaction, to a λ that yields unsafe behavior. The same holds
The stepwise inaction baseline (Turner, Hadfield-Menell,           for other common procedures to tune hyperparameters.
and Tadepalli 2020) ignores all actions that took place be-
fore t − 1, hence, to correctly penalize delayed effects, the                         6    Ways Forward
deviation measure needs to incorporate future effects. This
can be done by collecting rollouts of future trajectories us-      In this section, we put forward promising future research di-
ing a simulator or model of the environment. These rollouts        rections to overcome the challenges discussed in the previ-
depend on which rollout policy is followed by the agent in         ous sections.
the simulation. For the baseline states, the inaction policy
is the logical choice. For the rollout of the future effects of    6.1    A Causal Framing of Offsetting
the agent’s action, it is less clear which rollout policy should   In Section 3.2, we highlighted that some offsetting behavior
be used. Turner, Hadfield-Menell, and Tadepalli (2020) use         is desired and some undesired. To design an IR that allows
the inaction policy in this case. Hence, this IR considers a       for desired offsetting but prevents undesired offsetting, one
rollout where the agent takes its current action, after which      firsts needs to have a mechanism that can predict and dif-
it cannot do any further actions. This approach has signifi-       ferentiate between these two types of offsetting. Undesired
cant downsides, because IR does not allow the agent to do a        offsetting concerns the environment states that are a conse-
series of actions when determining the impact penalty (e.g.        quence of the task. The difficulty lies in determining which
the agent can take an action to jump, but cannot plan for its      states are a causal consequence of the task being completed
landing accordingly in the rollout). Therefore, we argue that      and differentiate them from states that could have occurred
future work should develop rollout policies different from         regardless of the task.
   Goal-based Tasks. When the task consists of reaching a           in a sequence of environments that increasingly resemble
certain goal state, the consequences of performing a task can       the production environment. At each iteration, the designer
be formalized in a causal framework (Pearl 2009). When a            identifies weaknesses and corrects them, such that the cri-
causal graph of the environment-agent-interaction is avail-         terion being optimized becomes increasingly true to the de-
able, the states that are a consequence of the task can be          signer’s intent. For example, an AI with the goal to trade
obtained from the graph as the causal children nodes of the         financial assets may be run against historical data (“back-
goal state. Hence, a baseline that allows for desired offset-       tested”) in order to understand how it might have reacted
ting behavior but prevents undesired offsetting behavior pre-       in the past, and presented with deliberately extreme inputs
vents the agent from interfering with the children nodes of         (“stress-tested”) in order to understand likely behavior in
the goal states, while allowing for offsetting on other states.     “out of sample” situations. To design a reward function and
   General Tasks. Not all tasks have a distinct goal state          a regularizer, it is crucial for the designer to be able to un-
which indicates the completion of a task, but accumulate in-        derstand how the system would react in novel situations and
stead task-related rewards at various time steps during an          how to fix it in case it exhibits undesired behavior. Further
episode. Extending this argument to general tasks remains           research aiming to increase the designer’s ability to under-
an open issue, for which causal influence diagrams (Everitt         stand how a system will react, will substantially help the de-
et al. 2019) can provide a mathematical framework.                  signer to communicate their intent more effectively. Recent
                                                                    work in this direction concerning interpretability (Gilpin
6.2   Probabilities Instead of Counterfactuals as                   et al. 2018), verification (e.g. Huang et al. 2017) of machine
      Baseline                                                      learning models is particularly promising.
Armstrong and Levinstein (2017) made the interesting argu-             Actively Learning from Humans. Considering the prob-
ment that probabilities are better suited than counterfactuals      lem from the perspective of the AI system, the goal is to
for measuring the impact of actions. Current implementa-            improve its ability to understand the designer’s intent, espe-
tions of IRs use a counterfactual as baseline (e.g. the inaction    cially in novel, unanticipated, scenarios. Instead of the de-
baseline or stepwise inaction baseline). Because this base-         signer telling the system their intent, this problem can be
line is one specific trajectory, it will differ considerably from   addressed by the system asking the designer about their in-
the actual trajectory of the agent in environments that ex-         tent. To decide what to ask the designer, the system may be
hibit chaotic dynamics. However, chaotic environments will          able to determine which states it is highly uncertain about,
also be highly sensitive to perturbations that do not orig-         even if it is not able to accurately ascribe values to some
inate from the agent’s actions. One possible way forward            of them. Recent work shows that such an approach can be
towards a more robust measure of the agent’s impact on the          effectively used to learn from the human about a task at
environment is hence to compare probabilities that marginal-        hand (Christiano et al. 2017), but it may also be used to
ize over all external perturbations instead of comparing spe-       learn something about the constraints of the environment
cific trajectories. Define p(st |A) as the probability of having    and which side effects are desired or undesired (Zhang, Dur-
state st , given the trajectory of actions A the agent took and     fee, and Singh 2018). Active learning could also provide a
p(st |B) as the probability of st given actions prescribed by       different perspective on impact regularizers: instead of di-
the baseline. All influences of perturbations that did not arise    rectly penalizing impact on the environment, a high value
from the agent are marginalized out in these probabilities.         of the regularization term could be understood as indicating
Hence, a divergence measure between these two probabili-            that the designer should give feedback. In particular, this ap-
ties can give a more robust measure of potential impact of          proach could help to resolve situations in which a positive
the agent, without being susceptible to non-necessary inac-         task reward conflicts with the regularization term.
tion incentives. To our best knowledge, this idea has not yet
been implemented as a concrete IR method and would hence
be a promising direction for future research.
                                                                                         7    Conclusion
                                                                    Avoiding negative side effects in systems that have the ca-
6.3   Improved Human-Computer interaction                           pacity to cause harm is necessary to fully realize the promise
Side effects occur if there is a difference between the out-        of artificial intelligence. In this paper, we discussed a pop-
come an AI system achieves and the intent of its (human)            ular approach to reduce negative side effects in RL: impact
designer. Thus improving how well the designer can com-             regularization (IR). We discussed the practical difficulty of
municate their intent to the AI system is an important aspect       choosing each of the three components: a baseline, a devia-
of eliminating side effects (Leike et al. 2018). This empha-        tion measure and a regularization strength. Furthermore, we
sis on the human component of learning to avoid negative            pointed to fundamental problems that are currently not ad-
side effects connects it closely to the problem of scalable         dressed by state-of-the-art methods, and presented several
oversight proposed by Amodei et al. (2016).                         new future research directions to address these. While our
   Improved Tools for Reward Designers. Commonly, a                 discussion showed that current approaches still leave signif-
designer will aim to iteratively improve the AI system and          icant opportunities for future work, IRs are a promising idea
its reward function. Similarly, when choosing an impact reg-        for building the next generation of safe AI systems, and we
ularizer, a designer will iterate on the choice of baseline, de-    hope that our discussion is valuable for researchers trying to
viation measure, and regularization strength and test them          build new IRs.
                   Acknowledgments                                McCarthy, J.; and Hayes, P. 1969. Some philosophical
                                                                  problems from the standpoint of ai, Machine Intelligence
We thank Andreas Krause, François Fleuret and Benjamin
                                                                  (Meltzer B. and Michie D., eds.), vol. 4.
Grewe for their valuable comments and suggestions. Kyle
Matoba was supported by the Swiss National Science Foun-          Pearl, J. 2009. Causality. Cambridge university press.
dation under grant number FNS-188758 “CORTI”.                     Rahaman, N.; Wolf, S.; Goyal, A.; Remme, R.; and Bengio,
                                                                  Y. 2019. Learning the Arrow of Time for Problems in Rein-
                        References                                forcement Learning. In International Conference on Learn-
                                                                  ing Representations (ICLR).
Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schul-
man, J.; and Mané, D. 2016. Concrete problems in AI safety.      Ray, A.; Achiam, J.; and Amodei, D. 2019. Bench-
arXiv:1606.06565 .                                                marking safe exploration in deep reinforcement learning.
                                                                  arXiv:1910.01708 .
Armstrong, S.; and Levinstein, B. 2017. Low impact artifi-
cial intelligences. arXiv:1705.10720 .                            Saisubramanian, S.; Kamar, E.; and Zilberstein, S. 2020.
                                                                  A Multi-Objective Approach to Mitigate Negative Side Ef-
Christiano, P. F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.;    fects. In Proceedings of International Joint Conferences on
and Amodei, D. 2017. Deep reinforcement learning from             Artificial Intelligence (IJCAI).
human preferences. In Advances in Neural Information Pro-
cessing Systems.                                                  Saisubramanian, S.; Zilberstein, S.; and Kamar, E. 2020.
                                                                  Avoiding negative side effects due to incomplete knowledge
Everitt, T.; Ortega, P. A.; Barnes, E.; and Legg, S. 2019. Un-    of ai systems. arXiv:2008.12146 .
derstanding Agent Incentives using Causal Influence Dia-
                                                                  Saunders, W.; Sastry, G.; Stuhlmüller, A.; and Evans, O.
grams. Part I: Single Action Settings. arXiv:1902.09980 .
                                                                  2018. Trial without Error: Towards Safe Reinforcement
Eysenbach, B.; Gu, S.; Ibarz, J.; and Levine, S. 2018. Leave      Learning via Human Intervention. In Proceedings of Inter-
no trace: Learning to reset for safe and autonomous rein-         national Conference on Autonomous Agents and MultiAgent
forcement learning. In International Conference on Learn-         Systems.
ing Representations (ICLR).                                       Shah, R.; Krasheninnikov, D.; Alexander, J.; Abbeel, P.; and
Garcıa, J.; and Fernández, F. 2015. A comprehensive survey       Dragan, A. 2019. Preferences Implicit in the State of the
on safe reinforcement learning. Journal of Machine Learn-         World. In International Conference on Learning Represen-
ing Research 16(1): 1437–1480.                                    tations (ICLR).
Gilpin, L. H.; Bau, D.; Yuan, B. Z.; Bajwa, A.; Specter, M.;      Turner, A. M.; Hadfield-Menell, D.; and Tadepalli, P. 2020.
and Kagal, L. 2018. Explaining explanations: An overview          Conservative agency via attainable utility preservation. In
of interpretability of machine learning. In IEEE 5th Interna-     Proceedings of the AAAI/ACM Conference on AI, Ethics,
tional Conference on data science and advanced analytics          and Society.
(DSAA), 80–89.                                                    Turner, A. M.; Ratzlaff, N.; and Tadepalli, P. 2020. Avoid-
Hadfield-Menell, D.; Dragan, A.; Abbeel, P.; and Russell, S.      ing Side Effects in Complex Environments. In Advances in
2017. The off-switch game. In Proceedings of International        Neural Information Processing Systems.
Joint Conferences on Artificial Intelligence (IJCAI).             Zhang, S.; Durfee, E.; and Singh, S. 2020. Querying to Find
                                                                  a Safe Policy under Uncertain Safety Constraints in Markov
Huang, X.; Kwiatkowska, M.; Wang, S.; and Wu, M. 2017.            Decision Processes. In Proceedings of the AAAI Conference
Safety verification of deep neural networks. In International     on Artificial Intelligence.
Conference on Computer Aided Verification, 3–29. Springer.
                                                                  Zhang, S.; Durfee, E. H.; and Singh, S. P. 2018. Minimax-
Krakovna, V.; Orseau, L.; Kumar, R.; Martic, M.; and Legg,        Regret Querying on Side Effects for Safe Optimality in Fac-
S. 2019. Penalizing side effects using stepwise relative          tored Markov Decision Processes. In Proceedings of In-
reachability. In Workshop on Artificial Intelligence Safety       ternational Joint Conferences on Artificial Intelligence (IJ-
at IJCAI.                                                         CAI).
Krakovna, V.; Orseau, L.; Ngo, R.; Martic, M.; and Legg, S.
2020a. Avoiding Side Effects By Considering Future Tasks.
In Advances in Neural Information Processing Systems.
Krakovna, V.; Uesato, J.; Mikulik, V.; Rahtz, M.; Everitt,
T.; Kumar, R.; Kenton, Z.; Leike, J.; and Legg, S.
2020b. Specification gaming: the flip side of AI ingenu-
ity. URL https://deepmind. com/blog/article/Specification-
gamingthe-flip-side-of-AI-ingenuity .
Leike, J.; Krueger, D.; Everitt, T.; Martic, M.; Maini, V.; and
Legg, S. 2018. Scalable agent alignment via reward model-
ing: a research direction. arXiv:1811.07871 .

</pre>