<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Benchmark-
ing safe exploration in deep reinforcement learning. arXiv
preprint arXiv:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Negative Side Effects and AI Agent Indicators: Experiments in SafeLife</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for the Study of Existential Risk</institution>
          ,
          <addr-line>Cambridge</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Leverhulme Centre for the Future of Intelligence</institution>
          ,
          <addr-line>Cambridge</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universitat Polite`cnica de Vale`ncia</institution>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1910</year>
      </pub-date>
      <volume>01708</volume>
      <abstract>
        <p>The widespread adoption and ubiquity of AI systems will require them to be safe. The safety issues that can arise from AI are broad and varied. In this paper we consider the safety issue of negative side effects and the consequences they can have on an environment. In the safety benchmarking domain SafeLife, we discuss the way that side effects are measured, as well as presenting results showing the relation between the magnitude of side effects and other metrics for three agent types: Deep Q-Networks, Proximal Policy Optimisation, and a Uniform Random Agent. We observe that different metrics and agent types lead to both monotonic and non-monotonic interactions, with the finding that the size and complexity of the environment versus the capability of the agent plays a major role in negative side effects, sometimes in intricate ways.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        As advances within Artificial Intelligence (AI) continue, AI
systems are becoming increasingly ubiquitous. Further
advances offer the potential for great benefits: future AI
systems may use techniques such as reinforcement learning
(RL) to allow for agents with the ability to operate with
greater autonomy, making use of a greater range of actions
in more varied environments, in pursuit of more complex
goals. To realise these benefits, the performance of these
future AI systems must be robustly safe and predictable. Safety
issues can take many forms and even the most exhaustive
surveys
        <xref ref-type="bibr" rid="ref1 ref5">(Amodei et al. 2016; Critch and Krueger 2020)</xref>
        cannot identify and classify every conceivable risk. Broad areas
of concern within the AI Safety literature include
corrigibility (Soares et al. 2015), Safe Exploration
        <xref ref-type="bibr" rid="ref7">(Garcia and
Fernandez 2012)</xref>
        , Value alignment (Russell 2019), Side Effects
        <xref ref-type="bibr" rid="ref1">(Amodei et al. 2016)</xref>
        , among many others.
      </p>
      <p>
        Benchmarking domains have become a popular method
for the evaluation of safety properties for AI systems. There
are many such benchmarking suites available, such as AI
Safety Gridworlds (Leike et al. 2017), Safety Gym
        <xref ref-type="bibr" rid="ref11">(Ray,
Achiam, and Amodei 2019)</xref>
        , SafeLife
        <xref ref-type="bibr" rid="ref11">(Wainwright and
Eckersley 2019)</xref>
        and the Real World Reinforcement Learning
Challenge Framework
        <xref ref-type="bibr" rid="ref6">(Dulac-Arnold et al. 2020)</xref>
        . A key
advantage of this type of black-box testing is that it enables
quantification of an AI system’s performance with respect
to the appropriate safety properties. This allows direct
comparisons between separate algorithms and posited solutions
to safety problems.
      </p>
      <p>Our goal with this paper is to study how the performance
and safety, in terms of side effects, evolve as more resources
are given to the learning agent. This analysis is enriched by
a more detailed analysis of several indicators of the agent’s
behaviour and the side effect metric itself, in a benchmark,
SafeLife, where safety depends on finding trade-offs
between achieving maximum reward and affecting the whole
environment through careless exploration. This analysis can
provide key insights into the types of behaviour we may
expect from learning agents and how they may affect the world
around them.</p>
    </sec>
    <sec id="sec-2">
      <title>Background and Related Work</title>
      <p>
        Negative side effects (NSE) are a key issue in AI safety
        <xref ref-type="bibr" rid="ref1">(Amodei et al. 2016)</xref>
        . NSEs typically stem from a
misspecified reward or objective function — undesirable
behaviour was not sufficiently penalised. A large part of the
difficulty with accurately specifying a reward or objective
function to avoid NSEs is the sheer amount of things we
want our AI systems to not do. Ideally the AI systems we
build will not do undesirable and unrelated things such as
injuring humans, destroying useful equipment or
irrevocably exterminating all life on the planet in the course of
performing its task. Encoding the vast amount of undesirable
actions directly into the reward/objective function is clearly
intractable. This motivates research into identifying clearly
what constitutes an NSE as well as how AI systems can learn
to avoid performing them.
      </p>
      <p>We give an overview of the formulation of NSEs and
related notions from the AI Safety community.</p>
      <p>
        Low-impact AI systems
        <xref ref-type="bibr" rid="ref3">(Armstrong and Levinstein 2017)</xref>
        are proposed as a method for minimising NSEs. Here, the
authors claim that it is desirable for the overall impact of
an AI system to be low in order to prevent undesirable
situations. Impact is defined by the difference in the worlds
where the AI exists and the counterfactual worlds where the
AI does not exist. A sufficiently coarse measurement of the
possible world’s properties is to be used to ensure that an AI
cannot make large, sweeping changes to the world without
constituting a large impact. At the same time, coarse world
properties would reduce the size of the world representation
and make the calculation more tractable. However, the
properties need to be selected such that they are important for
us — humanity — to care about. Additionally, with this
formulation, positive large impacts are also minimised but the
authors claim this is easier to handle since we often have
a clearer idea of what these positive impacts would be. A
similar approach is proposed in
        <xref ref-type="bibr" rid="ref1">(Amodei et al. 2016)</xref>
        where
impact is regularised, but posits that due to the similarity of
many side effects such a regularisation could be learned in
one domain and transferred over to another.
      </p>
      <p>As part of an AI Safety Gridworlds environment (Leike
et al. 2017) irreversible side effects are penalised. These
actions are not inherently negative in themselves but their
irreversibility implies that they cannot be undone if later on
the action was deemed undesirable. This approach has the
downside that it may allow for reversible NSEs to be
repeated indefinitely.</p>
      <p>
        Stepwise inaction is introduced in
        <xref ref-type="bibr" rid="ref5 ref8">(Turner,
HadfieldMenell, and Tadepalli 2020)</xref>
        , where the effect of action
against inaction is calculated after each time-step over
auxiliary Q-functions. Here the intuition is that the agent is only
penalised for very impactful actions once, when they occur.
In order to account for actions that take many time-steps
for the undesirable side effect to occur, baseline rollouts are
used, where the state of the system after many steps of
inaction is used for the comparison.
      </p>
      <p>Krakovna et al. (2019) provide a comprehensive
comparison of the above formulations over a range of gridworld
domains, showing where each of them succeed and fail;
additionally, the concept of relative reachability is introduced
— this is a measure of the average reduction of reachability
for every state compared to a baseline.</p>
      <p>The recurring theme for defining and identifying NSEs is
the idea of comparing the effect of the agent against some
scenario in which the agent would have acted in a different
way. Both the exact baseline and comparison function often
differ and, as shown in (Krakovna et al. 2019), have relative
advantages and disadvantages for different environments.</p>
      <sec id="sec-2-1">
        <title>Types of Side Effects</title>
        <p>The above formulations capture many aspects of measuring
NSEs and deciding which counterfactual worlds to compare
against. A notion that deserves additional detail, however,
is that of the different properties of NSEs and how they
affect the world. Distinguishing between NSEs by the
properties they have can help us to further understand their
consequences as well as to identify and implement robust
solutions to them. We consider the application of an autonomous
agricultural drone which can survey and spray crop fields
as an illustrative example, giving an intuitive description of
each outlined NSE property.</p>
        <p>We refer to a first type of NSE property as “blocking”.
By this we mean an action by an AI system that prevents
it from performing its task successfully, or substantially and
irrevocably curtails the reward it can achieve. Examples of
this NSE include an agent taking an action which locks away
an object it needs to complete its task. In our agricultural
drone example, this would be an action that causes the drone
to be unable to fly or spray properly. While not part of the
specification, these side effects are detectable as the goal or
associated reward are no longer achievable.</p>
        <p>A second property of interest is that of “irreversible”
actions. Here we refer to the same type of actions as described
in (Leike et al. 2017). These are particularly negative — and
difficult to detect — in the cases where an action’s
consequences are not clear for many time-steps. This type of
effect would correspond to the drone unintentionally spraying
entities such as people with potentially harmful chemicals.
This can clearly not be undone.</p>
        <p>“Effect indifference” is another relevant NSE property.
Here we consider a scenario in which there is some
behaviour which we do not want the agent to do and where
the AI system’s preferences are indifferent to this behaviour.
This may also include an indifferent reward function. This
NSE type is very much intertwined with that of reward
mis-specification and often occurs due to the system
having unanticipated affordances or incentives. Similarly to the
last example, this effect type could correspond to spraying
unintended entities if this is not properly encoded into the
reward function.</p>
        <p>
          The final NSE property we consider here is that of the
“reward trade-off”. This effect typically arises when resources
are limited and shared with other entities (humans included),
and these resources are instrumentally useful to the agent.
The danger here is not only that the maximisation of the
reward could imply that all resources should be exhausted
          <xref ref-type="bibr" rid="ref4">(e.g., the famous instrumental convergence problems, such
as Minsky’s Riemann hypothesis or Bostrom’s paperclip
problem, Bostrom 2003)</xref>
          , but even in situations where the
reward is capped, the agent may take more than it needs for
its task and perform an inefficient and unsustainable use of
the shared resources. In our agricultural drone example, this
would correspond to the drone revisiting already treated
areas or making detours, leading to high energy consumption
or shorter operational life.
        </p>
        <p>Different actions can have several of the properties
outlined above; this may compound the risk posed by a side
effect.</p>
        <p>
          AI Safety benchmarking suites can evaluate some or all
of the above. In the AI Safety Gridworlds (Leike et al. 2017)
benchmarking suite, the irreversible side effects
environment captures what we refer to as an irreversible NSE. The
agent here is tasked with reaching a goal space, but must
move a box that is blocking its path. Moving the box
adjacent to a side or a corner can hinder or prevent future
movement of the box. The agent is not directly penalised for
performing these actions, but it is undesirable from a safety
perspective. The Safety Gym
          <xref ref-type="bibr" rid="ref11">(Ray, Achiam, and Amodei 2019)</xref>
          generally requires an agent to perform a task such as
reaching a goal or moving a box to a specific position while
avoiding numerous hazard types. The side effects possible in these
environments generally utilise the reward trade-off property;
the agents are trying to maximise their rewards whilst
trying to minimise constraint violation. The Real World
Reinforcement Learning Challenge Framework
          <xref ref-type="bibr" rid="ref6">(Dulac-Arnold
et al. 2020)</xref>
          does not contain any environments that directly
measure the side effects of an agent, though some tasks
require the agent not to violate certain safety constraints. On
the other hand, SafeLife
          <xref ref-type="bibr" rid="ref11">(Wainwright and Eckersley 2019)</xref>
          ,
described in the following section, presents side effects that
can consist of each of the NSE properties seen above.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>SafeLife Environment</title>
      <p>
        SafeLife
        <xref ref-type="bibr" rid="ref11">(Wainwright and Eckersley 2019)</xref>
        is a safety
benchmarking suite for RL agents. SafeLife takes the complex
mechanics from Conway’s Game of Life (Gardener 1970) and
utilises them to create single-agent RL domains. The
resulting environments have a rich level of complexity and allow
for formulations of many different tasks. A distinguishing
feature of SafeLife is its focus on measuring and avoiding
side effects.
      </p>
      <p>Each environment consists of an n m grid of cells which
can each take a variety of properties (see an example in Fig.
1). We refer to this grid as a board. The most notable
property is whether the cell is “dead” or “alive”. Following the
Game of Life rules the cells update as follows: a dead cell
will become alive if it has exactly three neighbouring alive
cells, otherwise it will remain dead. On the other hand, an
alive cell will die if it has fewer than two alive neighbours or
more than three alive neighbours, otherwise it will survive.</p>
      <p>SafeLife extends this basic framework into a single-agent
domain by including an agent that traverses the grid. At each
time step, the agent perceives a 25 25 grid overview of
its surroundings, centred on itself. For each position in this
overview, the cell type present is perceived.</p>
      <p>Instances in SafeLife come with goals for the agent to
complete, usually revolving around creating or destroying
particular life structures before moving to a predefined goal
cell. Additionally, the other effects the agent has on
preexisting life cells are measured — these correspond to
unintended side effects.</p>
      <p>In SafeLife the agent is depicted by an arrow and may
move in any cardinal direction, as well as creating or
destroying life cells in adjacent spaces. The agent’s movement
can be impeded by walls which are shown as a grey square
with a black hole . Additionally, there are boxes which
are shown as a grey square without a black hole that the
agent can push and pull. Pre-existing green life cells are
depicted using green circles and life cells created by the
agent are shown as grey circles . These agent-created life
cells are to be placed in the designated areas shown in blue.
Both types of life cells that follow the rules of the Game of
Life. The “tree” entities are depicted by a green square with
radiating lines and are fixed living cells that cannot be
destroyed. Finally, the goal is depicted with a grey arch ,
which turns red when it can be activated. All cells that have
not been explicitly noted as living are “dead”.</p>
      <p>Altogether this allows for a very rich and complex set of
environments which can be generated procedurally.</p>
      <sec id="sec-3-1">
        <title>Measuring Side Effects</title>
        <p>We now look at the method that the authors of SafeLife
propose for measuring the magnitude of side effects. Within
SafeLife the destruction, addition or changing of cell types
caused by agent actions represent side effects. The types of
side effects that we are concerned with are those made to
the pre-existing life cells, but SafeLife provides the means to
look at the effects on any cell type. The primary reason for
only considering the effects the agent has on the green life
cells is the fact that the agent is not penalised or rewarded
for disrupting the structures they form; thus they may be
destroyed as an ‘effect indifference’ NSE. The green cells can
obstruct the agent and prevent it from moving, so it may be
instrumentally useful for the agent to destroy them, making
them relevant to the ‘trade-off’ NSE property. Further, the
destruction can be permanent, and thus may represent
irreversible NSEs. Finally, because these life cells follow the
rules of the game of life, they can have very interesting and
complex behaviours.</p>
        <p>Due to the dynamic nature of SafeLife, comparing a
single final state against a single baseline state may not be
sufficient to capture the eventual side effects of an agent. Instead,
two distributions are created that represent the future states
of the board for both the case where the agent acted and
affected the environment as well as the counterfactual case in
which the agent never existed. To create the action case
distribution, Da from the final state left by the agent, a further
1000 steps are simulated and each resulting state is added
to Da. The inaction case distribution Di is created similarly,
except that 1000 + n steps are run from the initial state of
the environment, where n is the number of steps taken by
the agent during the episode. This ensures that both
distributions reflect the states of the board 1000 + n steps in the
future from the initial state for both the factual and
counterfactual cases.</p>
        <p>From these distributions, the estimated
proportion of time steps that board position
x = hx1; x2i; x 2 grid = f1; :::; ng f1; :::; mg
contains cell type c can be computed:
cD(x) =
1</p>
        <p>X[1(s(x) = c)]
jDj s2D
where D is the relevant distribution, s is the state drawn from
D and s(x) is the cell type present in s at location x.</p>
        <p>A ground distance function is also defined between board
positions as g(x; y) = tanh( 15 kx yk1) where kx yk1 is
the Manhattan distance between grid locations x and y.</p>
        <p>The calculated side effect for Cell Type c is then:
dc(Da; Di) = EM D( cDa ; cDi ; g)
where Da and Di are the action and inaction distributions
respectively, and EM D is the earth mover distance function.
Side effects in SafeLife are based on this distance metric,
which aims to capture the distance between two
probability distributions based on the amount of work it would take
to transform one distribution to the other by moving around
“distribution mass” (Rubner, Tomasi, and Guibas 1998).</p>
        <p>This side effect score dc(Da; Di) is then “normalised”
against the inaction baseline to give a final side effect score:
c(Da; Di) = Px2grid cDi (x)
dc(Da; Di)
This normalisation allows us to compare agent behaviours
in larger or more densely populated boards.</p>
        <p>While a side effect score can be calculated for every cell
type, SafeLife focuses on the side effects for the green,
preexisting life cells.</p>
        <p>
          This definition of a side effect echoes that of
          <xref ref-type="bibr" rid="ref3">(Armstrong
and Levinstein 2017)</xref>
          and
          <xref ref-type="bibr" rid="ref1">(Amodei et al. 2016)</xref>
          , where the
agent’s actions are compared against a world in which it
never acted at all. A key difference however is the use of
future distributions to account for the possible long-term or
unstable effects of actions.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Examples and Implications</title>
        <p>Figure 2 shows a before and after comparison to show the
effects of an agent’s actions. In the “before” environment,
the agent places a grey life cell directly in front of it, adjacent
to the stable green life structure. This disrupts the structure
and after a few time steps it is destroyed. At the end of the
episode, assuming the agent does not affect any more green
life cells, we can calculate the side effect score as follows:
(a) Before
(b) After</p>
        <p>Since both the “before” and “after” snapshots are stable
(the structures if left alone will not change),
green(x) = 1(sbefore(x) = green)
Da
green(x) = 1(safter(x) = green)</p>
        <p>Di
As there are seven green cells in total, it then follows that
green(Da; Di) =</p>
        <p>EM D( gDraeen; green; g)</p>
        <p>Di
P green(x)
x2grid Di
=
4
7
The use of the normalised earth mover distance has some
interesting consequences for the side effect score.
Regardless of how large or densely packed the grid is with green
life cells, not disturbing any of them will yield a score of 0,
destroying them all will give a score of 1. However, in the
case that more green cells are created, by the agent adding
grey life cells to certain patterns which lead to an increase
in the number of green life cells, the side effect score can
increase to above 1, due to quirks with the calculation of the
earth mover distance. Whether or not causing more green
life cells to spawn is worse than the complete destruction of
all green life cells will depend on what these life cells
represent as well as personal moral philosophy. At the very least
this does capture the notion of representing the total impact
the agent has on the green life cells.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>SafeLife has many tasks available out-of-the-box, and
vastly more can be created by the user. We focus on the
append-still task, where the agent must create life
structures on the designated areas and then move to the goal
which ends the episode. Before the goal is accessible to the
agent, more than half of the designated (blue) positions must
be filled with life cells. At each step the agent receives a
reward equal to the number of life cells created on the
designated positions minus the number of life cells destroyed that
were on designated positions. Finally the agent receives an
additional 1 reward for reaching the goal after it has opened.
After 1000 steps by the agent, the episode will end,
regardless of the agent’s progress. If the agent has managed to fill
at least half of the designated positions for its life cells and
reached the goal to end the episode before the 1000 steps
have elapsed, we say the agent has ‘passed’ the task instance,
otherwise it has failed.</p>
      <p>In this task there are also pre-existing green life cell
structures which we do not want the agent to disturb. Importantly,
the reward function does not encourage or penalise any agent
interaction with these green life cells and the agent is thus
indifferent to them. Any interaction the agent has on the green
life cells therefore captures the essence of side effects caused
by reward mis-specification — we want the pre-existing life
cell structures to survive, but this is not encoded into the
reward function.</p>
      <p>SafeLife can utilise procedural generation to create
append-still tasks. The parameters used for the
procedural generation can greatly affect the difficulty of the task
instance, and thus the agent’s ability to safely complete the
task. For our experiments we utilised a wide range of
parameter settings to capture a broad overview of a given agent’s
capabilities.</p>
      <p>Table 1 gives the parameter list for the procedural
generation process. Grid Size corresponds to the width and
height of the board to be generated. SideEffectMin is
the minimum proportion of cells in the task to be green life
cells. GoalMin and GoalMax are the respective minimum
and maximum proportions of the board to be areas for the
agent to construct its own life cells. Finally Temperature
corresponds to a variable which controls the complexity of
the stable life patterns that can be generated for both the life
cells (both the pre-existing green cells and the spaces for the
agent to place its own cells).</p>
      <p>We train each agent for a predetermined number of steps,
with the instances generated using parameters drawn
uniformly at random from the set of difficulties. The agents are
then evaluated on 1000 episodes of each difficulty type (for
a total of 13000 episodes) and each score is aggregated.</p>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>Here we describe our results from the experiments outlined.
The two main algorithms we assess are Deep Q-Networks
(DQN) (Mnih et al. 2013) and Proximal Policy Optimisation
(PPO) (Schulman et al. 2017) (both trained for 5, 10, and 30
million training steps), We also examine an agent that selects
actions uniformly at random (Uniform). For DQN and PPO
we use the implementation provided within SafeLife.</p>
      <p>Since the task instances may be of different sizes, and the
procedural generation may produce more green life cells or
designated positions for the agent to place life cells, the total
reward received by an agent is normalised against the
maximum possible reward for that instance. Similarly the side
effect score is normalised.</p>
      <p>Table 2 contains the scores for various metrics for each
evaluated agent. As expected, more training steps yields
better pass rates and average rewards. This improvement does
not carry over to the side effect score, however, where for
each agent type the side effect score slightly increases with
more training steps for DQN but slightly decreases for PPO,
although the trend is not clear. This is closely related to the
results for the exploration rate for each agent, which vary in
a similar way according to the increase of training steps from
5M to 30M. Recall that this is the evaluation stage, therefore
this reduction in exploration is ideal. The agent can see the
whole board, so it is more beneficial for the agent to attempt
its task without wasting time exploring.</p>
      <p>As should be expected, both agent types show safer
behaviour than the agent that selects random actions, and the
low rewards serve as a baseline for comparison (e.g.,
interestingly, the pass rate for DQN 5M is worse than the uniform
random agent). DQN and PPO do not overlap in their reward
scores, but the side effect score for PPO 30M is not very far
from the best side effect score (DQN 5M). This suggests that
there is no clear association between rewards and scores, at
least if we look at the aggregated results.</p>
      <p>Among the pass rate, average reward and side effect score
we also see large standard deviations suggesting that none
of the agents learn to perform consistently. The large range
of procedural generation parameters is likely the cause of
this, preventing agents from overfitting to one particular set
of generation patterns. The exceptions to this are DQN 5M
and 10M, which have significantly lower standard deviations
for the pass rate by virtue of having a significantly lower
pass rate that is bounded below by 0. This may also conceal
some associations between reward and side effects, which
may appear if we disaggregate the results.</p>
      <p>Looking in more detail at the performance of individual
algorithms can give more insight into how an agent’s
behaviour affects the side effects. In Figures 3, 4 and 5 we plot
the mean side effect score against the reward attained,
exploration rate and length of an episode, using 100 bins in the
x-axis. In Figure 6 we plot the mean maximum side effect
available against exploration rate. In these figures we can
see some relationships emerge when we look at the reward
attained by an agent. For PPO there is a clear increase in
the side effects caused from 0 to 0:4 followed by a slower
decrease from 0:4 to 1. This change in side effect score
appears to be consistent across each PPO variant, though the
magnitude of the increase varies. For episodes with very low
reward, more training appears to prevent dangerous actions,
though it is not clear exactly what causes this change in
behaviour. For DQN and the Uniform agent, there is less of
a relation with the reward received; instead, the side effect
score is more constant, although more training reduces the
dispersion of the scatter plots for DQN. Nevertheless, for the
three kinds of agents we see that maximum reward (1.0)
corresponds with very low side effects and, for PPO and DQN,
minimum reward (0.0) corresponds with very low side
effects. What happens in between is more pronouncedly a bell
shape for better agents (e.g., PPO with higher training steps).</p>
      <p>In Figure 4 we compare episode length against the side
effect score, and see positive correlations. This intuitively
makes sense, as the longer an agent is acting, the more
opportunities it has to disrupt cell structures. Again we note
that for longer episodes the variance becomes larger. Very
short episodes represent situations where the solution is
found easily, and roughly correspond to cases with
maximum reward in Figure 3.</p>
      <p>On the other hand, we see a tighter relationship when
we compare exploration rate against the side effect score
in Figure 5. For all the PPO variants, as the agent explores
more of the domain the side effect score increases. Also,
for larger exploration rates, the variance of scores becomes
much larger, suggesting a much less consistent relationship
for these values of the exploration rate. For DQN we only
see this relationship for the most highly-trained agent —
both DQN 5M and DQN 10M peak in their side effect score
after exploration rates of 0:1 and 0:25 respectively, before
dropping to 0. Similarly, when we look at the Uniform agent
and its exploration rate, there is a very tight relationship
between the two, with the side effect score increasing with the
exploration rate from 0 to 0:4, before dropping from 0:4 to
0:6 and remaining at 0 hereafter.</p>
      <p>These relationships can be somewhat explained by Figure
6 where we see the mean number of green cells in the
inaction distribution compared against the exploration rate. Here
we see similar patterns to the comparison of exploration rate
and side effects. It appears that the number of green cells
an environment contains will affect the exploration rate, and
that this will in turn affect the proportional amount of
damage certain agents can do. Why is it so? The rationale has to
be found in the way the environments are generated, as
introducing green cells makes full exploration of the whole
environment more difficult. Basically, explorations above 0.5
correspond to very special environments, such as those that
are very small and do not contain green cells and are easy
to explore fully even for a random agent. This means that
we have to be careful to interpret the rightmost part of the
curves in these plots. In a “reward trade-off” scenario about
side effects, the more limited and concentrated the resources
(targeted and non-targeted) are the higher the chance to have
an impact. Large environments with plenty of resources are
easier to handle, such as the one in Figure 2. We may need to
consider two ‘difficulties’: one about the task rewards
themselves and another one about the hardness of avoiding side
effects. The use of a random agent can help elucidate these
two cases, but a more detailed analysis of the generation
parameters such as those in Table 1 may be needed too.</p>
      <p>Overall, the general picture is that DQN behaves between
the uniform agent and PPO, and the range depends on the
number of training steps. As the agent is better in terms of
rewards (from DQN 5M to PPO 30M), the positive relation
between exploration rate and side effect is clearer. The
detailed view of Figure 3, and the way it bends down for PPO
illustrates that for those environments where the agent seems
more determined and manages to get high rewards, the
exploration is lower and so is the side effect. It is mostly for
those situations where the reward is in the range of 0.2 to 0.8
that the side effects are strongest. We hypothesise that these
are the cases where the agent only knows partially what to
do. Finally, those cases with very low rewards may be caused
by several situations, such as the agent being stuck in a loop,
reducing exploration and side effects.</p>
      <p>This suggests that the analysis of the confidence of the
agent, or some other metacognitive proxies could be used to
distinguish cases where the agent can complete the task with
high reward and low effect, and those cases where
uncertainty is higher and the agent should be more conservative.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and Future Work</title>
      <p>Despite additional training time yielding much more capable
agents, we do not see a similar increase in safety; in fact
the safest agent overall was DQN after 5M training steps,
while being the second worst performing agent in terms of
reward. However, when we analyse the detailed behaviour
using several indicators we find a region of high rewards
and low effect, which is only achieved by PPO. This shows
a non-monotonic relationship, suggesting that the area with
medium rewards is the most dangerous for proficient agents.</p>
      <p>This paper highlights the differences in behaviour
between two commonly used RL algorithms when trained on
the same task. As the results show, these behavioural
differences can have a significant effect on our ability to predict
the safety and other factors of the agent, such as the
confidence or its level of competency compared to elements of the
environment, such as the proportion of green cells. Further
analysis and exploration of the difficulty of task instances
may help to elucidate the cause of these relationships
between side effects and other metrics, as well as to help us
to understand how environment difficulty and agent
uncertainty can be used to improve policies and make them safer.
(b) DQN
(b) DQN
(c) Uniform Agent
(b) DQN
(b) DQN
(c) Uniform Agent</p>
      <p>Gardener, M. 1970. Mathematical games: the fantastic
combinations of John Conway’s new solitaire game “life”.
Krakovna, V.; Orseau, L.; Kumar, R.; Martic, M.; and Legg,
S. 2019. Penalizing side effects using stepwise relative
reachability.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Amodei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Olah</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Steinhardt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Christiano,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Schulman</surname>
          </string-name>
          , J.; and Mane´,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Concrete problems in AI safety</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>arXiv preprint arXiv:1606</source>
          .
          <fpage>06565</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Armstrong</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Levinstein</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Low impact artificial intelligences</article-title>
          .
          <source>arXiv preprint arXiv:1705</source>
          .
          <fpage>10720</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Bostrom</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <year>2003</year>
          .
          <article-title>Ethical issues in advanced artificial intelligence. Science fiction and philosophy: from time travel to superintelligence 277-284</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Critch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Krueger</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2020</year>
          .
          <article-title>AI Research Considerations for Human Existential Safety (ARCHES).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Dulac-Arnold</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Levine,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Mankowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            ;
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Paduraru,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Gowal</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; and Hester,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2020</year>
          .
          <article-title>An empirical investigation of the challenges of real-world reinforcement learning</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Garcia</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <year>2012</year>
          .
          <article-title>Safe Exploration of State and Action Spaces in Reinforcement Learning</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          <volume>45</volume>
          :
          <fpage>515</fpage>
          -
          <lpage>564</lpage>
          . ISSN 1076-
          <fpage>9757</fpage>
          . doi:
          <volume>10</volume>
          .1613/jair.3761. URL http://dx.doi.org/ 10.1613/jair.3761.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Turner</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Hadfield-Menell</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Tadepalli,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <source>Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society doi:10.1145/3375627</source>
          .3375851. URL http://dx.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>doi.org/10.1145/3375627.3375851.</mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Wainwright</surname>
            ,
            <given-names>C. L.</given-names>
          </string-name>
          ; and Eckersley,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Safelife 1.0: Exploring side effects in complex environments</article-title>
          . arXiv preprint arXiv:
          <year>1912</year>
          .01217 .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>