=Paper=
{{Paper
|id=Vol-2808/Paper_24
|storemode=property
|title=Negative Side Effects and AI Agent Indicators: Experiments in SafeLife
|pdfUrl=https://ceur-ws.org/Vol-2808/Paper_24.pdf
|volume=Vol-2808
|authors=John Burden,Jose Hernandez-Orallo,Sean O'Heigeartaigh
|dblpUrl=https://dblp.org/rec/conf/aaai/BurdenHh21
}}
==Negative Side Effects and AI Agent Indicators: Experiments in SafeLife==
<pdf width="1500px">https://ceur-ws.org/Vol-2808/Paper_24.pdf</pdf>
<pre>
         Negative Side Effects and AI Agent Indicators: Experiments in SafeLife

                     John Burden,1 3 * José Hernández-Orallo, 1 2 Seán Ó hÉigeartaigh 1 3
                                 1
                                     Leverhulme Centre for the Future of Intelligence, Cambridge, UK
                                               2
                                                 Universitat Politècnica de València, Spain
                                        3
                                          Centre for the Study of Existential Risk, Cambridge, UK
                                                           *
                                                             jjb205@cam.ac.uk


                            Abstract                                   quantification of an AI system’s performance with respect
                                                                       to the appropriate safety properties. This allows direct com-
  The widespread adoption and ubiquity of AI systems will re-
  quire them to be safe. The safety issues that can arise from
                                                                       parisons between separate algorithms and posited solutions
  AI are broad and varied. In this paper we consider the safety        to safety problems.
  issue of negative side effects and the consequences they can            Our goal with this paper is to study how the performance
  have on an environment. In the safety benchmarking domain            and safety, in terms of side effects, evolve as more resources
  SafeLife, we discuss the way that side effects are measured,         are given to the learning agent. This analysis is enriched by
  as well as presenting results showing the relation between the       a more detailed analysis of several indicators of the agent’s
  magnitude of side effects and other metrics for three agent          behaviour and the side effect metric itself, in a benchmark,
  types: Deep Q-Networks, Proximal Policy Optimisation, and            SafeLife, where safety depends on finding trade-offs be-
  a Uniform Random Agent. We observe that different metrics            tween achieving maximum reward and affecting the whole
  and agent types lead to both monotonic and non-monotonic             environment through careless exploration. This analysis can
  interactions, with the finding that the size and complexity of
  the environment versus the capability of the agent plays a ma-
                                                                       provide key insights into the types of behaviour we may ex-
  jor role in negative side effects, sometimes in intricate ways.      pect from learning agents and how they may affect the world
                                                                       around them.

                        Introduction                                             Background and Related Work
As advances within Artificial Intelligence (AI) continue, AI           Negative side effects (NSE) are a key issue in AI safety
systems are becoming increasingly ubiquitous. Further ad-              (Amodei et al. 2016). NSEs typically stem from a mis-
vances offer the potential for great benefits: future AI sys-          specified reward or objective function — undesirable be-
tems may use techniques such as reinforcement learning                 haviour was not sufficiently penalised. A large part of the
(RL) to allow for agents with the ability to operate with              difficulty with accurately specifying a reward or objective
greater autonomy, making use of a greater range of actions             function to avoid NSEs is the sheer amount of things we
in more varied environments, in pursuit of more complex                want our AI systems to not do. Ideally the AI systems we
goals. To realise these benefits, the performance of these fu-         build will not do undesirable and unrelated things such as
ture AI systems must be robustly safe and predictable. Safety          injuring humans, destroying useful equipment or irrevoca-
issues can take many forms and even the most exhaustive                bly exterminating all life on the planet in the course of per-
surveys (Amodei et al. 2016; Critch and Krueger 2020) can-             forming its task. Encoding the vast amount of undesirable
not identify and classify every conceivable risk. Broad areas          actions directly into the reward/objective function is clearly
of concern within the AI Safety literature include corrigibil-         intractable. This motivates research into identifying clearly
ity (Soares et al. 2015), Safe Exploration (Garcia and Fer-            what constitutes an NSE as well as how AI systems can learn
nandez 2012), Value alignment (Russell 2019), Side Effects             to avoid performing them.
(Amodei et al. 2016), among many others.                                  We give an overview of the formulation of NSEs and re-
   Benchmarking domains have become a popular method                   lated notions from the AI Safety community.
for the evaluation of safety properties for AI systems. There             Low-impact AI systems (Armstrong and Levinstein 2017)
are many such benchmarking suites available, such as AI                are proposed as a method for minimising NSEs. Here, the
Safety Gridworlds (Leike et al. 2017), Safety Gym (Ray,                authors claim that it is desirable for the overall impact of
Achiam, and Amodei 2019), SafeLife (Wainwright and Eck-                an AI system to be low in order to prevent undesirable sit-
ersley 2019) and the Real World Reinforcement Learning                 uations. Impact is defined by the difference in the worlds
Challenge Framework (Dulac-Arnold et al. 2020). A key ad-              where the AI exists and the counterfactual worlds where the
vantage of this type of black-box testing is that it enables           AI does not exist. A sufficiently coarse measurement of the
Copyright © 2021, for this paper by its authors. Use permitted un-     possible world’s properties is to be used to ensure that an AI
der Creative Commons License Attribution 4.0 International (CC         cannot make large, sweeping changes to the world without
BY 4.0).                                                               constituting a large impact. At the same time, coarse world
properties would reduce the size of the world representation       to be unable to fly or spray properly. While not part of the
and make the calculation more tractable. However, the prop-        specification, these side effects are detectable as the goal or
erties need to be selected such that they are important for        associated reward are no longer achievable.
us — humanity — to care about. Additionally, with this for-           A second property of interest is that of “irreversible” ac-
mulation, positive large impacts are also minimised but the        tions. Here we refer to the same type of actions as described
authors claim this is easier to handle since we often have         in (Leike et al. 2017). These are particularly negative — and
a clearer idea of what these positive impacts would be. A          difficult to detect — in the cases where an action’s conse-
similar approach is proposed in (Amodei et al. 2016) where         quences are not clear for many time-steps. This type of ef-
impact is regularised, but posits that due to the similarity of    fect would correspond to the drone unintentionally spraying
many side effects such a regularisation could be learned in        entities such as people with potentially harmful chemicals.
one domain and transferred over to another.                        This can clearly not be undone.
   As part of an AI Safety Gridworlds environment (Leike              “Effect indifference” is another relevant NSE property.
et al. 2017) irreversible side effects are penalised. These ac-    Here we consider a scenario in which there is some be-
tions are not inherently negative in themselves but their ir-      haviour which we do not want the agent to do and where
reversibility implies that they cannot be undone if later on       the AI system’s preferences are indifferent to this behaviour.
the action was deemed undesirable. This approach has the           This may also include an indifferent reward function. This
downside that it may allow for reversible NSEs to be re-           NSE type is very much intertwined with that of reward
peated indefinitely.                                               mis-specification and often occurs due to the system hav-
   Stepwise inaction is introduced in (Turner, Hadfield-           ing unanticipated affordances or incentives. Similarly to the
Menell, and Tadepalli 2020), where the effect of action            last example, this effect type could correspond to spraying
against inaction is calculated after each time-step over aux-      unintended entities if this is not properly encoded into the
iliary Q-functions. Here the intuition is that the agent is only   reward function.
penalised for very impactful actions once, when they occur.           The final NSE property we consider here is that of the “re-
In order to account for actions that take many time-steps          ward trade-off”. This effect typically arises when resources
for the undesirable side effect to occur, baseline rollouts are    are limited and shared with other entities (humans included),
used, where the state of the system after many steps of inac-      and these resources are instrumentally useful to the agent.
tion is used for the comparison.                                   The danger here is not only that the maximisation of the
   Krakovna et al. (2019) provide a comprehensive compar-          reward could imply that all resources should be exhausted
ison of the above formulations over a range of gridworld           (e.g., the famous instrumental convergence problems, such
domains, showing where each of them succeed and fail; ad-          as Minsky’s Riemann hypothesis or Bostrom’s paperclip
ditionally, the concept of relative reachability is introduced     problem, Bostrom 2003), but even in situations where the
— this is a measure of the average reduction of reachability       reward is capped, the agent may take more than it needs for
for every state compared to a baseline.                            its task and perform an inefficient and unsustainable use of
   The recurring theme for defining and identifying NSEs is        the shared resources. In our agricultural drone example, this
the idea of comparing the effect of the agent against some         would correspond to the drone revisiting already treated ar-
scenario in which the agent would have acted in a different        eas or making detours, leading to high energy consumption
way. Both the exact baseline and comparison function often         or shorter operational life.
differ and, as shown in (Krakovna et al. 2019), have relative         Different actions can have several of the properties out-
advantages and disadvantages for different environments.           lined above; this may compound the risk posed by a side
                                                                   effect.
Types of Side Effects                                                 AI Safety benchmarking suites can evaluate some or all
The above formulations capture many aspects of measuring           of the above. In the AI Safety Gridworlds (Leike et al. 2017)
NSEs and deciding which counterfactual worlds to compare           benchmarking suite, the irreversible side effects environ-
against. A notion that deserves additional detail, however,        ment captures what we refer to as an irreversible NSE. The
is that of the different properties of NSEs and how they af-       agent here is tasked with reaching a goal space, but must
fect the world. Distinguishing between NSEs by the prop-           move a box that is blocking its path. Moving the box adja-
erties they have can help us to further understand their con-      cent to a side or a corner can hinder or prevent future move-
sequences as well as to identify and implement robust solu-        ment of the box. The agent is not directly penalised for per-
tions to them. We consider the application of an autonomous        forming these actions, but it is undesirable from a safety per-
agricultural drone which can survey and spray crop fields          spective. The Safety Gym (Ray, Achiam, and Amodei 2019)
as an illustrative example, giving an intuitive description of     generally requires an agent to perform a task such as reach-
each outlined NSE property.                                        ing a goal or moving a box to a specific position while avoid-
    We refer to a first type of NSE property as “blocking”.        ing numerous hazard types. The side effects possible in these
By this we mean an action by an AI system that prevents            environments generally utilise the reward trade-off property;
it from performing its task successfully, or substantially and     the agents are trying to maximise their rewards whilst try-
irrevocably curtails the reward it can achieve. Examples of        ing to minimise constraint violation. The Real World Re-
this NSE include an agent taking an action which locks away        inforcement Learning Challenge Framework (Dulac-Arnold
an object it needs to complete its task. In our agricultural       et al. 2020) does not contain any environments that directly
drone example, this would be an action that causes the drone       measure the side effects of an agent, though some tasks re-
quire the agent not to violate certain safety constraints. On
the other hand, SafeLife (Wainwright and Eckersley 2019),
described in the following section, presents side effects that
can consist of each of the NSE properties seen above.

                 SafeLife Environment
SafeLife (Wainwright and Eckersley 2019) is a safety bench-
marking suite for RL agents. SafeLife takes the complex me-
chanics from Conway’s Game of Life (Gardener 1970) and
utilises them to create single-agent RL domains. The result-
ing environments have a rich level of complexity and allow
for formulations of many different tasks. A distinguishing
feature of SafeLife is its focus on measuring and avoiding
side effects.
   Each environment consists of an n×m grid of cells which
can each take a variety of properties (see an example in Fig.
1). We refer to this grid as a board. The most notable prop-        Figure 1: An example task instance from SafeLife. The
erty is whether the cell is “dead” or “alive”. Following the        agent    must create life structures   in the designated blue
Game of Life rules the cells update as follows: a dead cell         positions before moving to the goal . Ideally the agent
will become alive if it has exactly three neighbouring alive        should not disturb the green life cells .
cells, otherwise it will remain dead. On the other hand, an
alive cell will die if it has fewer than two alive neighbours or
more than three alive neighbours, otherwise it will survive.        side effects that we are concerned with are those made to
   SafeLife extends this basic framework into a single-agent        the pre-existing life cells, but SafeLife provides the means to
domain by including an agent that traverses the grid. At each       look at the effects on any cell type. The primary reason for
time step, the agent perceives a 25 × 25 grid overview of           only considering the effects the agent has on the green life
its surroundings, centred on itself. For each position in this      cells is the fact that the agent is not penalised or rewarded
overview, the cell type present is perceived.                       for disrupting the structures they form; thus they may be de-
   Instances in SafeLife come with goals for the agent to           stroyed as an ‘effect indifference’ NSE. The green cells can
complete, usually revolving around creating or destroying           obstruct the agent and prevent it from moving, so it may be
particular life structures before moving to a predefined goal       instrumentally useful for the agent to destroy them, making
cell. Additionally, the other effects the agent has on pre-         them relevant to the ‘trade-off’ NSE property. Further, the
existing life cells are measured — these correspond to un-          destruction can be permanent, and thus may represent irre-
intended side effects.                                              versible NSEs. Finally, because these life cells follow the
   In SafeLife the agent is depicted by an arrow        and may     rules of the game of life, they can have very interesting and
move in any cardinal direction, as well as creating or de-          complex behaviours.
stroying life cells in adjacent spaces. The agent’s movement           Due to the dynamic nature of SafeLife, comparing a sin-
can be impeded by walls which are shown as a grey square            gle final state against a single baseline state may not be suffi-
with a black hole . Additionally, there are boxes which             cient to capture the eventual side effects of an agent. Instead,
are shown as a grey square without a black hole          that the   two distributions are created that represent the future states
agent can push and pull. Pre-existing green life cells are de-      of the board for both the case where the agent acted and af-
picted using green circles         and life cells created by the    fected the environment as well as the counterfactual case in
agent are shown as grey circles . These agent-created life          which the agent never existed. To create the action case dis-
cells are to be placed in the designated areas shown in blue.       tribution, Da from the final state left by the agent, a further
Both types of life cells that follow the rules of the Game of       1000 steps are simulated and each resulting state is added
Life. The “tree” entities are depicted by a green square with       to Da . The inaction case distribution Di is created similarly,
radiating lines       and are fixed living cells that cannot be     except that 1000 + n steps are run from the initial state of
destroyed. Finally, the goal is depicted with a grey arch ,         the environment, where n is the number of steps taken by
which turns red when it can be activated. All cells that have       the agent during the episode. This ensures that both distri-
not been explicitly noted as living are “dead”.                     butions reflect the states of the board 1000 + n steps in the
   Altogether this allows for a very rich and complex set of        future from the initial state for both the factual and counter-
environments which can be generated procedurally.                   factual cases.
                                                                       From these distributions, the estimated pro-
Measuring Side Effects                                              portion      of     time     steps    that     board    position
                                                                    x = hx1 , x2 i, x ∈ grid = {1, ..., n} × {1, ..., m}        con-
We now look at the method that the authors of SafeLife pro-         tains cell type c can be computed:
pose for measuring the magnitude of side effects. Within
                                                                                                1 X
SafeLife the destruction, addition or changing of cell types                        ρcD (x) =           [1(s(x) = c)]
caused by agent actions represent side effects. The types of                                   |D|
                                                                                                  s∈D
where D is the relevant distribution, s is the state drawn from                   ρgreen
                                                                                    Di   (x) = 1(saf ter (x) = green)
D and s(x) is the cell type present in s at location x.               As there are seven green cells in total, it then follows that
   A ground distance function is also defined between board
positions as g(x, y) = tanh( 15 kx − yk1 ) where kx − yk1 is                                     EM D(ρgreen , ρgreen , g)   4
                                                                           υgreen (Da , Di ) =    P Da green    Di
                                                                                                                           =
the Manhattan distance between grid locations x and y.                                                    ρ       (x)        7
                                                                                                    x∈grid Di
   The calculated side effect for Cell Type c is then:
                                                                      The use of the normalised earth mover distance has some
             dc (Da , Di ) = EM D(ρcDa , ρcDi , g)                    interesting consequences for the side effect score. Regard-
                                                                      less of how large or densely packed the grid is with green
where Da and Di are the action and inaction distributions re-         life cells, not disturbing any of them will yield a score of 0,
spectively, and EM D is the earth mover distance function.            destroying them all will give a score of 1. However, in the
Side effects in SafeLife are based on this distance metric,           case that more green cells are created, by the agent adding
which aims to capture the distance between two probabil-              grey life cells to certain patterns which lead to an increase
ity distributions based on the amount of work it would take           in the number of green life cells, the side effect score can
to transform one distribution to the other by moving around           increase to above 1, due to quirks with the calculation of the
“distribution mass” (Rubner, Tomasi, and Guibas 1998).                earth mover distance. Whether or not causing more green
   This side effect score dc (Da , Di ) is then “normalised”          life cells to spawn is worse than the complete destruction of
against the inaction baseline to give a final side effect score:      all green life cells will depend on what these life cells repre-
                                dc (Da , Di )                         sent as well as personal moral philosophy. At the very least
               υc (Da , Di ) = P         c                            this does capture the notion of representing the total impact
                                x∈grid ρDi (x)                        the agent has on the green life cells.
This normalisation allows us to compare agent behaviours
in larger or more densely populated boards.                                                  Experiments
   While a side effect score can be calculated for every cell         SafeLife has many tasks available out-of-the-box, and
type, SafeLife focuses on the side effects for the green, pre-        vastly more can be created by the user. We focus on the
existing life cells.                                                  append-still task, where the agent must create life
   This definition of a side effect echoes that of (Armstrong         structures on the designated areas and then move to the goal
and Levinstein 2017) and (Amodei et al. 2016), where the              which ends the episode. Before the goal is accessible to the
agent’s actions are compared against a world in which it              agent, more than half of the designated (blue) positions must
never acted at all. A key difference however is the use of            be filled with life cells. At each step the agent receives a re-
future distributions to account for the possible long-term or         ward equal to the number of life cells created on the desig-
unstable effects of actions.                                          nated positions minus the number of life cells destroyed that
                                                                      were on designated positions. Finally the agent receives an
Examples and Implications                                             additional 1 reward for reaching the goal after it has opened.
Figure 2 shows a before and after comparison to show the              After 1000 steps by the agent, the episode will end, regard-
effects of an agent’s actions. In the “before” environment,           less of the agent’s progress. If the agent has managed to fill
the agent places a grey life cell directly in front of it, adjacent   at least half of the designated positions for its life cells and
to the stable green life structure. This disrupts the structure       reached the goal to end the episode before the 1000 steps
and after a few time steps it is destroyed. At the end of the         have elapsed, we say the agent has ‘passed’ the task instance,
episode, assuming the agent does not affect any more green            otherwise it has failed.
life cells, we can calculate the side effect score as follows:           In this task there are also pre-existing green life cell struc-
                                                                      tures which we do not want the agent to disturb. Importantly,
                                                                      the reward function does not encourage or penalise any agent
                                                                      interaction with these green life cells and the agent is thus in-
                                                                      different to them. Any interaction the agent has on the green
                                                                      life cells therefore captures the essence of side effects caused
                                                                      by reward mis-specification — we want the pre-existing life
                                                                      cell structures to survive, but this is not encoded into the re-
                                                                      ward function.
                                                                         SafeLife can utilise procedural generation to create
                                                                      append-still tasks. The parameters used for the proce-
              (a) Before                   (b) After                  dural generation can greatly affect the difficulty of the task
                                                                      instance, and thus the agent’s ability to safely complete the
        Figure 2: Example for Side Effect calculation                 task. For our experiments we utilised a wide range of param-
                                                                      eter settings to capture a broad overview of a given agent’s
   Since both the “before” and “after” snapshots are stable           capabilities.
(the structures if left alone will not change),                          Table 1 gives the parameter list for the procedural gen-
                                                                      eration process. Grid Size corresponds to the width and
            ρgreen
             Da    (x) = 1(sbef ore (x) = green)                      height of the board to be generated. SideEffectMin is
the minimum proportion of cells in the task to be green life        some associations between reward and side effects, which
cells. GoalMin and GoalMax are the respective minimum               may appear if we disaggregate the results.
and maximum proportions of the board to be areas for the               Looking in more detail at the performance of individual
agent to construct its own life cells. Finally Temperature          algorithms can give more insight into how an agent’s be-
corresponds to a variable which controls the complexity of          haviour affects the side effects. In Figures 3, 4 and 5 we plot
the stable life patterns that can be generated for both the life    the mean side effect score against the reward attained, ex-
cells (both the pre-existing green cells and the spaces for the     ploration rate and length of an episode, using 100 bins in the
agent to place its own cells).                                      x-axis. In Figure 6 we plot the mean maximum side effect
   We train each agent for a predetermined number of steps,         available against exploration rate. In these figures we can
with the instances generated using parameters drawn uni-            see some relationships emerge when we look at the reward
formly at random from the set of difficulties. The agents are       attained by an agent. For PPO there is a clear increase in
then evaluated on 1000 episodes of each difficulty type (for        the side effects caused from 0 to 0.4 followed by a slower
a total of 13000 episodes) and each score is aggregated.            decrease from 0.4 to 1. This change in side effect score ap-
                                                                    pears to be consistent across each PPO variant, though the
                Results and Discussion                              magnitude of the increase varies. For episodes with very low
Here we describe our results from the experiments outlined.         reward, more training appears to prevent dangerous actions,
The two main algorithms we assess are Deep Q-Networks               though it is not clear exactly what causes this change in be-
(DQN) (Mnih et al. 2013) and Proximal Policy Optimisation           haviour. For DQN and the Uniform agent, there is less of
(PPO) (Schulman et al. 2017) (both trained for 5, 10, and 30        a relation with the reward received; instead, the side effect
million training steps), We also examine an agent that selects      score is more constant, although more training reduces the
actions uniformly at random (Uniform). For DQN and PPO              dispersion of the scatter plots for DQN. Nevertheless, for the
we use the implementation provided within SafeLife.                 three kinds of agents we see that maximum reward (1.0) cor-
   Since the task instances may be of different sizes, and the      responds with very low side effects and, for PPO and DQN,
procedural generation may produce more green life cells or          minimum reward (0.0) corresponds with very low side ef-
designated positions for the agent to place life cells, the total   fects. What happens in between is more pronouncedly a bell
reward received by an agent is normalised against the max-          shape for better agents (e.g., PPO with higher training steps).
imum possible reward for that instance. Similarly the side             In Figure 4 we compare episode length against the side
effect score is normalised.                                         effect score, and see positive correlations. This intuitively
   Table 2 contains the scores for various metrics for each         makes sense, as the longer an agent is acting, the more op-
evaluated agent. As expected, more training steps yields bet-       portunities it has to disrupt cell structures. Again we note
ter pass rates and average rewards. This improvement does           that for longer episodes the variance becomes larger. Very
not carry over to the side effect score, however, where for         short episodes represent situations where the solution is
each agent type the side effect score slightly increases with       found easily, and roughly correspond to cases with maxi-
more training steps for DQN but slightly decreases for PPO,         mum reward in Figure 3.
although the trend is not clear. This is closely related to the        On the other hand, we see a tighter relationship when
results for the exploration rate for each agent, which vary in      we compare exploration rate against the side effect score
a similar way according to the increase of training steps from      in Figure 5. For all the PPO variants, as the agent explores
5M to 30M. Recall that this is the evaluation stage, therefore      more of the domain the side effect score increases. Also,
this reduction in exploration is ideal. The agent can see the       for larger exploration rates, the variance of scores becomes
whole board, so it is more beneficial for the agent to attempt      much larger, suggesting a much less consistent relationship
its task without wasting time exploring.                            for these values of the exploration rate. For DQN we only
   As should be expected, both agent types show safer be-           see this relationship for the most highly-trained agent —
haviour than the agent that selects random actions, and the         both DQN 5M and DQN 10M peak in their side effect score
low rewards serve as a baseline for comparison (e.g., inter-        after exploration rates of 0.1 and 0.25 respectively, before
estingly, the pass rate for DQN 5M is worse than the uniform        dropping to 0. Similarly, when we look at the Uniform agent
random agent). DQN and PPO do not overlap in their reward           and its exploration rate, there is a very tight relationship be-
scores, but the side effect score for PPO 30M is not very far       tween the two, with the side effect score increasing with the
from the best side effect score (DQN 5M). This suggests that        exploration rate from 0 to 0.4, before dropping from 0.4 to
there is no clear association between rewards and scores, at        0.6 and remaining at 0 hereafter.
least if we look at the aggregated results.                            These relationships can be somewhat explained by Figure
   Among the pass rate, average reward and side effect score        6 where we see the mean number of green cells in the inac-
we also see large standard deviations suggesting that none          tion distribution compared against the exploration rate. Here
of the agents learn to perform consistently. The large range        we see similar patterns to the comparison of exploration rate
of procedural generation parameters is likely the cause of          and side effects. It appears that the number of green cells
this, preventing agents from overfitting to one particular set      an environment contains will affect the exploration rate, and
of generation patterns. The exceptions to this are DQN 5M           that this will in turn affect the proportional amount of dam-
and 10M, which have significantly lower standard deviations         age certain agents can do. Why is it so? The rationale has to
for the pass rate by virtue of having a significantly lower         be found in the way the environments are generated, as in-
pass rate that is bounded below by 0. This may also conceal         troducing green cells makes full exploration of the whole en-
                        Difficulty     Grid Size     sideEffectMin   GoalMin      GoalMax        Temperature
                            0          10 × 10              0           0            0              0.1
                            1          15 × 15            0.03        0.05          0.1             0.1
                            2          15 × 15            0.03        0.05          0.1             0.2
                            3          15 × 15            0.05        0.05          0.1             0.2
                            4          15 × 15            0.07        0.05          0.15            0.2
                            5          15 × 15            0.09        0.05          0.2             0.2
                            6          15 × 15             0.1        0.05          0.2             0.2
                            7          15 × 15             0.1         0.1          0.25            0.3
                            8          15 × 15             0.1        0.15          0.25            0.4
                            9          15 × 15             0.1        0.15          0.35            0.5
                           10          15 × 15             0.1        0.15          0.4             0.5
                           11          15 × 15             0.1        0.15          0.45            0.6
                           12          15 × 15             0.1         0.2          0.5             0.7

                                       Table 1: Difficulty Parameters for our SafeLife levels.

                      Agent            Pass Rate      Average Reward     Side Effect Score    Exploration Rate
                    DQN 5M           0.043 (0.203)     0.272 (0.288)       0.229 (0.342)       0.069 (0.042)
                    DQN 10M          0.098 (0.297)     0.417 (0.345)       0.268 (0.357)       0.110 (0.078)
                    DQN 30M          0.299 (0.458)     0.495 (0.348)       0.247 (0.347)       0.121 (0.071)
                     PPO 5M          0.682 (0.466)     0.658 (0.293)       0.324 (0.389)       0.205 (0.138)
                    PPO 10M          0.678 (0.467)     0.676 (0.294)       0.241 (0.344)       0.130 (0.084)
                    PPO 30M          0.709 (0.454)     0.695 (0.275)       0.243 (0.344)       0.148 (0.095)
                     Uniform         0.097 (0.295)     0.181 (0.343)       0.688 (0.450)       0.347 (0.086)

            Table 2: Results for various metrics in SafeLife for our selected agents. Mean and (standard deviation)


vironment more difficult. Basically, explorations above 0.5            reducing exploration and side effects.
correspond to very special environments, such as those that               This suggests that the analysis of the confidence of the
are very small and do not contain green cells and are easy             agent, or some other metacognitive proxies could be used to
to explore fully even for a random agent. This means that              distinguish cases where the agent can complete the task with
we have to be careful to interpret the rightmost part of the           high reward and low effect, and those cases where uncer-
curves in these plots. In a “reward trade-off” scenario about          tainty is higher and the agent should be more conservative.
side effects, the more limited and concentrated the resources
(targeted and non-targeted) are the higher the chance to have                     Conclusion and Future Work
an impact. Large environments with plenty of resources are             Despite additional training time yielding much more capable
easier to handle, such as the one in Figure 2. We may need to          agents, we do not see a similar increase in safety; in fact
consider two ‘difficulties’: one about the task rewards them-          the safest agent overall was DQN after 5M training steps,
selves and another one about the hardness of avoiding side             while being the second worst performing agent in terms of
effects. The use of a random agent can help elucidate these            reward. However, when we analyse the detailed behaviour
two cases, but a more detailed analysis of the generation pa-          using several indicators we find a region of high rewards
rameters such as those in Table 1 may be needed too.                   and low effect, which is only achieved by PPO. This shows
   Overall, the general picture is that DQN behaves between            a non-monotonic relationship, suggesting that the area with
the uniform agent and PPO, and the range depends on the                medium rewards is the most dangerous for proficient agents.
number of training steps. As the agent is better in terms of              This paper highlights the differences in behaviour be-
rewards (from DQN 5M to PPO 30M), the positive relation                tween two commonly used RL algorithms when trained on
between exploration rate and side effect is clearer. The de-           the same task. As the results show, these behavioural differ-
tailed view of Figure 3, and the way it bends down for PPO             ences can have a significant effect on our ability to predict
illustrates that for those environments where the agent seems          the safety and other factors of the agent, such as the confi-
more determined and manages to get high rewards, the ex-               dence or its level of competency compared to elements of the
ploration is lower and so is the side effect. It is mostly for         environment, such as the proportion of green cells. Further
those situations where the reward is in the range of 0.2 to 0.8        analysis and exploration of the difficulty of task instances
that the side effects are strongest. We hypothesise that these         may help to elucidate the cause of these relationships be-
are the cases where the agent only knows partially what to             tween side effects and other metrics, as well as to help us
do. Finally, those cases with very low rewards may be caused           to understand how environment difficulty and agent uncer-
by several situations, such as the agent being stuck in a loop,        tainty can be used to improve policies and make them safer.
                           (a) PPO                                                           (a) PPO


                           (b) DQN                                                           (b) DQN


                      (c) Uniform Agent                                                 (c) Uniform Agent

Figure 3: Mean Side Effect against Reward attained for            Figure 4: Mean Side Effect against Episode Length for
variants of PPO, DQN and Uniform agents. The 13,000               variants of PPO, DQN and Uniform agents. The 13,000
episodes are binned on the x-axis, with the size of the plotted   episodes are binned on the x-axis, with the size of the plotted
points (squares, triangles and circles) being logarithmic on      points (squares, triangles and circles) being logarithmic on
the number of episodes in each bin. The curves are rolling        the number of episodes in each bin. The curves are rolling
means of the plotted points with a window size of 7.              means of the plotted points with a window size of 7.
                                                                                             (a) PPO
                           (a) PPO


                                                                                            (b) DQN
                           (b) DQN


                                                                                        (c) Uniform Agent
                      (c) Uniform Agent
                                                                  Figure 6: Number of initial life cells against Exploration
Figure 5: Mean Side Effect against Exploration Rate for
                                                                  Rate for variants of PPO, DQN and Uniform agents. The
variants of PPO, DQN and Uniform agents. The 13,000
                                                                  13,000 episodes are binned on the x-axis, with the size of
episodes are binned on the x-axis, with the size of the plotted
                                                                  the plotted points (squares, triangles and circles) being log-
points (squares, triangles and circles) being logarithmic on
                                                                  arithmic on the number of episodes in each bin. The curves
the number of episodes in each bin. The curves are rolling
                                                                  are rolling means of the plotted points with a window size
means of the plotted points with a window size of 7.
                                                                  of 7.
Acknowledgements: This work was funded by the Leverhulme         Turner, A. M.; Hadfield-Menell, D.; and Tadepalli, P. 2020.
Trust, the Future of Life Institute, FLI (grant RFP2-152), the   Conservative Agency via Attainable Utility Preservation.
EU’s Horizon 2020 research and innovation programme (No.         Proceedings of the AAAI/ACM Conference on AI, Ethics,
952215, TAILOR), and EU (FEDER) and Spanish MINECO un-           and Society doi:10.1145/3375627.3375851. URL http://dx.
der RTI2018-094403-B-C32, and Generalitat Valenciana GV under    doi.org/10.1145/3375627.3375851.
PROMETEO/2019/098.                                               Wainwright, C. L.; and Eckersley, P. 2019. Safelife 1.0: Ex-
                                                                 ploring side effects in complex environments. arXiv preprint
                       References                                arXiv:1912.01217 .
Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schul-
man, J.; and Mané, D. 2016. Concrete problems in AI safety.
arXiv preprint arXiv:1606.06565 .
Armstrong, S.; and Levinstein, B. 2017. Low impact artifi-
cial intelligences. arXiv preprint arXiv:1705.10720 .
Bostrom, N. 2003. Ethical issues in advanced artificial in-
telligence. Science fiction and philosophy: from time travel
to superintelligence 277–284.
Critch, A.; and Krueger, D. 2020. AI Research Considera-
tions for Human Existential Safety (ARCHES).
Dulac-Arnold, G.; Levine, N.; Mankowitz, D. J.; Li, J.;
Paduraru, C.; Gowal, S.; and Hester, T. 2020. An empirical
investigation of the challenges of real-world reinforcement
learning .
Garcia, J.; and Fernandez, F. 2012. Safe Exploration of
State and Action Spaces in Reinforcement Learning. Jour-
nal of Artificial Intelligence Research 45: 515–564. ISSN
1076-9757. doi:10.1613/jair.3761. URL http://dx.doi.org/
10.1613/jair.3761.
Gardener, M. 1970. Mathematical games: the fantastic com-
binations of John Conway’s new solitaire game “life”.
Krakovna, V.; Orseau, L.; Kumar, R.; Martic, M.; and Legg,
S. 2019. Penalizing side effects using stepwise relative
reachability.
Leike, J.; Martic, M.; Krakovna, V.; Ortega, P. A.; Everitt,
T.; Lefrancq, A.; Orseau, L.; and Legg, S. 2017. AI Safety
Gridworlds.
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.;
Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Play-
ing Atari with Deep Reinforcement Learning.
Ray, A.; Achiam, J.; and Amodei, D. 2019. Benchmark-
ing safe exploration in deep reinforcement learning. arXiv
preprint arXiv:1910.01708 .
Rubner, Y.; Tomasi, C.; and Guibas, L. J. 1998. A Metric
for Distributions with Applications to Image Databases. In
Proceedings of the Sixth International Conference on Com-
puter Vision, ICCV ’98, 59. USA: IEEE Computer Society.
ISBN 8173192219.
Russell, S. 2019. Human Compatible: Artificial Intelligence
and the Problem of Control. Penguin Publishing Group.
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and
Klimov, O. 2017. Proximal Policy Optimization Algorithms.
Soares, N.; Fallenstein, B.; Armstrong, S.; and Yudkowsky,
E. 2015. Corrigibility. URL https://aaai.org/ocs/index.php/
WS/AAAIW15/paper/view/10124.

</pre>