=Paper=
{{Paper
|id=Vol-2808/Paper_24
|storemode=property
|title=Negative Side Effects and AI Agent Indicators: Experiments in SafeLife
|pdfUrl=https://ceur-ws.org/Vol-2808/Paper_24.pdf
|volume=Vol-2808
|authors=John Burden,Jose Hernandez-Orallo,Sean O'Heigeartaigh
|dblpUrl=https://dblp.org/rec/conf/aaai/BurdenHh21
}}
==Negative Side Effects and AI Agent Indicators: Experiments in SafeLife==
Negative Side Effects and AI Agent Indicators: Experiments in SafeLife John Burden,1 3 * José Hernández-Orallo, 1 2 Seán Ó hÉigeartaigh 1 3 1 Leverhulme Centre for the Future of Intelligence, Cambridge, UK 2 Universitat Politècnica de València, Spain 3 Centre for the Study of Existential Risk, Cambridge, UK * jjb205@cam.ac.uk Abstract quantification of an AI system’s performance with respect to the appropriate safety properties. This allows direct com- The widespread adoption and ubiquity of AI systems will re- quire them to be safe. The safety issues that can arise from parisons between separate algorithms and posited solutions AI are broad and varied. In this paper we consider the safety to safety problems. issue of negative side effects and the consequences they can Our goal with this paper is to study how the performance have on an environment. In the safety benchmarking domain and safety, in terms of side effects, evolve as more resources SafeLife, we discuss the way that side effects are measured, are given to the learning agent. This analysis is enriched by as well as presenting results showing the relation between the a more detailed analysis of several indicators of the agent’s magnitude of side effects and other metrics for three agent behaviour and the side effect metric itself, in a benchmark, types: Deep Q-Networks, Proximal Policy Optimisation, and SafeLife, where safety depends on finding trade-offs be- a Uniform Random Agent. We observe that different metrics tween achieving maximum reward and affecting the whole and agent types lead to both monotonic and non-monotonic environment through careless exploration. This analysis can interactions, with the finding that the size and complexity of the environment versus the capability of the agent plays a ma- provide key insights into the types of behaviour we may ex- jor role in negative side effects, sometimes in intricate ways. pect from learning agents and how they may affect the world around them. Introduction Background and Related Work As advances within Artificial Intelligence (AI) continue, AI Negative side effects (NSE) are a key issue in AI safety systems are becoming increasingly ubiquitous. Further ad- (Amodei et al. 2016). NSEs typically stem from a mis- vances offer the potential for great benefits: future AI sys- specified reward or objective function — undesirable be- tems may use techniques such as reinforcement learning haviour was not sufficiently penalised. A large part of the (RL) to allow for agents with the ability to operate with difficulty with accurately specifying a reward or objective greater autonomy, making use of a greater range of actions function to avoid NSEs is the sheer amount of things we in more varied environments, in pursuit of more complex want our AI systems to not do. Ideally the AI systems we goals. To realise these benefits, the performance of these fu- build will not do undesirable and unrelated things such as ture AI systems must be robustly safe and predictable. Safety injuring humans, destroying useful equipment or irrevoca- issues can take many forms and even the most exhaustive bly exterminating all life on the planet in the course of per- surveys (Amodei et al. 2016; Critch and Krueger 2020) can- forming its task. Encoding the vast amount of undesirable not identify and classify every conceivable risk. Broad areas actions directly into the reward/objective function is clearly of concern within the AI Safety literature include corrigibil- intractable. This motivates research into identifying clearly ity (Soares et al. 2015), Safe Exploration (Garcia and Fer- what constitutes an NSE as well as how AI systems can learn nandez 2012), Value alignment (Russell 2019), Side Effects to avoid performing them. (Amodei et al. 2016), among many others. We give an overview of the formulation of NSEs and re- Benchmarking domains have become a popular method lated notions from the AI Safety community. for the evaluation of safety properties for AI systems. There Low-impact AI systems (Armstrong and Levinstein 2017) are many such benchmarking suites available, such as AI are proposed as a method for minimising NSEs. Here, the Safety Gridworlds (Leike et al. 2017), Safety Gym (Ray, authors claim that it is desirable for the overall impact of Achiam, and Amodei 2019), SafeLife (Wainwright and Eck- an AI system to be low in order to prevent undesirable sit- ersley 2019) and the Real World Reinforcement Learning uations. Impact is defined by the difference in the worlds Challenge Framework (Dulac-Arnold et al. 2020). A key ad- where the AI exists and the counterfactual worlds where the vantage of this type of black-box testing is that it enables AI does not exist. A sufficiently coarse measurement of the Copyright © 2021, for this paper by its authors. Use permitted un- possible world’s properties is to be used to ensure that an AI der Creative Commons License Attribution 4.0 International (CC cannot make large, sweeping changes to the world without BY 4.0). constituting a large impact. At the same time, coarse world properties would reduce the size of the world representation to be unable to fly or spray properly. While not part of the and make the calculation more tractable. However, the prop- specification, these side effects are detectable as the goal or erties need to be selected such that they are important for associated reward are no longer achievable. us — humanity — to care about. Additionally, with this for- A second property of interest is that of “irreversible” ac- mulation, positive large impacts are also minimised but the tions. Here we refer to the same type of actions as described authors claim this is easier to handle since we often have in (Leike et al. 2017). These are particularly negative — and a clearer idea of what these positive impacts would be. A difficult to detect — in the cases where an action’s conse- similar approach is proposed in (Amodei et al. 2016) where quences are not clear for many time-steps. This type of ef- impact is regularised, but posits that due to the similarity of fect would correspond to the drone unintentionally spraying many side effects such a regularisation could be learned in entities such as people with potentially harmful chemicals. one domain and transferred over to another. This can clearly not be undone. As part of an AI Safety Gridworlds environment (Leike “Effect indifference” is another relevant NSE property. et al. 2017) irreversible side effects are penalised. These ac- Here we consider a scenario in which there is some be- tions are not inherently negative in themselves but their ir- haviour which we do not want the agent to do and where reversibility implies that they cannot be undone if later on the AI system’s preferences are indifferent to this behaviour. the action was deemed undesirable. This approach has the This may also include an indifferent reward function. This downside that it may allow for reversible NSEs to be re- NSE type is very much intertwined with that of reward peated indefinitely. mis-specification and often occurs due to the system hav- Stepwise inaction is introduced in (Turner, Hadfield- ing unanticipated affordances or incentives. Similarly to the Menell, and Tadepalli 2020), where the effect of action last example, this effect type could correspond to spraying against inaction is calculated after each time-step over aux- unintended entities if this is not properly encoded into the iliary Q-functions. Here the intuition is that the agent is only reward function. penalised for very impactful actions once, when they occur. The final NSE property we consider here is that of the “re- In order to account for actions that take many time-steps ward trade-off”. This effect typically arises when resources for the undesirable side effect to occur, baseline rollouts are are limited and shared with other entities (humans included), used, where the state of the system after many steps of inac- and these resources are instrumentally useful to the agent. tion is used for the comparison. The danger here is not only that the maximisation of the Krakovna et al. (2019) provide a comprehensive compar- reward could imply that all resources should be exhausted ison of the above formulations over a range of gridworld (e.g., the famous instrumental convergence problems, such domains, showing where each of them succeed and fail; ad- as Minsky’s Riemann hypothesis or Bostrom’s paperclip ditionally, the concept of relative reachability is introduced problem, Bostrom 2003), but even in situations where the — this is a measure of the average reduction of reachability reward is capped, the agent may take more than it needs for for every state compared to a baseline. its task and perform an inefficient and unsustainable use of The recurring theme for defining and identifying NSEs is the shared resources. In our agricultural drone example, this the idea of comparing the effect of the agent against some would correspond to the drone revisiting already treated ar- scenario in which the agent would have acted in a different eas or making detours, leading to high energy consumption way. Both the exact baseline and comparison function often or shorter operational life. differ and, as shown in (Krakovna et al. 2019), have relative Different actions can have several of the properties out- advantages and disadvantages for different environments. lined above; this may compound the risk posed by a side effect. Types of Side Effects AI Safety benchmarking suites can evaluate some or all The above formulations capture many aspects of measuring of the above. In the AI Safety Gridworlds (Leike et al. 2017) NSEs and deciding which counterfactual worlds to compare benchmarking suite, the irreversible side effects environ- against. A notion that deserves additional detail, however, ment captures what we refer to as an irreversible NSE. The is that of the different properties of NSEs and how they af- agent here is tasked with reaching a goal space, but must fect the world. Distinguishing between NSEs by the prop- move a box that is blocking its path. Moving the box adja- erties they have can help us to further understand their con- cent to a side or a corner can hinder or prevent future move- sequences as well as to identify and implement robust solu- ment of the box. The agent is not directly penalised for per- tions to them. We consider the application of an autonomous forming these actions, but it is undesirable from a safety per- agricultural drone which can survey and spray crop fields spective. The Safety Gym (Ray, Achiam, and Amodei 2019) as an illustrative example, giving an intuitive description of generally requires an agent to perform a task such as reach- each outlined NSE property. ing a goal or moving a box to a specific position while avoid- We refer to a first type of NSE property as “blocking”. ing numerous hazard types. The side effects possible in these By this we mean an action by an AI system that prevents environments generally utilise the reward trade-off property; it from performing its task successfully, or substantially and the agents are trying to maximise their rewards whilst try- irrevocably curtails the reward it can achieve. Examples of ing to minimise constraint violation. The Real World Re- this NSE include an agent taking an action which locks away inforcement Learning Challenge Framework (Dulac-Arnold an object it needs to complete its task. In our agricultural et al. 2020) does not contain any environments that directly drone example, this would be an action that causes the drone measure the side effects of an agent, though some tasks re- quire the agent not to violate certain safety constraints. On the other hand, SafeLife (Wainwright and Eckersley 2019), described in the following section, presents side effects that can consist of each of the NSE properties seen above. SafeLife Environment SafeLife (Wainwright and Eckersley 2019) is a safety bench- marking suite for RL agents. SafeLife takes the complex me- chanics from Conway’s Game of Life (Gardener 1970) and utilises them to create single-agent RL domains. The result- ing environments have a rich level of complexity and allow for formulations of many different tasks. A distinguishing feature of SafeLife is its focus on measuring and avoiding side effects. Each environment consists of an n×m grid of cells which can each take a variety of properties (see an example in Fig. 1). We refer to this grid as a board. The most notable prop- Figure 1: An example task instance from SafeLife. The erty is whether the cell is “dead” or “alive”. Following the agent must create life structures in the designated blue Game of Life rules the cells update as follows: a dead cell positions before moving to the goal . Ideally the agent will become alive if it has exactly three neighbouring alive should not disturb the green life cells . cells, otherwise it will remain dead. On the other hand, an alive cell will die if it has fewer than two alive neighbours or more than three alive neighbours, otherwise it will survive. side effects that we are concerned with are those made to SafeLife extends this basic framework into a single-agent the pre-existing life cells, but SafeLife provides the means to domain by including an agent that traverses the grid. At each look at the effects on any cell type. The primary reason for time step, the agent perceives a 25 × 25 grid overview of only considering the effects the agent has on the green life its surroundings, centred on itself. For each position in this cells is the fact that the agent is not penalised or rewarded overview, the cell type present is perceived. for disrupting the structures they form; thus they may be de- Instances in SafeLife come with goals for the agent to stroyed as an ‘effect indifference’ NSE. The green cells can complete, usually revolving around creating or destroying obstruct the agent and prevent it from moving, so it may be particular life structures before moving to a predefined goal instrumentally useful for the agent to destroy them, making cell. Additionally, the other effects the agent has on pre- them relevant to the ‘trade-off’ NSE property. Further, the existing life cells are measured — these correspond to un- destruction can be permanent, and thus may represent irre- intended side effects. versible NSEs. Finally, because these life cells follow the In SafeLife the agent is depicted by an arrow and may rules of the game of life, they can have very interesting and move in any cardinal direction, as well as creating or de- complex behaviours. stroying life cells in adjacent spaces. The agent’s movement Due to the dynamic nature of SafeLife, comparing a sin- can be impeded by walls which are shown as a grey square gle final state against a single baseline state may not be suffi- with a black hole . Additionally, there are boxes which cient to capture the eventual side effects of an agent. Instead, are shown as a grey square without a black hole that the two distributions are created that represent the future states agent can push and pull. Pre-existing green life cells are de- of the board for both the case where the agent acted and af- picted using green circles and life cells created by the fected the environment as well as the counterfactual case in agent are shown as grey circles . These agent-created life which the agent never existed. To create the action case dis- cells are to be placed in the designated areas shown in blue. tribution, Da from the final state left by the agent, a further Both types of life cells that follow the rules of the Game of 1000 steps are simulated and each resulting state is added Life. The “tree” entities are depicted by a green square with to Da . The inaction case distribution Di is created similarly, radiating lines and are fixed living cells that cannot be except that 1000 + n steps are run from the initial state of destroyed. Finally, the goal is depicted with a grey arch , the environment, where n is the number of steps taken by which turns red when it can be activated. All cells that have the agent during the episode. This ensures that both distri- not been explicitly noted as living are “dead”. butions reflect the states of the board 1000 + n steps in the Altogether this allows for a very rich and complex set of future from the initial state for both the factual and counter- environments which can be generated procedurally. factual cases. From these distributions, the estimated pro- Measuring Side Effects portion of time steps that board position x = hx1 , x2 i, x ∈ grid = {1, ..., n} × {1, ..., m} con- We now look at the method that the authors of SafeLife pro- tains cell type c can be computed: pose for measuring the magnitude of side effects. Within 1 X SafeLife the destruction, addition or changing of cell types ρcD (x) = [1(s(x) = c)] caused by agent actions represent side effects. The types of |D| s∈D where D is the relevant distribution, s is the state drawn from ρgreen Di (x) = 1(saf ter (x) = green) D and s(x) is the cell type present in s at location x. As there are seven green cells in total, it then follows that A ground distance function is also defined between board positions as g(x, y) = tanh( 15 kx − yk1 ) where kx − yk1 is EM D(ρgreen , ρgreen , g) 4 υgreen (Da , Di ) = P Da green Di = the Manhattan distance between grid locations x and y. ρ (x) 7 x∈grid Di The calculated side effect for Cell Type c is then: The use of the normalised earth mover distance has some dc (Da , Di ) = EM D(ρcDa , ρcDi , g) interesting consequences for the side effect score. Regard- less of how large or densely packed the grid is with green where Da and Di are the action and inaction distributions re- life cells, not disturbing any of them will yield a score of 0, spectively, and EM D is the earth mover distance function. destroying them all will give a score of 1. However, in the Side effects in SafeLife are based on this distance metric, case that more green cells are created, by the agent adding which aims to capture the distance between two probabil- grey life cells to certain patterns which lead to an increase ity distributions based on the amount of work it would take in the number of green life cells, the side effect score can to transform one distribution to the other by moving around increase to above 1, due to quirks with the calculation of the “distribution mass” (Rubner, Tomasi, and Guibas 1998). earth mover distance. Whether or not causing more green This side effect score dc (Da , Di ) is then “normalised” life cells to spawn is worse than the complete destruction of against the inaction baseline to give a final side effect score: all green life cells will depend on what these life cells repre- dc (Da , Di ) sent as well as personal moral philosophy. At the very least υc (Da , Di ) = P c this does capture the notion of representing the total impact x∈grid ρDi (x) the agent has on the green life cells. This normalisation allows us to compare agent behaviours in larger or more densely populated boards. Experiments While a side effect score can be calculated for every cell SafeLife has many tasks available out-of-the-box, and type, SafeLife focuses on the side effects for the green, pre- vastly more can be created by the user. We focus on the existing life cells. append-still task, where the agent must create life This definition of a side effect echoes that of (Armstrong structures on the designated areas and then move to the goal and Levinstein 2017) and (Amodei et al. 2016), where the which ends the episode. Before the goal is accessible to the agent’s actions are compared against a world in which it agent, more than half of the designated (blue) positions must never acted at all. A key difference however is the use of be filled with life cells. At each step the agent receives a re- future distributions to account for the possible long-term or ward equal to the number of life cells created on the desig- unstable effects of actions. nated positions minus the number of life cells destroyed that were on designated positions. Finally the agent receives an Examples and Implications additional 1 reward for reaching the goal after it has opened. Figure 2 shows a before and after comparison to show the After 1000 steps by the agent, the episode will end, regard- effects of an agent’s actions. In the “before” environment, less of the agent’s progress. If the agent has managed to fill the agent places a grey life cell directly in front of it, adjacent at least half of the designated positions for its life cells and to the stable green life structure. This disrupts the structure reached the goal to end the episode before the 1000 steps and after a few time steps it is destroyed. At the end of the have elapsed, we say the agent has ‘passed’ the task instance, episode, assuming the agent does not affect any more green otherwise it has failed. life cells, we can calculate the side effect score as follows: In this task there are also pre-existing green life cell struc- tures which we do not want the agent to disturb. Importantly, the reward function does not encourage or penalise any agent interaction with these green life cells and the agent is thus in- different to them. Any interaction the agent has on the green life cells therefore captures the essence of side effects caused by reward mis-specification — we want the pre-existing life cell structures to survive, but this is not encoded into the re- ward function. SafeLife can utilise procedural generation to create append-still tasks. The parameters used for the proce- (a) Before (b) After dural generation can greatly affect the difficulty of the task instance, and thus the agent’s ability to safely complete the Figure 2: Example for Side Effect calculation task. For our experiments we utilised a wide range of param- eter settings to capture a broad overview of a given agent’s Since both the “before” and “after” snapshots are stable capabilities. (the structures if left alone will not change), Table 1 gives the parameter list for the procedural gen- eration process. Grid Size corresponds to the width and ρgreen Da (x) = 1(sbef ore (x) = green) height of the board to be generated. SideEffectMin is the minimum proportion of cells in the task to be green life some associations between reward and side effects, which cells. GoalMin and GoalMax are the respective minimum may appear if we disaggregate the results. and maximum proportions of the board to be areas for the Looking in more detail at the performance of individual agent to construct its own life cells. Finally Temperature algorithms can give more insight into how an agent’s be- corresponds to a variable which controls the complexity of haviour affects the side effects. In Figures 3, 4 and 5 we plot the stable life patterns that can be generated for both the life the mean side effect score against the reward attained, ex- cells (both the pre-existing green cells and the spaces for the ploration rate and length of an episode, using 100 bins in the agent to place its own cells). x-axis. In Figure 6 we plot the mean maximum side effect We train each agent for a predetermined number of steps, available against exploration rate. In these figures we can with the instances generated using parameters drawn uni- see some relationships emerge when we look at the reward formly at random from the set of difficulties. The agents are attained by an agent. For PPO there is a clear increase in then evaluated on 1000 episodes of each difficulty type (for the side effects caused from 0 to 0.4 followed by a slower a total of 13000 episodes) and each score is aggregated. decrease from 0.4 to 1. This change in side effect score ap- pears to be consistent across each PPO variant, though the Results and Discussion magnitude of the increase varies. For episodes with very low Here we describe our results from the experiments outlined. reward, more training appears to prevent dangerous actions, The two main algorithms we assess are Deep Q-Networks though it is not clear exactly what causes this change in be- (DQN) (Mnih et al. 2013) and Proximal Policy Optimisation haviour. For DQN and the Uniform agent, there is less of (PPO) (Schulman et al. 2017) (both trained for 5, 10, and 30 a relation with the reward received; instead, the side effect million training steps), We also examine an agent that selects score is more constant, although more training reduces the actions uniformly at random (Uniform). For DQN and PPO dispersion of the scatter plots for DQN. Nevertheless, for the we use the implementation provided within SafeLife. three kinds of agents we see that maximum reward (1.0) cor- Since the task instances may be of different sizes, and the responds with very low side effects and, for PPO and DQN, procedural generation may produce more green life cells or minimum reward (0.0) corresponds with very low side ef- designated positions for the agent to place life cells, the total fects. What happens in between is more pronouncedly a bell reward received by an agent is normalised against the max- shape for better agents (e.g., PPO with higher training steps). imum possible reward for that instance. Similarly the side In Figure 4 we compare episode length against the side effect score is normalised. effect score, and see positive correlations. This intuitively Table 2 contains the scores for various metrics for each makes sense, as the longer an agent is acting, the more op- evaluated agent. As expected, more training steps yields bet- portunities it has to disrupt cell structures. Again we note ter pass rates and average rewards. This improvement does that for longer episodes the variance becomes larger. Very not carry over to the side effect score, however, where for short episodes represent situations where the solution is each agent type the side effect score slightly increases with found easily, and roughly correspond to cases with maxi- more training steps for DQN but slightly decreases for PPO, mum reward in Figure 3. although the trend is not clear. This is closely related to the On the other hand, we see a tighter relationship when results for the exploration rate for each agent, which vary in we compare exploration rate against the side effect score a similar way according to the increase of training steps from in Figure 5. For all the PPO variants, as the agent explores 5M to 30M. Recall that this is the evaluation stage, therefore more of the domain the side effect score increases. Also, this reduction in exploration is ideal. The agent can see the for larger exploration rates, the variance of scores becomes whole board, so it is more beneficial for the agent to attempt much larger, suggesting a much less consistent relationship its task without wasting time exploring. for these values of the exploration rate. For DQN we only As should be expected, both agent types show safer be- see this relationship for the most highly-trained agent — haviour than the agent that selects random actions, and the both DQN 5M and DQN 10M peak in their side effect score low rewards serve as a baseline for comparison (e.g., inter- after exploration rates of 0.1 and 0.25 respectively, before estingly, the pass rate for DQN 5M is worse than the uniform dropping to 0. Similarly, when we look at the Uniform agent random agent). DQN and PPO do not overlap in their reward and its exploration rate, there is a very tight relationship be- scores, but the side effect score for PPO 30M is not very far tween the two, with the side effect score increasing with the from the best side effect score (DQN 5M). This suggests that exploration rate from 0 to 0.4, before dropping from 0.4 to there is no clear association between rewards and scores, at 0.6 and remaining at 0 hereafter. least if we look at the aggregated results. These relationships can be somewhat explained by Figure Among the pass rate, average reward and side effect score 6 where we see the mean number of green cells in the inac- we also see large standard deviations suggesting that none tion distribution compared against the exploration rate. Here of the agents learn to perform consistently. The large range we see similar patterns to the comparison of exploration rate of procedural generation parameters is likely the cause of and side effects. It appears that the number of green cells this, preventing agents from overfitting to one particular set an environment contains will affect the exploration rate, and of generation patterns. The exceptions to this are DQN 5M that this will in turn affect the proportional amount of dam- and 10M, which have significantly lower standard deviations age certain agents can do. Why is it so? The rationale has to for the pass rate by virtue of having a significantly lower be found in the way the environments are generated, as in- pass rate that is bounded below by 0. This may also conceal troducing green cells makes full exploration of the whole en- Difficulty Grid Size sideEffectMin GoalMin GoalMax Temperature 0 10 × 10 0 0 0 0.1 1 15 × 15 0.03 0.05 0.1 0.1 2 15 × 15 0.03 0.05 0.1 0.2 3 15 × 15 0.05 0.05 0.1 0.2 4 15 × 15 0.07 0.05 0.15 0.2 5 15 × 15 0.09 0.05 0.2 0.2 6 15 × 15 0.1 0.05 0.2 0.2 7 15 × 15 0.1 0.1 0.25 0.3 8 15 × 15 0.1 0.15 0.25 0.4 9 15 × 15 0.1 0.15 0.35 0.5 10 15 × 15 0.1 0.15 0.4 0.5 11 15 × 15 0.1 0.15 0.45 0.6 12 15 × 15 0.1 0.2 0.5 0.7 Table 1: Difficulty Parameters for our SafeLife levels. Agent Pass Rate Average Reward Side Effect Score Exploration Rate DQN 5M 0.043 (0.203) 0.272 (0.288) 0.229 (0.342) 0.069 (0.042) DQN 10M 0.098 (0.297) 0.417 (0.345) 0.268 (0.357) 0.110 (0.078) DQN 30M 0.299 (0.458) 0.495 (0.348) 0.247 (0.347) 0.121 (0.071) PPO 5M 0.682 (0.466) 0.658 (0.293) 0.324 (0.389) 0.205 (0.138) PPO 10M 0.678 (0.467) 0.676 (0.294) 0.241 (0.344) 0.130 (0.084) PPO 30M 0.709 (0.454) 0.695 (0.275) 0.243 (0.344) 0.148 (0.095) Uniform 0.097 (0.295) 0.181 (0.343) 0.688 (0.450) 0.347 (0.086) Table 2: Results for various metrics in SafeLife for our selected agents. Mean and (standard deviation) vironment more difficult. Basically, explorations above 0.5 reducing exploration and side effects. correspond to very special environments, such as those that This suggests that the analysis of the confidence of the are very small and do not contain green cells and are easy agent, or some other metacognitive proxies could be used to to explore fully even for a random agent. This means that distinguish cases where the agent can complete the task with we have to be careful to interpret the rightmost part of the high reward and low effect, and those cases where uncer- curves in these plots. In a “reward trade-off” scenario about tainty is higher and the agent should be more conservative. side effects, the more limited and concentrated the resources (targeted and non-targeted) are the higher the chance to have Conclusion and Future Work an impact. Large environments with plenty of resources are Despite additional training time yielding much more capable easier to handle, such as the one in Figure 2. We may need to agents, we do not see a similar increase in safety; in fact consider two ‘difficulties’: one about the task rewards them- the safest agent overall was DQN after 5M training steps, selves and another one about the hardness of avoiding side while being the second worst performing agent in terms of effects. The use of a random agent can help elucidate these reward. However, when we analyse the detailed behaviour two cases, but a more detailed analysis of the generation pa- using several indicators we find a region of high rewards rameters such as those in Table 1 may be needed too. and low effect, which is only achieved by PPO. This shows Overall, the general picture is that DQN behaves between a non-monotonic relationship, suggesting that the area with the uniform agent and PPO, and the range depends on the medium rewards is the most dangerous for proficient agents. number of training steps. As the agent is better in terms of This paper highlights the differences in behaviour be- rewards (from DQN 5M to PPO 30M), the positive relation tween two commonly used RL algorithms when trained on between exploration rate and side effect is clearer. The de- the same task. As the results show, these behavioural differ- tailed view of Figure 3, and the way it bends down for PPO ences can have a significant effect on our ability to predict illustrates that for those environments where the agent seems the safety and other factors of the agent, such as the confi- more determined and manages to get high rewards, the ex- dence or its level of competency compared to elements of the ploration is lower and so is the side effect. It is mostly for environment, such as the proportion of green cells. Further those situations where the reward is in the range of 0.2 to 0.8 analysis and exploration of the difficulty of task instances that the side effects are strongest. We hypothesise that these may help to elucidate the cause of these relationships be- are the cases where the agent only knows partially what to tween side effects and other metrics, as well as to help us do. Finally, those cases with very low rewards may be caused to understand how environment difficulty and agent uncer- by several situations, such as the agent being stuck in a loop, tainty can be used to improve policies and make them safer. (a) PPO (a) PPO (b) DQN (b) DQN (c) Uniform Agent (c) Uniform Agent Figure 3: Mean Side Effect against Reward attained for Figure 4: Mean Side Effect against Episode Length for variants of PPO, DQN and Uniform agents. The 13,000 variants of PPO, DQN and Uniform agents. The 13,000 episodes are binned on the x-axis, with the size of the plotted episodes are binned on the x-axis, with the size of the plotted points (squares, triangles and circles) being logarithmic on points (squares, triangles and circles) being logarithmic on the number of episodes in each bin. The curves are rolling the number of episodes in each bin. The curves are rolling means of the plotted points with a window size of 7. means of the plotted points with a window size of 7. (a) PPO (a) PPO (b) DQN (b) DQN (c) Uniform Agent (c) Uniform Agent Figure 6: Number of initial life cells against Exploration Figure 5: Mean Side Effect against Exploration Rate for Rate for variants of PPO, DQN and Uniform agents. The variants of PPO, DQN and Uniform agents. The 13,000 13,000 episodes are binned on the x-axis, with the size of episodes are binned on the x-axis, with the size of the plotted the plotted points (squares, triangles and circles) being log- points (squares, triangles and circles) being logarithmic on arithmic on the number of episodes in each bin. The curves the number of episodes in each bin. The curves are rolling are rolling means of the plotted points with a window size means of the plotted points with a window size of 7. of 7. Acknowledgements: This work was funded by the Leverhulme Turner, A. M.; Hadfield-Menell, D.; and Tadepalli, P. 2020. Trust, the Future of Life Institute, FLI (grant RFP2-152), the Conservative Agency via Attainable Utility Preservation. EU’s Horizon 2020 research and innovation programme (No. Proceedings of the AAAI/ACM Conference on AI, Ethics, 952215, TAILOR), and EU (FEDER) and Spanish MINECO un- and Society doi:10.1145/3375627.3375851. URL http://dx. der RTI2018-094403-B-C32, and Generalitat Valenciana GV under doi.org/10.1145/3375627.3375851. PROMETEO/2019/098. Wainwright, C. L.; and Eckersley, P. 2019. Safelife 1.0: Ex- ploring side effects in complex environments. arXiv preprint References arXiv:1912.01217 . Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schul- man, J.; and Mané, D. 2016. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 . Armstrong, S.; and Levinstein, B. 2017. Low impact artifi- cial intelligences. arXiv preprint arXiv:1705.10720 . Bostrom, N. 2003. Ethical issues in advanced artificial in- telligence. Science fiction and philosophy: from time travel to superintelligence 277–284. Critch, A.; and Krueger, D. 2020. AI Research Considera- tions for Human Existential Safety (ARCHES). Dulac-Arnold, G.; Levine, N.; Mankowitz, D. J.; Li, J.; Paduraru, C.; Gowal, S.; and Hester, T. 2020. An empirical investigation of the challenges of real-world reinforcement learning . Garcia, J.; and Fernandez, F. 2012. Safe Exploration of State and Action Spaces in Reinforcement Learning. Jour- nal of Artificial Intelligence Research 45: 515–564. ISSN 1076-9757. doi:10.1613/jair.3761. URL http://dx.doi.org/ 10.1613/jair.3761. Gardener, M. 1970. Mathematical games: the fantastic com- binations of John Conway’s new solitaire game “life”. Krakovna, V.; Orseau, L.; Kumar, R.; Martic, M.; and Legg, S. 2019. Penalizing side effects using stepwise relative reachability. Leike, J.; Martic, M.; Krakovna, V.; Ortega, P. A.; Everitt, T.; Lefrancq, A.; Orseau, L.; and Legg, S. 2017. AI Safety Gridworlds. Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; and Riedmiller, M. 2013. Play- ing Atari with Deep Reinforcement Learning. Ray, A.; Achiam, J.; and Amodei, D. 2019. Benchmark- ing safe exploration in deep reinforcement learning. arXiv preprint arXiv:1910.01708 . Rubner, Y.; Tomasi, C.; and Guibas, L. J. 1998. A Metric for Distributions with Applications to Image Databases. In Proceedings of the Sixth International Conference on Com- puter Vision, ICCV ’98, 59. USA: IEEE Computer Society. ISBN 8173192219. Russell, S. 2019. Human Compatible: Artificial Intelligence and the Problem of Control. Penguin Publishing Group. Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; and Klimov, O. 2017. Proximal Policy Optimization Algorithms. Soares, N.; Fallenstein, B.; Armstrong, S.; and Yudkowsky, E. 2015. Corrigibility. URL https://aaai.org/ocs/index.php/ WS/AAAIW15/paper/view/10124.