=Paper=
{{Paper
|id=Vol-2808/Paper_20
|storemode=property
|title=Challenges for Using Impact Regularizers to Avoid Negative Side Effects
|pdfUrl=https://ceur-ws.org/Vol-2808/Paper_20.pdf
|volume=Vol-2808
|authors=David Lindner,Kyle Matoba,Alexander Meulemans
|dblpUrl=https://dblp.org/rec/conf/aaai/LindnerMM21
}}
==Challenges for Using Impact Regularizers to Avoid Negative Side Effects==
Challenges for Using Impact Regularizers to Avoid Negative Side Effects David Lindner,1 * Kyle Matoba, 2 * Alexander Meulemans 3 * 1 Department of Computer Science, ETH Zurich 2 Idiap and EPFL 3 Institute of Neuroinformatics, University of Zurich and ETH Zurich david.lindner@inf.ethz.ch, kyle.matoba@epfl.ch, ameulema@ethz.ch Abstract process and there is no guarantee that the agent will not ex- Designing reward functions for reinforcement learning is dif- hibit side effects when it encounters new situations. In fact, ficult: besides specifying which behavior is rewarded for a such problems with misspecified reward functions have been task, the reward also has to discourage undesired outcomes. observed in various practical applications of RL (Krakovna Misspecified reward functions can lead to unintended nega- et al. 2020b). tive side effects, and overall unsafe behavior. To overcome In most situations, it is useful to decompose the re- this problem, recent work proposed to augment the specified ward R(s) into a task-related component Rtask (s) and an reward function with an impact regularizer that discourages environment-related component Renv (s), where the latter behavior that has a big impact on the environment. Although specifies how the agent should behave in the environment, initial results with impact regularizers seem promising in mit- regardless of the task.1 As Shah et al. (2019) observe, Renv is igating some types of side effects, important challenges re- related to the frame problem in classical AI (McCarthy and main. In this paper, we examine the main current challenges of impact regularizers and relate them to fundamental design Hayes 1969): we not only have to make a prediction about decisions. We discuss in detail which challenges recent ap- what is supposed to change, but also what is supposed to re- proaches address and which remain unsolved. Finally, we ex- main unchanged. Renv is more prone to misspecification, be- plore promising directions to overcome the unsolved chal- cause it needs to specify everything that can happen beyond lenges in preventing negative side effects with impact regu- a task, that can result in undesired outcomes. Because the larizers. designer builds an RL agent to solve a specific problem, it is relatively easy to anticipate considerations directly related 1 Introduction to solving the task in Rtask . Shah et al. (2019) point out that environments are generally already optimized for humans, Specifying a reward function in reinforcement learning (RL) hence, defining Renv primarily requires to specify which fea- that completely aligns with the designer’s intent is a difficult tures of the environment the AI systems should not disturb. task. Besides specifying what is important to solve the task Therefore, penalizing large changes in the current state of at hand, the designer also needs to specify how the AI sys- the world can be thought of as a coarse approximation for tem should behave in the environment in general, which is Renv . hard to fully cover. For example, RL agents playing video Impact regularization (IR) has emerged as a tractable and games often learn to achieve a high score without solving effective way to approximate Renv (Armstrong and Levin- the desired task by exploiting the game (e.g. Saunders et al. stein 2017; Krakovna et al. 2019; Turner, Hadfield-Menell, 2018). Side effects occur when the behavior of the AI system and Tadepalli 2020). The main idea behind IR is to approx- diverges from the designer’s intent because of some consid- imate Renv through a measure of “impact on the environ- erations that were not anticipated beforehand, such as the ment”, which avoids negative side effects and reduces the possibility to exploit a game. In this work, we focus on side burden on the reward designer. effects that are tied to the reward function, which we define In this paper, we discuss IR of the form as side effects that would still occur if we had access to an or- acle that finds an optimal policy for a given reward function. R(st ) = Rspec (st ) − λ · d(st , b(s0 , st−1 , t)) (1) We explicitly do not consider side effects resulting from the used RL algorithm, which are often discussed using the term where st denotes the state at time step t, Rspec denotes the safe exploration (Garcıa and Fernández 2015). reward function specified by the designer,2 and: In practice, the designer typically goes through several it- • the baseline b(s0 , st−1 , t) provides a state obtained by fol- erations of reward specification to optimize the agent’s per- lowing a “default” or “safe” policy at timestep t and uses formance and minimize side effects. This is often a tedious 1 * The authors contributed equally. We write the reward function only as a function of states for Copyright ©2021 for this paper by its authors. Use permitted under simplicity, as the state-space can be formally extended to include Creative Commons License Attribution 4.0 International (CC BY the last action. 2 4.0). Rspec contains the specified parts of both Rtask and Renv . either the initial state and the current time (s0 , t) to com- measures respectively, which both share the same structural pute it, or else the current state st−1 , form for the deviation measure: • d measures the deviation of the realized state from the X X dVD (st , s0t ) = wx f Vx (s0t ) − Vx (st ) , baseline state, and (2) x=1 • λ ≥ 0 gives a global scale at which to trade off the speci- fied reward and the regularization. where x ranges over some sources of value, Vx (st ) is the Composing these three terms gives a general formulation value of state st according to x, wx is its weight in the sum of regularization that encompasses most proposals found in and f is a function characterizing the deviation between the the literature, but permits separate analysis (Krakovna et al. values. AU is a special case of this with wx = 1/X for all x 2019). and the absolute value operator as f . This formulation cap- We start by giving an overview of the related work on tures the same intuition as RR, but allows to measure the IR (Section 2), before we discuss the three main design de- impact of the agent in terms of different value functions, in- cisions for IR. First, we discuss how to choose a baseline stead of just counting states. Concretely, AU aims to mea- (Section 3), emphasizing considerations of environment dy- sure the agent’s ability to achieve high utility on a range of namics and a tendency for agents to offset their actions. Sec- different goals in the environment, and penalizes any change ond, we discuss how to quantify deviations from the base- that reduces this ability. Turner, Hadfield-Menell, and Tade- line (Section 4), especially the distinction between negative, palli (2020) also introduced the stepwise inaction baseline to neutral, and positive side effects. Third, we discuss how to mitigate offsetting behavior (c.f. Section 3.2). This baseline choose the scale λ (Section 5). Finally, we propose some follows an inaction policy starting from the previous state directions to improve the effectiveness of IR (Section 6) . st−1 rather than the starting state s0 . Follow-up work scaled The main contribution of this work is to, discuss in de- AU towards more complex environments (Turner, Ratzlaff, tail the current main challenges of IR, building upon previ- and Tadepalli 2020). ous work, and to suggest possible ways forward to overcome Krakovna et al. (2020a) built upon the VD measure and these challenges. introduced an auxiliary loss representing how well the agent could solve future tasks in the same environment, given its current state. This can be seen as a deviation measure in e.q. 2 Related Work (1) that rewards similarity with a baseline instead of penal- Amodei et al. (2016) reviewed negative side effects as one izing deviation from it. Eysenbach et al. (2018)’s approach of several problems in AI safety, and discussed using im- to penalize irreversibility can be seen as a special case of pact regularization (IR) to avoid negative side effects. Since Krakovna et al. (2020a). then, several concrete approaches to IR have been proposed, Aside from IR, Rahaman et al. (2019) proposed to learn of which eq. (1) gives the underlying structure. Armstrong an arrow of time, representing a directed measure of reach- and Levinstein (2017) proposed to measure the impact of the ability, using the intuition that irreversible actions tend to agent compared to the inaction baseline, starting from the leave the environment in a more disorderly state, making it initial state s0 . The inaction baseline assumes the agent does possible to define an arrow of time with methods inspired by nothing, which can be formalized by assuming a non-action thermodynamics. As another alternative to IR, Zhang, Dur- exists.3 Armstrong and Levinstein (2017) emphasized the fee, and Singh (2018, 2020) proposed to learn which envi- importance of a semantically meaningful state representa- ronmental features an AI system is allowed to change by tion for the environment when measuring distances from the querying a human overseer. They provided an active query- inaction baseline. While Armstrong and Levinstein (2017) ing approach that makes maximally informative queries. discussed the problem of measuring the impact of an agent Shah et al. (2019) developed a method for learning which abstractly, Krakovna et al. (2019) proposed a concrete de- parts of the environment a human cares about by assum- viation measure called Relative Reachability (RR). RR mea- ing that the world is optimized to suit humans. Saisubra- sures the average reduction in the number of states reachable manian, Kamar, and Zilberstein (2020) formulated the side from the current state, compared to a baseline state. This effects problem as a multi-objective Markov Decision Pro- captures the intuition that irreversible changes to the envi- cess, where they learn a separate reward function penaliz- ronment should be penalized more, but has advantages over ing negative side effects and optimize this secondary objec- directly using irreversibility as a measure of impact (as e.g. tive while staying close to the optimal policy of the task ob- in Eysenbach et al. (2018)), such as allowing to quantify the jective. Saisubramanian, Zilberstein, and Kamar (2020) pro- magnitude of different irreversible changes. vide a broad overview of the various existing approaches for Turner, Hadfield-Menell, and Tadepalli (2020) and mitigating negative side effects, while we zoom in on one Krakovna et al. (2019) generalized the concept of RR to- class of approaches, IR, and discuss the corresponding chal- wards Attainable Utility (AU) and Value Difference (VD) lenges in detail. 3 Armstrong and Levinstein (2017) define this baseline as the 3 Choosing a Baseline state the environment would be in when the agent would have never been deployed. This is slightly different from the definition of the Recent work mainly uses two types of baselines in impact inaction baseline we give here and that later work used, as the mere regularization (IR): (i) the inaction baseline b(s0 , st , t) = presence of the agent can influence the environment. T (st |s0 , πinaction ) and (ii) the stepwise inaction baseline b(s0 , st , t) = T (st |st−1 , πinaction ), where T is the distribu- belt, collect the reward, and afterwards put the vase back on tion over states st when starting at state s0 or st−1 respec- the conveyor belt to reduce deviation from the baseline. To tively and following the inaction policy πinaction that always understand this offsetting behavior recall the decomposition takes an action anop that does nothing. of the true reward into a task-related and an environment- Unfortunately, the inaction baseline can lead to undesir- related component from section 1. A designer usually spec- task able offsetting behavior, where the agent tries to undo the ifies a task reward Rspec that rewards states signaling task outcomes of their task after collecting the reward, moving completion (e.g. taking the vase off the belt). However, each back closer to the initial baseline (Turner, Hadfield-Menell, task has consequences to the environment, which often are and Tadepalli 2020). The stepwise inaction baseline re- the reason why the task should be completed in the first place moves the offsetting incentive of the agent by branching off (e.g. the vase being not broken). In all but simple tasks, as- from the previous state instead of the starting state (Turner, signing a reward to every task consequence is impossible, Hadfield-Menell, and Tadepalli 2020). However, Krakovna and so by omission, they have a zero reward. When IR pe- et al. (2020a) argued that offsetting behavior is desirable in nalizes consequences of completing the task, because they many cases. In section 3.2 we contribute to this discussion differ from the baseline, this results in undesirable offset- by breaking down in detail when offsetting behavior is desir- ting behavior. The stepwise inaction baseline (Turner, Rat- able or undesirable, whereas in section 3.3, we argue that the zlaff, and Tadepalli 2020) successfully removes all offsetting inaction baseline and step-wise inaction baseline can lead incentives. However, in other situations offsetting might be to inaction incentives in nonlinear dynamical environments. desired. We start, however, with the fundamental observation that the Desirable Offsetting. In many cases, offsetting behavior inaction baseline and stepwise inaction baseline do not al- is desired, because it can prevent unnecessary side effects. ways represent safe policies in section 3.1. Krakovna et al. (2020a) provide an example of an agent which is asked to go shopping, and needs to open the front 3.1 Inaction Baselines are not Always Safe door of the house to go to the shop. If the agent leaves the The baseline used in IR should represent a safe policy where door open, wind from outside can knock over a vase in- the AI system does not harm its environment or itself. In side, which the agent can prevent by closing the door after many cases, taking no actions would be a safe policy for the leaving the house. When using the stepwise inaction base- agent, e.g. for a cleaning robot. However, if the AI system is line (with rollouts, c.f. Section 4.2), the agent gets penal- responsible for a task requiring continuous control, inaction ized once when opening the door for knocking over the vase of the AI system can be disastrous. For example, if the agent in the future, independent of whether it closes the door af- is responsible for driving a car on a highway, doing noth- terwards (and thus prevents the vase from breaking) or not. ing likely results in a crash. This is particularly problematic Hence, for this example, the offsetting behavior (closing the for the stepwise inaction baseline, which follows an inaction door) is desirable. The reasoning behind this example can be policy starting from the previous state. The inaction policy generalized to all cases where the offsetting behavior con- starting from the initial state can also be unsafe, for example, cerns states that are instrumental towards achieving the task if an agent takes over the control of the car from a human, (e.g. opening the door) and not a consequence of completing and therefore the initial state s0 already has the car driving. the task (e.g. the vase being not broken). For this reason, designing a safe baseline for a task or en- vironment that requires continuous control is a hard prob- A Crucial Need for a New Baseline. The recently pro- lem. One possible approach is to design a policy that is posed baselines either remove offsetting incentives alto- known to be safe based on expert knowledge. However, this gether or allow for both undesirable and desirable offsetting can be a time-consuming process, and is not always feasible. to occur, which are both unsatisfactory solutions. Krakovna Designing safe baselines for tasks and environments that re- et al. (2020a) proposed resolving this issue by allowing all quire continuous control is an open problem that has to be offsetting (e.g. by using the inaction baseline) and rewarding solved before IR can be used in these applications. all states where the task is completed in the specified reward function. However, we attribute three important downsides 3.2 Offsetting to this approach. First, states that occur after task comple- An agent engages in offsetting behavior when it tries to undo tion can still have negative side effects. If the reward asso- the outcomes of previous actions, i.e. when it “covers up its ciated with these states is high enough to prevent offsetting, tracks”. Offsetting behavior can be desirable or undesirable, it might also be high enough to encourage the agent to pur- depending on which outcomes the agent counteracts. sue these states and ignore their negative side effects. Sec- Undesirable offsetting. Using IRs with an inaction base- ond, not all tasks have a distinct goal state that indicates the line starting from the initial state can lead to undesirable off- completion of a task, but rather accumulate task-related re- setting behavior where the agent counteracts the outcomes wards at various time steps during an episode. Third, this of its task (Krakovna et al. 2019; Turner, Hadfield-Menell, approach creates a new incentive for the agent to prevent and Tadepalli 2020). For example, Krakovna et al. (2019) shut-down, as it continues to get rewards after the task is consider a vase on a conveyor belt. The agent is rewarded completed (Hadfield-Menell et al. 2017). for taking the vase off the belt, hence preventing that it will We conclude that offsetting is still an unsolved problem, fall off the belt. The desired behavior is to take the vase and highlighting the need for a new baseline, to prevent undesir- stay put. The offsetting behavior is to take the vase off the able offsetting behavior, but allow for desirable offsetting. 3.3 Environment Dynamics and Inaction Deviation Measures. At the core of the inaction prob- Incentives lem is that some negative side effects are worse than others. Usually, it does not matter if the agent changes the weather In dynamic environments that are highly sensitive to the conditions by moving around, however, it would matter if agent’s actions, the agent will be susceptible to inaction in- the agent causes a serious negative side effect, for example centives. Either the agent does not act at all (for all but small a hurricane. While both outcomes can be a result of com- magnitudes of λ) or it will be insufficiently regularized and plex and chaotic dynamics of the environment, we care less possibly result in undesired side effects (for small λ). about the former and more about the latter. Differentiating Sensitivity to Typical Actions. Many real-world environ- between negative, neutral and positive side effects is a task ments exhibit chaotic behavior, in which the state of the en- of the deviation measure used in the IR, which is discussed vironment is highly sensitive to small perturbations. In such in the next section. environments, the environment state where the agent has performed an action will be fundamentally different from 4 Choosing a Deviation Measure the environment state for the inaction baseline (Armstrong and Levinstein 2017). Furthermore, for the step-wise inac- A baseline defines a “safe” counterfactual to the agent’s ac- tion baseline, the same argument holds for the non-action tions. The deviation measure determines how much a de- compared to the planned action of the agent. Hence, when viation from this baseline by the agent should be penal- using these baselines for IR, all actions of the agent will be ized or rewarded. Currently, the main approaches to a de- strongly regularized, creating the inaction incentive. When viation measure are the relative reachability (RR) measure λ is lowered to allow the agent to take actions, the agent can (Krakovna et al. 2019), the attainable utility (AU) measure cause negative side effects when the IR cannot differentiate (Turner, Hadfield-Menell, and Tadepalli 2020) and the fu- between negative side effects and chaotic changes in the en- ture task (FT) reward (Krakovna et al. 2020a). In the prac- vironment. Here, it is useful to distinguish between typical tical implementations of AU and FT, they use reachability and atypical actions. We say (informally) that an action is tasks, which can be considered as a sub-sampling of RR. In typical if it is commonly used for solving a wide variety of this section, we argue that the current deviation measures tasks (e.g. moving). When the environment is highly sen- should be augmented with a notion of value of the impact to sitive to typical actions, IRs with the current baselines will avoid unsatisfactory performance of the agent and that new prevent the agent from engaging in normal operations. How- rollout policies should be designed that allow for a proper ever, it is not always a problem if the environment is highly incorporation of delayed effects into the deviation measure. sensitive to atypical actions of the agent (e.g. discharging onboard weaponry), as preventing atypical actions impedes 4.1 Which Side Effects are Negative? less with the normal operation of the agent. The goal of IRs is to approximate Renv for all states in a Capability of the Agent. The inaction incentive will be- tractable manner. It does this by penalizing impact on the en- come more apparent for agents that are highly capable of vironment, built upon the assumption that the environment is predicting the detailed consequences of their actions, for ex- already optimized for human preferences (Shah et al. 2019). ample by using a powerful physics engine. As the ability to The IR aims to penalize impact proportionally to the mag- predict the consequences of an action is fundamental to min- nitude of this impact which corresponds with the magnitude imizing side effects, limiting the prediction capabilities of an of the side effect (Krakovna et al. 2019; Turner, Hadfield- agent to prevent the inaction incentive is not desired. Rather, Menell, and Tadepalli 2020). However, not all impact is neg- for agents that can very accurately predict the implications ative, but it can also be neutral or even positive. Renv does not of their actions, it is necessary to have an accompanying in- only consider the magnitude the impact on the environment, telligent impact regularizer. but also to which degree this impact is negative, neutral or State Features. Armstrong and Levinstein (2017) point positive. Neglecting the associated value of impact can lead out that for IR one should not represent states with overly to suboptimal agent behavior, as highlighted in the example fine-grained features, as presenting an agent with too much below. information exposes them to basing decisions on irrelevan- Example: The Chemical Production Plant. Consider cies. For example, it would be counterproductive for an an AI system controlling a plant producing a chemical prod- agent attempting to forecast demand in an online sales sit- uct for which various unknown reactions exist, each produc- uation to model each potential customer separately, when ing a different combination of waste products. The task of broader aggregates would suffice. However, there remain the AI system is to optimize the production rate of the plant, two issues with this approach to mitigate the inaction in- i.e. it gets a reward proportional to the production rate. To centive. First, the intrinsic dynamics of the environment re- minimize the impact of the plant on the environment, the main unchanged, so it is still highly sensitive to small per- reward function of the agent is augmented with an impact turbations, of which the results can be visible in the coarser regularizer, which penalizes the mass of waste products re- features (e.g. the specific weather conditions). Second, for leased in the environment, compared to an inaction baseline advanced AI systems, it might be beneficial to change their (where the plant is not operational). Some waste products feature representation to become more capable of predicting are harmless (e.g. O2 ), whereas others can be toxic. When the consequences of their actions. In this case, one would the deviation measure of the impact regularizer does not have no control over the granularity of the features. differentiate between negative, neutral or positive impact, the AI system is incentivized to use a reaction mechanism the inaction policy, such as the current policy of the agent. that maximizes production while minimizing waste. How- ever, this reaction might output mostly toxic waste product, 5 Choosing the Magnitude of the Regularizer whereas another reaction outputs only harmless waste prod- To combine the IR with a specified reward function, the ucts and hence has no negative side effects. Tuning the regu- designer has to choose the magnitude of the regularizer larizer magnitude λ does not provide a satisfactory solution λ. Turner, Hadfield-Menell, and Tadepalli (2020) say that in this case, as either the plant is not operational (for high “loosely speaking, λ can be interpreted as expressing the lambda), or the plant is at risk of releasing toxic waste prod- designer’s beliefs about the extent to which R [the specified ucts in the environment. reward] might be misspecified”. Positive Side Effects. The distinction between positive, It is crucial to choose the correct λ. If λ is too small, the neutral and negative impact is not only needed to allow for regularizer may not reduce the risk of undesirable side ef- a satisfactory performance of the agent in many environ- fects effectively. If λ is too big, the regularizer will overly ments, it is also desirable for encouraging unanticipated pos- restrict necessary effects of the agent on the environment, itive side effects. Expanding upon the example in 4.1: if the and the agent will be less effective at achieving its goal. agent discovered a way to costlessly sequester carbon diox- Note, that while the regularizers proposed by Krakovna et al. ide alongside its other tasks it should do so, whilst an IR (2019) and Turner, Hadfield-Menell, and Tadepalli (2020) would encourage the robot to not interfere. While very posi- already measure utility, in general λ must also handle a unit- tive unexpected outcomes might be unlikely, this possibility conversion of the regularizer to make it comparable with the should not be neglected in the analysis of impact regulariz- reward function. ers. Some intuition for choosing λ comes from a Bayesian per- Value Differences. To distinguish between positive, neu- spective, where the regularizer encodes prior knowledge and tral and negative side effects, we need an approximation of λ controls how far from the prior the posterior should have Renv that goes beyond measuring impact as a sole source moved. Another distinct view on setting λ comes from the of information. The value difference framework (Turner, dual optimization problem, where it represents the Lagrange Hadfield-Menell, and Tadepalli 2020) allows for differenti- multiplier on an implied set of constraints: λ is the magni- ating between positive and negative impact by defining the tude of the regularizer for which the solution to the penal- deviation measure as a sum of differences in value between ized optimization problem coincides with a constrained opti- a baseline and the agent’s state-action pair for various value mization problem. Hence, the designer can use λ to commu- functions. Hence, it is possible to reflect how much the de- nicate constraints to the AI system, which is a natural way signer’s values different kinds of side effects in these value to phrase some common safety problems (Ray, Achiam, and functions. However, the challenge remains to design value Amodei 2019). functions that approximate Renv to a sufficient degree on the Armstrong and Levinstein (2017) discuss the problem of complete state space, which is again prone to reward mis- tuning λ and note that contrary to intuition the region of use- specification. So although the value difference framework ful λ’s can be very small and hard to find safely. In practice λ allows for specifying values for side effects, how to specify is often tuned until the desired behavior is achieved, e.g., by this notion of value is still an open problem. starting with a high λ and reducing it until the agent achieves 4.2 Rollout Policies the desired behavior. This approach is in general insufficient to find the correct trade-off. For a fixed step-size in decreas- Often, the actions of an agent cause delayed effects, i.e. ef- ing λ, the tuning might always jump from a λ that leads to fects that are not visible immediately after taking the action. inaction, to a λ that yields unsafe behavior. The same holds The stepwise inaction baseline (Turner, Hadfield-Menell, for other common procedures to tune hyperparameters. and Tadepalli 2020) ignores all actions that took place be- fore t − 1, hence, to correctly penalize delayed effects, the 6 Ways Forward deviation measure needs to incorporate future effects. This can be done by collecting rollouts of future trajectories us- In this section, we put forward promising future research di- ing a simulator or model of the environment. These rollouts rections to overcome the challenges discussed in the previ- depend on which rollout policy is followed by the agent in ous sections. the simulation. For the baseline states, the inaction policy is the logical choice. For the rollout of the future effects of 6.1 A Causal Framing of Offsetting the agent’s action, it is less clear which rollout policy should In Section 3.2, we highlighted that some offsetting behavior be used. Turner, Hadfield-Menell, and Tadepalli (2020) use is desired and some undesired. To design an IR that allows the inaction policy in this case. Hence, this IR considers a for desired offsetting but prevents undesired offsetting, one rollout where the agent takes its current action, after which firsts needs to have a mechanism that can predict and dif- it cannot do any further actions. This approach has signifi- ferentiate between these two types of offsetting. Undesired cant downsides, because IR does not allow the agent to do a offsetting concerns the environment states that are a conse- series of actions when determining the impact penalty (e.g. quence of the task. The difficulty lies in determining which the agent can take an action to jump, but cannot plan for its states are a causal consequence of the task being completed landing accordingly in the rollout). Therefore, we argue that and differentiate them from states that could have occurred future work should develop rollout policies different from regardless of the task. Goal-based Tasks. When the task consists of reaching a in a sequence of environments that increasingly resemble certain goal state, the consequences of performing a task can the production environment. At each iteration, the designer be formalized in a causal framework (Pearl 2009). When a identifies weaknesses and corrects them, such that the cri- causal graph of the environment-agent-interaction is avail- terion being optimized becomes increasingly true to the de- able, the states that are a consequence of the task can be signer’s intent. For example, an AI with the goal to trade obtained from the graph as the causal children nodes of the financial assets may be run against historical data (“back- goal state. Hence, a baseline that allows for desired offset- tested”) in order to understand how it might have reacted ting behavior but prevents undesired offsetting behavior pre- in the past, and presented with deliberately extreme inputs vents the agent from interfering with the children nodes of (“stress-tested”) in order to understand likely behavior in the goal states, while allowing for offsetting on other states. “out of sample” situations. To design a reward function and General Tasks. Not all tasks have a distinct goal state a regularizer, it is crucial for the designer to be able to un- which indicates the completion of a task, but accumulate in- derstand how the system would react in novel situations and stead task-related rewards at various time steps during an how to fix it in case it exhibits undesired behavior. Further episode. Extending this argument to general tasks remains research aiming to increase the designer’s ability to under- an open issue, for which causal influence diagrams (Everitt stand how a system will react, will substantially help the de- et al. 2019) can provide a mathematical framework. signer to communicate their intent more effectively. Recent work in this direction concerning interpretability (Gilpin 6.2 Probabilities Instead of Counterfactuals as et al. 2018), verification (e.g. Huang et al. 2017) of machine Baseline learning models is particularly promising. Armstrong and Levinstein (2017) made the interesting argu- Actively Learning from Humans. Considering the prob- ment that probabilities are better suited than counterfactuals lem from the perspective of the AI system, the goal is to for measuring the impact of actions. Current implementa- improve its ability to understand the designer’s intent, espe- tions of IRs use a counterfactual as baseline (e.g. the inaction cially in novel, unanticipated, scenarios. Instead of the de- baseline or stepwise inaction baseline). Because this base- signer telling the system their intent, this problem can be line is one specific trajectory, it will differ considerably from addressed by the system asking the designer about their in- the actual trajectory of the agent in environments that ex- tent. To decide what to ask the designer, the system may be hibit chaotic dynamics. However, chaotic environments will able to determine which states it is highly uncertain about, also be highly sensitive to perturbations that do not orig- even if it is not able to accurately ascribe values to some inate from the agent’s actions. One possible way forward of them. Recent work shows that such an approach can be towards a more robust measure of the agent’s impact on the effectively used to learn from the human about a task at environment is hence to compare probabilities that marginal- hand (Christiano et al. 2017), but it may also be used to ize over all external perturbations instead of comparing spe- learn something about the constraints of the environment cific trajectories. Define p(st |A) as the probability of having and which side effects are desired or undesired (Zhang, Dur- state st , given the trajectory of actions A the agent took and fee, and Singh 2018). Active learning could also provide a p(st |B) as the probability of st given actions prescribed by different perspective on impact regularizers: instead of di- the baseline. All influences of perturbations that did not arise rectly penalizing impact on the environment, a high value from the agent are marginalized out in these probabilities. of the regularization term could be understood as indicating Hence, a divergence measure between these two probabili- that the designer should give feedback. In particular, this ap- ties can give a more robust measure of potential impact of proach could help to resolve situations in which a positive the agent, without being susceptible to non-necessary inac- task reward conflicts with the regularization term. tion incentives. To our best knowledge, this idea has not yet been implemented as a concrete IR method and would hence be a promising direction for future research. 7 Conclusion Avoiding negative side effects in systems that have the ca- 6.3 Improved Human-Computer interaction pacity to cause harm is necessary to fully realize the promise Side effects occur if there is a difference between the out- of artificial intelligence. In this paper, we discussed a pop- come an AI system achieves and the intent of its (human) ular approach to reduce negative side effects in RL: impact designer. Thus improving how well the designer can com- regularization (IR). We discussed the practical difficulty of municate their intent to the AI system is an important aspect choosing each of the three components: a baseline, a devia- of eliminating side effects (Leike et al. 2018). This empha- tion measure and a regularization strength. Furthermore, we sis on the human component of learning to avoid negative pointed to fundamental problems that are currently not ad- side effects connects it closely to the problem of scalable dressed by state-of-the-art methods, and presented several oversight proposed by Amodei et al. (2016). new future research directions to address these. While our Improved Tools for Reward Designers. Commonly, a discussion showed that current approaches still leave signif- designer will aim to iteratively improve the AI system and icant opportunities for future work, IRs are a promising idea its reward function. Similarly, when choosing an impact reg- for building the next generation of safe AI systems, and we ularizer, a designer will iterate on the choice of baseline, de- hope that our discussion is valuable for researchers trying to viation measure, and regularization strength and test them build new IRs. Acknowledgments McCarthy, J.; and Hayes, P. 1969. Some philosophical problems from the standpoint of ai, Machine Intelligence We thank Andreas Krause, François Fleuret and Benjamin (Meltzer B. and Michie D., eds.), vol. 4. Grewe for their valuable comments and suggestions. Kyle Matoba was supported by the Swiss National Science Foun- Pearl, J. 2009. Causality. Cambridge university press. dation under grant number FNS-188758 “CORTI”. Rahaman, N.; Wolf, S.; Goyal, A.; Remme, R.; and Bengio, Y. 2019. Learning the Arrow of Time for Problems in Rein- References forcement Learning. In International Conference on Learn- ing Representations (ICLR). Amodei, D.; Olah, C.; Steinhardt, J.; Christiano, P.; Schul- man, J.; and Mané, D. 2016. Concrete problems in AI safety. Ray, A.; Achiam, J.; and Amodei, D. 2019. Bench- arXiv:1606.06565 . marking safe exploration in deep reinforcement learning. arXiv:1910.01708 . Armstrong, S.; and Levinstein, B. 2017. Low impact artifi- cial intelligences. arXiv:1705.10720 . Saisubramanian, S.; Kamar, E.; and Zilberstein, S. 2020. A Multi-Objective Approach to Mitigate Negative Side Ef- Christiano, P. F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; fects. In Proceedings of International Joint Conferences on and Amodei, D. 2017. Deep reinforcement learning from Artificial Intelligence (IJCAI). human preferences. In Advances in Neural Information Pro- cessing Systems. Saisubramanian, S.; Zilberstein, S.; and Kamar, E. 2020. Avoiding negative side effects due to incomplete knowledge Everitt, T.; Ortega, P. A.; Barnes, E.; and Legg, S. 2019. Un- of ai systems. arXiv:2008.12146 . derstanding Agent Incentives using Causal Influence Dia- Saunders, W.; Sastry, G.; Stuhlmüller, A.; and Evans, O. grams. Part I: Single Action Settings. arXiv:1902.09980 . 2018. Trial without Error: Towards Safe Reinforcement Eysenbach, B.; Gu, S.; Ibarz, J.; and Levine, S. 2018. Leave Learning via Human Intervention. In Proceedings of Inter- no trace: Learning to reset for safe and autonomous rein- national Conference on Autonomous Agents and MultiAgent forcement learning. In International Conference on Learn- Systems. ing Representations (ICLR). Shah, R.; Krasheninnikov, D.; Alexander, J.; Abbeel, P.; and Garcıa, J.; and Fernández, F. 2015. A comprehensive survey Dragan, A. 2019. Preferences Implicit in the State of the on safe reinforcement learning. Journal of Machine Learn- World. In International Conference on Learning Represen- ing Research 16(1): 1437–1480. tations (ICLR). Gilpin, L. H.; Bau, D.; Yuan, B. Z.; Bajwa, A.; Specter, M.; Turner, A. M.; Hadfield-Menell, D.; and Tadepalli, P. 2020. and Kagal, L. 2018. Explaining explanations: An overview Conservative agency via attainable utility preservation. In of interpretability of machine learning. In IEEE 5th Interna- Proceedings of the AAAI/ACM Conference on AI, Ethics, tional Conference on data science and advanced analytics and Society. (DSAA), 80–89. Turner, A. M.; Ratzlaff, N.; and Tadepalli, P. 2020. Avoid- Hadfield-Menell, D.; Dragan, A.; Abbeel, P.; and Russell, S. ing Side Effects in Complex Environments. In Advances in 2017. The off-switch game. In Proceedings of International Neural Information Processing Systems. Joint Conferences on Artificial Intelligence (IJCAI). Zhang, S.; Durfee, E.; and Singh, S. 2020. Querying to Find a Safe Policy under Uncertain Safety Constraints in Markov Huang, X.; Kwiatkowska, M.; Wang, S.; and Wu, M. 2017. Decision Processes. In Proceedings of the AAAI Conference Safety verification of deep neural networks. In International on Artificial Intelligence. Conference on Computer Aided Verification, 3–29. Springer. Zhang, S.; Durfee, E. H.; and Singh, S. P. 2018. Minimax- Krakovna, V.; Orseau, L.; Kumar, R.; Martic, M.; and Legg, Regret Querying on Side Effects for Safe Optimality in Fac- S. 2019. Penalizing side effects using stepwise relative tored Markov Decision Processes. In Proceedings of In- reachability. In Workshop on Artificial Intelligence Safety ternational Joint Conferences on Artificial Intelligence (IJ- at IJCAI. CAI). Krakovna, V.; Orseau, L.; Ngo, R.; Martic, M.; and Legg, S. 2020a. Avoiding Side Effects By Considering Future Tasks. In Advances in Neural Information Processing Systems. Krakovna, V.; Uesato, J.; Mikulik, V.; Rahtz, M.; Everitt, T.; Kumar, R.; Kenton, Z.; Leike, J.; and Legg, S. 2020b. Specification gaming: the flip side of AI ingenu- ity. URL https://deepmind. com/blog/article/Specification- gamingthe-flip-side-of-AI-ingenuity . Leike, J.; Krueger, D.; Everitt, T.; Martic, M.; Maini, V.; and Legg, S. 2018. Scalable agent alignment via reward model- ing: a research direction. arXiv:1811.07871 .