Conservative Agency Alexander Matt Turner1 , Dylan Hadfield-Menell2 and Prasad Tadepalli1 1 Oregon State University 2 UC Berkeley {turneale, prasad.tadepalli}@oregonstate.edu, dhm@eecs.berkeley.edu Abstract As agents are increasingly employed for real-world tasks, misspecification will become more difficult to avoid and will Reward functions are easy to misspecify; although have more serious consequences. In this work, we focus on designers can make corrections after observing mitigating these consequences. mistakes, an agent pursuing a misspecified reward The specification process can be thought of as an iterated function can irreversibly change the state of its en- game. First, the designers provide a reward function. Using a vironment. If that change precludes optimization learned model, the agent then computes and follows a policy of the correctly specified reward function, then cor- that optimizes the reward function. The designers can then rection is futile. For example, a robotic factory as- correct the reward function, which the agent then optimizes, sistant could break expensive equipment due to a and so on. Ideally, the agent should maximize the reward over reward misspecification; even if the designers im- time, not just within any particular round – in other words, mediately correct the reward function, the damage it should minimize regret for the correctly specified reward is done. To mitigate this risk, we introduce an ap- function over the course of the game. proach that balances optimization of the primary For example, consider a robotic factory assistant. In- reward function with preservation of the ability to evitably, a reward misspecification might cause erroneous be- optimize auxiliary reward functions. Surprisingly, havior, such as going to the wrong place. However, we would even when the auxiliary reward functions are ran- prefer misspecification not induce irreversible and costly mis- domly generated and therefore uninformative about takes, such as breaking expensive equipment or harming the correctly specified reward function, this ap- workers. proach induces conservative, effective behavior. Such mistakes have a large impact on the ability to opti- mize a wide range of reward functions. Spilling paint im- 1 Introduction pinges on the many objectives which involve keeping the fac- tory floor clean. Breaking a vase interferes with every objec- Recent years have seen a rapid expansion of the number of tive involving vases. The expensive equipment can be used tasks that reinforcement learning (RL) agents can learn to to manufacture various kinds of widgets, so any damage im- complete, from Go ([Silver et al., 2016]) to Dota 2 ([OpenAI, pedes many objectives. The objectives affected by these ac- 2018]). The designers specify the reward function, which tions include the unknown correct objective. To minimize guides the learned behavior. regret over the course of the game, the agent should preserve Reward misspecification can lead to strange agent behav- its ability to optimize the correct objective. ior, from purposefully dying before entering a video game Our key insight is that by avoiding these impactful ac- level in which scoring points is initially more difficult ([Saun- tions to the extent possible, we greatly increase the chance of ders et al., 2018]), to exploiting a learned reward predic- preserving the agent’s ability to optimize the correct reward tor by indefinitely volleying a Pong ball ([Christiano et al., function. By preserving options for arbitrary objectives, one 2017]). Specification is often difficult for non-trivial tasks, can often preserve options for the correct objective – even for reasons including insufficient time, human error, or lack of without knowing anything about it. Thus, without making as- knowledge about the relative desirability of states. [Amodei sumptions about the nature of the misspecification early on, et al., 2016] explain: the agent can still achieve low regret over the game. An objective function that focuses on only one as- To leverage this insight, we consider a state embedding pect of the environment may implicitly express in- in which each dimension is the optimal value function (i.e., difference over other aspects of the environment. the attainable utility) for a different reward function. We An agent optimizing this objective function might show that penalizing distance traveled in this embedding nat- thus engage in major disruptions of the broader en- urally captures and unifies several concepts in the litera- vironment if doing so provides even a tiny advan- ture, including side effect avoidance ([Amodei et al., 2016; tage for the task at hand. Zhang et al., 2018]), minimizing change to the state of the environment ([Armstrong and Levinstein, 2017]), and reacha- larger. However, empowerment is inappropriately sensitive to bility preservation ([Moldovan and Abbeel, 2012; Eysenbach the action encoding. et al., 2018]). We refer to this unification as conservative Safe RL ([Pecka and Svoboda, 2014; Garcı́a and agency: optimizing the primary reward function while pre- Fernández, 2015; Berkenkamp et al., 2017; Chow et al., serving the ability to optimize others. 2018]) focuses on avoiding irrecoverable mistakes during Contributions. We frame the reward specification process training. However, if the objective is misspecified, safe RL as an iterated game and introduce the notion of conserva- agents can converge to arbitrarily undesirable policies. Al- tive agency. This notion inspires an approach called attain- though our approach should be compatible with safe RL tech- able utility preservation (AUP), for which we show that Q- niques, we concern ourselves with the consequences of the learning converges. We offer a principled interpretation of optimal policy in this work. design choices made by previous approaches – choices upon which we significantly improve. 3 Approach We run a thorough hyperparameter sweep and conduct an Everyday experience suggests that the ability to achieve one ablation study whose results favorably compare variants of goal is linked to the ability to achieve a seemingly unrelated AUP to a reachability preservation method on a range of goal. Reading this paper takes away from time spent learning gridworlds. By testing for broadly applicable agent incen- woodworking, and going hiking means you can’t reach the tives, these simple environments demonstrate the desirable airport as quickly. However, one might wonder whether these properties of conservative agency. Our results indicate that everyday intuitions are true in a formal sense. In other words, even when simply preserving the ability to optimize uni- are the optimal value functions for a wide range of reward formly sampled reward functions, AUP agents accrue pri- functions thus correlated? If so, preserving the ability to op- mary reward while preserving state reachabilities, minimiz- timize somewhat unrelated reward functions likely preserves ing change to the environment, and avoiding side effects with- the ability to optimize the correct reward function. out specification of what counts as a side effect. 3.1 Formalization 2 Prior Work In this work, we consider a standard Markov decision pro- cess (MDP) hS, A, T, R, γi with state space S, action space Our proposal aims to minimize change to the agent’s ability A, transition function T : S × A → ∆(S), reward function to optimize the correct objective, which directly helps reduce R : S × A → R, and discount factor γ. We assume the ex- regret over the specification process. In contrast, previous istence of a no-op action ∅ ∈ A for which the agent does approaches to regularizing the optimal policy were more in- nothing. In addition to the primary reward function R, we as- direct, minimizing change to state features ([Armstrong and sume that the designer supplies a finite set of auxiliary reward Levinstein, 2017]) or decrease in the reachability of states functions called the auxiliary set, R ⊂ RS×A . Each Ri ∈ R ([Krakovna et al., 2018]’s relative reachability). The latter is has a corresponding Q-function QRi . We do not assume that recovered as a special case of AUP. the correct reward function belongs to R. In fact, one of our Other methods for constraining or otherwise mitigating the key findings is that AUP tends to preserve the ability to op- consequences of reward misspecification have been consid- timize the correct reward function even when the correct re- ered. A wealth of work is available on constrained MDPs, ward function is not included in the auxiliary set. in which reward is maximized while satisfying certain con- straints ([Altman, 1999]). For example, [Zhang et al., 2018] Definition (AUP penalty). Let s be a state and a be an action. employ a whitelisted constraint scheme to avoid negative side effects. However, we may not assume we can specify all rel- |R| X evant constraints, or a reasonable feasible set of reward func- P ENALTY(s, a) := |QRi (s, a) − QRi (s, ∅)| . (1) tions for robust optimization ([Regan and Boutilier, 2010]). i=1 [Everitt et al., 2017] formalize reward misspecification as the corruption of some true reward function. [Hadfield- The penalty is the L1 distance from the no-op in a state Menell et al., 2017b] interpret the provided reward function embedding in which each dimension is the value function for as merely an observation of the true objective. [Shah et al., an auxiliary reward function. This measures change in the 2019] employ the information about human preferences im- ability to optimize each auxiliary reward function. plicitly present in the initial state to avoid negative side ef- We want the penalty term to be roughly invariant to the fects. While both our approach and theirs aim to avoid side absolute magnitude of the auxiliary Q-values, which can be effects, they assume that the correct reward function is linear arbitrary (it is well-known that the optimal policy is invariant in state features, while we do not. to positive affine transformation of the reward function). To [Amodei et al., 2016] consider avoiding side effects by do this, we normalize with respect to the agent’s situation. The designer can choose to scale with respect to the penalty minimizing the agent’s information-theoretic empowerment ([Mohamed and Rezende, 2015]). Empowerment quantifies of some mild action or, if R ⊂ RS×A >0 , the total ability to an agent’s control over future states of the world in terms optimize the auxiliary set: of the maximum possible mutual information between fu- |R| ture observations and the agent’s actions. The intuition is X S CALE(s) := QRi (s, ∅), (2) that when an agent has greater control, side effects tend to be i=1 where S CALE : S → R>0 in general. With this, we are now Starting state ready to define the full AUP objective: Definition (AUP reward function). Let λ ≥ 0. Then ... Inaction P ENALTY(s, a) RAUP (s, a) := R(s, a) − λ . (3) S CALE(s) Action Stepwise inaction Similar to the regularization parameter in supervised learn- Figure 1: An action’s penalty is calculated with respect to the chosen ing, λ is a regularization parameter that controls the influence baseline. of the AUP penalty on the reward function. Loosely speak- ing, λ can be interpreted as expressing the designer’s beliefs about the extent to which R might be misspecified. Figure 1 compares the baselines. Each baseline implies Lemma 2. ∀s, a : RAUP converges with probability 1. a different assumption about how the environment is config- ured to facilitate optimization of the correctly specified re- Theorem 1. ∀s, a : QRAUP converges with probability 1. ward function: the state is initially configured (starting state), processes initially configure (inaction), or processes continu- The AUP reward function then defines a new MDP ally reconfigure in response to the agent’s actions (stepwise hS, A, T, RAUP , γi. Therefore, given the primary and aux- inaction). The stepwise inaction baseline aims to allow for iliary reward functions, the model-based agent in the iterated the response of other agents implicitly present in the environ- game can compute RAUP and the corresponding optimal pol- ment (such as humans). icy. For our purposes, we simultaneously learn the optimal aux- Deviation. Relative reachability only penalizes decreases iliary Q-functions. in state reachability, while AUP penalizes absolute change in the ability to optimize the auxiliary reward functions. Ini- Algorithm 1 AUP update tially, this choice seems confusing – we don’t mind if the agent becomes better able to optimize the correct reward 1: procedure U PDATE(s, a, s0 ) function. 2: for i ∈ [ |R| ] ∪ {AUP} do However, not only must the agent remain able to optimize 3: Q0 = Ri (s, a) + γ maxa0 QRi (s0 , a0 ) the correct objective, but we also must remain able to im- 4: QRi (s, a) += α(Q0 − QRi (s, a)) plement the correction. Suppose an agent predicts that do- 5: end for ing nothing would lead to shutdown. Since the agent can- 6: end procedure not accrue the primary reward when shut down, it would be incentivized to avoid correction. Avoiding correction (e.g., by hiding in the factory) would not be penalized if 3.2 Design Choices only decreases are penalized, since the auxiliary Q-values Following the decomposition of [Krakovna et al., 2018], we would increase compared to deactivation. An agent exhibit- now explore two choices implicitly made by the P ENALTY ing this behavior would be more difficult to correct. The definition: with respect to what baseline is penalty computed, agent should be incentivized to accept shutdown without be- and using what deviation metric? ing incentivized to shut itself down ([Soares et al., 2015; Hadfield-Menell et al., 2017a]). Baseline. An obvious candidate is the starting state. For example, starting state relative reachability would compare Delayed Effects the initial reachability of states with their expected reachabil- Sometimes the agent disrupts a process which takes multiple ity after the agent acts. time steps to complete, and we would like this to be appropri- However, the starting state baseline can penalize the nor- ately penalized. For example, suppose that soff is a terminal mal evolution of the state (e.g., the moving hands of a clock) state representing shutdown, and let Ron (s) := 1s6=soff be the and other natural processes. The inaction baseline is the state only auxiliary reward function. Further suppose that if (and which would have resulted had the agent never acted. only if) the agent does not select disable within the first 1 As the agent acts, the current state may increasingly differ two time steps, it enters soff . QRon (s1 , disable) = 1−γ from the inaction baseline, which creates strange incentives. γ and QRon (s1 , ∅) = 1−γ , so choosing disable at time step For example, consider a robot rewarded for rescuing erro- 1 1 incurs only 1 penalty (instead of the 1−γ penalty induced neously discarded items from imminent disposal. An agent penalizing with respect to the inaction baseline might rescue by comparing with shutdown). a vase, collect the reward, and then dispose of it anyways. To avert this, we introduce the stepwise inaction baseline, under which the agent compares acting with not acting at each time disable1 ∅1 step. This avoids penalizing the effects of a single action mul- tiple times (under the inaction baseline, penalty is applied as ∅2 ∅2 long as the rescued vase remains unbroken) and ensures that not acting incurs zero penalty. Figure 2: Comparing rollouts; subscript denotes time step. (a) Options (b) Damage (c) Correction (d) Offset (e) Interference Figure 3: The agent should reach the goal without having the side effect of: (a) irreversibly pushing the crate downwards into the corner ([Leike et al., 2017]); (b) bumping into the horizontally pacing human ([Leech et al., 2018]); (c) disabling the off-switch (if the switch is not disabled within two time steps, the episode ends); (d) rescuing the right-moving vase and then replacing it on the conveyor belt ([Krakovna et al., 2018] – note that no goal cell is present); (e) stopping the left-moving pallet from reaching the human ([Leech et al., 2018]). In general, the single-step no-op comparison of Equation 1 avoiding the introduction of perverse incentives. The agent applies insufficient penalty when the increase is induced by should not be incentivized to artificially reduce the measured the optimal policies of the auxiliary reward functions at the penalty (Offset: a car should not be built and then immedi- next time step. One solution is to use a model to compute roll- ately disassembled) or interfere with changes already under- outs. For example, to evaluate the delayed effect of choosing way in the world (Interference: workers should not be disable, compare the Q-values at the leaves in Figure 2. impeded). The agent remains active in the left branch, but is shut down Each property seems conducive to achieving low regret in the right branch; this induces a substantial penalty. over the course of the specification process. Accordingly, if the agent has the side effect detailed in Figure 3, an unob- 4 Experimental Design served performance penalty of −2 is recorded. By also incor- porating the observed primary reward into the performance We compare AUP and several of its ablated variants against metric, we evaluate a combination of conservativeness and relative reachability ([Krakovna et al., 2018]) and standard efficacy. Q-learning within the environments of Figure 3. For each en- vironment, A = {up, down, left, right, ∅}. On contact, Each trial, the auxiliary reward functions are randomly se- the agent pushes the crate, removes the human and the off- lected from [0, 1]S ; to learn their complex Q-functions using switch, pushes the vase, and blocks the pallet. The episode tabular Q-learning, the agent explores randomly for the first ends after the agent reaches the goal cell, 20 time steps 4,000 episodes and .2-greedily (with respect to QRAUP ) for the elapse (the time step is not observed by the agent), or the off- remaining 2,000. The greedy policy is evaluated at the end of switch is not contacted and disabled within two time steps. In training. S CALE is as defined in Equation 2. The default Correction (which we introduce), a yellow indicator ap- parameters are α = 1, γ = .996, λ = .67, and |R| = 30. pears one step before shutdown, and turns red upon shutdown. We investigate how varying γ, λ, and |R| affects Model-free In all environments except Offset, the agent observes a pri- AUP performance, and conduct an ablation study on design mary reward of 1 for reaching the goal. In Offset, a pri- choices. mary reward of 1 is observed for moving downward twice Relative reachability has an inaction baseline, decrease- and thereby rescuing the vase from disappearing upon con- only deviation metric, and an auxiliary set containing the state tact with the eastern wall. indicator functions (whose Q-values are clipped to [0, 1] to Our overarching goal is allowing for low regret over the emulate discounted state reachability). To match [Krakovna course of the specification game. In service of this goal, we et al., 2018]’s results, this condition has γ = .996, λ = .2. aim to preserve the agent’s ability to optimize the correctly All agents except Standard (a normal Q-learner) and specified reward function. To facilitate this, there are two Model-free AUP are 9-step optimal discounted planning sets of qualitative properties one intuitively expects, and each agents with perfect models. The planning agents (sans Rela- property has an illustration in the context of the robotic fac- tive reachability) use Model-free AUP’s learned auxiliary Q- tory assistant. values and share the default γ = .996, λ = .67. By modify- The first set contains positive qualities, with a focus on ing the relevant design choice in AUP, we have the Starting correctly penalizing significant shifts in the agent’s ability to state, Inaction, and Decrease AUP variants. be redirected towards the right objective. The agent should When calculating P ENALTY(s, a), all planning agents maximally preserve options (Options: objects should not model the auxiliary Q-values resulting from taking action a be wedged in locations from which extraction is difficult; and then selecting ∅ until time step 9. Starting state AUP Damage: workers should not be injured) and allow correc- compares these auxiliary Q-values with those of the start- tion (Correction: if vases are being painted the wrong ing state. Agents with inaction or stepwise inaction baselines color, then straightforward correction should be in order). compare with respect to the appropriate no-op rollouts up to The second set contains negative qualities, with a focus on time step 9 (see Figures 1 and 2). No side effect, complete No side effect, incomplete Side effect, complete Side effect, incomplete Options Damage Correction Offset Interference 50 25 0 .875 .969 .992 .998 .875 .969 .992 .998 .875 .969 .992 .998 .875 .969 .992 .998 .875 .969 .992 .998 50 Trials 25 0 .4 .5 .7 1.1 3.3 .4 .5 .7 1.1 3.3 .4 .5 .7 1.1 3.3 .4 .5 .7 1.1 3.3 .4 .5 .7 1.1 3.3 50 25 0 0 15 30 45 0 15 30 45 0 15 30 45 0 15 30 45 0 15 30 45 | | Figure 4: Outcome tallies for Model-free AUP across parameter settings. “Complete” means the agent accrued the primary reward. In Correction, reaching the goal is mutually exclusive with not disabling the off-switch, so “no side effect, incomplete” is the best outcome. Options Damage Correction Offset Interference deploying an insufficiently conservative agent. 1 Even though R is randomly generated and the environ- Performance ments are different, S CALE ensures that when λ > 1, the 0 agent never ends the episode by reaching the goal. None of the auxiliary reward functions can be optimized after the 1 agent ends the episode, so the auxiliary Q-values are all zero 0 1000 2000 3000 4000 5000 6000 and P ENALTY computes the total ability to optimize the auxil- Episode iary set – in other words, the S CALE value. The RAUP -reward for reaching the goal is then 1 − λ. Figure 5: Model-free AUP performance averaged over 50 trials. The performance combines the observed primary reward of 1 for com- If the optimal value functions for most reward functions pleting the objective, and the unobserved penalty of −2 for having were not correlated, then one would expect to randomly gen- the side effect in Figure 3. The dashed vertical line marks the shift erate an enormous number of auxiliary reward functions be- in exploration strategy. fore sampling one resembling “don’t have side effects”. How- ever, merely five sufficed. This supports the notion that these value functions are correlated, which agrees with the infor- 5 Results mal intuitions discussed earlier. 5.1 Model-free AUP 5.2 Ablation Model-free AUP fails Correction for the reasons dis- cussed in 3.2: Delayed effects.1 The results are presented in Table 1 due to the binary nature of As shown in Figure 4, low γ values induce a substan- performance at appropriate settings, and were not sensitive to tial movement penalty, as the auxiliary Q-values are sensi- the rollout length (as long as it allowed for relevant interaction tive to the immediate surroundings. The optimal value for with the environment). Options is γ ≈ .996, with performance decreasing as Standard moves directly to the goal, pushing the crate γ → 1 due to increasing sample complexity for learning the into the corner in Options and bumping into the human in auxiliary Q-values. Damage. In Options, small values of λ begin to induce side effects Model-free and Starting state AUP fail Correction for as the scaled penalty shrinks. One can seemingly decrease the same reason (see 3.2: Delayed effects), refraining from λ until effective behavior is achieved, reducing the risk of disabling the off-switch only when λ > 1. Relative reachabil- ity and Decrease AUP fail because avoiding shutdown doesn’t 1 Code and animated results available at https://github.com/ decrease the auxiliary Q-values. alexander-turner/attainable-utility-preservation. Relative reachability and Inaction AUP’s poor perfor- Options Damage Correction Offset Interference AUP 3 3 3 3 3 Relative reachability 3 3 7 7 3 Standard 7 7 7 3 3 Model-free AUP 3 3 7 3 3 Starting state AUP 3 3 7 3 7 Inaction AUP 3 3 3 7 3 Decrease AUP 3 3 7 3 3 Table 1: Ablation results; 3 for achieving the best outcome (see Figure 4), 7 otherwise. mance in Offset stems from the inaction baseline (although ing negative feedback after each mistake has been made (and [Krakovna et al., 2018] note that relative reachability passes thereby confronting a credit assignment problem). In con- using undiscounted state reachabilities). Since the vase falls trast, once provided the Q-function for an auxiliary objective, off the conveyor belt in the inaction rollout, states in which the AUP agent becomes sensitive to all events relevant to that the vase is intact have different auxiliary Q-values. To avoid objective, applying penalty proportional to the relevance. continually incurring penalty after receiving the primary re- ward for saving the vase, the agents replace the vase on the 7 Conclusion belt so that it once again breaks. This work is rooted in twin insights: that the reward speci- By taking positive action to stop the pallet in fication process can be viewed as an iterated game, and that Interference, Starting state AUP shows that poor preserving the ability to optimize arbitrary objectives often design choices create perverse incentives. preserves the ability to optimize the unknown correct objec- tive. To achieve low regret over the course of the game, we 6 Discussion can design conservative agents which optimize the primary Correction suggests that AUP agents are significantly objective while preserving their ability to optimize auxiliary easier to correct. Since the agent is unable to optimize objec- objectives. We demonstrated how AUP agents act both con- tives if shut down, avoiding shutdown significantly changes servatively and effectively while exhibiting a range of desir- the ability to optimize almost every objective. AUP seems able qualitative properties. to naturally incentivize passivity, without requiring e.g. as- Given our current reward specification abilities, misspeci- sumption of a correct parametrization of human reward func- fication may be inevitable, but it need not be disastrous. tions (as does the approach of [Hadfield-Menell et al., 2016], which [Carey, 2018] demonstrated). Acknowledgments Although we only ablated AUP, we expect that, equipped This work was supported by the Center for Human- with our design choices of stepwise baseline and absolute Compatible AI and the Berkeley Existential Risk Initiative. value deviation metric, relative reachability would also pass We thank Thomas Dietterich, Alan Fern, Adam Gleave, Vic- all five environments. The case for this is made by consid- toria Krakovna, Matthew Rahtz, and Cody Wild for their ering the performance of Relative reachability, Inaction AUP, feedback, and are grateful for the preparatory assistance of and Decrease AUP. This suggests that AUP’s improved per- Phillip Bindeman, Alison Bowden, and Neale Ratzlaff. formance is due to better design choices. However, we an- ticipate that AUP offers more than robustness against random auxiliary sets. A Theoretical Results Relative reachability computes state reachabilities between Consider an MDP hS, A, T, R, γi whose state space S and all |S|2 pairs of states. In contrast, AUP only requires the action space A are both finite, with ∅ ∈ A. Let γ ∈ [0, 1), learning of Q-functions and should therefore scale relatively λ ≥ 0, and consider finite R ⊂ RS×A . smoothly. We speculate that in partially observable environ- We make the standard assumptions of an exploration pol- ments, a small sample of somewhat task-relevant auxiliary icy greedy in the limit of infinite exploration and a learn- reward functions induces conservative behavior. ing rate schedule with infinite sum but finite sum of squares. For example, suppose we train an agent to handle vases, Suppose S CALE : S → R>0 converges in the limit of Q- and then to clean, and then to make widgets with the equip- learning. P ENALTY(s, a) (abbr. P EN), S CALE(s) (abbr. S C), ment. Then, we deploy an AUP agent with a more ambi- and RAUP (s, a) are understood to be calculated with respect ∗ tious primary objective and the learned Q-functions of the to the QRi being learned online; P EN *, S C *, RAUP , and Q∗Ri aforementioned auxiliary objectives. The agent would apply are taken to be their limit counterparts. penalties to modifying vases, making messes, interfering with Lemma 1. ∀s, a : P ENALTY converges with probability 1. equipment, and anything else bearing on the auxiliary objec- tives. Proof outline. Let  > 0, and suppose for all Ri ∈ R, Before AUP, this could only be achieved by e.g. specifying maxs, a |Q∗Ri (s, a) − QRi (s, a)| < 2|R|  (because Q-learning penalties for the litany of individual side effects or provid- converges; see [Watkins and Dayan, 1992]). Proposition 1 (Invariance properties). Let c ∈ R>0 , b ∈ R. max |P ENALTY *(s, a) − P ENALTY(s, a)| (4) a) Let R0 denote the set of functions induced by the s, a positive affine transformation cX + b on R, and take |R| P EN *R0 to be calculated with respect to attainable set X ≤ max Q∗Ri (s, a) − QRi (s, a) + R0 . Then P EN *R0 = c P EN *R . In particular, when S C * s, a i=1 (5) ∗ is a P ENALTY calculation, RAUP is invariant to positive Q∗Ri (s, ∅) − QRi (s, ∅) affine transformations of R. < . (6) b) Let R0 := cR + b, and take RAUP 0∗ to incorporate R0 instead of R. Then by multiplying λ by c, the induced optimal policy remains invariant. The intuition for Lemma 2 is that since P ENALTY and S CALE both converge, so must RAUP . For readability, we Proof outline. For a), since the optimal policy is invariant suppress the arguments to P ENALTY and S CALE. to positive affine transformation of the reward function, for Lemma 2. ∀s, a : RAUP converges with probability 1. each Ri0 ∈ R0 we have Q∗R0 = c Q∗Ri + 1−γ b . Substituting i into Equation 1 (P ENALTY), the claim follows. Proof outline. If λ = 0, the claim follows trivially. Otherwise, let  > 0, B := max s, a S C *+P  EN *, and := C  For b), we again use the above invariance of optimal policies:  C2 mins, a S C *. Choose any R ∈ 0, min C, 0∗ P EN * λB +  C RAUP := cR + b − cλ (17) and assume P EN and S C are both R -close. SC* ∗ = cRAUP + b. (18) ∗ max |RAUP (s, a) − RAUP (s, a)| (7) s, a = max λ P EN − P EN * (8) References s, a SC SC* [Altman, 1999] Eitan Altman. Constrained Markov decision |P EN · S C * − S C · P EN *| processes, volume 7. CRC Press, 1999. = max λ (9) s, a SC* · SC [Amodei et al., 2016] Dario Amodei, Chris Olah, Jacob |(P EN * + R )S C * − (S C * − R )P EN *| Steinhardt, Paul Christiano, John Schulman, and Dan < max λ (10) Mané. Concrete problems in AI safety. arXiv:1606.06565 s, a C (S C * − R ) [cs], June 2016. arXiv: 1606.06565. λB R ≤ · (11) [Armstrong and Levinstein, 2017] Stuart Armstrong and C C − R Benjamin Levinstein. Low impact artificial intelligences. λB  C2 arXiv:1705.10720 [cs], May 2017. arXiv: 1705.10720. < ·  C2 (12) C (λB +  C)(C − λB+ C) [Berkenkamp et al., 2017] Felix Berkenkamp, Matteo λB  C2 Turchetta, Angela Schoellig, and Andreas Krause. Safe < ·  C2 (13) model-based reinforcement learning with stability guar- C λB(C − λB+ C) antees. In Advances in Neural Information Processing  Systems, pages 908–918, 2017. < C (14) 1 − λB+ C [Carey, 2018] Ryan Carey. Incorrigibility in the CIRL frame-  C  work. AI, Ethics, and Society, 2018. = 1+ . (15) [Chow et al., 2018] Yinlam Chow, Ofir Nachum, Edgar λB Duenez-Guzman, and Mohammad Ghavamzadeh. A But B, C, λ are constants, and  was arbitrary; clearly 0 > 0 lyapunov-based approach to safe reinforcement learning. can be substituted such that (15) < . In Advances in Neural Information Processing Systems, Theorem 1. ∀s, a : QRAUP converges with probability 1. pages 8092–8101, 2018. [Christiano et al., 2017] Paul F Christiano, Jan Leike, Tom Proof outline. Let  > 0, and suppose RAUP is (1−γ)2 -close. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Then Q-learning on RAUP eventually converges to a limit Deep reinforcement learning from human preferences. Q̃RAUP such that maxs, a |Q∗RAUP (s, a) − Q̃RAUP (s, a)| < 2 . In Advances in Neural Information Processing Systems, By the convergence of Q-learning, we also eventually have pages 4299–4307, 2017. maxs, a |Q̃RAUP (s, a) − QRAUP (s, a)| < 2 . Then [Everitt et al., 2017] Tom Everitt, Victoria Krakovna, Lau- rent Orseau, and Shane Legg. Reinforcement learning with max Q∗RAUP (s, a) − QRAUP (s, a) < . (16) a corrupted reward channel. In Proceedings of the Twenty- s, a Sixth International Joint Conference on Artificial Intelli- gence, IJCAI-17, pages 4705–4713, 2017. [Eysenbach et al., 2018] Benjamin Eysenbach, Shixiang Gu, Towards safe reinforcement learning via human interven- Julian Ibarz, and Sergey Levine. Leave no trace: Learning tion. In Proceedings of the 17th International Conference to reset for safe and autonomous reinforcement learning. on Autonomous Agents and Multi-Agent Systems, pages In International Conference on Learning Representations, 2067–2069, 2018. 2018. [Shah et al., 2019] Rohin Shah, Dmitrii Krasheninnikov, [Garcı́a and Fernández, 2015] Javier Garcı́a and Fernando Jordan Alexander, Pieter Abbeel, and Anca Dragan. The Fernández. A comprehensive survey on safe reinforce- implicit preference information in an initial state. In Inter- ment learning. Journal of Machine Learning Research, national Conference on Learning Representations, 2019. 16(1):1437–1480, 2015. [Silver et al., 2016] David Silver, Aja Huang, Chris J Maddi- [Hadfield-Menell et al., 2016] Dylan Hadfield-Menell, Stu- son, Arthur Guez, Laurent Sifre, George Van Den Driess- art Russell, Pieter Abbeel, and Anca Dragan. Cooperative che, Julian Schrittwieser, Ioannis Antonoglou, Veda Pan- inverse reinforcement learning. In Advances in Neural In- neershelvam, Marc Lanctot, et al. Mastering the game of formation Processing Systems, pages 3909–3917, 2016. Go with deep neural networks and tree search. Nature, [Hadfield-Menell et al., 2017a] Dylan 529(7587):484, 2016. Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell. The [Soares et al., 2015] Nate Soares, Benja Fallenstein, Stuart off-switch game. In Proceedings of the Twenty-Sixth Armstrong, and Eliezer Yudkowsky. Corrigibility. AAAI International Joint Conference on Artificial Intelligence, Workshops, 2015. IJCAI-17, pages 220–227, 2017. [Watkins and Dayan, 1992] Christopher Watkins and Peter [Hadfield-Menell et al., 2017b] Dylan Hadfield-Menell, Dayan. Q-learning. Machine Learning, 8(3-4):279–292, Smitha Milli, Pieter Abbeel, Stuart Russell, and Anca 1992. Dragan. Inverse reward design. In Advances in Neural [Zhang et al., 2018] Shun Zhang, Edmund H Durfee, and Information Processing Systems, pages 6765–6774, 2017. Satinder P Singh. Minimax-regret querying on side ef- [Krakovna et al., 2018] Victoria Krakovna, Laurent Orseau, fects for safe optimality in factored Markov decision pro- Miljan Martic, and Shane Legg. Measuring and avoiding cesses. In Proceedings of the Twenty-Seventh Interna- side effects using relative reachability. arXiv:1806.01186 tional Joint Conference on Artificial Intelligence, IJCAI- [cs, stat], June 2018. arXiv: 1806.01186. 18, pages 4867–4873, 2018. [Leech et al., 2018] Gavin Leech, Karol Kubicki, Jessica Cooper, and Tom McGrath. Preventing side-effects in gridworlds, 2018. [Leike et al., 2017] Jan Leike, Miljan Martic, Victoria Krakovna, Pedro Ortega, Tom Everitt, Andrew Lefrancq, Laurent Orseau, and Shane Legg. AI safety grid- worlds. arXiv:1711.09883 [cs], November 2017. arXiv: 1711.09883. [Mohamed and Rezende, 2015] Shakir Mohamed and Danilo Jimenez Rezende. Variational information maximisation for intrinsically motivated reinforcement learning. In Advances in Neural Information Processing Systems, pages 2125–2133, 2015. [Moldovan and Abbeel, 2012] Teodor Mihai Moldovan and Pieter Abbeel. Safe exploration in Markov decision pro- cesses. ICML, 2012. [OpenAI, 2018] OpenAI. OpenAI Five. https://blog.openai. com/openai-five/, 2018. [Pecka and Svoboda, 2014] Martin Pecka and Tomas Svo- boda. Safe exploration techniques for reinforcement learning–an overview. In International Workshop on Mod- elling and Simulation for Autonomous Systems, pages 357–375. Springer, 2014. [Regan and Boutilier, 2010] Kevin Regan and Craig Boutilier. Robust policy computation in reward-uncertain MDPs using nondominated policies. In AAAI, 2010. [Saunders et al., 2018] William Saunders, Girish Sastry, An- dreas Stuhlmueller, and Owain Evans. Trial without error: