=Paper=
{{Paper
|id=Vol-2808/Paper_26
|storemode=property
|title=Performance of Bounded-Rational Agents With the Ability to Self-Modify
|pdfUrl=https://ceur-ws.org/Vol-2808/Paper_26.pdf
|volume=Vol-2808
|authors=Jakub Tětek,Marek Sklenka,Tomáš Gavenčiak
|dblpUrl=https://dblp.org/rec/conf/aaai/TetekSG21
}}
==Performance of Bounded-Rational Agents With the Ability to Self-Modify==
Performance of Bounded-Rational Agents With the Ability to Self-Modify* Jakub Tětek1† , Marek Sklenka2 , Tomáš Gavenčiak3 1 BARC, University of Copenhagen j.tetek@gmail.com 2 University of Oxford sklenka.marek@gmail.com 3 Independent researcher gavento@ucw.cz Abstract ing systems and one of the major proposed models for future AI systems1 . Self-modification of agents embedded in complex environ- If strong AI systems with the ability to act in the real ments is hard to avoid, whether it happens via direct means (e.g. own code modification) or indirectly (e.g. influencing world are ever deployed2 , it is very likely that they will have the operator, exploiting bugs or the environment). It has been some means of deliberately manipulating their own imple- argued that intelligent agents have an incentive to avoid mod- mentation, either directly or indirectly (e.g. via manipulat- ifying their utility function so that their future instances work ing the human controller, influencing the development of a towards the same goals. future AI, exploiting their own bugs or physical limitations Everitt et al. (2016) formally show that providing an option of the hardware, etc). While the extent of those means is un- to self-modify is harmless for perfectly rational agents. We known, even weak indirect means could be extensively ex- show that this result is no longer true for agents with bounded ploited with sufficient knowledge, compute, modelling ca- rationality. In such agents, self-modification may cause ex- pabilities and time. ponential deterioration in performance and gradual misalign- ment of a previously aligned agent. We investigate how the Omohundro (2008) argues that every intelligent system size of this effect depends on the type and magnitude of im- has a fundamental drive for goal preservation, because when perfections in the agent’s rationality (1-4 below). We also dis- the future instance of the same agent strives towards the cuss model assumptions and the wider problem and framing same goal, it is more likely that the goal will be achieved. space. Therefore, Ohomundro argues, a rational agent should never We examine four ways in which an agent can be bounded- modify into an agent optimizing different goals. rational: it either (1) doesn’t always choose the optimal ac- Everitt et al. (2016) examine this question formally and tion, (2) is not perfectly aligned with human values, (3) has arrive at the same conclusion: that the agent preserves its an inaccurate model of the environment, or (4) uses the wrong goals in time (as long as the agent’s planning algorithm an- temporal discounting factor. We show that while in the cases ticipates the consequences of self-modifications and uses the (2)-(4) the misalignment caused by the agent’s imperfection current utility function to evaluate different futures).3 How- does not increase over time, with (1) the misalignment may grow exponentially. ever, Everitt’s analysis assumes that the agent is a perfect utility maximizer (i.e. it always takes the action with the greatest expected utility), and has perfect knowledge of the 1 Introduction environment. These assumptions are probably unattainable We face the prospect of creating superhuman (or otherwise in any complex environment. very powerful) AI systems in the future where those sys- To address this, we present a theoretical analysis of a self- tems hold significant power in the real world (Bostrom 2014; modifying agent with imperfect optimization ability and in- Russell 2019). Building up theoretical foundations for the complete knowledge. We model the agent in the standard study and design of such systems gives us a better chance to 1 align them with our long-term interests. In this line of work, Other major models include e.g. comprehensive systems of we study agent-like systems, i.e. systems optimizing their services (Drexler 2019) and ”Oracle AI” or ”Tool AI” (Arm- actions to maximize a certain utility function – the frame- strong, Sandberg, and Bostrom 2012). However, there are con- cerns and ongoing research into the emergence of agency in these work behind the current state-of-the-art reinforcement learn- systems (Omohundro 2008; Miller, Yampolskiy, and Häggström * Supported by Grant Number 16582, Basic Algorithms Re- 2020). 2 search Copenhagen (BARC), from the VILLUM Foundation. Proposals to prevent this include e.g. boxing (Bostrom 2014) † but as e.g. Yampolskiy (2012) argues, this may be difficult or im- This author was supported by the Bakala Foundation Scholar- ship practical. 3 Copyright © 2021, for this paper by its authors. Use permitted un- Everitt et al. (2016)’s results hold independent of the length of der Creative Commons License Attribution 4.0 International (CC the time horizon or temporal discounting (by simple utility scal- BY 4.0). ing). cybernetic model where the agent can be bounded-rational not self-modify. We show upper and tight lower bounds (by in two different ways. Either the agent makes suboptimal a constant) on the worst-case value loss in Theorem 7. As decisions (is a bounded-optimization agent) or has inaccu- we decrease γ (increase discounting), the rate at which the rate knowledge. We conclude that imperfect optimization agent’s performance deteriorates increases and the possibil- can lead to exponential deterioration of alignment through ity of self-modification becomes a more serious problem. self-modification, as opposed to bounded knowledge, which Our analysis of bounded-optimization agents is a general- does not result in future misalignment. An informal sum- ization of Theorem 16 from Everitt et al. (2016) in the sense mary of the results is presented below. that their result can be easily recovered by a basic measure- Finally, we explicitly list and discuss the underlying as- theoretic argument. sumptions that motivate the theoretical problem and anal- Self-modifying u -misaligned, ρ -ignorant, or γ -impatient ysis. In addition to clearly specifying the scope of conclu- perfect optimizers can only lose the same value as non-self- sions, the explicit problem assumptions can be used as a modifying agents with the same irrationality bounds. This rough axis to map the space of viable research questions in also holds for any combination of the three types of bounded the area; see Sections 2 and 6. knowledge. We give tight upper and lower bounds (up to a constant factor) for the worst-case performance. See Sec- 1.1 Summary of our results tion 5.2 for details. The result of Everitt et al. (2016) could be loosely inter- This implies that unlike bounded-optimization agents, the preted to imply that agents with close to perfect rationality performance of perfect-optimization bounded-knowledge would either prefer not to self-modify, or would self-modify agents does not deteriorate in time. This is because bounded- and only lose a negligible target value. knowledge agents continue to take optimal actions with re- We show that when we relax the assumption of perfect ra- spect to their almost correct knowledge and do not self- tionality, their result no longer applies. The bounded-rational modify in a way that would worsen their performance in agent may prefer to self-modify given the option and in do- their view. Therefore, the possibility of self-modification ing so, become less aligned and lose a significant part of the seems less dangerous in the case of bounded-knowledge attainable value according to its original goals. agents than in the case of bounded-optimization agents. We use the difference between the attainable and attained A self-modifying agent with any combination of the four ir- expected future value at an (arbitrarily chosen) future time rationality types may lose value exponential in the time step point as a proxy for the degree of the agent’s misalignment t when the agent optimization error parameter o > 0. We at that time. Specifically, for a future time t, we consider again give tight (up to a constant factor) lower bounds on the the value attainable from time t (after the agent already ran worst-case performance of such agents. See Section 5.3 for and self-modified for t time units), and we estimate the loss details. of value f t relative to the non-modified agent in the same Our results do not imply that every such agent will ac- environment state. Note that f t is not pre-discounted by the tually perform this poorly but the prospect of exponential previous t steps. See Section 3 for formal definitions and deterioration is worrying in the long-term, even if it happens Section 2 for motivation and discussion. at a much slower speed than suggested by our results. We fo- We consider four types of deviation from perfect rational- cus on worst-case analysis because it tells us whether we can ity, see Section 4 for formal definitions. have formal guarantees of the agent’s behaviour – a highly • -optimizers make suboptimal decisions. desirable property for powerful real-world autonomous sys- tems, including a prospective AGI (artificial general intelli- • -misaligned agents have inaccurate knowledge of the hu- gence) or otherwise strong AIs. man utility function. Overview of formal results. Here we summarize how much • -ignorant agents have inaccurate knowledge of the envi- value the different types of bounded-rational agents may ronment. lose via misalignment. Note that the maximal attainable dis- • -impatient agents have inaccurate knowledge of the cor- 1 counted value is at most 1−γ and the losses should be con- rect temporal discount function. sidered relative to that, or to the maximum attainable value Note that for the sake of simplicity, we use a very simple in concrete scenarios. Otherwise, the values for different model of bounded rationality where the errors are simply value of γ are incomparable. In all cases, the worst-case bounded by the error parameters • ; this has to be taken into lower and upper bounds are tight up to a constant. account when interpreting the results. However, we suspect -optimizer agents – bounded optimization, after t steps of that the asymptotic dependence of value loss on the size of possible self-modification (Theorem 7) errors and time would be similar for a range of natural, real- istic models of bounded rationality. t 1 fopt (, γ) = min( t−1 , ) γ 1−γ Informal result statements Self-modifying -optimizers may deteriorate in future align- -misaligned agents – inaccurate utility (Theorem 9) ment and performance exponentially over time, losing ex- 2 ponential amount of utility compared to -optimizers that do futil (, γ) = 1−γ -ignorant agents – inaccurate belief (Theorem 11) goals, knowledge or behaviour. The problem of robust strong AI corrigibility is far from solved today and this 2 2 fbel (, γ) = − paper can be read as a further argument for substantially 1−γ 1 − γ(1 − ) more research in this direction. -impatient agents – inaccurate discounting (Theorem 13) (v) Worst-case analysis and bound tightness. We focus on Here γ ∗ is the correct discount factor and γ is the agent’s worst-case performance guarantees in abstracted models incorrect discount factor. rather than e.g. full distributional analysis, and we show 1 that our worst-case bounds are attainable (up to constant 2γ ∗ lg γ − 1 factors) under certain agent behaviour. Note this approach fdisc (γ, γ ∗ ) ≈ 1 − γ∗ may turn out as too pessimistic or even impossible in some settings (e.g. quantum physics). 2 Assumptions and rationale (vi) Bounded value attainable per time unit. We assume the Both the statement of the problem and its relevance to AI agent obtains instantaneous utility between 0 and 1 at alignment rest on a set of assumptions listed below. While each time step. This is not an arbitrary choice: A constant this list is non-exhaustive, we try to cover the main implicit bound on instantaneous value can be normalized to this and explicit choices in our framing, and the space of alter- interval. Instantaneous values bounded by a function of natives. This is largely in hope of eventually finding a better, time U (t) < µt can be pre-discounted when γµ < 1, and more robust theoretical framework for solving agent self- generally lead to infinite future values otherwise, which modification within the context of AI alignment, but even we disallow here to avoid foundational problems. further negative results in the space would inform our intu- itions on what aspects of self-modification make the prob- (vii) Temporal value discounting. We assume the agent em- lem harder. ploys some form of temporal value discounting. This We propose consideration of various assumptions as a could be motivated by technical or algorithmic limita- framework for thinking about prospective realistic agent tions, increasing uncertainty about the future, or to avoid models that admit formal guarantees. We invite further re- issues with incomparable infinite values of considered search and generalizations in this area, one high-level goal futures (see Bostrom (2011) for a discussion of infinite being to map a part of the space of agent models and as- ethics). Discounting, however, contrasts with the long- sumptions that do or do not permit guarantees, eventually termist view; see the discussion below. finding agent models that do come with meaningful guaran- (viii) Exponential discounting. Our model assumes the agent tees. Further negative results would inform our intuitions on discounts future utility exponentially, a standard assump- what aspects of the problems make it harder. tion in artificial intelligence and the only time-invariant (i) Bounded rationality model. In the models of -bounded- discounting schema (Strotz 1955) leading to consistent rational agents defined in Section 4.1, is generally an preferences over time. upper bound on the size of the optimization or knowledge (ix) Unbounded temporal horizons. Our analysis focuses on error. One interpretation of our results is that value drift the long-term behaviour of the agent, in particular stabil- can happen even if the error is bounded at every step. One ity and performance from the perspective of future stake- could argue that a more realistic scenario would assume holders (sharing the original utility function). Note that some distribution of the size of the errors, assuming larger our results also to some extent apply to finite-horizon but errors less likely or less frequent; see discussion below long-running systems. and in Section 6. Temporal discounting contrasts with the long-termist (ii) Unlimited self-modification ability. We assume the agent view: Why not model non-discounted future utility di- is able to perform any self-modification at any time. This rectly? Noting the motivations we mention in (vii), we models the worst-case scenario when compared to a lim- agree that models of future value aggregation other than ited but still perfectly controlled self-modification. How- discounting would be generally better suited for long-term ever, embedded (non-dualistic) agents in complex envi- objectives. However, this seems to be a difficult task, as ronments may chieve almost-unlimited self-modification such models are neither well developed nor currently used from a limited ability, e.g. over a longer time span; see in open-ended AI algorithms (with the obvious exception e.g. (Demski and Garrabrant 2019). We model the agent’s of a finite time horizon, which we propose to explore in self-modifications as orthogonal to actions in the environ- Section 6). ment. We therefore propose a direct interpretation of our results: (iii) Modification-independence. We assume that the agent’s Assuming we implement agents that are -optimizers with utility function does not explicitly reward or punish self- discounting, they may become exponentially less aligned modifications. We also assume that self-modifications do over time. This is not the case with perfect optimizers with not have any direct effect on the environment. This is cap- imperfect knowledge and discounting. tured by Definition 2. (x) Dualistic setting. We assume a dualistic agent and allow (iv) No corrigibility mechanisms. We do not consider systems self-modification through special actions. This allows us that would allow human operators to correct the system’s to formally model one aspect of embedded agency – at least until there are sufficient theoretical foundations of a perception. A history is a sequence of action-perception embedded agency. pairs æ1 æ2 ...æt . We will often abbreviate such sequences to Note that in the embedded (non-dualistic) agent setting, it æWe consider agents with inaccurate knowledge of the correct 2|Π| > |Π|. A history can now be written as: utility function (Definition 4), inaccurate knowledge of the world (Definitions 5 and 6), and inaccurate knowledge of æ1:t = a1 e1 a2 e2 ...at et = ă1 π2 e1 ă2 π3 e2 ...ăt πt+1 et the correct discount factor (how much future reward is worth The subscripts for policies are one time step ahead because compared to reward in the present). the policy chosen at time t is used to pick an action at time t + 1. The subscript denotes at which time step the policy is Misaligned agents We define -misaligned agents as used. Policy πt is used to choose the action at = (a˘t , πt+1 ). agents whose utility function u has absolute error with re- No policy modification happens when at = (a˘t , πt ). spect to the correct utility function u∗ . In the previous section, we used these rules to calculate the probability of any finite history: P (et | æ