=Paper= {{Paper |id=Vol-2808/Paper_26 |storemode=property |title=Performance of Bounded-Rational Agents With the Ability to Self-Modify |pdfUrl=https://ceur-ws.org/Vol-2808/Paper_26.pdf |volume=Vol-2808 |authors=Jakub Tětek,Marek Sklenka,Tomáš Gavenčiak |dblpUrl=https://dblp.org/rec/conf/aaai/TetekSG21 }} ==Performance of Bounded-Rational Agents With the Ability to Self-Modify== https://ceur-ws.org/Vol-2808/Paper_26.pdf
      Performance of Bounded-Rational Agents With the Ability to Self-Modify*
                                  Jakub Tětek1† , Marek Sklenka2 , Tomáš Gavenčiak3
                                       1
                                           BARC, University of Copenhagen j.tetek@gmail.com
                                            2
                                              University of Oxford sklenka.marek@gmail.com
                                                3
                                                  Independent researcher gavento@ucw.cz




                            Abstract                                  ing systems and one of the major proposed models for future
                                                                      AI systems1 .
  Self-modification of agents embedded in complex environ-
                                                                         If strong AI systems with the ability to act in the real
  ments is hard to avoid, whether it happens via direct means
  (e.g. own code modification) or indirectly (e.g. influencing        world are ever deployed2 , it is very likely that they will have
  the operator, exploiting bugs or the environment). It has been      some means of deliberately manipulating their own imple-
  argued that intelligent agents have an incentive to avoid mod-      mentation, either directly or indirectly (e.g. via manipulat-
  ifying their utility function so that their future instances work   ing the human controller, influencing the development of a
  towards the same goals.                                             future AI, exploiting their own bugs or physical limitations
  Everitt et al. (2016) formally show that providing an option        of the hardware, etc). While the extent of those means is un-
  to self-modify is harmless for perfectly rational agents. We        known, even weak indirect means could be extensively ex-
  show that this result is no longer true for agents with bounded     ploited with sufficient knowledge, compute, modelling ca-
  rationality. In such agents, self-modification may cause ex-        pabilities and time.
  ponential deterioration in performance and gradual misalign-
  ment of a previously aligned agent. We investigate how the             Omohundro (2008) argues that every intelligent system
  size of this effect depends on the type and magnitude of im-        has a fundamental drive for goal preservation, because when
  perfections in the agent’s rationality (1-4 below). We also dis-    the future instance of the same agent strives towards the
  cuss model assumptions and the wider problem and framing            same goal, it is more likely that the goal will be achieved.
  space.                                                              Therefore, Ohomundro argues, a rational agent should never
  We examine four ways in which an agent can be bounded-              modify into an agent optimizing different goals.
  rational: it either (1) doesn’t always choose the optimal ac-          Everitt et al. (2016) examine this question formally and
  tion, (2) is not perfectly aligned with human values, (3) has       arrive at the same conclusion: that the agent preserves its
  an inaccurate model of the environment, or (4) uses the wrong       goals in time (as long as the agent’s planning algorithm an-
  temporal discounting factor. We show that while in the cases        ticipates the consequences of self-modifications and uses the
  (2)-(4) the misalignment caused by the agent’s imperfection         current utility function to evaluate different futures).3 How-
  does not increase over time, with (1) the misalignment may
  grow exponentially.
                                                                      ever, Everitt’s analysis assumes that the agent is a perfect
                                                                      utility maximizer (i.e. it always takes the action with the
                                                                      greatest expected utility), and has perfect knowledge of the
                      1    Introduction                               environment. These assumptions are probably unattainable
We face the prospect of creating superhuman (or otherwise             in any complex environment.
very powerful) AI systems in the future where those sys-                 To address this, we present a theoretical analysis of a self-
tems hold significant power in the real world (Bostrom 2014;          modifying agent with imperfect optimization ability and in-
Russell 2019). Building up theoretical foundations for the            complete knowledge. We model the agent in the standard
study and design of such systems gives us a better chance to
                                                                         1
align them with our long-term interests. In this line of work,              Other major models include e.g. comprehensive systems of
we study agent-like systems, i.e. systems optimizing their            services (Drexler 2019) and ”Oracle AI” or ”Tool AI” (Arm-
actions to maximize a certain utility function – the frame-           strong, Sandberg, and Bostrom 2012). However, there are con-
                                                                      cerns and ongoing research into the emergence of agency in these
work behind the current state-of-the-art reinforcement learn-
                                                                      systems (Omohundro 2008; Miller, Yampolskiy, and Häggström
   * Supported by Grant Number 16582, Basic Algorithms Re-            2020).
                                                                          2
search Copenhagen (BARC), from the VILLUM Foundation.                       Proposals to prevent this include e.g. boxing (Bostrom 2014)
   †                                                                  but as e.g. Yampolskiy (2012) argues, this may be difficult or im-
     This author was supported by the Bakala Foundation Scholar-
ship                                                                  practical.
                                                                          3
Copyright © 2021, for this paper by its authors. Use permitted un-          Everitt et al. (2016)’s results hold independent of the length of
der Creative Commons License Attribution 4.0 International (CC        the time horizon or temporal discounting (by simple utility scal-
BY 4.0).                                                              ing).
cybernetic model where the agent can be bounded-rational          not self-modify. We show upper and tight lower bounds (by
in two different ways. Either the agent makes suboptimal          a constant) on the worst-case value loss in Theorem 7. As
decisions (is a bounded-optimization agent) or has inaccu-        we decrease γ (increase discounting), the rate at which the
rate knowledge. We conclude that imperfect optimization           agent’s performance deteriorates increases and the possibil-
can lead to exponential deterioration of alignment through        ity of self-modification becomes a more serious problem.
self-modification, as opposed to bounded knowledge, which            Our analysis of bounded-optimization agents is a general-
does not result in future misalignment. An informal sum-          ization of Theorem 16 from Everitt et al. (2016) in the sense
mary of the results is presented below.                           that their result can be easily recovered by a basic measure-
   Finally, we explicitly list and discuss the underlying as-     theoretic argument.
sumptions that motivate the theoretical problem and anal-         Self-modifying u -misaligned, ρ -ignorant, or γ -impatient
ysis. In addition to clearly specifying the scope of conclu-      perfect optimizers can only lose the same value as non-self-
sions, the explicit problem assumptions can be used as a          modifying agents with the same irrationality bounds. This
rough axis to map the space of viable research questions in       also holds for any combination of the three types of bounded
the area; see Sections 2 and 6.                                   knowledge. We give tight upper and lower bounds (up to a
                                                                  constant factor) for the worst-case performance. See Sec-
1.1   Summary of our results                                      tion 5.2 for details.
The result of Everitt et al. (2016) could be loosely inter-          This implies that unlike bounded-optimization agents, the
preted to imply that agents with close to perfect rationality     performance of perfect-optimization bounded-knowledge
would either prefer not to self-modify, or would self-modify      agents does not deteriorate in time. This is because bounded-
and only lose a negligible target value.                          knowledge agents continue to take optimal actions with re-
   We show that when we relax the assumption of perfect ra-       spect to their almost correct knowledge and do not self-
tionality, their result no longer applies. The bounded-rational   modify in a way that would worsen their performance in
agent may prefer to self-modify given the option and in do-       their view. Therefore, the possibility of self-modification
ing so, become less aligned and lose a significant part of the    seems less dangerous in the case of bounded-knowledge
attainable value according to its original goals.                 agents than in the case of bounded-optimization agents.
   We use the difference between the attainable and attained      A self-modifying agent with any combination of the four ir-
expected future value at an (arbitrarily chosen) future time      rationality types may lose value exponential in the time step
point as a proxy for the degree of the agent’s misalignment       t when the agent optimization error parameter o > 0. We
at that time. Specifically, for a future time t, we consider      again give tight (up to a constant factor) lower bounds on the
the value attainable from time t (after the agent already ran     worst-case performance of such agents. See Section 5.3 for
and self-modified for t time units), and we estimate the loss     details.
of value f t relative to the non-modified agent in the same
                                                                     Our results do not imply that every such agent will ac-
environment state. Note that f t is not pre-discounted by the
                                                                  tually perform this poorly but the prospect of exponential
previous t steps. See Section 3 for formal definitions and
                                                                  deterioration is worrying in the long-term, even if it happens
Section 2 for motivation and discussion.
                                                                  at a much slower speed than suggested by our results. We fo-
   We consider four types of deviation from perfect rational-     cus on worst-case analysis because it tells us whether we can
ity, see Section 4 for formal definitions.                        have formal guarantees of the agent’s behaviour – a highly
• -optimizers make suboptimal decisions.                         desirable property for powerful real-world autonomous sys-
                                                                  tems, including a prospective AGI (artificial general intelli-
• -misaligned agents have inaccurate knowledge of the hu-        gence) or otherwise strong AIs.
  man utility function.
                                                                  Overview of formal results. Here we summarize how much
• -ignorant agents have inaccurate knowledge of the envi-
                                                                  value the different types of bounded-rational agents may
  ronment.
                                                                  lose via misalignment. Note that the maximal attainable dis-
• -impatient agents have inaccurate knowledge of the cor-                                     1
                                                                  counted value is at most 1−γ     and the losses should be con-
  rect temporal discount function.                                sidered relative to that, or to the maximum attainable value
   Note that for the sake of simplicity, we use a very simple     in concrete scenarios. Otherwise, the values for different
model of bounded rationality where the errors are simply          value of γ are incomparable. In all cases, the worst-case
bounded by the error parameters • ; this has to be taken into    lower and upper bounds are tight up to a constant.
account when interpreting the results. However, we suspect        -optimizer agents – bounded optimization, after t steps of
that the asymptotic dependence of value loss on the size of       possible self-modification (Theorem 7)
errors and time would be similar for a range of natural, real-
istic models of bounded rationality.                                             t                       1
                                                                                fopt (, γ) = min( t−1 ,     )
                                                                                                  γ      1−γ
Informal result statements
Self-modifying -optimizers may deteriorate in future align-      -misaligned agents – inaccurate utility (Theorem 9)
ment and performance exponentially over time, losing ex-                                                 2
ponential amount of utility compared to -optimizers that do                           futil (, γ) =
                                                                                                        1−γ
  -ignorant agents – inaccurate belief (Theorem 11)                       goals, knowledge or behaviour. The problem of robust
                                                                           strong AI corrigibility is far from solved today and this
                                  2         2
                 fbel (, γ) =       −                                     paper can be read as a further argument for substantially
                                 1−γ   1 − γ(1 − )                        more research in this direction.
  -impatient agents – inaccurate discounting (Theorem 13)             (v) Worst-case analysis and bound tightness. We focus on
  Here γ ∗ is the correct discount factor and γ is the agent’s             worst-case performance guarantees in abstracted models
  incorrect discount factor.                                               rather than e.g. full distributional analysis, and we show
                                              1                            that our worst-case bounds are attainable (up to constant
                                         2γ ∗ lg γ − 1                     factors) under certain agent behaviour. Note this approach
                     fdisc (γ, γ ∗ ) ≈
                                           1 − γ∗                          may turn out as too pessimistic or even impossible in
                                                                           some settings (e.g. quantum physics).
             2      Assumptions and rationale
                                                                      (vi) Bounded value attainable per time unit. We assume the
  Both the statement of the problem and its relevance to AI
                                                                           agent obtains instantaneous utility between 0 and 1 at
  alignment rest on a set of assumptions listed below. While
                                                                           each time step. This is not an arbitrary choice: A constant
  this list is non-exhaustive, we try to cover the main implicit
                                                                           bound on instantaneous value can be normalized to this
  and explicit choices in our framing, and the space of alter-
                                                                           interval. Instantaneous values bounded by a function of
  natives. This is largely in hope of eventually finding a better,
                                                                           time U (t) < µt can be pre-discounted when γµ < 1, and
  more robust theoretical framework for solving agent self-
                                                                           generally lead to infinite future values otherwise, which
  modification within the context of AI alignment, but even
                                                                           we disallow here to avoid foundational problems.
  further negative results in the space would inform our intu-
  itions on what aspects of self-modification make the prob-         (vii) Temporal value discounting. We assume the agent em-
  lem harder.                                                              ploys some form of temporal value discounting. This
     We propose consideration of various assumptions as a                  could be motivated by technical or algorithmic limita-
  framework for thinking about prospective realistic agent                 tions, increasing uncertainty about the future, or to avoid
  models that admit formal guarantees. We invite further re-               issues with incomparable infinite values of considered
  search and generalizations in this area, one high-level goal             futures (see Bostrom (2011) for a discussion of infinite
  being to map a part of the space of agent models and as-                 ethics). Discounting, however, contrasts with the long-
  sumptions that do or do not permit guarantees, eventually                termist view; see the discussion below.
  finding agent models that do come with meaningful guaran-          (viii) Exponential discounting. Our model assumes the agent
  tees. Further negative results would inform our intuitions on             discounts future utility exponentially, a standard assump-
  what aspects of the problems make it harder.                              tion in artificial intelligence and the only time-invariant
  (i) Bounded rationality model. In the models of -bounded-                discounting schema (Strotz 1955) leading to consistent
      rational agents defined in Section 4.1,  is generally an             preferences over time.
      upper bound on the size of the optimization or knowledge        (ix) Unbounded temporal horizons. Our analysis focuses on
      error. One interpretation of our results is that value drift         the long-term behaviour of the agent, in particular stabil-
      can happen even if the error is bounded at every step. One           ity and performance from the perspective of future stake-
      could argue that a more realistic scenario would assume              holders (sharing the original utility function). Note that
      some distribution of the size of the errors, assuming larger         our results also to some extent apply to finite-horizon but
      errors less likely or less frequent; see discussion below            long-running systems.
      and in Section 6.                                                    Temporal discounting contrasts with the long-termist
 (ii) Unlimited self-modification ability. We assume the agent             view: Why not model non-discounted future utility di-
      is able to perform any self-modification at any time. This           rectly? Noting the motivations we mention in (vii), we
      models the worst-case scenario when compared to a lim-               agree that models of future value aggregation other than
      ited but still perfectly controlled self-modification. How-          discounting would be generally better suited for long-term
      ever, embedded (non-dualistic) agents in complex envi-               objectives. However, this seems to be a difficult task, as
      ronments may chieve almost-unlimited self-modification               such models are neither well developed nor currently used
      from a limited ability, e.g. over a longer time span; see            in open-ended AI algorithms (with the obvious exception
      e.g. (Demski and Garrabrant 2019). We model the agent’s              of a finite time horizon, which we propose to explore in
      self-modifications as orthogonal to actions in the environ-          Section 6).
      ment.                                                                We therefore propose a direct interpretation of our results:
(iii) Modification-independence. We assume that the agent’s                Assuming we implement agents that are -optimizers with
      utility function does not explicitly reward or punish self-          discounting, they may become exponentially less aligned
      modifications. We also assume that self-modifications do             over time. This is not the case with perfect optimizers with
      not have any direct effect on the environment. This is cap-          imperfect knowledge and discounting.
      tured by Definition 2.                                           (x) Dualistic setting. We assume a dualistic agent and allow
(iv) No corrigibility mechanisms. We do not consider systems               self-modification through special actions. This allows us
      that would allow human operators to correct the system’s             to formally model one aspect of embedded agency – at
  least until there are sufficient theoretical foundations of      a perception. A history is a sequence of action-perception
  embedded agency.                                                 pairs æ1 æ2 ...æt . We will often abbreviate such sequences to
  Note that in the embedded (non-dualistic) agent setting, it      æ               We consider agents with inaccurate knowledge of the correct
2|Π| > |Π|. A history can now be written as:                             utility function (Definition 4), inaccurate knowledge of the
                                                                         world (Definitions 5 and 6), and inaccurate knowledge of
      æ1:t = a1 e1 a2 e2 ...at et = ă1 π2 e1 ă2 π3 e2 ...ăt πt+1 et   the correct discount factor (how much future reward is worth
The subscripts for policies are one time step ahead because              compared to reward in the present).
the policy chosen at time t is used to pick an action at time
t + 1. The subscript denotes at which time step the policy is            Misaligned agents We define -misaligned agents as
used. Policy πt is used to choose the action at = (a˘t , πt+1 ).         agents whose utility function u has absolute error  with re-
No policy modification happens when at = (a˘t , πt ).                    spect to the correct utility function u∗ .
   In the previous section, we used these rules to calculate
the probability of any finite history: P (et | æ