=Paper= {{Paper |id=Vol-2808/Paper_26 |storemode=property |title=Performance of Bounded-Rational Agents With the Ability to Self-Modify |pdfUrl=https://ceur-ws.org/Vol-2808/Paper_26.pdf |volume=Vol-2808 |authors=Jakub Tětek,Marek Sklenka,Tomáš Gavenčiak |dblpUrl=https://dblp.org/rec/conf/aaai/TetekSG21 }} ==Performance of Bounded-Rational Agents With the Ability to Self-Modify== https://ceur-ws.org/Vol-2808/Paper_26.pdf

Performance of Bounded-Rational Agents With the Ability to Self-Modify*
Jakub Tětek1† , Marek Sklenka2 , Tomáš Gavenčiak3
1
BARC, University of Copenhagen j.tetek@gmail.com
2
University of Oxford sklenka.marek@gmail.com
3
Independent researcher gavento@ucw.cz

Abstract ing systems and one of the major proposed models for future
AI systems1 .
Self-modification of agents embedded in complex environ-
If strong AI systems with the ability to act in the real
ments is hard to avoid, whether it happens via direct means
(e.g. own code modification) or indirectly (e.g. influencing world are ever deployed2 , it is very likely that they will have
the operator, exploiting bugs or the environment). It has been some means of deliberately manipulating their own imple-
argued that intelligent agents have an incentive to avoid mod- mentation, either directly or indirectly (e.g. via manipulat-
ifying their utility function so that their future instances work ing the human controller, influencing the development of a
towards the same goals. future AI, exploiting their own bugs or physical limitations
Everitt et al. (2016) formally show that providing an option of the hardware, etc). While the extent of those means is un-
to self-modify is harmless for perfectly rational agents. We known, even weak indirect means could be extensively ex-
show that this result is no longer true for agents with bounded ploited with sufficient knowledge, compute, modelling ca-
rationality. In such agents, self-modification may cause ex- pabilities and time.
ponential deterioration in performance and gradual misalign-
ment of a previously aligned agent. We investigate how the Omohundro (2008) argues that every intelligent system
size of this effect depends on the type and magnitude of im- has a fundamental drive for goal preservation, because when
perfections in the agent’s rationality (1-4 below). We also dis- the future instance of the same agent strives towards the
cuss model assumptions and the wider problem and framing same goal, it is more likely that the goal will be achieved.
space. Therefore, Ohomundro argues, a rational agent should never
We examine four ways in which an agent can be bounded- modify into an agent optimizing different goals.
rational: it either (1) doesn’t always choose the optimal ac- Everitt et al. (2016) examine this question formally and
tion, (2) is not perfectly aligned with human values, (3) has arrive at the same conclusion: that the agent preserves its
an inaccurate model of the environment, or (4) uses the wrong goals in time (as long as the agent’s planning algorithm an-
temporal discounting factor. We show that while in the cases ticipates the consequences of self-modifications and uses the
(2)-(4) the misalignment caused by the agent’s imperfection current utility function to evaluate different futures).3 How-
does not increase over time, with (1) the misalignment may
grow exponentially.
ever, Everitt’s analysis assumes that the agent is a perfect
utility maximizer (i.e. it always takes the action with the
greatest expected utility), and has perfect knowledge of the
1 Introduction environment. These assumptions are probably unattainable
We face the prospect of creating superhuman (or otherwise in any complex environment.
very powerful) AI systems in the future where those sys- To address this, we present a theoretical analysis of a self-
tems hold significant power in the real world (Bostrom 2014; modifying agent with imperfect optimization ability and in-
Russell 2019). Building up theoretical foundations for the complete knowledge. We model the agent in the standard
study and design of such systems gives us a better chance to
1
align them with our long-term interests. In this line of work, Other major models include e.g. comprehensive systems of
we study agent-like systems, i.e. systems optimizing their services (Drexler 2019) and ”Oracle AI” or ”Tool AI” (Arm-
actions to maximize a certain utility function – the frame- strong, Sandberg, and Bostrom 2012). However, there are con-
cerns and ongoing research into the emergence of agency in these
work behind the current state-of-the-art reinforcement learn-
systems (Omohundro 2008; Miller, Yampolskiy, and Häggström
* Supported by Grant Number 16582, Basic Algorithms Re- 2020).
2
search Copenhagen (BARC), from the VILLUM Foundation. Proposals to prevent this include e.g. boxing (Bostrom 2014)
† but as e.g. Yampolskiy (2012) argues, this may be difficult or im-
This author was supported by the Bakala Foundation Scholar-
ship practical.
3
Copyright © 2021, for this paper by its authors. Use permitted un- Everitt et al. (2016)’s results hold independent of the length of
der Creative Commons License Attribution 4.0 International (CC the time horizon or temporal discounting (by simple utility scal-
BY 4.0). ing).
cybernetic model where the agent can be bounded-rational not self-modify. We show upper and tight lower bounds (by
in two different ways. Either the agent makes suboptimal a constant) on the worst-case value loss in Theorem 7. As
decisions (is a bounded-optimization agent) or has inaccu- we decrease γ (increase discounting), the rate at which the
rate knowledge. We conclude that imperfect optimization agent’s performance deteriorates increases and the possibil-
can lead to exponential deterioration of alignment through ity of self-modification becomes a more serious problem.
self-modification, as opposed to bounded knowledge, which Our analysis of bounded-optimization agents is a general-
does not result in future misalignment. An informal sum- ization of Theorem 16 from Everitt et al. (2016) in the sense
mary of the results is presented below. that their result can be easily recovered by a basic measure-
Finally, we explicitly list and discuss the underlying as- theoretic argument.
sumptions that motivate the theoretical problem and anal- Self-modifying u -misaligned, ρ -ignorant, or γ -impatient
ysis. In addition to clearly specifying the scope of conclu- perfect optimizers can only lose the same value as non-self-
sions, the explicit problem assumptions can be used as a modifying agents with the same irrationality bounds. This
rough axis to map the space of viable research questions in also holds for any combination of the three types of bounded
the area; see Sections 2 and 6. knowledge. We give tight upper and lower bounds (up to a
constant factor) for the worst-case performance. See Sec-
1.1 Summary of our results tion 5.2 for details.
The result of Everitt et al. (2016) could be loosely inter- This implies that unlike bounded-optimization agents, the
preted to imply that agents with close to perfect rationality performance of perfect-optimization bounded-knowledge
would either prefer not to self-modify, or would self-modify agents does not deteriorate in time. This is because bounded-
and only lose a negligible target value. knowledge agents continue to take optimal actions with re-
We show that when we relax the assumption of perfect ra- spect to their almost correct knowledge and do not self-
tionality, their result no longer applies. The bounded-rational modify in a way that would worsen their performance in
agent may prefer to self-modify given the option and in do- their view. Therefore, the possibility of self-modification
ing so, become less aligned and lose a significant part of the seems less dangerous in the case of bounded-knowledge
attainable value according to its original goals. agents than in the case of bounded-optimization agents.
We use the difference between the attainable and attained A self-modifying agent with any combination of the four ir-
expected future value at an (arbitrarily chosen) future time rationality types may lose value exponential in the time step
point as a proxy for the degree of the agent’s misalignment t when the agent optimization error parameter o > 0. We
at that time. Specifically, for a future time t, we consider again give tight (up to a constant factor) lower bounds on the
the value attainable from time t (after the agent already ran worst-case performance of such agents. See Section 5.3 for
and self-modified for t time units), and we estimate the loss details.
of value f t relative to the non-modified agent in the same
Our results do not imply that every such agent will ac-
environment state. Note that f t is not pre-discounted by the
tually perform this poorly but the prospect of exponential
previous t steps. See Section 3 for formal definitions and
deterioration is worrying in the long-term, even if it happens
Section 2 for motivation and discussion.
at a much slower speed than suggested by our results. We fo-
We consider four types of deviation from perfect rational- cus on worst-case analysis because it tells us whether we can
ity, see Section 4 for formal definitions. have formal guarantees of the agent’s behaviour – a highly
• -optimizers make suboptimal decisions. desirable property for powerful real-world autonomous sys-
tems, including a prospective AGI (artificial general intelli-
• -misaligned agents have inaccurate knowledge of the hu- gence) or otherwise strong AIs.
man utility function.
Overview of formal results. Here we summarize how much
• -ignorant agents have inaccurate knowledge of the envi-
value the different types of bounded-rational agents may
ronment.
lose via misalignment. Note that the maximal attainable dis-
• -impatient agents have inaccurate knowledge of the cor- 1
counted value is at most 1−γ and the losses should be con-
rect temporal discount function. sidered relative to that, or to the maximum attainable value
Note that for the sake of simplicity, we use a very simple in concrete scenarios. Otherwise, the values for different
model of bounded rationality where the errors are simply value of γ are incomparable. In all cases, the worst-case
bounded by the error parameters • ; this has to be taken into lower and upper bounds are tight up to a constant.
account when interpreting the results. However, we suspect -optimizer agents – bounded optimization, after t steps of
that the asymptotic dependence of value loss on the size of possible self-modification (Theorem 7)
errors and time would be similar for a range of natural, real-
istic models of bounded rationality. t 1
fopt (, γ) = min( t−1 , )
γ 1−γ
Informal result statements
Self-modifying -optimizers may deteriorate in future align- -misaligned agents – inaccurate utility (Theorem 9)
ment and performance exponentially over time, losing ex- 2
ponential amount of utility compared to -optimizers that do futil (, γ) =
1−γ
-ignorant agents – inaccurate belief (Theorem 11) goals, knowledge or behaviour. The problem of robust
strong AI corrigibility is far from solved today and this
2 2
fbel (, γ) = − paper can be read as a further argument for substantially
1−γ 1 − γ(1 − ) more research in this direction.
-impatient agents – inaccurate discounting (Theorem 13) (v) Worst-case analysis and bound tightness. We focus on
Here γ ∗ is the correct discount factor and γ is the agent’s worst-case performance guarantees in abstracted models
incorrect discount factor. rather than e.g. full distributional analysis, and we show
1 that our worst-case bounds are attainable (up to constant
2γ ∗ lg γ − 1 factors) under certain agent behaviour. Note this approach
fdisc (γ, γ ∗ ) ≈
1 − γ∗ may turn out as too pessimistic or even impossible in
some settings (e.g. quantum physics).
2 Assumptions and rationale
(vi) Bounded value attainable per time unit. We assume the
Both the statement of the problem and its relevance to AI
agent obtains instantaneous utility between 0 and 1 at
alignment rest on a set of assumptions listed below. While
each time step. This is not an arbitrary choice: A constant
this list is non-exhaustive, we try to cover the main implicit
bound on instantaneous value can be normalized to this
and explicit choices in our framing, and the space of alter-
interval. Instantaneous values bounded by a function of
natives. This is largely in hope of eventually finding a better,
time U (t) < µt can be pre-discounted when γµ < 1, and
more robust theoretical framework for solving agent self-
generally lead to infinite future values otherwise, which
modification within the context of AI alignment, but even
we disallow here to avoid foundational problems.
further negative results in the space would inform our intu-
itions on what aspects of self-modification make the prob- (vii) Temporal value discounting. We assume the agent em-
lem harder. ploys some form of temporal value discounting. This
We propose consideration of various assumptions as a could be motivated by technical or algorithmic limita-
framework for thinking about prospective realistic agent tions, increasing uncertainty about the future, or to avoid
models that admit formal guarantees. We invite further re- issues with incomparable infinite values of considered
search and generalizations in this area, one high-level goal futures (see Bostrom (2011) for a discussion of infinite
being to map a part of the space of agent models and as- ethics). Discounting, however, contrasts with the long-
sumptions that do or do not permit guarantees, eventually termist view; see the discussion below.
finding agent models that do come with meaningful guaran- (viii) Exponential discounting. Our model assumes the agent
tees. Further negative results would inform our intuitions on discounts future utility exponentially, a standard assump-
what aspects of the problems make it harder. tion in artificial intelligence and the only time-invariant
(i) Bounded rationality model. In the models of -bounded- discounting schema (Strotz 1955) leading to consistent
rational agents defined in Section 4.1, is generally an preferences over time.
upper bound on the size of the optimization or knowledge (ix) Unbounded temporal horizons. Our analysis focuses on
error. One interpretation of our results is that value drift the long-term behaviour of the agent, in particular stabil-
can happen even if the error is bounded at every step. One ity and performance from the perspective of future stake-
could argue that a more realistic scenario would assume holders (sharing the original utility function). Note that
some distribution of the size of the errors, assuming larger our results also to some extent apply to finite-horizon but
errors less likely or less frequent; see discussion below long-running systems.
and in Section 6. Temporal discounting contrasts with the long-termist
(ii) Unlimited self-modification ability. We assume the agent view: Why not model non-discounted future utility di-
is able to perform any self-modification at any time. This rectly? Noting the motivations we mention in (vii), we
models the worst-case scenario when compared to a lim- agree that models of future value aggregation other than
ited but still perfectly controlled self-modification. How- discounting would be generally better suited for long-term
ever, embedded (non-dualistic) agents in complex envi- objectives. However, this seems to be a difficult task, as
ronments may chieve almost-unlimited self-modification such models are neither well developed nor currently used
from a limited ability, e.g. over a longer time span; see in open-ended AI algorithms (with the obvious exception
e.g. (Demski and Garrabrant 2019). We model the agent’s of a finite time horizon, which we propose to explore in
self-modifications as orthogonal to actions in the environ- Section 6).
ment. We therefore propose a direct interpretation of our results:
(iii) Modification-independence. We assume that the agent’s Assuming we implement agents that are -optimizers with
utility function does not explicitly reward or punish self- discounting, they may become exponentially less aligned
modifications. We also assume that self-modifications do over time. This is not the case with perfect optimizers with
not have any direct effect on the environment. This is cap- imperfect knowledge and discounting.
tured by Definition 2. (x) Dualistic setting. We assume a dualistic agent and allow
(iv) No corrigibility mechanisms. We do not consider systems self-modification through special actions. This allows us
that would allow human operators to correct the system’s to formally model one aspect of embedded agency – at
least until there are sufficient theoretical foundations of a perception. A history is a sequence of action-perception
embedded agency. pairs æ1 æ2 ...æt . We will often abbreviate such sequences to
Note that in the embedded (non-dualistic) agent setting, it æ We consider agents with inaccurate knowledge of the correct
2|Π| > |Π|. A history can now be written as: utility function (Definition 4), inaccurate knowledge of the
world (Definitions 5 and 6), and inaccurate knowledge of
æ1:t = a1 e1 a2 e2 ...at et = ă1 π2 e1 ă2 π3 e2 ...ăt πt+1 et the correct discount factor (how much future reward is worth
The subscripts for policies are one time step ahead because compared to reward in the present).
the policy chosen at time t is used to pick an action at time
t + 1. The subscript denotes at which time step the policy is Misaligned agents We define -misaligned agents as
used. Policy πt is used to choose the action at = (a˘t , πt+1 ). agents whose utility function u has absolute error with re-
No policy modification happens when at = (a˘t , πt ). spect to the correct utility function u∗ .
In the previous section, we used these rules to calculate
the probability of any finite history: P (et | æ