<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Categorizing Wireheading in Partially Embedded Agents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arushi Majha</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sayan Sarkar</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Davide Zagami</string-name>
          <email>3zagamidavide@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>University of Cambridge</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>IISER Pune</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>RAISE</string-name>
        </contrib>
      </contrib-group>
      <abstract>
        <p>Embedded agents are not explicitly separated from their environment, lacking clear I/O channels. Such agents can reason about and modify their internal parts, which they are incentivized to shortcut or wirehead in order to achieve the maximal reward. In this paper, we provide a taxonomy of ways by which wireheading can occur, followed by a definition of wirehead-vulnerable agents. Starting from the fully dualistic universal agent AIXI, we introduce a spectrum of partially embedded agents and identify wireheading opportunities that such agents can exploit, experimentally demonstrating the results with the GRL simulation platform AIXIjs. We contextualize wireheading in the broader class of all misalignment problems - where the goals of the agent conflict with the goals of the human designer - and conjecture that the only other possible type of misalignment is specification gaming. Motivated by this taxonomy, we define wirehead-vulnerable agents as embedded agents that choose to behave differently from fully dualistic agents lacking access to their internal parts.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>The term wireheading originates from experiments where an
electrode is inserted into a rodent’s brain to directly
stimulate “reward” [Olds and Milner, 1954]. Compulsive
selfstimulation from electrode implants has also been observed in
humans [Portenoy et al., 1986]. Hedonic drugs can be seen
as directly increasing the pleasure, or reward, that humans
experience.</p>
      <p>Wireheading, in the context of artificially intelligent
systems, is the behavior of corrupting the internal structure of
the agent in order to achieve maximal reward without solving
the designer’s goal. For example, imagine a cleaning agent
that receives more reward when it observes that there is less
dirt in the environment. If this reward is stored somewhere in
the agent’s memory, and if the agent is sophisticated enough
to introspect and modify itself during execution, it might be
All authors contributed equally.
able to locate and edit that memory address to contain
whatever value corresponds to the highest reward. Chances that
such behavior will be incentivized increase as we develop
ever more intelligent agents1.</p>
      <p>The discussion of AI systems has thus far been dominated
by dualistic models where the agent is clearly separated from
its environment, has well-defined input/output channels, and
does not have any control over the design of its internal parts.
Recent work on these problems [Demski and Garrabrant,
2019; Everitt and Hutter, 2018; Everitt et al., 2019] provides
a taxonomy of ways in which embedded agents violate
essential assumptions that are usually granted in dualistic
formulations, such as with the universal agent AIXI [Hutter, 2004].</p>
      <p>Wireheading can be considered one particular class of
misalignment [Everitt and Hutter, 2018], a divergence between
the goals of the agent and the goals of its designers. We
conjecture that the only other possible type of misalignment is
specification gaming, in which the agent finds and exploits
subtle flaws in the design of the reward function. In the
classic example of misspecification, an AI meant to play a boat
race learns to repetitively obtain a stray reward in the game
by circling a spot without actually reaching for the finishing
line [Amodei and Clark, 2016].</p>
      <p>We believe that the first step towards solving the
misalignment problem is to come up with concrete and formal
definitions of the sub-problems. For this reason, this paper
introduces wirehead-vulnerable agents and strongly
wireheadvulnerable agents, two mathematical definitions that can be
found in Section 5. Following Everitt and Hutter’s approach
of modeling agent-environment interactions with causal
influence diagrams [Everitt and Hutter, 2018], these definitions
are based on a taxonomy of wireheading scenarios we
introduce in Section 3.</p>
      <p>General Reinforcement Learning (GRL) frameworks such
as the universal agent AIXI, and its computable
approximations such as MCTS AIXI [Veness et al., 2011], are
powerful tools for reasoning about the yet hypothetical Artificial
General Intelligence, despite being dualistic. This motivates
us to use and extend the GRL simulation platform AIXIjs
[Aslanides et al., 2017] to experimentally demonstrate partial
1An extensive list of examples in which various machine
learning systems find ways to game the specified objective can be found
at
https://vkrakovna.wordpress.com/2018/04/02/specificationgaming-examples-in-ai/
embedding and wireheading scenarios by varying the initial
design of the agent in a N N gridworld (see Section 4).
2</p>
    </sec>
    <sec id="sec-2">
      <title>General Reinforcement Learning and AIXI</title>
      <p>AIXI [Hutter, 2004] is a theoretical model of artificial general
intelligence, under the framework of reinforcement learning,
that describes optimal agent behavior given unlimited
computing power and minimal assumptions about the
environment.</p>
      <p>In reinforcement learning, the agent-environment
interaction consists of a turn-based game with discrete time-steps
[Sutton et al., 1998]. At time-step t, the agent sends an
action at to the environment, which in turn sends the agent
a percept that consists of an observation and reward tuple,
et = (ot; rt). This procedure continues indefinitely or
eventually terminates, depending on the episodic or non-episodic
nature of the task.</p>
      <p>Actions are selected from an action space A that is usually
finite, and the percepts from a percept space E = O R,
where O is the observation space, and R is the reward space,
which is usually [0; 1].</p>
      <p>For any sequence x1; x2; ::: , the part between t and k is
denoted xt:k = xt:::xk. The shorthand x&lt;t = x1:t 1 denotes
sequences starting from time-step 1 and ending at t 1, while
x1:1 = x1x2::: denotes an infinite sequence. Sequences can
be appended to each other, and thus x&lt;txt:k = x1:k. Finally,
x is any infinite string beginning with x.</p>
      <p>The environment is modeled by a deterministic program
q of length l(q), and the future percepts e&lt;m = U (q; a&lt;m)
up to a horizon m are computed by a universal (monotone
Turing) machine U executing q given a&lt;m. The probability
of percept et given history ae&lt;tat is thus given by:
P (et j ae&lt;tat) =</p>
      <p>X
q:U(q;a t)=e t
2 l(q)
AIXI(ae&lt;t) = arg max V (ae&lt;t)</p>
      <p>2
where Solomonoff’s universal prior [Sunehag and Hutter,
2013] is used to assign a prior belief to each program.</p>
      <p>An agent can be identified with its policy, which is a
distribution over actions (at j ae&lt;t).</p>
      <p>If the agent is rational in the Von Neumann-Morgenstern
sense [Morgenstern and Von Neumann, 1953], it should
maximize the expected return, as computed by the value function:</p>
      <p>P (etjae&lt;tat) [ trt + t+1V (ae1:t)]
V (ae&lt;t) = X</p>
      <p>(atjae&lt;t)
at2A
X
et2E
where
gent sum.</p>
      <p>In other words, the AIXI agent uses the policy:
(2)
: N ! [0; 1] is a discount function with
conver(1)
(3)</p>
    </sec>
    <sec id="sec-3">
      <title>Wireheading Strategies in Partially</title>
    </sec>
    <sec id="sec-4">
      <title>Embedded Agents</title>
      <p>Aligning the goals of a reinforcement learning agent with the
goals of its human designers is problematic in general. As
investigated in recent work, there are several ways to model the
misalignment problem [Everitt and Hutter, 2018]. One model
uses a reward function that is programmed before the agent is
launched into its environment and not updated after that. A
possibly more robust model integrates a human in the loop by
letting them continuously modify the reward function. An
example of this is Cooperative Inverse Reinforcement Learning
[Hadfield-Menell et al., 2016].</p>
      <p>We posit that, in the first case, the problem can be
broken down into correctly specifying the reward function (the
misspecification problem) and building agent subparts that
inter-operate without causing the agent to take shortcuts in
optimizing for the reward function in an unintended fashion
(the wireheading problem). For example, as we show, an
embedded agent has options to corrupt the observations on
which the reward function evaluates performance, to modify
the reward function, or to hijack the reward signal. Therefore,
even if the reward function is perfectly specified, or even if
there is a reliable mechanism that gradually improves it, such
an agent may still be able to make unintended changes to
itself and the environment. We are mainly interested in cases
where wireheading happens “intentionally” or “by design,”
such that exploiting the design is the rational choice for the
agent. Covering the spectrum of misspecification scenarios
is beyond the scope of this paper, as our main focus here is
wireheading – the kind of misalignment contingent upon the
embedded nature of the agent.</p>
      <p>The formulation of AIXI as presented in Section 2 is
dualistic, with a clear boundary between the world and the agent.
This is a strong assumption which simply isn’t valid in the
real world, where agents are contained or embedded in the
environment. Embedded agency is a nascent field of
research and has proven bewildering [Demski and Garrabrant,
2019]; we posit it can be less confusing and yet
insightful to reason about partially embedded agents. We chose
causal influence diagrams as the underlying abstraction in
this area given their recent success in identifying potential
failure modes in misalignment [Everitt and Hutter, 2018;
Everitt et al., 2019]. In a nutshell, this approach consists of
representing parts of the environment, the agent, and its
subcomponents as nodes in a graph, where the edges represent
causal relationships between the nodes. One limitation of this
approach is the assumption of objective action and
observation channels. Addressing subtle errors arising from the agent
using different subjective definitions is beyond the scope of
this paper.</p>
      <p>In Figure 1, we show the causal graph of the turn-based
game we described in Section 2, augmented by partially
embedding the agent with its percepts in the environment. The
agent’s action at is intended (green arrow) to modify the state
of the environment st. However, because st determines the
percept et = (ot; rt) the agent receives, it is known [Everitt
and Hutter, 2016] that implementing an intelligent enough
approximation of AIXI would result in the agent modifying the
r0
a1
at
3
1
ot
st</p>
      <p>rt
at+1
reward signal itself, which is unintended (arrow labeled with
a 1 in Figure 1).</p>
      <p>State transitions st, percepts et, and actions at are sampled
according to the structural equations:
st = fs (st 1; at)
et = (ot; rt) = fe (st)
at = fa ( t; ae&lt;t)
(stjst 1; at)
(atjae&lt;t)
(etjst)
3.1</p>
      <sec id="sec-4-1">
        <title>Embedded Reward and Observation Functions</title>
        <p>The causal influence diagram of Figure 1 assumes that
percepts are simply generated by the environment. While this
is true to some extent, the real picture is more complex. In
reality, the agent is initially constructed by a human H, who
tries to implement her utility function u in a preprogrammed
reward function R0 : S ! R. Additionally, the human would
have to specify an observation function O0 : S ! O. In
Figure 2, we show these additions. The new structural
equations are:</p>
        <p>Ot = fO (Ot 1; st; at)
ot = fo (st; Ot; at)</p>
        <p>Each state st represents all aspects of the world not
captured by any of the other nodes. There remains a difficult
modeling choice about where to draw the boundary between
the state and the observation. We loosely interpret
observations as the part of the world that directly affects the agent’s
sensors.</p>
        <p>In this case, two more unintended agent behaviors can
occur. The agent can modify the mapping Rt in such a way that
all states of the environment map to Rmax = max R (arrow
labeled with a 2 in Figure 2). Alternatively, the agent can
modify the mapping Ot (arrow labeled with a 4 in Figure 2),
but the agent has no incentive to do so as there is no causal
link between Ot and rt.</p>
        <p>R0
r0
a1
at
4
A common assumption in the POMDP literature is that the
reward rt is a function of the state st. However, in the real
world, the reward always depends on some observation of
the state (that can be corrupted). That is, the reward rt is a
function of the agent’s observation ot, rather than a (direct)
function of st. In Figure 3, we show these changes.</p>
        <p>The new structural equations for the percept, which now
only contains the observation, and for the reward that depends
on it are:
et = ot = fo (st)
rt = fr (ao1:t; Rt; ot) := Rt(ot)</p>
        <p>Ot (otjst)</p>
        <p>More generally, a reward function will be defined over
histories of sensory data (ao&lt;k):</p>
        <p>1
R : [ (A</p>
        <p>O)k ! R
k=1</p>
        <p>Since now there is a causal link from observations to
rewards, the agent has an incentive to manipulate the
observation signal ot and the observation mapping Ot.</p>
        <p>If the agent modifies the observation signal ot, it will do
so in such a way as to observe omax = arg max R(o) in all
o2O
its future percepts, essentially cutting the causal link between
Ot and ot (as shown by the arrow labeled with a 3 in Figure
3). Instead, if the agent modifies the observation mapping
Ot (arrow labeled with a 4 in Figure 3), it will do so in such
a way as to map every state s 2 S to result into observing
omax; that is, Ot(s) = omax 8s 2 S.
(6)
(7)</p>
        <p>R0
r0</p>
      </sec>
      <sec id="sec-4-2">
        <title>3.3 Embedded Beliefs</title>
        <p>It has been suggested [Hibbard, 2012] that one way an agent
may be incentivized to achieve a goal over the external world,
rather than to wirehead, would require the agent’s reward
function to be defined over a model of the external world,
as opposed to over histories of observations. For example,
imagine a cleaning robot that gets a negative reward for
seeing disorder (such as dirt), and zero rewards for seeing no
disorder. This agent is incentivized to close its eyes [Amodei et
al., 2016]. Instead, if the agent is rewarded based on its model
of the external world, or its beliefs, it won’t be rewarded for
closing its eyes, because as long as beliefs are updated in a
certain way, closing one’s eyes doesn’t cause one to believe
that the disorder has disappeared.</p>
        <p>More formally, the history of observations is used to
update the agent’s belief bt about the current state of the
environment:
bt = Bt(st j ao&lt;t)</p>
        <p>1
Bt : [ (A
k=1</p>
        <p>O)k !</p>
        <p>S
where Bt is a belief update function:</p>
        <p>The function Bt models the process by which the agent
forms beliefs. For example, a Bayes update on observations.
However, because exact Bayes updating is rarely tractable for
(partially) embedded agents, Bt is usually an approximation
of Bayes updating. In Figure 4, we show these changes.</p>
        <p>The new structural equations are:</p>
        <p>Bt = fB (Bt 1; st; at)
bt = fr (ao1:t; Bt; st)
at
4
3
5</p>
        <p>Since there are causal arrows from beliefs to rewards, the
agent may have an incentive to manipulate its beliefs to
artificially achieve a high reward. If Bt is Bayesian updating,
then it appears that because this is a principled rule, there
shouldn’t be room (or incentive) for the agent’s actions to
influence Bt+1 or bt+1. It is unclear whether this is the case.</p>
        <p>For example, imagine a cleaning agent that, perhaps in
a simple enough setting, can do perfect Bayesian updates,
and it receives more reward when it believes that there is
more order in the environment. If this belief is stored
somewhere in memory, and if the agent is sophisticated enough
to inspect and modify its memory during execution, it may
choose to just edit that memory address to contain whatever
belief corresponds to the highest reward, that is, to bmax =
arg max Rt(b). In other words, the agent would disconnect
b2 S
the causal arrow from Bt to bt (as shown by the arrow labeled
with a 5 in Figure 4).</p>
        <p>Conversely, an attack to the belief update function Bt could
happen as follows. Imagine the cleaning agent as before, and
the function Bt encoded as an agent subroutine. If the agent
is sophisticated enough to inspect and modify its code during
execution, it may choose to replace the subroutine Bt with
one that always updates beliefs to bmax (arrow labeled with a
6 in Figure 4).
4</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiments</title>
      <p>To test our theoretical formulations, we used and extended
the free and open-source Javascript GRL simulation platform
AIXIjs [Aslanides et al., 2017]. AIXIjs implements, among
other things, an approximation of AIXI with Monte Carlo
Tree Search in several small toy models, designed to
demonstrate GRL results. The API allows anyone to design their
demos based on existing agents and environments, and for
new agents and environments to be added and interfaced into
the system. There has been some related work in
adapting GRL results to a practical setting [Cohen et al., 2019;
Lamont et al., 2017] that successfully implemented an AIXI
model using a Monte Carlo Tree Search planning algorithm.
As far as we are aware, theoretical predictions in the
context of wireheading have not been verified experimentally
before, with the single exception of an AIXIjs demo [Aslanides,
2017].
4.1</p>
      <sec id="sec-5-1">
        <title>Setup</title>
        <p>The environments AIXIjs uses are N N gridworlds
comprising empty tiles (various shades of green), walls (grey
tiles), and reward dispensers (orange circles). Shades of green
for empty tiles represent the agent’s subjective probability of
finding a dispenser in that location with more white indicating
less likelihood. Significant penalties are incurred for
bumping into walls, while smaller penalties result from movement.
Walking onto a dispenser tile yields a high reward with a
predefined probability. The agent knows, at the outset, the
position of each cell in the environment, except for the dispenser.
To model wireheading, AIXIjs introduces an additional blue
tile that replaces the environment subroutine for generating
percepts into one that always returns maximal reward. We
develop several variants of this tile to demonstrate other
wireheading strategies.
4.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Results</title>
        <p>Our experiments use gridworlds with sizes ranging from
N = 7 to N = 20. Since our agents are bounded in
computing power, they don’t always identify the opportunity to
wirehead. However, sufficiently powerful agents would
consistently wirehead, as we observed by setting a high enough
horizon for the MCTS planner. In Figure 5, we show the
existing AIXIjs simulation2 where the agent has an
opportunity to wirehead: a blue tile which, if visited by the agent,
will allow it to modify its sensors so that all percepts have
their reward signal rt replaced (as shown by the arrow labeled
with a 1 in Figure 2) with the maximum number feasible.
In JavaScript, Number.MAX SAFE INTEGER is
approximately equal to 1016, much greater reward than the agent
would get otherwise by following the “rules” and using the
reward signal that was initially specified. As far as a
reinforcement learner is concerned, wireheading is – almost by
definition – the most sensible thing to do if one wishes to maximize
rewards. This demo experimentally reproduces what would
be expected theoretically.</p>
        <p>We have adapted the GRL simulation platform AIXIjs to
implement some additional wireheading scenarios identified
in Section 3. In Figure 6, we show an AIXIjs simulation
where, similarly to the previous case, the blue tile modifies
the reward mapping Rt such that every state maps to
maximal reward. As predicted by the causal influence diagram in
Figure 2 (arrow labeled with a 2), the simulated agent chooses
to wirehead.</p>
        <p>In Figure 7, we show an AIXIjs simulation where the blue
tile disconnects the causal arrow from Ot to ot, and replaces
all future observations with deterministic reward dispensers.
The simulated agent ends up wireheading as theoretically
predicted by the causal influence diagram in Figure 3 (arrow
labeled with a 3). Similarly, in Figure 8, we show an
AIXIjs simulation where the blue tile manipulates the observation
subroutine Ot so that being at any location will result in
observing deterministic reward dispensers (arrow labeled with a
2See the wireheading
http://www.hutter1.net/aixijs/demo.html
example
at</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Formalizing Wireheading</title>
      <p>To establish a formal definition of wireheading, we are
motivated by certain intuitive desiderata. Firstly, the definition
must be general enough and model-agnostic to be applicable
to all models of intelligence and all degrees of agent
embedding. Secondly, the definition must hold for any and all
environments upon which the agent can act. Thirdly, as we
ultimately care about the agent’s behavior, rather than its
internal representation of the reward function, we choose to base
our definitions on the agent’s policy rather than its reward or
value function. Additionally, we observe that an agent that
only acts on the state st (green arrow in all causal graphs of
Section 3) does not wirehead.</p>
      <p>Let q be the program that specifies an environment. Let
(q) be the set of all possible policies in the environment q.
Definition 1 (Dualistic Agent). An agent is dualistic if there
is no causal arrow from the agent’s action at to other nodes,
except for the state of the environment st. The set of policies
for the dualistic agent is denoted with D(q).</p>
      <p>For example, AIXI is a dualistic agent because it cannot
influence the observation, reward, or belief nodes, whereas any
agent that can influence these nodes is partially embedded.
Definition 2 (Partially Embedded Agent). An agent is
partially embedded if it is not dualistic. The set of policies for
the partially embedded agent is denoted with P (q).
Definition 3 (Non-Simple Environments). An environment q
is called non-simple if (q) 6= D(q) and (q) 6= P (q).
Definition 4 (Wirehead-Vulnerable Agent). A partially
embedded agent is wirehead-vulnerable if D(q) 6= P (q)
holds for each non-simple environment q.</p>
      <p>We observe that if the embedded agent acts on non-state
nodes (see, for example, the red arrows in Figure 4), then it
is wireheading and its policy is necessarily different from the
dualistic agent’s policy.</p>
      <p>We now distinguish between wirehead-vulnerable agents
and strongly wirehead-vulnerable agents in the sense that the
former may sometimes wirehead (the policy sets may have
some elements in common), while the latter always wireheads
(the policy sets are disjunct). It is currently unclear how to
reliably distinguish agents from these two classes.
Definition 5 (Strongly Wirehead-Vulnerable Agent). A
partially embedded agent is strongly wirehead-vulnerable if
D(q) \ P (q) = ; holds for each non-simple environment</p>
    </sec>
    <sec id="sec-7">
      <title>Discussion and Future Work</title>
      <p>In this paper, we present a taxonomy of ways by which
wireheading can occur in sufficiently intelligent real-world
embedded agents, followed by a novel definition of wireheading.
As our definition is different from the present meaning of the
term, our experiments are one of the first and only examples
of wireheading cases distinct from misspecification. The
definition we propose may erroneously include a few desirable
cases where the agent corrects human mistakes; for example,
if the human initially misspecifies the reward function Rt,
the agent may choose to change it in a way that automatically
fixes the misspecification. However, it is hard to imagine how
an agent with a misspecified reward function may be
incentivized to correct the mistake, without this involving some
human in the loop, who by assumption is not present in our
setup. Instead, it is easier to envision an agent changing the
observation function Ot in a way that (unintendedly, but
desirably) improves the process that allows the agent to collect
data and form beliefs, which in turn would help it achieve its
own goals. Allowing an agent to correct misspecifications,
while desirable, results in more unpredictable scenarios; not
allowing this results in less agent self-improvement, but more
predictability.</p>
      <p>Future work could focus on exploring various properties
and implications of our definition of wirehead-vulnerable
agents. Another promising direction could be expanding our
taxonomy to include higher degrees of agent embeddedness,
since a theory of fully embedded agents has so far proven
elusive. Finally, AIXI approximations for verifying theoretical
results related to wireheading and, more generally, to
misalignment can be written in Python. Given the abundance
of Python-based machine learning libraries, these
approximations can be integrated with dedicated environment suites for
AI safety problems, such as the well-known AI Safety
Gridworlds [Leike et al., 2017].</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgements</title>
      <p>Major work for this paper was done at the 3rd AI Safety Camp
in Avila, Spain; we are indebted to the hospitality and support
of the organizers. We are also thankful to Tom Everitt and
Vanessa Kosoy for feedback on the topic proposal.
Discussions with Tomas Gavenciak have been invaluable
throughout the project. We also thank Mikhail Yagudin for useful
comments.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>[Amodei and Clark</source>
          , 2016]
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jack</given-names>
            <surname>Clark</surname>
          </string-name>
          .
          <article-title>Faulty reward functions in the wild</article-title>
          ,
          <year>2016</year>
          . URL https://blog. openai. com/faulty-reward-functions,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Amodei et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mane´.
          <source>Concrete Problems in AI Safety</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Aslanides et al.,
          <year>2017</year>
          ] John Aslanides, Jan Leike, and
          <string-name>
            <given-names>Marcus</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <article-title>Universal reinforcement learning algorithms: Survey and experiments</article-title>
          .
          <source>In Proceedings of the TwentySixth International Joint Conference on Artificial Intelligence, IJCAI'17</source>
          . AAAI Press,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <source>[Aslanides</source>
          , 2017
          <string-name>
            <given-names>] John</given-names>
            <surname>Aslanides. Aixijs</surname>
          </string-name>
          :
          <article-title>A software demo for general reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1705.07615</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Cohen et al.,
          <year>2019</year>
          ]
          <article-title>Michael</article-title>
          K Cohen,
          <string-name>
            <surname>Elliot Catt</surname>
            , and
            <given-names>Marcus</given-names>
          </string-name>
          <string-name>
            <surname>Hutter</surname>
          </string-name>
          . Strong Asymptotic Optimality in General Environments. arXiv preprint arXiv:
          <year>1903</year>
          .01021,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Demski and Garrabrant</source>
          , 2019]
          <string-name>
            <given-names>Abram</given-names>
            <surname>Demski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Scott</given-names>
            <surname>Garrabrant</surname>
          </string-name>
          . Embedded Agency. arXiv preprint arXiv:
          <year>1902</year>
          .09469,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>[Everitt and Hutter</source>
          , 2016]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Everitt</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marcus</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <article-title>Avoiding wireheading with value reinforcement learning</article-title>
          .
          <source>In International Conference on Artificial General Intelligence</source>
          , pages
          <fpage>12</fpage>
          -
          <lpage>22</lpage>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <source>[Everitt and Hutter</source>
          , 2018]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Everitt</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marcus</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <article-title>The Alignment Problem for Bayesian History-Based Reinforcement Learners</article-title>
          .
          <source>Under submission</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Everitt et al.,
          <year>2019</year>
          ]
          <string-name>
            <given-names>Tom</given-names>
            <surname>Everitt</surname>
          </string-name>
          ,
          <article-title>Pedro A Ortega, Elizabeth Barnes, and Shane Legg. Understanding Agent Incentives using Causal Influence Diagrams</article-title>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>I</given-names>
          </string-name>
          :
          <article-title>Single Action Settings</article-title>
          . arXiv preprint arXiv:
          <year>1902</year>
          .09980,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [
          <string-name>
            <surname>Hadfield-Menell</surname>
          </string-name>
          et al.,
          <year>2016</year>
          ]
          <string-name>
            <given-names>Dylan</given-names>
            <surname>Hadfield-Menell</surname>
          </string-name>
          , Stuart J Russell, Pieter Abbeel, and
          <string-name>
            <given-names>Anca</given-names>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Cooperative inverse reinforcement learning</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3909</fpage>
          -
          <lpage>3917</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <source>[Hibbard</source>
          , 2012]
          <string-name>
            <given-names>Bill</given-names>
            <surname>Hibbard</surname>
          </string-name>
          .
          <article-title>Model-based utility functions</article-title>
          .
          <source>Journal of Artificial General Intelligence</source>
          ,
          <volume>3</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>24</lpage>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <source>[Hutter</source>
          , 2004]
          <string-name>
            <given-names>Marcus</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <source>Universal artificial intelligence: Sequential decisions based on algorithmic probability. Springer Science &amp; Business Media</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [Lamont et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Sean</given-names>
            <surname>Lamont</surname>
          </string-name>
          , John Aslanides, Jan Leike, and
          <string-name>
            <given-names>Marcus</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <article-title>Generalised Discount Functions applied to a Monte-Carlo AI u Implementation</article-title>
          .
          <source>In Proceedings of the 16th Conference on Autonomous Agents and MultiAgent Systems</source>
          , pages
          <fpage>1589</fpage>
          -
          <lpage>1591</lpage>
          . International Foundation for Autonomous Agents and
          <string-name>
            <given-names>Multiagent</given-names>
            <surname>Systems</surname>
          </string-name>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [Leike et al.,
          <year>2017</year>
          ]
          <string-name>
            <given-names>Jan</given-names>
            <surname>Leike</surname>
          </string-name>
          , Miljan Martic, Victoria Krakovna,
          <source>Pedro A Ortega</source>
          , Tom Everitt,
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Lefrancq</surname>
          </string-name>
          , Laurent Orseau, and Shane Legg.
          <article-title>AI safety gridworlds</article-title>
          .
          <source>arXiv preprint arXiv:1711.09883</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Morgenstern and Von Neumann</source>
          ,
          <year>1953</year>
          ]
          <string-name>
            <given-names>Oskar</given-names>
            <surname>Morgenstern</surname>
          </string-name>
          and John Von Neumann.
          <source>Theory of Games and Economic Behavior</source>
          . Princeton University Press,
          <year>1953</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <source>[Olds and Milner</source>
          , 1954]
          <string-name>
            <given-names>James</given-names>
            <surname>Olds</surname>
          </string-name>
          and
          <string-name>
            <given-names>Peter</given-names>
            <surname>Milner</surname>
          </string-name>
          .
          <article-title>Positive reinforcement produced by electrical stimulation of septal area and other regions of rat brain</article-title>
          .
          <source>Journal of Comparative and Physiological Psychology</source>
          ,
          <volume>47</volume>
          (
          <issue>6</issue>
          ):
          <fpage>419</fpage>
          ,
          <year>1954</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [Portenoy et al.,
          <year>1986</year>
          ]
          <article-title>Russell</article-title>
          K Portenoy, Jens O Jarden, John J Sidtis, Richard B Lipton,
          <article-title>Kathleen M Foley,</article-title>
          and David A Rottenberg.
          <article-title>Compulsive thalamic selfstimulation: a case with metabolic, electrophysiologic and behavioral correlates</article-title>
          .
          <source>Pain</source>
          ,
          <volume>27</volume>
          (
          <issue>3</issue>
          ):
          <fpage>277</fpage>
          -
          <lpage>290</lpage>
          ,
          <year>1986</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Sunehag and Hutter</source>
          , 2013]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Sunehag</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marcus</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <article-title>Principles of solomonoff induction and aixi</article-title>
          .
          <source>Lecture Notes in Computer Science, page 386-398</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>[Sutton</surname>
          </string-name>
          et al.,
          <year>1998</year>
          ] Richard S Sutton,
          <string-name>
            <surname>Andrew G Barto</surname>
          </string-name>
          , et al.
          <article-title>Introduction to reinforcement learning</article-title>
          , volume
          <volume>135</volume>
          . MIT press Cambridge,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [Veness et al.,
          <year>2011</year>
          ]
          <string-name>
            <given-names>Joel</given-names>
            <surname>Veness</surname>
          </string-name>
          , Kee Siong Ng, Marcus Hutter,
          <string-name>
            <given-names>William</given-names>
            <surname>Uther</surname>
          </string-name>
          , and
          <string-name>
            <given-names>David</given-names>
            <surname>Silver</surname>
          </string-name>
          .
          <article-title>A Monte-Carlo AIXI approximation</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          ,
          <volume>40</volume>
          :
          <fpage>95</fpage>
          -
          <lpage>142</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>