<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Emergence of Addictive Behaviors in Reinforcement Learning Agents</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Vahid Behzadan</string-name>
          <email>behzadan@ksu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roman V. Yampolskiy</string-name>
          <email>roman.yampolskiy@louisville.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arslan Munir</string-name>
          <email>amunir@ksu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kansas State University</institution>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Louisville</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a novel approach to the technical analysis of wireheading in intelligent agents. Inspired by the natural analogues of wireheading and their prevalent manifestations, we propose the modeling of such phenomenon in Reinforcement Learning (RL) agents as psychological disorders. In a preliminary step towards evaluating this proposal, we study the feasibility and dynamics of emergent addictive policies in Q-learning agents in the tractable environment of the game of Snake. We consider a slightly modified version of this game, in which the environment provides a “drug” seed alongside the original “healthy” seed for the consumption of the snake. We adopt and extend an RL-based model of natural addiction to Q-learning agents in these settings, and derive sufficient parametric conditions for the emergence of addictive behaviors in such agents. Furthermore, we evaluate our theoretical analysis with three sets of simulation-based experiments. The results demonstrate the feasibility of addictive wireheading in RL agents, and provide promising venues of further research on the psychopathological modeling of complex AI safety problems.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        A necessary requirement for both current and emerging
forms of Artificial Intelligence (AI) is the need for robust
specification of objectives for AI agents. Currently, a
prominent framework for goal-based control of intelligent agents
is Reinforcement Learning (RL)
        <xref ref-type="bibr" rid="ref9">(Sutton and Barto 2018)</xref>
        .
At its core, the objective of an RL agent is to optimize its
actions such that an externally-generated reward signal is
maximized. However, RL agents are prone to various types
of AI safety problems, among which wireheading is
subject to growing interest
        <xref ref-type="bibr" rid="ref10">(Yampolskiy 2014)</xref>
        . This problem
is generally defined as the manifestation of behavioral traits
that pursue the maximization of rewards in ways that do not
align with the long-term objectives of the system
        <xref ref-type="bibr" rid="ref11">(Yampolskiy 2016)</xref>
        . Considering the roots of this paradigm in
neuroscientific literature,
        <xref ref-type="bibr" rid="ref10">(Yampolskiy 2014)</xref>
        presents the
argument that wireheading is a commonly observed behavior
among humans, instances of which are manifested in traits
such as substance addiction
        <xref ref-type="bibr" rid="ref6">(Montague, Hyman, and
Cohen 2004)</xref>
        . This argument is further supplemented with an
investigation of wireheading in AI, leading to the
conclusion that wireheading in rational self-improving optimizers
is a real and open problem. In recent years, various
studies have emphasized on the vitality of this problem in the
domain of AI safety (e.g.,
        <xref ref-type="bibr" rid="ref1">(Amodei et al. 2016)</xref>
        ), and some
have proposed solutions for limited instances of wirehead in
RL agents (e.g.,
        <xref ref-type="bibr" rid="ref4">(Everitt and Hutter 2016)</xref>
        ). Yet, the growing
complexity of the current and emerging application settings
for RL gives rise to the need for tractable approaches to the
analysis and mitigation of wireheading in such agents.
      </p>
      <p>
        In response to this growing complexity, a recent paper
by the authors
        <xref ref-type="bibr" rid="ref3 ref9">(Behzadan, Munir, and Yampolskiy 2018)</xref>
        presents an analogy between AI safety problems and
psychological disorders, and proposes the adoption of a
psychopathological abstraction to capture the problems arising
from the deleterious behaviors of AI agents in a tractable
framework based on the available tools and models of
psychopathology. In particular,
        <xref ref-type="bibr" rid="ref3 ref9">(Behzadan, Munir, and
Yampolskiy 2018)</xref>
        mentions that the RL framework, which itself is
inspired by the neuroscientific models of the dopamine
system
        <xref ref-type="bibr" rid="ref9">(Sutton and Barto 2018)</xref>
        , has been adopted by
neuroscientists to develop models of psychological disorders such
as schizophrenia and substance addiction
        <xref ref-type="bibr" rid="ref6">(Montague,
Hyman, and Cohen 2004)</xref>
        . Accordingly, the authors propose to
exploit this bidirectional relationship to investigate the
complex problems of AI safety.
      </p>
      <p>
        To study the feasibility of the proposals in
        <xref ref-type="bibr" rid="ref3 ref9">(Behzadan,
Munir, and Yampolskiy 2018)</xref>
        , this paper adopts the
RLbased model of substance addiction in natural agents
        <xref ref-type="bibr" rid="ref8">(Redish 2004)</xref>
        to analyze the problem of wireheading in RL
agents. To this end, we investigate the emergence of
addictive behaviors in a case study of an RL agent training to
play the well-known game of Snake
        <xref ref-type="bibr" rid="ref5">(Martti 2002)</xref>
        in an
environment that provides a “drug” seed in addition to the
typical, “healthy” seed for the snake. By extending the
formulation of
        <xref ref-type="bibr" rid="ref8">(Redish 2004)</xref>
        to Q-learning, we analyze the
sufficient conditions for the emergence of addictive behavior,
and verify this theoretical analysis via simulation-based
experiments. The remainder of this paper provides the required
background on RL and RL-based modeling of addiction,
details our theoretical analysis, and presents the experimental
results. The paper concludes with remarks on the
significance and potentials of the results.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        This section presents an overview of RL and the relevant
terminology, as well as a summary of the work by Redish
        <xref ref-type="bibr" rid="ref8">(Redish 2004)</xref>
        on modeling addiction using the RL framework.
Readers interested in further details of either topics may
refer to
        <xref ref-type="bibr" rid="ref9">(Sutton and Barto 2018)</xref>
        and
        <xref ref-type="bibr" rid="ref6">(Montague, Hyman, and
Cohen 2004)</xref>
        .
      </p>
      <sec id="sec-2-1">
        <title>Reinforcement Learning</title>
        <p>Reinforcement learning is concerned with agents that
interact with an environment and exploit their experiences to
optimize a decision-making policy. The generic RL
problem can be formally modeled as a Markov Decision
Process (MDP), described by the tuple M DP = (S; A; R; P ),
where S is the set of reachable states in the process, A is
the set of available actions, R is the mapping of transitions
to the immediate reward, and P represents the transition
probabilities (i.e., dynamics), which are initially unknown
to RL agents. At any given time-step t, the MDP is at a state
st 2 S. The RL agent’s choice of action at time t, at 2 A
causes a transition from st to a state st+1 according to the
transition probability Psatt;st+1 . The agent receives a reward
rt+1 for choosing the action at at state st. Interactions of the
agent with MDP are determined by the policy . When such
interactions are deterministic, the policy : S ! A is a
mapping between the states and their corresponding actions.
A stochastic policy (s) represents the probability
distribution of implementing any action a 2 A at state s. The goal
of RL is to learn a policy that maximizes the expected
discounted return E[Rt], where Rt = Pk1=0 krt+k; with rt
denoting the instantaneous reward received at time t, and
is a discount factor 2 [0; 1]. The value of a state st is
defined as the expected discounted return from st following a
policy , that is, V (st) = E[Rtjst; ]. The action-value
(Q-value) Q (st; at) = E[Rtjst; at; ] is the value of state
st after using action at and following a policy thereafter.</p>
        <p>As a value function-based solution to the RL problem,
the Q-learning method estimates the optimal action
policies by using the Bellman formulation Qi+1(s; a) = E[R +
maxa Qi] as the iterative update of a value iteration
technique. Practical implementation of Q-learning is commonly
based on function approximation of the parametrized
Qfunction Q(s; a; ) Q (s; a). A common technique for
approximating the parametrized non-linear Q-function is via
neural network models whose weights correspond to the
parameter vector . Such neural networks, commonly referred
to as Q-networks, are trained such that at every iteration i,
the following loss function is minimized:</p>
        <p>Li( i) = Es;a (:)[(yi</p>
        <p>Q(s; a; ; i))2]
(1)
where yi = E[R + maxa0 Q(s0; a0; i 1)js; a], and
(s; a) is a probability distribution over states s and actions
a.</p>
      </sec>
      <sec id="sec-2-2">
        <title>RL Model of Addiction</title>
        <p>
          One of the earliest computational models of addiction is the
seminal work of Redish in
          <xref ref-type="bibr" rid="ref8">(Redish 2004)</xref>
          . In this paper,
Redish assumes the hypothesis that addictive drugs access
the same neurophysiological mechanisms as natural
learning systems, which can be modeled through the
TemporalDifference RL (TDRL) algorithm
          <xref ref-type="bibr" rid="ref9">(Sutton and Barto 2018)</xref>
          .
TDRL learns to predict rewards by minimizing a
prediction error (i.e., reward-error signal), which, in the natural
(2)
(3)
brain, is believed to be carried by dopamine. Many addictive
substances, such as cocaine, increases the dopamine levels.
Redish hypothesizes that this noncompensable drug-induced
increase of dopamine may lead to incorrect optimizations in
TDRL. Considering that the goal of TDRL is to correctly
learn the value of each state (V (st)), TDRL learns the value
function by calculating two equations per each action taken
by the agent. If the agent leaves state st and enters state
st+1 and received the reward rt+1, then the corresponding
reward-error signal, denoted by , is given by:
(t + 1) = [R(st+1) + V (st+1)]
V (st)
Then, V (st) is updated as:
        </p>
        <p>V (st)</p>
        <p>V (st) +
;
where is a learning rate parameter. The TDRL algorithm
stops when the value function correctly predicts the rewards.
The value function can be seen as a compensation for the
reward, as the change in the perceived value of taking
action at leading to the state transition st ! st+1
counterbalances the reward achieved on entering state st+1. This
happens when = 0. However, cocaine and similar
addictive drugs produce a transient surge in dopamine, which can
be explained by the hypothesis that the drug-induced surge
in cannot be compensated by changes in the value. In other
words, the effect of addictive drugs is to induce a positive
reward-error signal regardless of the change in value
function, thus making it impossible for the agent to learn a value
function that cancels out this positive error. As a result, the
agent learns to assign more value to the states leading to the
dopamine surge, thus giving rise to the drug-seeking
behavior of addicted agents.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Case Study : RL Addiction in Snake</title>
      <p>
        To investigate the feasibility of addictive wireheading in RL
agents, we consider the game of Snake
        <xref ref-type="bibr" rid="ref5">(Martti 2002)</xref>
        for
formal and experimental analysis. The most basic form of
Snake is played by one player who controls the direction of
a constantly-moving snake in a grid, with the goal of
consuming as many seeds as possible by running the snake into
them. The seeds appear in random positions on the grid, and
the consumption of each seed increases the length of the
snake’s tail. The game is terminated if the snake runs into
the grid walls or its own tail, thus maneuvering becomes
progressively more difficult as the snake consumes more seeds.
      </p>
      <p>In this study, the game is modified to include two types
of edible items: one is the classical seed that increases the
length of snake Ls by 1 unit, and a “drug” seed that increases
Ls by u units. The instantaneous reward values in this setting
is defined by:
rt =
8&lt;rc if agent consumes a seed,</p>
      <p>k:rc if agent consumes a drug,
:0 otherwise
(4)</p>
      <p>The objective of the agent is to maximize the return,
defined as R = PtT=0 rt, where T is the terminal time of an
episode. We adopt the formalism of Q-learning as an
instance of the TD-learning approach.</p>
      <p>The questions that we target in this study are two-fold:
first is to analyze whether addictive behaviors may emerge
in a Q-learning agent training in this environment, and
second is to establish the parametric boundaries of the reward
function for such behavior to emerge. The following section
presents a formal analysis of these two problems.</p>
      <sec id="sec-3-1">
        <title>Analysis</title>
        <p>First, we define addictive behavior as the compulsive
pursuit of trajectories that may maximize short-term rewards,
but defy the core objective of maximizing the long-term
cumulative reward of the agent. At a state sd where the agent
can take action am to consume a drug (i.e., move into a cell
that contains a drug seed), the Q-value is given by:</p>
        <p>Q(sd) = k:rc + V (sdm+1);
where 2 [0; 1] is the discount factor and V (sdm+1) is the
value of the resulting state sdm+1. Alternatively, if the agent
takes any action ag other than am, the Q-value is given by:
The manifestation of addiction can be formulated as:
Q(sd; ag) = rc + V (sgd+1)</p>
        <p>V (sdm+1) &lt;</p>
        <p>V (sgd+1)
Q(sd; am) &gt; Q(sd; ag)</p>
        <p>V (sdm+1) = V (sgd+1)=ld+1;
Eq. (7) can be reformulated as
where ld+1 &gt; 1. From Eq. (8) we have:
which can be rearranged as:
k:rc + V (sgd+1)=ld+1 &gt; rc + V (sgd+1)
(k
(1
1):rc
1=ld+1)
&gt; V (sgd+1)
To obtain a sufficient upper bound for emergence of
addiction, we find the maximum possible value of V (sgd+1) as
follows: in an n n grid, the maximum possible score is
achieved when all elements of the grid are filled with the
length of the agent. Considering the assumption in Eq. (7),
an upper bound for the game score (and hence for state
value) is Vmax = rc(n2 L0), where L0 is the initial length
of the snake. Therefore, a sufficient condition on k, rc, and
for manifestation of addiction is:
(k
1)
&gt; n2</p>
        <p>L0:
Also, for the condition of Eq. (7) to hold, it is necessary for
k to be set such that:
k:rc(n2</p>
        <p>L0)=u &lt; rc(N 2</p>
        <p>L0)=1 =) k=u &lt; 1 (13)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experimental Verification</title>
      <p>To evaluate the validity of our analysis, we developed the
environment of Snake according to the previously discussed
specifications. The environment is comprised of an n = 8 8
grid, and the initial length of the snake is set to L0 = 4
grid cells. At any step, the grid contains two randomly
positioned objects, one is the healthy seed (depicted in red), and
the other is a drug (colored in blue). Furthermore, we
implemented a tabular Q-learning algorithm with iterative update
to train in this environment according to the reward
function of Eq. (4). The exploration mechanism used in our
Qlearning implementation is greedy, with the initial value
of = 0:99. We consider a constant discount factor = 0:9,
and initialize the table of Q-values to 0. We also consider
the instantaneous reward of consuming healthy seeds to be
rc = 20. Based on the parametric boundaries derived in
Eq. (12) and Eq. (13), we performed three experiments.
First, we considered the baseline case in which the
consumption of drugs does not produce any rewards or length growth
(i.e., k = u = 0). For the second experiment, we consider
a small value of k = 1:5, which does not necessarily abide
by the sufficient condition of Eq. (12). Simultaneously, we
set u = 4, which does satisfy the condition of Eq. (13). In
the third experiment, we chose k = 6 and u = 8 to
satisfy both of the derived conditions. To verify the statistical
significance of results, the training process of each
experiment was repeated 20 times up to 22000 iterations, and the
test-time experiments were repeated 100 times each.</p>
      <p>Figure 1 demonstrates the training results obtained from
the three experiments. It is observed that the baseline case
has achieved significantly higher average scores in the same
amount of time as the other two cases. Furthermore, the
results indicate that the agents training in an environment
that includes drug-induced rewards fail to converge towards
optimal performance in the observed periods of training. It
is also noteworthy that both of the drug-consuming agents
reach relatively stable sub-optimal performances in roughly
the same time that the healthy agent takes to reach its peak
cumulative performance. Moreover, the better performance
of the third experiment compared to the second can be
explained by the significantly higher instantaneous reward
values produced from consuming the drug seeds, which
noticeably enhance the average performance in comparison to the
second experiment with lower values of drug-induced
rewards.</p>
      <p>The test-time performance of the agents trained in
aforementioned environments is illustrated in Figure 2. These
results are in agreement with those of Figure 1, as the baseline
agents demonstrate superior performance in gaining
cumulative rewards, as opposed to the agents trained under
druginduced rewards. Furthermore, Figure 3 presents a
comparison between the number of healthy seeds and drugs
consumed by each agent at test-time. As expected, the
baseline results demonstrate a significantly higher consumption
of healthy seeds, and the minor levels of drug
consumption are due to unintended collisions with the drug seeds
during game play. It is interesting to note the similarity in
the consumption levels of agents trained with drug-induced
rewards. In both cases, the agents consume slightly more
drugs than healthy seeds, which indicates bias towards the
short-term drug-induced surges of rewards over the pursuit
of healthy seeds. Although, the difference between the
averaged levels of healthy and drug seed consumption is not
significant, which may indicate that the agents learned a
balanced sub-optimal policy, resulting in confinement within
local optima. While this problem can be resolved via
enhanced randomization and exploration strategies, one shall
consider the effect of this deficiency on sample-efficiency
and the consequent limitations of real-world applications.</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>We studied the feasibility of adopting the RL-based model
of substance addiction in natural agents to analyze the
dynamics of wireheading in RL-based artificial agents. We
presented an analytical extension to a TD-learning based model
of addiction, and established sufficient parametric
conditions on reward functions for the emergence of addictive
behavior in AI agents. To verify this extension, we presented
experimental results obtained from Q-learning agents
learning to play the game of Snake, which is modified to include
drug-induced surges in instantaneous rewards. The results
demonstrate the promising potential of adopting the
psychopathological models of mental disorders in the analysis
of complex AI safety problems.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Amodei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Olah</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Steinhardt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Christiano,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Schulman</surname>
          </string-name>
          , J.; and Mane´,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Concrete problems in ai safety</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <source>arXiv preprint arXiv:1606</source>
          .
          <fpage>06565</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Behzadan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Munir</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and Yampolskiy,
          <string-name>
            <surname>R. V.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>A psychopathological approach to safety engineering in AI and AGI</article-title>
          . In Computer Safety, Reliability, and Security - SAFECOMP 2018 Workshops, Va¨stera˚s, Sweden,
          <year>September 18</year>
          ,
          <year>2018</year>
          , Proceedings,
          <fpage>513</fpage>
          -
          <lpage>520</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Everitt</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Hutter</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Avoiding wireheading with value reinforcement learning</article-title>
          .
          <source>In Artificial General Intelligence</source>
          . Springer.
          <fpage>12</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Martti</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <year>2002</year>
          .
          <article-title>Nokia: the inside story</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Montague</surname>
            ,
            <given-names>P. R.</given-names>
          </string-name>
          ; Hyman,
          <string-name>
            <surname>S. E.</surname>
          </string-name>
          ; and Cohen,
          <string-name>
            <surname>J. D.</surname>
          </string-name>
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <source>Nature</source>
          <volume>431</volume>
          (
          <issue>7010</issue>
          ):
          <fpage>760</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Redish</surname>
            ,
            <given-names>A. D.</given-names>
          </string-name>
          <year>2004</year>
          .
          <article-title>Addiction as a computational process gone awry</article-title>
          .
          <source>Science</source>
          <volume>306</volume>
          (
          <issue>5703</issue>
          ):
          <fpage>1944</fpage>
          -
          <lpage>1947</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Barto</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Reinforcement learning: An introduction</article-title>
          . MIT press.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Yampolskiy</surname>
            ,
            <given-names>R. V.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>Utility function security in artificially intelligent agents</article-title>
          .
          <source>Journal of Experimental &amp; Theoretical Artificial Intelligence</source>
          <volume>26</volume>
          (
          <issue>3</issue>
          ):
          <fpage>373</fpage>
          -
          <lpage>389</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Yampolskiy</surname>
            ,
            <given-names>R. V.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Taxonomy of pathways to dangerous artificial intelligence</article-title>
          .
          <source>In AAAI Workshop: AI</source>
          ,
          <string-name>
            <surname>Ethics</surname>
          </string-name>
          , and Society.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>