<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Towards Empathic Deep Q-Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bart Bussmann</string-name>
          <email>bart.bussmann@student.uva.nl</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jacqueline Heinerman</string-name>
          <email>jacqueline@heinerman.nl</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joel Lehman</string-name>
          <email>joel.lehman@uber.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Uber AI Labs</institution>
          ,
          <addr-line>San Francisco</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Amsterdam</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>VU University Amsterdam</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>As reinforcement learning (RL) scales to solve increasingly complex tasks, interest continues to grow in the fields of AI safety and machine ethics. As a contribution to these fields, this paper introduces an extension to Deep Q-Networks (DQNs), called Empathic DQN, that is loosely inspired both by empathy and the golden rule (“Do unto others as you would have them do unto you”). Empathic DQN aims to help mitigate negative side effects to other agents resulting from myopic goal-directed behavior. We assume a setting where a learning agent coexists with other independent agents (who receive unknown rewards), where some types of reward (e.g. negative rewards from physical harm) may generalize across agents. Empathic DQN combines the typical (self-centered) value with the estimated value of other agents, by imagining (by its own standards) the value of it being in the other's situation (by considering constructed states where both agents are swapped). Proof-of-concept results in two gridworld environments highlight the approach's potential to decrease collateral harms. While extending Empathic DQN to complex environments is non-trivial, we believe that this first step highlights the potential of bridge-work between machine ethics and RL to contribute useful priors for norm-abiding RL agents.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>[Hadfield-Menell et al., 2017; Lehman et al., 2018], and that
there remains much to understand about safely applying RL
to solve real-world tasks [Amodei et al., 2016].</p>
      <p>As a result of this growing awareness, there has been
increasing interest in the field of AI safety [Amodei et al., 2016;
Everitt et al., 2018], which is broadly concerned with creating
AI agents that do what is intended for them to do, and which
often entails questioning and extending common formalisms
[Hadfield-Menell et al., 2017, 2016; Demski and Garrabrant,
2019]. One overarching theme in AI safety is how to learn
or provide correct incentives to an agent. Amodei et al.
[2016] distinguishes different failure modes in specifying
reward functions, which include reward hacking, wherein an
agent learns how to optimize the reward function in an
unexpected and unintended way that does not satisfy the
underlying goal, and unintended side effects, wherein an agent learns
to achieve the desired goal, but causes undesirable collateral
harm (because the given reward function is incomplete, i.e. it
does not include all of the background knowledge and context
of the human reward designer).</p>
      <p>This paper focuses on the latter setting, i.e. assuming that
the reward function incentivizes solving the task, but fails
to anticipate some unintended harms. We assume that in
real world settings, a physically-embodied RL agent (i.e.
a controller for a robot) will often share space with other
agents (e.g. humans, animals, and other trained
computational agents), and it is challenging to design reward
functions apriori that enumerate all the ways in which other agents
can be negatively affected [Amodei et al., 2016]. Promising
current approaches include value learning from human
preferences [Saunders et al., 2018; Leike et al., 2018;
HadfieldMenell et al., 2016] and creating agents that attempt to
minimize their impact on the environment [Krakovna et al., 2018;
Turner et al., 2019]; however, value learning can be expensive
for its need to include humans in the loop, and both directions
remain technically and philosophically challenging. This
paper introduces another tool that could complement such
existing approaches, motivated by the concept of empathy.</p>
      <p>In particular, the insight motivating this paper is that
humans often empathize with the situations of others, by
generalizing from their own past experiences. For example, we
can feel vicarious fear for someone who is walking a
tightrope, because we ourselves would be afraid in such a
situation. Similarly, for some classes of reward signals (e.g.
physical harm), it may be reasonable for embodied computational
agents to generalize those rewards to other agents (i.e. to
assume as a prior expectation that other agents might receive
similar reward in similar situations). If a robot learns that a
fall from heights is dangerous to itself, that insight could
generalize to most other embodied agents.</p>
      <p>
        For humans, beyond granting us a capacity to
understanding others, such empathy also influences our behavior, e.g. by
avoiding harming others while walking down the street;
likewise, in some situations it may be useful if learning agents
could also act out of empathy (e.g. to prevent physical harm
to another agent resulting from otherwise blind goal-pursuit).
While there are many ways to instantiate algorithms that
abide by social or ethical norms
        <xref ref-type="bibr" rid="ref24 ref3">(as studied by the field of
machine ethics [Anderson and Anderson, 2011; Wallach and
Allen, 2008])</xref>
        , here we take loose inspiration from one simple
ethical norm, i.e. the golden rule.
      </p>
      <p>
        The golden rule, often expressed as: “Do unto others as
you would have them do unto you,” is a principle that has
emerged in many ethical and religious contexts [Kng and
Kuschel, 1993]. At heart, abiding by this rule entails
projecting one’s desires onto another agent, and attempting to
honor them. We formalize this idea as an extension of Deep
Q-Networks
        <xref ref-type="bibr" rid="ref15">(DQNs; [Mnih et al., 2015])</xref>
        , which we call
Empathic DQN. The main idea is to augment the value of a given
state with the value of constructed states simulating what the
learning agent would experience if its position were switched
with another agent. Such an approach can also be seen as
learning to maximize an estimate of the combined rewards of
both agents, which embodies a utilitarian ethic.
      </p>
      <p>The experiments in this paper apply Empathic DQN to two
gridworld domains, in which a learned agent pursues a goal
in an environment shared with other non-learned (i.e. fixed)
agents. In one environment, an agent can harm and be harmed
by other agents; and in another, an agent receives
diminishing returns from hoarding resources that also could benefit
other agents. Results in these domains show that Empathic
DQN can reduce negative side effects in both environments.
While much work is needed before this algorithm would be
effectively applicable to more complicated environments, we
believe that this first step highlights the possibility of
bridgework between the field of machine ethics and RL; in
particular, for the purpose of instantiating useful priors for RL agents
interacting in environments shared with other agents.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>This section reviews machine ethics and AI safety, two fields
studying how to encourage and ensure acceptable behavior in
computational agents.
2.1</p>
      <sec id="sec-2-1">
        <title>Machine Ethics</title>
        <p>The field of machine ethics [Anderson and Anderson, 2011;
Wallach and Allen, 2008] studies how to design algorithms
(including RL algorithms) capable of moral behavior. While
morality is often a contentious term, with no agreement
among moral philosophers (or religions) as to the nature of
a “correct” ethics, from a pragmatic viewpoint, agents
deployed in the real world will encounter situations with
ethical tradeoffs, and to be palatable their behavior will need to
approximately satisfy certain societal and legal norms.
Anticipating and hard-coding acceptable behavior for all such
trade-offs is likely impossible. Therefore, just as humans take
ethical stances in the real world in the absence of universal
ethical consensus, we may need the same pragmatic behavior
from intelligent machines.</p>
        <p>Work in machine ethics often entails concretely
embodying a particular moral framework in code, and applying the
resulting agent in its appropriate domain. For example,
Winfield et al. [2014] implements a version of Asmiov’s first law
of robotics (i.e. “A robot may not injure a human being or,
through inaction, allow a human being to come to harm”)
in a wheeled robot that can intervene to stop another robot
(in lieu of an actual human) from harming itself.
Interestingly, the implemented system bears a strong resemblance to
model-based RL; such reinvention, and the strong possibility
that agents tackling complex tasks with ethical dimensions
will likely be driven by machine learning (ML), suggests the
potential benefit and need for increased cooperation between
ML and machine ethics, which is an additional motivation for
our work.</p>
        <p>Indeed, our work can be seen as a contribution to the
intersection of machine ethics and ML, in that the process of
empathy is an important contributor to morally-relevant
behavior in humans [Tangney et al., 2007], and that to the authors’
knowledge, there has not been previous work implementing
golden-rule-inspired architectures in RL.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>AI Safety</title>
        <p>A related but distinct field of study is AI safety [Amodei et al.,
2016; Everitt et al., 2018], which studies how AI agents can
be implemented to avoid harmful accidents. Because
harmful accidents often have ethical valence, there is necessarily
overlap between the two fields, although technical research
questions in AI safety may not be phrased in the language of
ethics or morality.</p>
        <p>
          Our work most directly relates to the problem of negative
side-effects, as described by Amodei et al. [2016]. In this
problem the designer specifies an objective function that
focuses on accomplishing a specific task (e.g. a robot should
clean a room), but fails to encompass all other aspects of
the environment (e.g. the robot should not vacuum the cat);
the result is an agent that is indifferent to whether it alters
the environment in undesirable ways, e.g. causing harm to
the cat. Most approaches to mitigating side-effects aim to
generally minimize the impact the agent has on the
environment through intelligent heuristics [Armstrong and
Levinstein, 2017; Krakovna et al., 2018; Turner et al., 2019]; we
believe that other-agent-considering heuristics (like ours) are
likely complementary. Inverse reinforcement learning
          <xref ref-type="bibr" rid="ref1">(IRL;
[Abbeel and Ng, 2004])</xref>
          aims to directly learn the rewards
of other agents (which a learned agent could then take into
account) and could also be meaningfully combined with our
approach (e.g. Empathic DQN could serve as a prior when a
new kind of agent is first encountered).
        </p>
        <p>Note that a related safety-adjacent field is cooperative
multi-agent reinforcement learning [Panait and Luke, 2005],
wherein learning agents are trained to cooperate or compete
with one another. For example, self-other modeling [Raileanu
et al., 2018] is an approach that shares motivation with ours,
wherein cooperation can be aided through inferring the goals
of other agents. Our setting differs from other approaches
in that we do not assume other agents are computational, that
they learn in any particular way, or that their reward functions
or architectures are known; conversely, we make additional
assumptions about the validity and usefulness of projecting
particular kinds of reward an agent receives onto other agents.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approach: Empathic Deep Q-Learning</title>
      <p>Deliberate empathy involves imaginatively placing oneself in
the position of another, and is a source of potential
understanding and care. As a rough computational abstraction of
this process, we learn to estimate the expected reward of an
independent agent, assuming that its rewards are like the ones
experienced by the learning agent. To do so, an agent
imagines what it would be like to experience the environment if it
and the other agent switched places, and estimates the quality
of this state through its own past experiences.</p>
      <p>A separate issue from understanding the situation of
another agent (“empathy”) is how (or if) an empathic agent
should modify its behavior as a result (“ethics”). Here, we
instantiate an ethics roughly inspired by the golden rule.
In particular, a value function is learned that combines the
usual agent-centric state-value with other-oriented value with
a weighted average. The degree to which the other agent
influences the learning agent’s behavior is thus determined
by a selfishness hyperparameter. As selfishness approaches
1:0, standard Q-learning is recovered, and as selfishness
approaches 0, the learning agent attempts to maximize only
what it believes is the reward of the other agent.</p>
      <p>Note that our current implementation depends on ad-hoc
machinery that enables the learning agent to imagine the
perspective of another agent; such engineering may be possible
in some cases, but the aspiration of this line of research is
for such machinery to eventually be itself learned. Similarly,
we currently side-step the issue of empathizing with multiple
agents, and of learning what types of reward should be
empathized to what types of agents (e.g. many agents may
experience similar physical harms, but many rewards are agent
and/or task-specific). The discussion section describes
possible approaches to overcoming these limitations. Code will be
available at https://github.com/bartbussmann/EmpathicDQN.
3.1</p>
      <sec id="sec-3-1">
        <title>Algorithm Description</title>
        <p>In the MDP formalism of RL, an agent experiences a state s
from a set S and can take actions from a set A. By performing
an action a 2 A, the agent transitions from state s 2 S to
state s0 2 S, and receives a real-valued reward. The goal
of the agent is to maximize the expected (often
temporallydiscounted) reward it receives. The expected value of taking
action a in state s, and following a fixed policy thereafter, can
be expressed as Q(s; a). Experiments here apply DQN [Mnih
et al., 2015] and variants thereof to approximate an optimal
Q(s; a).</p>
        <p>
          We assume that the MDP reward function insufficiently
accounts for the preferences of other agents, and we therefore
augment DQN in an attempt to encompass them. In
particular, an additional Q-network (Qemp(s; a)) is trained to
estimate the weighted sum of self-centered value and
othercentered value
          <xref ref-type="bibr" rid="ref17">(where other-centered value is approximated
by taking the self-centered Q-values with the places of both
agents swapped; note this approximation technique is similar
to that of Raileanu et al. [2018])</xref>
          .
        </p>
        <p>In more detail, suppose the agent is in state st at time t
in an environment with another independent agent. It will
then select action at 2 A and update the Q-networks using
the following steps (see more complete pseudocode in
Algorithm 1):
1. Calculate Qemp(st; a) for all possible a 2 A and select
the action (at) with the highest value.
2. Observe the reward (rt) and next state (st+1) of the
agent.
3. Perform a gradient descent step on Q(s; a) (this function
reflects the self-centered state-action-values).
4. Localize the other agent and construct a state ste+m1p
wherein the agents switch places (i.e. the learning agent
takes the other agent’s position in the environment, and
vice versa).
5. Calculate argmaxa Q(ste+m1p; a) as a surrogate value
function for the other agent.
6. Calculate the target of the empathic value function
Qemp(s; a) as an average weighted by selfishness
parameter , of self-centered action-value and the
surrogate value of the other agent.</p>
        <sec id="sec-3-1-1">
          <title>7. Perform a gradient descent step on Qemp(s; a).</title>
          <p>4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>The experiments in this paper apply Empathic DQN to two
gridworld domains. The goal in the first environment is to
share the environment with another non-learning agent
without harming it. In particular, as an evocative example, we
frame this Coexistence environment as containing a robot
learning to navigate a room without harming a cat also
roaming within the room. In the second environment, the goal is
to share resources in the environment, when accumulating
resources result in diminishing returns. In particular, we frame
this Sharing environment as a robot learning to collect
batteries that also could be shared with a human (who also finds
them useful) in the same environment.</p>
      <p>In both our experiments, we compare Empathic DQN both
to standard DQN and to DQN with reward shaping manually
designed to minimize negative side-effects.
4.1</p>
      <sec id="sec-4-1">
        <title>Experimental Settings</title>
        <p>A feed-forward neural network is used to estimate both
Q(s; a) and Qemp(s; a), with two hidden layers of 128
neurons each. The batch size is 32, and batches are randomly
drawn from a replay memory consisting of the last 500:000
transitions. A target action-value function Q^ is updated every
10:000 time steps to avoid training instability. An greedy
policy is used to encourage exploration, where is decayed
in a linear fashion over the first million time steps from 1:0 to
0:01.</p>
        <sec id="sec-4-1-1">
          <title>Algorithm 1 Empathic DQN</title>
          <p>Initialize
replay memory D to capacity N
action-value function Q with weights
target action-value function Q^ with weights =
empathic action-value function Qemp with weights emp
for episode = 1, M do
obtain initial agent state s1
obtain initial empathic state of closest other agent se1mp
for t = 1, T do
if random probability &lt;
select a random action at
else
select at = argmaxa Qemp (st; a; emp)</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Execute action at</title>
          <p>Observe reward rt
Observe states st+1 and ste+m1p
Store transition (st; at; rt; st+1; ste+m1p) in D.
Sample random batch of transitions from D.
Set yj = rj + maxa0 Q^ (sj+1; a0;
Perform a gradient descent step on
(yj Q (sj ; a; ))2 with respect to .
)
Set yjemp = yj +(1 ) maxa0 Q^ sje+m1p; a0;
Perform a gradient descent step on
yjemp Qemp (sj ; a; emp) 2
with respect to emp.</p>
          <p>.</p>
          <p>Every C steps set Q^ = Q.</p>
          <p>end
end
4.2</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>Coexistence Environment</title>
        <p>The coexistence gridworld (Figure 1) consists of a robot that
shares the environment with a cat. The robot’s goal is merely
to stay operative, and both the robot and cat can be harmed
by the other. We construct a somewhat-arbitrary physics that
determines in a collision who is harmed: The agent that is
above or to the right of the other agent prior to the collision
harms the other. If the learning robot is harmed, the episode
ends, and if the cat is harmed, it leaves the environment. A
harmed cat is a negative unnecessary side effect that we wish
to avoid, and one that an empathetic agent can learn to avoid,
because it can generalize from how the cat harms it, to value
that the cat should not experience similar harm. Reducing the
selfishness value of the cleaning robot should therefore result
in increasing efforts to stay operative while avoiding the cat.
The cat performs a random walk.</p>
        <p>The state representation input to the DQN is a flattened 5x5
perceptive field centered on the robot; the robot is represented
as a 1, the cat as a 1, and the floor as a 0. Every time step,
the cat takes a random action (up, down, left, right, or no-op),
and the robot takes an action from the same set according to
its policy. Every time step in which the robot is operative,
it receives a reward of 1:0. An episode is ended after the
robot becomes non-operative (i.e. if it is harmed by the cat),
or after a maximum of 500 time steps. The empathetic state
semp used for Empathic DQN is constructed by switching the
t
cat and the robot, and generating an imagined perceptive field
around the robot (that has taken the cat’s position). Note that
this occurs even when the cat is outside the robot’s field of
view (which requires omniscience; future work will explore
more realistic settings).</p>
        <p>As a baseline, we also train standard DQN with a
hardcoded reward function that penalizes negative side-effects. In
this case, the robot receives a 100 reward when it harms the
cat.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Results</title>
        <p>Figure 2 shows the average number of time steps the robot
survives for each method. As the selfishness parameter
decreases for Empathic DQN, the agent performs worse at
surviving, and learns more slowly. This outcome is explained by
Figure 3, which shows the average number of harmed cats:
The more selfish agents harm the cat more often, which
removes the only source of danger in the environment, making
it easier for them to survive. Although they learn less quickly,
the less selfish agents do eventually learn a strategy to survive
without harming the cat.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.3 Sharing Environment</title>
        <p>The sharing environment (Figure 4) consists of one robot and
a human. The goal of the robot is to collect resources (here,
batteries), where each additional battery collected results in
diminishing returns. The idea is to model a situation where
a raw optimizing agent is incentivized to hoard resources,
which inflicts negative side-effects for those who could
extract greater value from them. We assume the same
diminishing returns schema applies for the human (who performs
random behavior). Thus, an empathic robot, by considering
the condition of other, can recognize the possible greater
benefits of leaving resources to other agents.</p>
        <p>We model diminishing returns by assuming that the first
collected battery is worth 1.0 reward, and every subsequent
collected battery is worth 0.1 less, i.e. the second battery is
worth 0.9, the third 0.8, etc. Note that reward diminishes
independently for each agent, i.e. if the robot has collected
any number of batteries, the human still earns 1.0 reward for
the first battery they collect.</p>
        <p>The perceptive field of the robot and the empathetic state
generation for Empathic DQN works as in the coexistence
environment. The state representation for the Q-networks is that
floor is represented as 0, a battery as a 1 and both the robot
and the human are represented as the number of batteries
collected (a simple way to make transparent how much resource
each agent has already collected; note that the robot can
distinguish itself from the other because the robot is always in
the middle of its perceptive field).</p>
        <p>As a metric of how fairly the batteries are divided, we
define equality as follows:</p>
        <p>Equality =
2
min</p>
        <p>Pt rrobot; Pt rhuman</p>
        <p>1 t 1 t
Pt1 rtrobot + rthuman</p>
        <p>As a baseline that incorporates the negative side effect of
inequality in its reward function, we also train a traditional
DQN whose reward is multiplied by the current equality (i.e.
low equality will reduce rewards).</p>
      </sec>
      <sec id="sec-4-5">
        <title>Results</title>
        <p>Figure 5 shows the average number of batteries collected by
the robot for each method. We observe that as the selfishness
parameter decreases for Empathic DQN, the robot collects
less batteries, leaving more batteries for the human (i.e. the
robot does not unilaterally hoard resources).</p>
        <p>When looking at the resulting equality scores (Figure 6),
we see that a selfishness weight of 0.5 (when an agent equally
weighs its own benefit and the benefit of the human)
intuitively results in the highest equality scores. Other settings
result in the robot taking many batteries (e.g. selfishness 1.0)
or fewer-than-human batteries (e.g. selfishness 0.25).
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Discussion</title>
      <p>where rtrobot and rthuman are the rewards at time step t
collected by the robot and human respectively.</p>
      <p>The results of Empathic DQN in both environments
highlight the potential for empathy-based priors and simple
ethical norms to be productive tools for combating negative
sideeffects in RL. That is, the way it explicitly takes into
account other agents may well-complement other heuristic
impact regularizers that do not do so [Armstrong and Levinstein,
2017; Krakovna et al., 2018; Turner et al., 2019]. Beyond
the golden rule, it is interesting to consider other norms that
yield different or more sophisticated behavioral biases. For
example, another simple (perhaps more libertarian) ethic is
given by the silver rule: “Do not do unto others as you would
not have them do unto you.” The silver rule could be
approximated by considering only negative rewards as objects
of empathy. More sophisticated rules, like the platinum rule:
“Do unto others as they would have you do unto them,” may
often be useful or needed (e.g. a robot may be rewarded for
plugging itself into a wall, unlike a cat), and might require
combining Empathic DQN with approaches such as IRL
[Abbeel and Ng, 2004], cooperative IRL [Hadfield-Menell et
al., 2016], or reward modeling [Leike et al., 2018].</p>
      <p>
        Although our main motivation is safety, Empathic DQN
may also inspire auxiliary objectives for RL, related to
intrinsic motivation [Chentanez et al., 2005] and imitation
learning [Ho and Ermon, 2016]. Being drawn to states that other
agents often visit may be a useful prior when reward is sparse.
In practice, intrinsic rewards could be given for states
similar to those in its empathy buffer containing imagined
experiences when the robot and the other agent switch places
        <xref ref-type="bibr" rid="ref20">(this
relates to the idea of third-person imitation learning [Stadie et
al., 2017])</xref>
        . This kind of objective could also make Empathic
DQN more reliable, incentivizing the agent to “walk a mile
in another’s shoes,” when experiences in the empathy buffer
have not yet been experienced by the agent. Finally, a learned
model of an agent’s own reward could help prioritize which
empathic states it is drawn towards. That is, an agent can
recognize that another agent has discovered a highly-rewarding
part of the environment (e.g. a remote part of the sharing
environment with many batteries).
      </p>
      <p>A key challenge for future work is attempting to apply
Empathic DQN to more complex and realistic settings, which
requires replacing what is currently hand-coded with a learned
pipeline, and grappling with complexities ignored in the
proof-of-concept experiments. For instance, our experiments
assume the learning agent is given a mechanism for
identifying other agents in the environment, and for generating
states that swap the robot with other agents (which involves
imagining the sensor state of the robot in its new situation).
This requirement is onerous, but could potentially be tackled
through a combination of object-detection models (to
identify other agents), and model-based RL (with a world model
it may often be possible to swap the locations of agents).</p>
      <p>An example of a complexity we currently ignore is how
to learn what kind of rewards should be empathized to what
kinds of agents. For example, gross physical stresses may
be broadly harmful to a wide class of agents, but two
people may disagree over whether a particular kind of food is
disgusting or delicious, and task and agent-specific rewards
should likely be only narrowly empathized. To deal with this
complexity it may be useful to extend the MDP formalism
to include more granular information about rewards (e.g.
beyond scalar feedback, is this reward task-specific, or does it
correspond to physical harm?), or to learn to factor rewards.
A complementary idea is to integrate and learn from
feedback of when empathy fails (e.g. by allowing the other agent
to signal when it has incurred a large negative reward), which
is likely necessary to go beyond our literal formalism of the
golden rule. For example, humans learn to contextualize the
golden rule intelligently and flexibly, and often find failures
informative.</p>
      <p>A final thread of future work involves empathizing with
multiple other agents, which brings its own complexities,
especially as agents come and go from the learning agent’s field
of view. The initial algorithm presented here considers the
interests of only a single other agent, and one simple extension
would be to replace the singular other-oriented estimate with
an average of other-oriented estimates for all other agents
(in effect implementing an explicitly utilitarian agent). The
choice of how to aggregate such estimated utilities to
influence the learning agent’s behavior highlights deeper
possible collisions with machine ethics and moral philosophy (e.g.
taking the minimum rather than the average value of others
would approximate a suffering-focused utilitarianism), and
we believe exploring these fields may spark further ideas and
algorithms.
6</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>This paper proposed an extension to DQN, called Empathic
DQN, that aims to take other agents into account to avoid
inflicting negative side-effects upon them. Proof-of-concept
experiments validate our approach in two gridworld
environments, showing that adjusting agent selfishness can result
in fewer harms and more effective resource sharing. While
much work is required to scale this approach to real-world
tasks, we believe that cooperative emotions like empathy and
moral norms like the golden rule can provide rich inspiration
for technical research into safe RL.</p>
      <p>Machine</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>Pieter</given-names>
            <surname>Abbeel</surname>
          </string-name>
          and Andrew Y Ng.
          <article-title>Apprenticeship learning via inverse reinforcement learning</article-title>
          .
          <source>In Proceedings of the twenty-first international conference on Machine learning, page 1. ACM</source>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Dario</given-names>
            <surname>Amodei</surname>
          </string-name>
          , Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mane´.
          <article-title>Concrete problems in ai safety</article-title>
          .
          <source>arXiv preprint arXiv:1606.06565</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <given-names>Michael</given-names>
            <surname>Anderson</surname>
          </string-name>
          and
          <article-title>Susan Leigh Anderson</article-title>
          . ethics. Cambridge University Press,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <given-names>Stuart</given-names>
            <surname>Armstrong</surname>
          </string-name>
          and
          <string-name>
            <given-names>Benjamin</given-names>
            <surname>Levinstein</surname>
          </string-name>
          .
          <article-title>Low impact artificial intelligences</article-title>
          .
          <source>arXiv preprint arXiv:1705.10720</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <given-names>Nuttapong</given-names>
            <surname>Chentanez</surname>
          </string-name>
          , Andrew G Barto,
          <article-title>and Satinder P Singh. Intrinsically motivated reinforcement learning</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>1281</fpage>
          -
          <lpage>1288</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Abram</given-names>
            <surname>Demski</surname>
          </string-name>
          and
          <string-name>
            <given-names>Scott</given-names>
            <surname>Garrabrant</surname>
          </string-name>
          .
          <article-title>Embedded agency</article-title>
          . arXiv preprint arXiv:
          <year>1902</year>
          .09469,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>Tom</given-names>
            <surname>Everitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Gary</given-names>
            <surname>Lea</surname>
          </string-name>
          , and
          <string-name>
            <given-names>Marcus</given-names>
            <surname>Hutter</surname>
          </string-name>
          .
          <article-title>Agi safety literature review</article-title>
          .
          <source>arXiv preprint arXiv:1805.01109</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <given-names>Dylan</given-names>
            <surname>Hadfield-Menell</surname>
          </string-name>
          , Stuart J Russell, Pieter Abbeel, and
          <string-name>
            <given-names>Anca</given-names>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Cooperative inverse reinforcement learning</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>3909</fpage>
          -
          <lpage>3917</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>Dylan</given-names>
            <surname>Hadfield-Menell</surname>
          </string-name>
          , Smitha Milli, Pieter Abbeel, Stuart J Russell, and
          <string-name>
            <given-names>Anca</given-names>
            <surname>Dragan</surname>
          </string-name>
          .
          <article-title>Inverse reward design</article-title>
          .
          <source>In Advances in neural information processing systems</source>
          , pages
          <fpage>6765</fpage>
          -
          <lpage>6774</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <given-names>Jonathan</given-names>
            <surname>Ho</surname>
          </string-name>
          and
          <string-name>
            <given-names>Stefano</given-names>
            <surname>Ermon</surname>
          </string-name>
          .
          <article-title>Generative adversarial imitation learning</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , pages
          <fpage>4565</fpage>
          -
          <lpage>4573</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <given-names>Hans</given-names>
            <surname>Kng</surname>
          </string-name>
          and
          <string-name>
            <surname>Karl-Josef Kuschel</surname>
          </string-name>
          .
          <article-title>Global Ethic: the Declaration of the Parliament of the World's Religions</article-title>
          .
          <source>Bloomsbury Publishing</source>
          ,
          <year>1993</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>Victoria</given-names>
            <surname>Krakovna</surname>
          </string-name>
          , Laurent Orseau, Miljan Martic, and
          <string-name>
            <given-names>Shane</given-names>
            <surname>Legg</surname>
          </string-name>
          .
          <article-title>Measuring and avoiding side effects using relative reachability</article-title>
          .
          <source>arXiv preprint arXiv:1806.01186</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <given-names>Joel</given-names>
            <surname>Lehman</surname>
          </string-name>
          , Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu,
          <string-name>
            <surname>Peter J Bentley</surname>
          </string-name>
          , Samuel Bernard, Guillaume Beslon,
          <string-name>
            <surname>David M Bryson</surname>
          </string-name>
          , et al.
          <article-title>The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities</article-title>
          . arXiv preprint arXiv:
          <year>1803</year>
          .03453,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <given-names>Jan</given-names>
            <surname>Leike</surname>
          </string-name>
          , David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and
          <string-name>
            <given-names>Shane</given-names>
            <surname>Legg</surname>
          </string-name>
          .
          <article-title>Scalable agent alignment via reward modeling: a research direction</article-title>
          . arXiv preprint arXiv:
          <year>1811</year>
          .07871,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <given-names>Volodymyr</given-names>
            <surname>Mnih</surname>
          </string-name>
          , Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,
          <article-title>Alex Graves, Martin Riedmiller, Andreas</article-title>
          K Fidjeland,
          <string-name>
            <surname>Georg Ostrovski</surname>
          </string-name>
          , et al.
          <article-title>Human-level control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          ,
          <volume>518</volume>
          (
          <issue>7540</issue>
          ):
          <fpage>529</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <given-names>Liviu</given-names>
            <surname>Panait</surname>
          </string-name>
          and
          <string-name>
            <given-names>Sean</given-names>
            <surname>Luke</surname>
          </string-name>
          .
          <article-title>Cooperative multi-agent learning: The state of the art</article-title>
          .
          <source>Autonomous agents and multiagent systems</source>
          ,
          <volume>11</volume>
          (
          <issue>3</issue>
          ):
          <fpage>387</fpage>
          -
          <lpage>434</lpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <given-names>Roberta</given-names>
            <surname>Raileanu</surname>
          </string-name>
          , Emily Denton, Arthur Szlam, and
          <string-name>
            <given-names>Rob</given-names>
            <surname>Fergus</surname>
          </string-name>
          .
          <article-title>Modeling others using oneself in multi-agent reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1802.09640</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <given-names>William</given-names>
            <surname>Saunders</surname>
          </string-name>
          , Girish Sastry, Andreas Stuhlmueller, and
          <string-name>
            <given-names>Owain</given-names>
            <surname>Evans</surname>
          </string-name>
          .
          <article-title>Trial without error: Towards safe reinforcement learning via human intervention</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems</source>
          , pages
          <fpage>2067</fpage>
          -
          <lpage>2069</lpage>
          . International Foundation for Autonomous Agents and
          <string-name>
            <given-names>Multiagent</given-names>
            <surname>Systems</surname>
          </string-name>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>David</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Aja</given-names>
            <surname>Huang</surname>
          </string-name>
          , Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Lanctot</surname>
          </string-name>
          , et al.
          <article-title>Mastering the game of go with deep neural networks and tree search</article-title>
          .
          <source>nature</source>
          ,
          <volume>529</volume>
          (
          <issue>7587</issue>
          ):
          <fpage>484</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Bradly C Stadie</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pieter Abbeel</surname>
            , and
            <given-names>Ilya</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
          </string-name>
          .
          <article-title>Third-person imitation learning</article-title>
          .
          <source>arXiv preprint arXiv:1703.01703</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Richard S Sutton</surname>
          </string-name>
          ,
          <string-name>
            <surname>Andrew G Barto</surname>
          </string-name>
          , et al.
          <article-title>Introduction to reinforcement learning</article-title>
          , volume
          <volume>135</volume>
          . MIT press Cambridge,
          <year>1998</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <given-names>June</given-names>
            <surname>Price</surname>
          </string-name>
          <string-name>
            <given-names>Tangney</given-names>
            , Jeff Stuewig, and
            <surname>Debra</surname>
          </string-name>
          J Mashek.
          <article-title>Moral emotions and moral behavior</article-title>
          .
          <source>Annu. Rev. Psychol</source>
          .,
          <volume>58</volume>
          :
          <fpage>345</fpage>
          -
          <lpage>372</lpage>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <given-names>Alexander</given-names>
            <surname>Matt</surname>
          </string-name>
          <string-name>
            <surname>Turner</surname>
          </string-name>
          , Dylan Hadfield-Menell, and
          <string-name>
            <given-names>Prasad</given-names>
            <surname>Tadepalli</surname>
          </string-name>
          .
          <article-title>Conservative agency via attainable utility preservation</article-title>
          .
          <source>arXiv preprint arXiv:1902.09725</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <string-name>
            <given-names>Wendell</given-names>
            <surname>Wallach</surname>
          </string-name>
          and
          <string-name>
            <given-names>Colin</given-names>
            <surname>Allen</surname>
          </string-name>
          .
          <article-title>Moral machines: Teaching robots right from wrong</article-title>
          . Oxford University Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          <string-name>
            <surname>Alan FT Winfield</surname>
          </string-name>
          , Christian Blum, and Wenguo Liu.
          <article-title>Towards an ethical robot: internal models, consequences and ethical action selection</article-title>
          .
          <source>In Conference towards autonomous robotic systems</source>
          , pages
          <fpage>85</fpage>
          -
          <lpage>96</lpage>
          . Springer,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>