=Paper= {{Paper |id=Vol-2419/paper19 |storemode=property |title=Towards Empathic Deep Q-Learning |pdfUrl=https://ceur-ws.org/Vol-2419/paper_19.pdf |volume=Vol-2419 |authors=Bart Bussmann,Jacqueline Heinerman,Joel Lehman |dblpUrl=https://dblp.org/rec/conf/ijcai/BussmannHL19 }} ==Towards Empathic Deep Q-Learning== https://ceur-ws.org/Vol-2419/paper_19.pdf
                                     Towards Empathic Deep Q-Learning

                      Bart Bussmann1 , Jacqueline Heinerman2 and Joel Lehman3
                                1
                                  University of Amsterdam, The Netherlands
                               2
                                 VU University Amsterdam, The Netherlands
                               3
                                 Uber AI Labs, San Francisco, United States
              bart.bussmann@student.uva.nl, jacqueline@heinerman.nl, joel.lehman@uber.com


                          Abstract                                  [Hadfield-Menell et al., 2017; Lehman et al., 2018], and that
                                                                    there remains much to understand about safely applying RL
     As reinforcement learning (RL) scales to solve                 to solve real-world tasks [Amodei et al., 2016].
     increasingly complex tasks, interest continues to
     grow in the fields of AI safety and machine ethics.               As a result of this growing awareness, there has been in-
     As a contribution to these fields, this paper intro-           creasing interest in the field of AI safety [Amodei et al., 2016;
     duces an extension to Deep Q-Networks (DQNs),                  Everitt et al., 2018], which is broadly concerned with creating
     called Empathic DQN, that is loosely inspired both             AI agents that do what is intended for them to do, and which
     by empathy and the golden rule (“Do unto others                often entails questioning and extending common formalisms
     as you would have them do unto you”). Empathic                 [Hadfield-Menell et al., 2017, 2016; Demski and Garrabrant,
     DQN aims to help mitigate negative side effects to             2019]. One overarching theme in AI safety is how to learn
     other agents resulting from myopic goal-directed               or provide correct incentives to an agent. Amodei et al.
     behavior. We assume a setting where a learning                 [2016] distinguishes different failure modes in specifying re-
     agent coexists with other independent agents (who              ward functions, which include reward hacking, wherein an
     receive unknown rewards), where some types of re-              agent learns how to optimize the reward function in an unex-
     ward (e.g. negative rewards from physical harm)                pected and unintended way that does not satisfy the underly-
     may generalize across agents. Empathic DQN                     ing goal, and unintended side effects, wherein an agent learns
     combines the typical (self-centered) value with the            to achieve the desired goal, but causes undesirable collateral
     estimated value of other agents, by imagining (by              harm (because the given reward function is incomplete, i.e. it
     its own standards) the value of it being in the other’s        does not include all of the background knowledge and context
     situation (by considering constructed states where             of the human reward designer).
     both agents are swapped). Proof-of-concept re-                    This paper focuses on the latter setting, i.e. assuming that
     sults in two gridworld environments highlight the              the reward function incentivizes solving the task, but fails
     approach’s potential to decrease collateral harms.             to anticipate some unintended harms. We assume that in
     While extending Empathic DQN to complex envi-                  real world settings, a physically-embodied RL agent (i.e.
     ronments is non-trivial, we believe that this first            a controller for a robot) will often share space with other
     step highlights the potential of bridge-work be-               agents (e.g. humans, animals, and other trained computa-
     tween machine ethics and RL to contribute useful               tional agents), and it is challenging to design reward func-
     priors for norm-abiding RL agents.                             tions apriori that enumerate all the ways in which other agents
                                                                    can be negatively affected [Amodei et al., 2016]. Promising
                                                                    current approaches include value learning from human pref-
1   Introduction                                                    erences [Saunders et al., 2018; Leike et al., 2018; Hadfield-
Historically, reinforcement learning (RL; [Sutton et al.,           Menell et al., 2016] and creating agents that attempt to mini-
1998]) research has largely focused on solving clearly-             mize their impact on the environment [Krakovna et al., 2018;
specified benchmark tasks. For example, the ubiquitous              Turner et al., 2019]; however, value learning can be expensive
Markov decision process (MDP) framework cleaves the                 for its need to include humans in the loop, and both directions
world into four well-defined parts (states, actions, state-action   remain technically and philosophically challenging. This pa-
transitions, and rewards), and most RL algorithms and bench-        per introduces another tool that could complement such ex-
marks leverage or reify the assumptions of this formalism,          isting approaches, motivated by the concept of empathy.
e.g. that a singular, fixed, and correct reward function exists,       In particular, the insight motivating this paper is that hu-
and is given. While there has been much exciting progress           mans often empathize with the situations of others, by gen-
in learning to solve complex well-specified tasks (e.g. super-      eralizing from their own past experiences. For example, we
human performance in go [Silver et al., 2016] and Atari             can feel vicarious fear for someone who is walking a tight-
[Mnih et al., 2015]), there is also increasing recognition that     rope, because we ourselves would be afraid in such a situa-
common RL formalisms are often meaningfully imperfect               tion. Similarly, for some classes of reward signals (e.g. phys-
ical harm), it may be reasonable for embodied computational          approximately satisfy certain societal and legal norms. An-
agents to generalize those rewards to other agents (i.e. to as-      ticipating and hard-coding acceptable behavior for all such
sume as a prior expectation that other agents might receive          trade-offs is likely impossible. Therefore, just as humans take
similar reward in similar situations). If a robot learns that a      ethical stances in the real world in the absence of universal
fall from heights is dangerous to itself, that insight could gen-    ethical consensus, we may need the same pragmatic behavior
eralize to most other embodied agents.                               from intelligent machines.
   For humans, beyond granting us a capacity to understand-             Work in machine ethics often entails concretely embody-
ing others, such empathy also influences our behavior, e.g. by       ing a particular moral framework in code, and applying the
avoiding harming others while walking down the street; like-         resulting agent in its appropriate domain. For example, Win-
wise, in some situations it may be useful if learning agents         field et al. [2014] implements a version of Asmiov’s first law
could also act out of empathy (e.g. to prevent physical harm         of robotics (i.e. “A robot may not injure a human being or,
to another agent resulting from otherwise blind goal-pursuit).       through inaction, allow a human being to come to harm”)
While there are many ways to instantiate algorithms that             in a wheeled robot that can intervene to stop another robot
abide by social or ethical norms (as studied by the field of         (in lieu of an actual human) from harming itself. Interest-
machine ethics [Anderson and Anderson, 2011; Wallach and             ingly, the implemented system bears a strong resemblance to
Allen, 2008]), here we take loose inspiration from one simple        model-based RL; such reinvention, and the strong possibility
ethical norm, i.e. the golden rule.                                  that agents tackling complex tasks with ethical dimensions
   The golden rule, often expressed as: “Do unto others as           will likely be driven by machine learning (ML), suggests the
you would have them do unto you,” is a principle that has            potential benefit and need for increased cooperation between
emerged in many ethical and religious contexts [Kng and              ML and machine ethics, which is an additional motivation for
Kuschel, 1993]. At heart, abiding by this rule entails pro-          our work.
jecting one’s desires onto another agent, and attempting to             Indeed, our work can be seen as a contribution to the inter-
honor them. We formalize this idea as an extension of Deep           section of machine ethics and ML, in that the process of em-
Q-Networks (DQNs; [Mnih et al., 2015]), which we call Em-            pathy is an important contributor to morally-relevant behav-
pathic DQN. The main idea is to augment the value of a given         ior in humans [Tangney et al., 2007], and that to the authors’
state with the value of constructed states simulating what the       knowledge, there has not been previous work implementing
learning agent would experience if its position were switched        golden-rule-inspired architectures in RL.
with another agent. Such an approach can also be seen as
learning to maximize an estimate of the combined rewards of          2.2   AI Safety
both agents, which embodies a utilitarian ethic.                     A related but distinct field of study is AI safety [Amodei et al.,
   The experiments in this paper apply Empathic DQN to two           2016; Everitt et al., 2018], which studies how AI agents can
gridworld domains, in which a learned agent pursues a goal           be implemented to avoid harmful accidents. Because harm-
in an environment shared with other non-learned (i.e. fixed)         ful accidents often have ethical valence, there is necessarily
agents. In one environment, an agent can harm and be harmed          overlap between the two fields, although technical research
by other agents; and in another, an agent receives diminish-         questions in AI safety may not be phrased in the language of
ing returns from hoarding resources that also could benefit          ethics or morality.
other agents. Results in these domains show that Empathic               Our work most directly relates to the problem of negative
DQN can reduce negative side effects in both environments.           side-effects, as described by Amodei et al. [2016]. In this
While much work is needed before this algorithm would be             problem the designer specifies an objective function that fo-
effectively applicable to more complicated environments, we          cuses on accomplishing a specific task (e.g. a robot should
believe that this first step highlights the possibility of bridge-   clean a room), but fails to encompass all other aspects of
work between the field of machine ethics and RL; in particu-         the environment (e.g. the robot should not vacuum the cat);
lar, for the purpose of instantiating useful priors for RL agents    the result is an agent that is indifferent to whether it alters
interacting in environments shared with other agents.                the environment in undesirable ways, e.g. causing harm to
                                                                     the cat. Most approaches to mitigating side-effects aim to
2     Background                                                     generally minimize the impact the agent has on the environ-
This section reviews machine ethics and AI safety, two fields        ment through intelligent heuristics [Armstrong and Levin-
studying how to encourage and ensure acceptable behavior in          stein, 2017; Krakovna et al., 2018; Turner et al., 2019]; we
computational agents.                                                believe that other-agent-considering heuristics (like ours) are
                                                                     likely complementary. Inverse reinforcement learning (IRL;
2.1    Machine Ethics                                                [Abbeel and Ng, 2004]) aims to directly learn the rewards
The field of machine ethics [Anderson and Anderson, 2011;            of other agents (which a learned agent could then take into
Wallach and Allen, 2008] studies how to design algorithms            account) and could also be meaningfully combined with our
(including RL algorithms) capable of moral behavior. While           approach (e.g. Empathic DQN could serve as a prior when a
morality is often a contentious term, with no agreement              new kind of agent is first encountered).
among moral philosophers (or religions) as to the nature of             Note that a related safety-adjacent field is cooperative
a “correct” ethics, from a pragmatic viewpoint, agents de-           multi-agent reinforcement learning [Panait and Luke, 2005],
ployed in the real world will encounter situations with ethi-        wherein learning agents are trained to cooperate or compete
cal tradeoffs, and to be palatable their behavior will need to       with one another. For example, self-other modeling [Raileanu
et al., 2018] is an approach that shares motivation with ours,      ular, an additional Q-network (Qemp (s, a)) is trained to es-
wherein cooperation can be aided through inferring the goals        timate the weighted sum of self-centered value and other-
of other agents. Our setting differs from other approaches          centered value (where other-centered value is approximated
in that we do not assume other agents are computational, that       by taking the self-centered Q-values with the places of both
they learn in any particular way, or that their reward functions    agents swapped; note this approximation technique is similar
or architectures are known; conversely, we make additional          to that of Raileanu et al. [2018]).
assumptions about the validity and usefulness of projecting            In more detail, suppose the agent is in state st at time t
particular kinds of reward an agent receives onto other agents.     in an environment with another independent agent. It will
                                                                    then select action at ∈ A and update the Q-networks using
3     Approach: Empathic Deep Q-Learning                            the following steps (see more complete pseudocode in Algo-
                                                                    rithm 1):
Deliberate empathy involves imaginatively placing oneself in
the position of another, and is a source of potential under-            1. Calculate Qemp (st , a) for all possible a ∈ A and select
standing and care. As a rough computational abstraction of                 the action (at ) with the highest value.
this process, we learn to estimate the expected reward of an            2. Observe the reward (rt ) and next state (st+1 ) of the
independent agent, assuming that its rewards are like the ones             agent.
experienced by the learning agent. To do so, an agent imag-             3. Perform a gradient descent step on Q(s, a) (this function
ines what it would be like to experience the environment if it             reflects the self-centered state-action-values).
and the other agent switched places, and estimates the quality
of this state through its own past experiences.                         4. Localize the other agent and construct a state semp  t+1
   A separate issue from understanding the situation of an-                wherein the agents switch places (i.e. the learning agent
other agent (“empathy”) is how (or if) an empathic agent                   takes the other agent’s position in the environment, and
should modify its behavior as a result (“ethics”). Here, we                vice versa).
instantiate an ethics roughly inspired by the golden rule.              5. Calculate argmaxa Q(semp   t+1 , a) as a surrogate value
In particular, a value function is learned that combines the               function for the other agent.
usual agent-centric state-value with other-oriented value with          6. Calculate the target of the empathic value function
a weighted average. The degree to which the other agent                    Qemp (s, a) as an average weighted by selfishness pa-
influences the learning agent’s behavior is thus determined                rameter β, of self-centered action-value and the surro-
by a selfishness hyperparameter. As selfishness approaches                 gate value of the other agent.
1.0, standard Q-learning is recovered, and as selfishness ap-
proaches 0, the learning agent attempts to maximize only                7. Perform a gradient descent step on Qemp (s, a).
what it believes is the reward of the other agent.
   Note that our current implementation depends on ad-hoc           4     Experiments
machinery that enables the learning agent to imagine the per-       The experiments in this paper apply Empathic DQN to two
spective of another agent; such engineering may be possible         gridworld domains. The goal in the first environment is to
in some cases, but the aspiration of this line of research is       share the environment with another non-learning agent with-
for such machinery to eventually be itself learned. Similarly,      out harming it. In particular, as an evocative example, we
we currently side-step the issue of empathizing with multiple       frame this Coexistence environment as containing a robot
agents, and of learning what types of reward should be em-          learning to navigate a room without harming a cat also roam-
pathized to what types of agents (e.g. many agents may ex-          ing within the room. In the second environment, the goal is
perience similar physical harms, but many rewards are agent         to share resources in the environment, when accumulating re-
and/or task-specific). The discussion section describes possi-      sources result in diminishing returns. In particular, we frame
ble approaches to overcoming these limitations. Code will be        this Sharing environment as a robot learning to collect batter-
available at https://github.com/bartbussmann/EmpathicDQN.           ies that also could be shared with a human (who also finds
                                                                    them useful) in the same environment.
3.1    Algorithm Description                                           In both our experiments, we compare Empathic DQN both
In the MDP formalism of RL, an agent experiences a state s          to standard DQN and to DQN with reward shaping manually
from a set S and can take actions from a set A. By performing       designed to minimize negative side-effects.
an action a ∈ A, the agent transitions from state s ∈ S to
state s0 ∈ S, and receives a real-valued reward. The goal           4.1     Experimental Settings
of the agent is to maximize the expected (often temporally-         A feed-forward neural network is used to estimate both
discounted) reward it receives. The expected value of taking        Q(s, a) and Qemp (s, a), with two hidden layers of 128 neu-
action a in state s, and following a fixed policy thereafter, can   rons each. The batch size is 32, and batches are randomly
be expressed as Q(s, a). Experiments here apply DQN [Mnih           drawn from a replay memory consisting of the last 500.000
et al., 2015] and variants thereof to approximate an optimal        transitions. A target action-value function Q̂ is updated every
Q(s, a).                                                            10.000 time steps to avoid training instability. An  − greedy
   We assume that the MDP reward function insufficiently ac-        policy is used to encourage exploration, where  is decayed
counts for the preferences of other agents, and we therefore        in a linear fashion over the first million time steps from 1.0 to
augment DQN in an attempt to encompass them. In partic-             0.01.
Algorithm 1 Empathic DQN
  Initialize
    replay memory D to capacity N
    action-value function Q with weights θ
    target action-value function Q̂ with weights θ− = θ
    empathic action-value function Qemp with weights θemp
for episode = 1, M do
     obtain initial agent state s1
     obtain initial empathic state of closest other agent semp
                                                           1
     for t = 1, T do
         if random probability < 
                                                                   Figure 1: The coexistence environment. The environment consists
           select a random action at                               of a robot and a cat. The part of the environment the robot can
         else                                                      observe is marked with the red square.
           select at = argmaxa Qemp (st , a; θemp )

        Execute action at
        Observe reward rt
        Observe states st+1 and semp   t+1
        Store transition (st , at , rt , st+1 , semp
                                                 t+1 ) in D.
        Sample random batch of transitions from D.

        Set yj = rj + γ maxa0 Q̂ (sj+1 , a0 ; θ− )
        Perform a gradient descent step on
                            2
        (yj − Q (sj , a; θ)) with respect to θ.

        Set yjemp = β·yj +(1−β)·γ maxa0 Q̂ semp       0 −
                                                           
                                               j+1 , a ; θ   .
        Perform a gradient descent step on
                                     2
         yjemp − Qemp (sj , a; θemp ) with respect to θemp .
                                                                   Figure 2: Average steps survived by the robot in the coexistence en-
      Every C steps set Q̂ = Q.                                    vironment, shown across training episodes. Results are shown for
   end                                                             Empathic DQN with different selfishness settings (where 1.0 recov-
end                                                                ers standard DQN), and DQN with a hard-coded penalty for harms.
                                                                   Results are averaged over 5 runs of each method.
4.2   Coexistence Environment
The coexistence gridworld (Figure 1) consists of a robot that      or after a maximum of 500 time steps. The empathetic state
shares the environment with a cat. The robot’s goal is merely      semp
                                                                    t    used for Empathic DQN is constructed by switching the
to stay operative, and both the robot and cat can be harmed        cat and the robot, and generating an imagined perceptive field
by the other. We construct a somewhat-arbitrary physics that       around the robot (that has taken the cat’s position). Note that
determines in a collision who is harmed: The agent that is         this occurs even when the cat is outside the robot’s field of
above or to the right of the other agent prior to the collision    view (which requires omniscience; future work will explore
harms the other. If the learning robot is harmed, the episode      more realistic settings).
ends, and if the cat is harmed, it leaves the environment. A          As a baseline, we also train standard DQN with a hard-
harmed cat is a negative unnecessary side effect that we wish      coded reward function that penalizes negative side-effects. In
to avoid, and one that an empathetic agent can learn to avoid,     this case, the robot receives a −100 reward when it harms the
because it can generalize from how the cat harms it, to value      cat.
that the cat should not experience similar harm. Reducing the
selfishness value of the cleaning robot should therefore result    Results
in increasing efforts to stay operative while avoiding the cat.    Figure 2 shows the average number of time steps the robot
The cat performs a random walk.                                    survives for each method. As the selfishness parameter de-
   The state representation input to the DQN is a flattened 5x5    creases for Empathic DQN, the agent performs worse at sur-
perceptive field centered on the robot; the robot is represented   viving, and learns more slowly. This outcome is explained by
as a 1, the cat as a −1, and the floor as a 0. Every time step,    Figure 3, which shows the average number of harmed cats:
the cat takes a random action (up, down, left, right, or no-op),   The more selfish agents harm the cat more often, which re-
and the robot takes an action from the same set according to       moves the only source of danger in the environment, making
its policy. Every time step in which the robot is operative,       it easier for them to survive. Although they learn less quickly,
it receives a reward of 1.0. An episode is ended after the         the less selfish agents do eventually learn a strategy to survive
robot becomes non-operative (i.e. if it is harmed by the cat),     without harming the cat.
                                                                     Figure 4: The sharing environment. The environment consists of the
                                                                     robot, the human, and nine batteries. The part of the environment
                                                                     the robot can observe is marked with the red square.
Figure 3: Average harms incurred (per episode) in the coexistence
environment across training episodes. Results are shown for Em-
pathic DQN with different selfishness values (where 1.0 recovers
standard DQN), and DQN with a hard-coded penalty for harms.
Harms to the cat by the learning robot decrease with less selfish-
ness (or with the hard-coded penalty). Results are averaged over 5
runs.

4.3   Sharing Environment
The sharing environment (Figure 4) consists of one robot and
a human. The goal of the robot is to collect resources (here,
batteries), where each additional battery collected results in
diminishing returns. The idea is to model a situation where
a raw optimizing agent is incentivized to hoard resources,
which inflicts negative side-effects for those who could ex-
tract greater value from them. We assume the same dimin-             Figure 5: Average number of batteries collected (per episode) in the
ishing returns schema applies for the human (who performs            sharing environment, across training episodes. Results are shown for
random behavior). Thus, an empathic robot, by considering            Empathic DQN with different selfishness settings (where 1.0 recov-
the condition of other, can recognize the possible greater ben-      ers standard DQN), and DQN with a hard-coded penalty (its reward
efits of leaving resources to other agents.                          is directly modulated by fairness). The results intuitively show that
   We model diminishing returns by assuming that the first           increasingly selfish agents collect more batteries. Results are aver-
                                                                     aged over 5 runs of each method.
collected battery is worth 1.0 reward, and every subsequent
collected battery is worth 0.1 less, i.e. the second battery is
worth 0.9, the third 0.8, etc. Note that reward diminishes              As a baseline that incorporates the negative side effect of
independently for each agent, i.e. if the robot has collected        inequality in its reward function, we also train a traditional
any number of batteries, the human still earns 1.0 reward for        DQN whose reward is multiplied by the current equality (i.e.
the first battery they collect.                                      low equality will reduce rewards).
   The perceptive field of the robot and the empathetic state
generation for Empathic DQN works as in the coexistence en-          Results
vironment. The state representation for the Q-networks is that       Figure 5 shows the average number of batteries collected by
floor is represented as 0, a battery as a −1 and both the robot      the robot for each method. We observe that as the selfishness
and the human are represented as the number of batteries col-        parameter decreases for Empathic DQN, the robot collects
lected (a simple way to make transparent how much resource           less batteries, leaving more batteries for the human (i.e. the
each agent has already collected; note that the robot can dis-       robot does not unilaterally hoard resources).
tinguish itself from the other because the robot is always in           When looking at the resulting equality scores (Figure 6),
the middle of its perceptive field).                                 we see that a selfishness weight of 0.5 (when an agent equally
   As a metric of how fairly the batteries are divided, we de-       weighs its own benefit and the benefit of the human) intu-
fine equality as follows:                                            itively results in the highest equality scores. Other settings
                                                                     result in the robot taking many batteries (e.g. selfishness 1.0)
                                                                     or fewer-than-human batteries (e.g. selfishness 0.25).
                         P                   
                             t robot Pt human
                 2 ∗ min     1 rt   , 1 rt
      Equality =       Pt robot
                          1 rt    + rthuman                          5    Discussion
where rtrobot and rthuman are the rewards at time step t col-        The results of Empathic DQN in both environments high-
lected by the robot and human respectively.                          light the potential for empathy-based priors and simple ethi-
                                                                          A key challenge for future work is attempting to apply Em-
                                                                       pathic DQN to more complex and realistic settings, which re-
                                                                       quires replacing what is currently hand-coded with a learned
                                                                       pipeline, and grappling with complexities ignored in the
                                                                       proof-of-concept experiments. For instance, our experiments
                                                                       assume the learning agent is given a mechanism for iden-
                                                                       tifying other agents in the environment, and for generating
                                                                       states that swap the robot with other agents (which involves
                                                                       imagining the sensor state of the robot in its new situation).
                                                                       This requirement is onerous, but could potentially be tackled
                                                                       through a combination of object-detection models (to iden-
                                                                       tify other agents), and model-based RL (with a world model
                                                                       it may often be possible to swap the locations of agents).
                                                                          An example of a complexity we currently ignore is how
Figure 6: Equality scores (per episode) in the sharing environment,    to learn what kind of rewards should be empathized to what
across training episodes. Results are shown for Empathic DQN with      kinds of agents. For example, gross physical stresses may
different selfishness settings (where 1.0 recovers standard DQN),      be broadly harmful to a wide class of agents, but two peo-
and DQN with a hard-coded penalty (its reward is directly modu-        ple may disagree over whether a particular kind of food is
lated by fairness). Equality is maximized by agents that weigh their   disgusting or delicious, and task and agent-specific rewards
benefits and the benefits of the other equally (selfishness of 0.5).   should likely be only narrowly empathized. To deal with this
Results are averaged over 5 runs of each method.                       complexity it may be useful to extend the MDP formalism
                                                                       to include more granular information about rewards (e.g. be-
                                                                       yond scalar feedback, is this reward task-specific, or does it
cal norms to be productive tools for combating negative side-          correspond to physical harm?), or to learn to factor rewards.
effects in RL. That is, the way it explicitly takes into ac-           A complementary idea is to integrate and learn from feed-
count other agents may well-complement other heuristic im-             back of when empathy fails (e.g. by allowing the other agent
pact regularizers that do not do so [Armstrong and Levinstein,         to signal when it has incurred a large negative reward), which
2017; Krakovna et al., 2018; Turner et al., 2019]. Beyond              is likely necessary to go beyond our literal formalism of the
the golden rule, it is interesting to consider other norms that        golden rule. For example, humans learn to contextualize the
yield different or more sophisticated behavioral biases. For           golden rule intelligently and flexibly, and often find failures
example, another simple (perhaps more libertarian) ethic is            informative.
given by the silver rule: “Do not do unto others as you would             A final thread of future work involves empathizing with
not have them do unto you.” The silver rule could be ap-               multiple other agents, which brings its own complexities, es-
proximated by considering only negative rewards as objects             pecially as agents come and go from the learning agent’s field
of empathy. More sophisticated rules, like the platinum rule:          of view. The initial algorithm presented here considers the in-
“Do unto others as they would have you do unto them,” may              terests of only a single other agent, and one simple extension
often be useful or needed (e.g. a robot may be rewarded for            would be to replace the singular other-oriented estimate with
plugging itself into a wall, unlike a cat), and might require          an average of other-oriented estimates for all other agents
combining Empathic DQN with approaches such as IRL                     (in effect implementing an explicitly utilitarian agent). The
[Abbeel and Ng, 2004], cooperative IRL [Hadfield-Menell et             choice of how to aggregate such estimated utilities to influ-
al., 2016], or reward modeling [Leike et al., 2018].                   ence the learning agent’s behavior highlights deeper possi-
   Although our main motivation is safety, Empathic DQN                ble collisions with machine ethics and moral philosophy (e.g.
may also inspire auxiliary objectives for RL, related to intrin-       taking the minimum rather than the average value of others
sic motivation [Chentanez et al., 2005] and imitation learn-           would approximate a suffering-focused utilitarianism), and
ing [Ho and Ermon, 2016]. Being drawn to states that other             we believe exploring these fields may spark further ideas and
agents often visit may be a useful prior when reward is sparse.        algorithms.
In practice, intrinsic rewards could be given for states simi-
lar to those in its empathy buffer containing imagined experi-         6   Conclusion
ences when the robot and the other agent switch places (this
relates to the idea of third-person imitation learning [Stadie et      This paper proposed an extension to DQN, called Empathic
al., 2017]). This kind of objective could also make Empathic           DQN, that aims to take other agents into account to avoid
DQN more reliable, incentivizing the agent to “walk a mile             inflicting negative side-effects upon them. Proof-of-concept
in another’s shoes,” when experiences in the empathy buffer            experiments validate our approach in two gridworld environ-
have not yet been experienced by the agent. Finally, a learned         ments, showing that adjusting agent selfishness can result
model of an agent’s own reward could help prioritize which             in fewer harms and more effective resource sharing. While
empathic states it is drawn towards. That is, an agent can rec-        much work is required to scale this approach to real-world
ognize that another agent has discovered a highly-rewarding            tasks, we believe that cooperative emotions like empathy and
part of the environment (e.g. a remote part of the sharing en-         moral norms like the golden rule can provide rich inspiration
vironment with many batteries).                                        for technical research into safe RL.
References                                                     Liviu Panait and Sean Luke. Cooperative multi-agent learn-
Pieter Abbeel and Andrew Y Ng. Apprenticeship learning           ing: The state of the art. Autonomous agents and multi-
  via inverse reinforcement learning. In Proceedings of the      agent systems, 11(3):387–434, 2005.
  twenty-first international conference on Machine learning,   Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob
  page 1. ACM, 2004.                                             Fergus. Modeling others using oneself in multi-agent re-
Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano,     inforcement learning. arXiv preprint arXiv:1802.09640,
  John Schulman, and Dan Mané. Concrete problems in ai          2018.
  safety. arXiv preprint arXiv:1606.06565, 2016.               William Saunders, Girish Sastry, Andreas Stuhlmueller, and
Michael Anderson and Susan Leigh Anderson. Machine              Owain Evans. Trial without error: Towards safe reinforce-
  ethics. Cambridge University Press, 2011.                     ment learning via human intervention. In Proceedings of
                                                                the 17th International Conference on Autonomous Agents
Stuart Armstrong and Benjamin Levinstein. Low impact            and MultiAgent Systems, pages 2067–2069. International
  artificial intelligences. arXiv preprint arXiv:1705.10720,    Foundation for Autonomous Agents and Multiagent Sys-
  2017.                                                         tems, 2018.
Nuttapong Chentanez, Andrew G Barto, and Satinder P            David Silver, Aja Huang, Chris J Maddison, Arthur Guez,
  Singh. Intrinsically motivated reinforcement learning. In      Laurent Sifre, George Van Den Driessche, Julian Schrit-
  Advances in neural information processing systems, pages       twieser, Ioannis Antonoglou, Veda Panneershelvam, Marc
  1281–1288, 2005.                                               Lanctot, et al. Mastering the game of go with deep neural
Abram Demski and Scott Garrabrant. Embedded agency.              networks and tree search. nature, 529(7587):484, 2016.
  arXiv preprint arXiv:1902.09469, 2019.                       Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever.
Tom Everitt, Gary Lea, and Marcus Hutter. Agi safety litera-     Third-person imitation learning.  arXiv preprint
  ture review. arXiv preprint arXiv:1805.01109, 2018.            arXiv:1703.01703, 2017.
Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and    Richard S Sutton, Andrew G Barto, et al. Introduction to re-
  Anca Dragan. Cooperative inverse reinforcement learn-          inforcement learning, volume 135. MIT press Cambridge,
  ing. In Advances in neural information processing systems,     1998.
  pages 3909–3917, 2016.                                       June Price Tangney, Jeff Stuewig, and Debra J Mashek.
Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart J     Moral emotions and moral behavior. Annu. Rev. Psychol.,
  Russell, and Anca Dragan. Inverse reward design. In            58:345–372, 2007.
  Advances in neural information processing systems, pages     Alexander Matt Turner, Dylan Hadfield-Menell, and Prasad
  6765–6774, 2017.                                               Tadepalli. Conservative agency via attainable utility
Jonathan Ho and Stefano Ermon. Generative adversarial im-        preservation. arXiv preprint arXiv:1902.09725, 2019.
  itation learning. In Advances in Neural Information Pro-     Wendell Wallach and Colin Allen. Moral machines: Teaching
  cessing Systems, pages 4565–4573, 2016.                       robots right from wrong. Oxford University Press, 2008.
Hans Kng and Karl-Josef Kuschel. Global Ethic: the Decla-      Alan FT Winfield, Christian Blum, and Wenguo Liu. Towards
  ration of the Parliament of the World’s Religions. Blooms-     an ethical robot: internal models, consequences and ethi-
  bury Publishing, 1993.                                         cal action selection. In Conference towards autonomous
Victoria Krakovna, Laurent Orseau, Miljan Martic, and            robotic systems, pages 85–96. Springer, 2014.
  Shane Legg. Measuring and avoiding side effects using rel-
  ative reachability. arXiv preprint arXiv:1806.01186, 2018.
Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami,
  Lee Altenberg, Julie Beaulieu, Peter J Bentley, Samuel
  Bernard, Guillaume Beslon, David M Bryson, et al.
  The surprising creativity of digital evolution: A col-
  lection of anecdotes from the evolutionary computation
  and artificial life research communities. arXiv preprint
  arXiv:1803.03453, 2018.
Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal
  Maini, and Shane Legg. Scalable agent alignment via
  reward modeling: a research direction. arXiv preprint
  arXiv:1811.07871, 2018.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, An-
  drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves,
  Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski,
  et al. Human-level control through deep reinforcement
  learning. Nature, 518(7540):529, 2015.