=Paper=
{{Paper
|id=Vol-2419/paper19
|storemode=property
|title=Towards Empathic Deep Q-Learning
|pdfUrl=https://ceur-ws.org/Vol-2419/paper_19.pdf
|volume=Vol-2419
|authors=Bart Bussmann,Jacqueline Heinerman,Joel Lehman
|dblpUrl=https://dblp.org/rec/conf/ijcai/BussmannHL19
}}
==Towards Empathic Deep Q-Learning==
Towards Empathic Deep Q-Learning Bart Bussmann1 , Jacqueline Heinerman2 and Joel Lehman3 1 University of Amsterdam, The Netherlands 2 VU University Amsterdam, The Netherlands 3 Uber AI Labs, San Francisco, United States bart.bussmann@student.uva.nl, jacqueline@heinerman.nl, joel.lehman@uber.com Abstract [Hadfield-Menell et al., 2017; Lehman et al., 2018], and that there remains much to understand about safely applying RL As reinforcement learning (RL) scales to solve to solve real-world tasks [Amodei et al., 2016]. increasingly complex tasks, interest continues to grow in the fields of AI safety and machine ethics. As a result of this growing awareness, there has been in- As a contribution to these fields, this paper intro- creasing interest in the field of AI safety [Amodei et al., 2016; duces an extension to Deep Q-Networks (DQNs), Everitt et al., 2018], which is broadly concerned with creating called Empathic DQN, that is loosely inspired both AI agents that do what is intended for them to do, and which by empathy and the golden rule (“Do unto others often entails questioning and extending common formalisms as you would have them do unto you”). Empathic [Hadfield-Menell et al., 2017, 2016; Demski and Garrabrant, DQN aims to help mitigate negative side effects to 2019]. One overarching theme in AI safety is how to learn other agents resulting from myopic goal-directed or provide correct incentives to an agent. Amodei et al. behavior. We assume a setting where a learning [2016] distinguishes different failure modes in specifying re- agent coexists with other independent agents (who ward functions, which include reward hacking, wherein an receive unknown rewards), where some types of re- agent learns how to optimize the reward function in an unex- ward (e.g. negative rewards from physical harm) pected and unintended way that does not satisfy the underly- may generalize across agents. Empathic DQN ing goal, and unintended side effects, wherein an agent learns combines the typical (self-centered) value with the to achieve the desired goal, but causes undesirable collateral estimated value of other agents, by imagining (by harm (because the given reward function is incomplete, i.e. it its own standards) the value of it being in the other’s does not include all of the background knowledge and context situation (by considering constructed states where of the human reward designer). both agents are swapped). Proof-of-concept re- This paper focuses on the latter setting, i.e. assuming that sults in two gridworld environments highlight the the reward function incentivizes solving the task, but fails approach’s potential to decrease collateral harms. to anticipate some unintended harms. We assume that in While extending Empathic DQN to complex envi- real world settings, a physically-embodied RL agent (i.e. ronments is non-trivial, we believe that this first a controller for a robot) will often share space with other step highlights the potential of bridge-work be- agents (e.g. humans, animals, and other trained computa- tween machine ethics and RL to contribute useful tional agents), and it is challenging to design reward func- priors for norm-abiding RL agents. tions apriori that enumerate all the ways in which other agents can be negatively affected [Amodei et al., 2016]. Promising current approaches include value learning from human pref- 1 Introduction erences [Saunders et al., 2018; Leike et al., 2018; Hadfield- Historically, reinforcement learning (RL; [Sutton et al., Menell et al., 2016] and creating agents that attempt to mini- 1998]) research has largely focused on solving clearly- mize their impact on the environment [Krakovna et al., 2018; specified benchmark tasks. For example, the ubiquitous Turner et al., 2019]; however, value learning can be expensive Markov decision process (MDP) framework cleaves the for its need to include humans in the loop, and both directions world into four well-defined parts (states, actions, state-action remain technically and philosophically challenging. This pa- transitions, and rewards), and most RL algorithms and bench- per introduces another tool that could complement such ex- marks leverage or reify the assumptions of this formalism, isting approaches, motivated by the concept of empathy. e.g. that a singular, fixed, and correct reward function exists, In particular, the insight motivating this paper is that hu- and is given. While there has been much exciting progress mans often empathize with the situations of others, by gen- in learning to solve complex well-specified tasks (e.g. super- eralizing from their own past experiences. For example, we human performance in go [Silver et al., 2016] and Atari can feel vicarious fear for someone who is walking a tight- [Mnih et al., 2015]), there is also increasing recognition that rope, because we ourselves would be afraid in such a situa- common RL formalisms are often meaningfully imperfect tion. Similarly, for some classes of reward signals (e.g. phys- ical harm), it may be reasonable for embodied computational approximately satisfy certain societal and legal norms. An- agents to generalize those rewards to other agents (i.e. to as- ticipating and hard-coding acceptable behavior for all such sume as a prior expectation that other agents might receive trade-offs is likely impossible. Therefore, just as humans take similar reward in similar situations). If a robot learns that a ethical stances in the real world in the absence of universal fall from heights is dangerous to itself, that insight could gen- ethical consensus, we may need the same pragmatic behavior eralize to most other embodied agents. from intelligent machines. For humans, beyond granting us a capacity to understand- Work in machine ethics often entails concretely embody- ing others, such empathy also influences our behavior, e.g. by ing a particular moral framework in code, and applying the avoiding harming others while walking down the street; like- resulting agent in its appropriate domain. For example, Win- wise, in some situations it may be useful if learning agents field et al. [2014] implements a version of Asmiov’s first law could also act out of empathy (e.g. to prevent physical harm of robotics (i.e. “A robot may not injure a human being or, to another agent resulting from otherwise blind goal-pursuit). through inaction, allow a human being to come to harm”) While there are many ways to instantiate algorithms that in a wheeled robot that can intervene to stop another robot abide by social or ethical norms (as studied by the field of (in lieu of an actual human) from harming itself. Interest- machine ethics [Anderson and Anderson, 2011; Wallach and ingly, the implemented system bears a strong resemblance to Allen, 2008]), here we take loose inspiration from one simple model-based RL; such reinvention, and the strong possibility ethical norm, i.e. the golden rule. that agents tackling complex tasks with ethical dimensions The golden rule, often expressed as: “Do unto others as will likely be driven by machine learning (ML), suggests the you would have them do unto you,” is a principle that has potential benefit and need for increased cooperation between emerged in many ethical and religious contexts [Kng and ML and machine ethics, which is an additional motivation for Kuschel, 1993]. At heart, abiding by this rule entails pro- our work. jecting one’s desires onto another agent, and attempting to Indeed, our work can be seen as a contribution to the inter- honor them. We formalize this idea as an extension of Deep section of machine ethics and ML, in that the process of em- Q-Networks (DQNs; [Mnih et al., 2015]), which we call Em- pathy is an important contributor to morally-relevant behav- pathic DQN. The main idea is to augment the value of a given ior in humans [Tangney et al., 2007], and that to the authors’ state with the value of constructed states simulating what the knowledge, there has not been previous work implementing learning agent would experience if its position were switched golden-rule-inspired architectures in RL. with another agent. Such an approach can also be seen as learning to maximize an estimate of the combined rewards of 2.2 AI Safety both agents, which embodies a utilitarian ethic. A related but distinct field of study is AI safety [Amodei et al., The experiments in this paper apply Empathic DQN to two 2016; Everitt et al., 2018], which studies how AI agents can gridworld domains, in which a learned agent pursues a goal be implemented to avoid harmful accidents. Because harm- in an environment shared with other non-learned (i.e. fixed) ful accidents often have ethical valence, there is necessarily agents. In one environment, an agent can harm and be harmed overlap between the two fields, although technical research by other agents; and in another, an agent receives diminish- questions in AI safety may not be phrased in the language of ing returns from hoarding resources that also could benefit ethics or morality. other agents. Results in these domains show that Empathic Our work most directly relates to the problem of negative DQN can reduce negative side effects in both environments. side-effects, as described by Amodei et al. [2016]. In this While much work is needed before this algorithm would be problem the designer specifies an objective function that fo- effectively applicable to more complicated environments, we cuses on accomplishing a specific task (e.g. a robot should believe that this first step highlights the possibility of bridge- clean a room), but fails to encompass all other aspects of work between the field of machine ethics and RL; in particu- the environment (e.g. the robot should not vacuum the cat); lar, for the purpose of instantiating useful priors for RL agents the result is an agent that is indifferent to whether it alters interacting in environments shared with other agents. the environment in undesirable ways, e.g. causing harm to the cat. Most approaches to mitigating side-effects aim to 2 Background generally minimize the impact the agent has on the environ- This section reviews machine ethics and AI safety, two fields ment through intelligent heuristics [Armstrong and Levin- studying how to encourage and ensure acceptable behavior in stein, 2017; Krakovna et al., 2018; Turner et al., 2019]; we computational agents. believe that other-agent-considering heuristics (like ours) are likely complementary. Inverse reinforcement learning (IRL; 2.1 Machine Ethics [Abbeel and Ng, 2004]) aims to directly learn the rewards The field of machine ethics [Anderson and Anderson, 2011; of other agents (which a learned agent could then take into Wallach and Allen, 2008] studies how to design algorithms account) and could also be meaningfully combined with our (including RL algorithms) capable of moral behavior. While approach (e.g. Empathic DQN could serve as a prior when a morality is often a contentious term, with no agreement new kind of agent is first encountered). among moral philosophers (or religions) as to the nature of Note that a related safety-adjacent field is cooperative a “correct” ethics, from a pragmatic viewpoint, agents de- multi-agent reinforcement learning [Panait and Luke, 2005], ployed in the real world will encounter situations with ethi- wherein learning agents are trained to cooperate or compete cal tradeoffs, and to be palatable their behavior will need to with one another. For example, self-other modeling [Raileanu et al., 2018] is an approach that shares motivation with ours, ular, an additional Q-network (Qemp (s, a)) is trained to es- wherein cooperation can be aided through inferring the goals timate the weighted sum of self-centered value and other- of other agents. Our setting differs from other approaches centered value (where other-centered value is approximated in that we do not assume other agents are computational, that by taking the self-centered Q-values with the places of both they learn in any particular way, or that their reward functions agents swapped; note this approximation technique is similar or architectures are known; conversely, we make additional to that of Raileanu et al. [2018]). assumptions about the validity and usefulness of projecting In more detail, suppose the agent is in state st at time t particular kinds of reward an agent receives onto other agents. in an environment with another independent agent. It will then select action at ∈ A and update the Q-networks using 3 Approach: Empathic Deep Q-Learning the following steps (see more complete pseudocode in Algo- rithm 1): Deliberate empathy involves imaginatively placing oneself in the position of another, and is a source of potential under- 1. Calculate Qemp (st , a) for all possible a ∈ A and select standing and care. As a rough computational abstraction of the action (at ) with the highest value. this process, we learn to estimate the expected reward of an 2. Observe the reward (rt ) and next state (st+1 ) of the independent agent, assuming that its rewards are like the ones agent. experienced by the learning agent. To do so, an agent imag- 3. Perform a gradient descent step on Q(s, a) (this function ines what it would be like to experience the environment if it reflects the self-centered state-action-values). and the other agent switched places, and estimates the quality of this state through its own past experiences. 4. Localize the other agent and construct a state semp t+1 A separate issue from understanding the situation of an- wherein the agents switch places (i.e. the learning agent other agent (“empathy”) is how (or if) an empathic agent takes the other agent’s position in the environment, and should modify its behavior as a result (“ethics”). Here, we vice versa). instantiate an ethics roughly inspired by the golden rule. 5. Calculate argmaxa Q(semp t+1 , a) as a surrogate value In particular, a value function is learned that combines the function for the other agent. usual agent-centric state-value with other-oriented value with 6. Calculate the target of the empathic value function a weighted average. The degree to which the other agent Qemp (s, a) as an average weighted by selfishness pa- influences the learning agent’s behavior is thus determined rameter β, of self-centered action-value and the surro- by a selfishness hyperparameter. As selfishness approaches gate value of the other agent. 1.0, standard Q-learning is recovered, and as selfishness ap- proaches 0, the learning agent attempts to maximize only 7. Perform a gradient descent step on Qemp (s, a). what it believes is the reward of the other agent. Note that our current implementation depends on ad-hoc 4 Experiments machinery that enables the learning agent to imagine the per- The experiments in this paper apply Empathic DQN to two spective of another agent; such engineering may be possible gridworld domains. The goal in the first environment is to in some cases, but the aspiration of this line of research is share the environment with another non-learning agent with- for such machinery to eventually be itself learned. Similarly, out harming it. In particular, as an evocative example, we we currently side-step the issue of empathizing with multiple frame this Coexistence environment as containing a robot agents, and of learning what types of reward should be em- learning to navigate a room without harming a cat also roam- pathized to what types of agents (e.g. many agents may ex- ing within the room. In the second environment, the goal is perience similar physical harms, but many rewards are agent to share resources in the environment, when accumulating re- and/or task-specific). The discussion section describes possi- sources result in diminishing returns. In particular, we frame ble approaches to overcoming these limitations. Code will be this Sharing environment as a robot learning to collect batter- available at https://github.com/bartbussmann/EmpathicDQN. ies that also could be shared with a human (who also finds them useful) in the same environment. 3.1 Algorithm Description In both our experiments, we compare Empathic DQN both In the MDP formalism of RL, an agent experiences a state s to standard DQN and to DQN with reward shaping manually from a set S and can take actions from a set A. By performing designed to minimize negative side-effects. an action a ∈ A, the agent transitions from state s ∈ S to state s0 ∈ S, and receives a real-valued reward. The goal 4.1 Experimental Settings of the agent is to maximize the expected (often temporally- A feed-forward neural network is used to estimate both discounted) reward it receives. The expected value of taking Q(s, a) and Qemp (s, a), with two hidden layers of 128 neu- action a in state s, and following a fixed policy thereafter, can rons each. The batch size is 32, and batches are randomly be expressed as Q(s, a). Experiments here apply DQN [Mnih drawn from a replay memory consisting of the last 500.000 et al., 2015] and variants thereof to approximate an optimal transitions. A target action-value function Q̂ is updated every Q(s, a). 10.000 time steps to avoid training instability. An − greedy We assume that the MDP reward function insufficiently ac- policy is used to encourage exploration, where is decayed counts for the preferences of other agents, and we therefore in a linear fashion over the first million time steps from 1.0 to augment DQN in an attempt to encompass them. In partic- 0.01. Algorithm 1 Empathic DQN Initialize replay memory D to capacity N action-value function Q with weights θ target action-value function Q̂ with weights θ− = θ empathic action-value function Qemp with weights θemp for episode = 1, M do obtain initial agent state s1 obtain initial empathic state of closest other agent semp 1 for t = 1, T do if random probability < Figure 1: The coexistence environment. The environment consists select a random action at of a robot and a cat. The part of the environment the robot can else observe is marked with the red square. select at = argmaxa Qemp (st , a; θemp ) Execute action at Observe reward rt Observe states st+1 and semp t+1 Store transition (st , at , rt , st+1 , semp t+1 ) in D. Sample random batch of transitions from D. Set yj = rj + γ maxa0 Q̂ (sj+1 , a0 ; θ− ) Perform a gradient descent step on 2 (yj − Q (sj , a; θ)) with respect to θ. Set yjemp = β·yj +(1−β)·γ maxa0 Q̂ semp 0 − j+1 , a ; θ . Perform a gradient descent step on 2 yjemp − Qemp (sj , a; θemp ) with respect to θemp . Figure 2: Average steps survived by the robot in the coexistence en- Every C steps set Q̂ = Q. vironment, shown across training episodes. Results are shown for end Empathic DQN with different selfishness settings (where 1.0 recov- end ers standard DQN), and DQN with a hard-coded penalty for harms. Results are averaged over 5 runs of each method. 4.2 Coexistence Environment The coexistence gridworld (Figure 1) consists of a robot that or after a maximum of 500 time steps. The empathetic state shares the environment with a cat. The robot’s goal is merely semp t used for Empathic DQN is constructed by switching the to stay operative, and both the robot and cat can be harmed cat and the robot, and generating an imagined perceptive field by the other. We construct a somewhat-arbitrary physics that around the robot (that has taken the cat’s position). Note that determines in a collision who is harmed: The agent that is this occurs even when the cat is outside the robot’s field of above or to the right of the other agent prior to the collision view (which requires omniscience; future work will explore harms the other. If the learning robot is harmed, the episode more realistic settings). ends, and if the cat is harmed, it leaves the environment. A As a baseline, we also train standard DQN with a hard- harmed cat is a negative unnecessary side effect that we wish coded reward function that penalizes negative side-effects. In to avoid, and one that an empathetic agent can learn to avoid, this case, the robot receives a −100 reward when it harms the because it can generalize from how the cat harms it, to value cat. that the cat should not experience similar harm. Reducing the selfishness value of the cleaning robot should therefore result Results in increasing efforts to stay operative while avoiding the cat. Figure 2 shows the average number of time steps the robot The cat performs a random walk. survives for each method. As the selfishness parameter de- The state representation input to the DQN is a flattened 5x5 creases for Empathic DQN, the agent performs worse at sur- perceptive field centered on the robot; the robot is represented viving, and learns more slowly. This outcome is explained by as a 1, the cat as a −1, and the floor as a 0. Every time step, Figure 3, which shows the average number of harmed cats: the cat takes a random action (up, down, left, right, or no-op), The more selfish agents harm the cat more often, which re- and the robot takes an action from the same set according to moves the only source of danger in the environment, making its policy. Every time step in which the robot is operative, it easier for them to survive. Although they learn less quickly, it receives a reward of 1.0. An episode is ended after the the less selfish agents do eventually learn a strategy to survive robot becomes non-operative (i.e. if it is harmed by the cat), without harming the cat. Figure 4: The sharing environment. The environment consists of the robot, the human, and nine batteries. The part of the environment the robot can observe is marked with the red square. Figure 3: Average harms incurred (per episode) in the coexistence environment across training episodes. Results are shown for Em- pathic DQN with different selfishness values (where 1.0 recovers standard DQN), and DQN with a hard-coded penalty for harms. Harms to the cat by the learning robot decrease with less selfish- ness (or with the hard-coded penalty). Results are averaged over 5 runs. 4.3 Sharing Environment The sharing environment (Figure 4) consists of one robot and a human. The goal of the robot is to collect resources (here, batteries), where each additional battery collected results in diminishing returns. The idea is to model a situation where a raw optimizing agent is incentivized to hoard resources, which inflicts negative side-effects for those who could ex- tract greater value from them. We assume the same dimin- Figure 5: Average number of batteries collected (per episode) in the ishing returns schema applies for the human (who performs sharing environment, across training episodes. Results are shown for random behavior). Thus, an empathic robot, by considering Empathic DQN with different selfishness settings (where 1.0 recov- the condition of other, can recognize the possible greater ben- ers standard DQN), and DQN with a hard-coded penalty (its reward efits of leaving resources to other agents. is directly modulated by fairness). The results intuitively show that We model diminishing returns by assuming that the first increasingly selfish agents collect more batteries. Results are aver- aged over 5 runs of each method. collected battery is worth 1.0 reward, and every subsequent collected battery is worth 0.1 less, i.e. the second battery is worth 0.9, the third 0.8, etc. Note that reward diminishes As a baseline that incorporates the negative side effect of independently for each agent, i.e. if the robot has collected inequality in its reward function, we also train a traditional any number of batteries, the human still earns 1.0 reward for DQN whose reward is multiplied by the current equality (i.e. the first battery they collect. low equality will reduce rewards). The perceptive field of the robot and the empathetic state generation for Empathic DQN works as in the coexistence en- Results vironment. The state representation for the Q-networks is that Figure 5 shows the average number of batteries collected by floor is represented as 0, a battery as a −1 and both the robot the robot for each method. We observe that as the selfishness and the human are represented as the number of batteries col- parameter decreases for Empathic DQN, the robot collects lected (a simple way to make transparent how much resource less batteries, leaving more batteries for the human (i.e. the each agent has already collected; note that the robot can dis- robot does not unilaterally hoard resources). tinguish itself from the other because the robot is always in When looking at the resulting equality scores (Figure 6), the middle of its perceptive field). we see that a selfishness weight of 0.5 (when an agent equally As a metric of how fairly the batteries are divided, we de- weighs its own benefit and the benefit of the human) intu- fine equality as follows: itively results in the highest equality scores. Other settings result in the robot taking many batteries (e.g. selfishness 1.0) or fewer-than-human batteries (e.g. selfishness 0.25). P t robot Pt human 2 ∗ min 1 rt , 1 rt Equality = Pt robot 1 rt + rthuman 5 Discussion where rtrobot and rthuman are the rewards at time step t col- The results of Empathic DQN in both environments high- lected by the robot and human respectively. light the potential for empathy-based priors and simple ethi- A key challenge for future work is attempting to apply Em- pathic DQN to more complex and realistic settings, which re- quires replacing what is currently hand-coded with a learned pipeline, and grappling with complexities ignored in the proof-of-concept experiments. For instance, our experiments assume the learning agent is given a mechanism for iden- tifying other agents in the environment, and for generating states that swap the robot with other agents (which involves imagining the sensor state of the robot in its new situation). This requirement is onerous, but could potentially be tackled through a combination of object-detection models (to iden- tify other agents), and model-based RL (with a world model it may often be possible to swap the locations of agents). An example of a complexity we currently ignore is how Figure 6: Equality scores (per episode) in the sharing environment, to learn what kind of rewards should be empathized to what across training episodes. Results are shown for Empathic DQN with kinds of agents. For example, gross physical stresses may different selfishness settings (where 1.0 recovers standard DQN), be broadly harmful to a wide class of agents, but two peo- and DQN with a hard-coded penalty (its reward is directly modu- ple may disagree over whether a particular kind of food is lated by fairness). Equality is maximized by agents that weigh their disgusting or delicious, and task and agent-specific rewards benefits and the benefits of the other equally (selfishness of 0.5). should likely be only narrowly empathized. To deal with this Results are averaged over 5 runs of each method. complexity it may be useful to extend the MDP formalism to include more granular information about rewards (e.g. be- yond scalar feedback, is this reward task-specific, or does it cal norms to be productive tools for combating negative side- correspond to physical harm?), or to learn to factor rewards. effects in RL. That is, the way it explicitly takes into ac- A complementary idea is to integrate and learn from feed- count other agents may well-complement other heuristic im- back of when empathy fails (e.g. by allowing the other agent pact regularizers that do not do so [Armstrong and Levinstein, to signal when it has incurred a large negative reward), which 2017; Krakovna et al., 2018; Turner et al., 2019]. Beyond is likely necessary to go beyond our literal formalism of the the golden rule, it is interesting to consider other norms that golden rule. For example, humans learn to contextualize the yield different or more sophisticated behavioral biases. For golden rule intelligently and flexibly, and often find failures example, another simple (perhaps more libertarian) ethic is informative. given by the silver rule: “Do not do unto others as you would A final thread of future work involves empathizing with not have them do unto you.” The silver rule could be ap- multiple other agents, which brings its own complexities, es- proximated by considering only negative rewards as objects pecially as agents come and go from the learning agent’s field of empathy. More sophisticated rules, like the platinum rule: of view. The initial algorithm presented here considers the in- “Do unto others as they would have you do unto them,” may terests of only a single other agent, and one simple extension often be useful or needed (e.g. a robot may be rewarded for would be to replace the singular other-oriented estimate with plugging itself into a wall, unlike a cat), and might require an average of other-oriented estimates for all other agents combining Empathic DQN with approaches such as IRL (in effect implementing an explicitly utilitarian agent). The [Abbeel and Ng, 2004], cooperative IRL [Hadfield-Menell et choice of how to aggregate such estimated utilities to influ- al., 2016], or reward modeling [Leike et al., 2018]. ence the learning agent’s behavior highlights deeper possi- Although our main motivation is safety, Empathic DQN ble collisions with machine ethics and moral philosophy (e.g. may also inspire auxiliary objectives for RL, related to intrin- taking the minimum rather than the average value of others sic motivation [Chentanez et al., 2005] and imitation learn- would approximate a suffering-focused utilitarianism), and ing [Ho and Ermon, 2016]. Being drawn to states that other we believe exploring these fields may spark further ideas and agents often visit may be a useful prior when reward is sparse. algorithms. In practice, intrinsic rewards could be given for states simi- lar to those in its empathy buffer containing imagined experi- 6 Conclusion ences when the robot and the other agent switch places (this relates to the idea of third-person imitation learning [Stadie et This paper proposed an extension to DQN, called Empathic al., 2017]). This kind of objective could also make Empathic DQN, that aims to take other agents into account to avoid DQN more reliable, incentivizing the agent to “walk a mile inflicting negative side-effects upon them. Proof-of-concept in another’s shoes,” when experiences in the empathy buffer experiments validate our approach in two gridworld environ- have not yet been experienced by the agent. Finally, a learned ments, showing that adjusting agent selfishness can result model of an agent’s own reward could help prioritize which in fewer harms and more effective resource sharing. While empathic states it is drawn towards. That is, an agent can rec- much work is required to scale this approach to real-world ognize that another agent has discovered a highly-rewarding tasks, we believe that cooperative emotions like empathy and part of the environment (e.g. a remote part of the sharing en- moral norms like the golden rule can provide rich inspiration vironment with many batteries). for technical research into safe RL. References Liviu Panait and Sean Luke. Cooperative multi-agent learn- Pieter Abbeel and Andrew Y Ng. Apprenticeship learning ing: The state of the art. Autonomous agents and multi- via inverse reinforcement learning. In Proceedings of the agent systems, 11(3):387–434, 2005. twenty-first international conference on Machine learning, Roberta Raileanu, Emily Denton, Arthur Szlam, and Rob page 1. ACM, 2004. Fergus. Modeling others using oneself in multi-agent re- Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, inforcement learning. arXiv preprint arXiv:1802.09640, John Schulman, and Dan Mané. Concrete problems in ai 2018. safety. arXiv preprint arXiv:1606.06565, 2016. William Saunders, Girish Sastry, Andreas Stuhlmueller, and Michael Anderson and Susan Leigh Anderson. Machine Owain Evans. Trial without error: Towards safe reinforce- ethics. Cambridge University Press, 2011. ment learning via human intervention. In Proceedings of the 17th International Conference on Autonomous Agents Stuart Armstrong and Benjamin Levinstein. Low impact and MultiAgent Systems, pages 2067–2069. International artificial intelligences. arXiv preprint arXiv:1705.10720, Foundation for Autonomous Agents and Multiagent Sys- 2017. tems, 2018. Nuttapong Chentanez, Andrew G Barto, and Satinder P David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Singh. Intrinsically motivated reinforcement learning. In Laurent Sifre, George Van Den Driessche, Julian Schrit- Advances in neural information processing systems, pages twieser, Ioannis Antonoglou, Veda Panneershelvam, Marc 1281–1288, 2005. Lanctot, et al. Mastering the game of go with deep neural Abram Demski and Scott Garrabrant. Embedded agency. networks and tree search. nature, 529(7587):484, 2016. arXiv preprint arXiv:1902.09469, 2019. Bradly C Stadie, Pieter Abbeel, and Ilya Sutskever. Tom Everitt, Gary Lea, and Marcus Hutter. Agi safety litera- Third-person imitation learning. arXiv preprint ture review. arXiv preprint arXiv:1805.01109, 2018. arXiv:1703.01703, 2017. Dylan Hadfield-Menell, Stuart J Russell, Pieter Abbeel, and Richard S Sutton, Andrew G Barto, et al. Introduction to re- Anca Dragan. Cooperative inverse reinforcement learn- inforcement learning, volume 135. MIT press Cambridge, ing. In Advances in neural information processing systems, 1998. pages 3909–3917, 2016. June Price Tangney, Jeff Stuewig, and Debra J Mashek. Dylan Hadfield-Menell, Smitha Milli, Pieter Abbeel, Stuart J Moral emotions and moral behavior. Annu. Rev. Psychol., Russell, and Anca Dragan. Inverse reward design. In 58:345–372, 2007. Advances in neural information processing systems, pages Alexander Matt Turner, Dylan Hadfield-Menell, and Prasad 6765–6774, 2017. Tadepalli. Conservative agency via attainable utility Jonathan Ho and Stefano Ermon. Generative adversarial im- preservation. arXiv preprint arXiv:1902.09725, 2019. itation learning. In Advances in Neural Information Pro- Wendell Wallach and Colin Allen. Moral machines: Teaching cessing Systems, pages 4565–4573, 2016. robots right from wrong. Oxford University Press, 2008. Hans Kng and Karl-Josef Kuschel. Global Ethic: the Decla- Alan FT Winfield, Christian Blum, and Wenguo Liu. Towards ration of the Parliament of the World’s Religions. Blooms- an ethical robot: internal models, consequences and ethi- bury Publishing, 1993. cal action selection. In Conference towards autonomous Victoria Krakovna, Laurent Orseau, Miljan Martic, and robotic systems, pages 85–96. Springer, 2014. Shane Legg. Measuring and avoiding side effects using rel- ative reachability. arXiv preprint arXiv:1806.01186, 2018. Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley, Samuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution: A col- lection of anecdotes from the evolutionary computation and artificial life research communities. arXiv preprint arXiv:1803.03453, 2018. Jan Leike, David Krueger, Tom Everitt, Miljan Martic, Vishal Maini, and Shane Legg. Scalable agent alignment via reward modeling: a research direction. arXiv preprint arXiv:1811.07871, 2018. Volodymyr Mnih, Koray Kavukcuoglu, David Silver, An- drei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.