Coordination-driven learning in multi-agent problem spaces Sean L. Barton, Nicholas R. Waytowich, and Derrik E. Asher U.S. Army Research Laboratory, Aberdeen Proving Ground Aberdeen, Maryland 21005 Abstract course of learning. Further, we present a research concept for using this metric to shape agent learning towards coordi- We discuss the role of coordination as a direct learning objec- nated behavior, as well as the impact that different degrees tive in multi-agent reinforcement learning (MARL) domains. of coordination can have on multi-agent task performance. To this end, we present a novel means of quantifying coordi- nation in multi-agent systems, and discuss the implications of using such a measure to optimize coordinated agent policies. 1.1 Adversary-aware RL as MARL This concept has important implications for adversary-aware Understanding adversary-aware RL agents in terms of RL, which we take to be a sub-domain of multi-agent learn- MARL is straightforward when we consider that training in ing. the presence of adversarial attacks is similar to training in the presence of agents pursuing competing goals. In competitive RL, outcomes are often considered zero-sum, when agents 1 Introduction reward/loss are in direct opposition (Busoniu, Babuska, and Modern reinforcement learning (RL) has demonstrated a De Schutter 2008; Crandall and Goodrich 2011). In the case number of striking achievements in the realm of intelligent of attacks on RL agents, the adversary’s goal is typically to behavior by leveraging the power of deep neural networks learn a cost function that, when optimized, minimizes the re- (Mnih et al. 2015). However, like any deep-learning system, turns of the attacked agent (Pattanaik et al. 2017). Thus, the RL agents are vulnerable to adversarial attacks that seek to adversary’s reward is the opposite of the attacked agent’s. undermine their learned behaviors (Huang et al. 2017). In If we take seriously these comparisons, the problem of order for RL agents to function effectively along side hu- creating adversary-aware agents is largely one of develop- mans in real-world problems, their behaviors must be re- ing agents that can learn to coordinate their behaviors effec- silient against such adversarial assaults. tively with the actions of an adversary so as to minimize the Promisingly, there is recent evidence showing deep RL impact of its attacks. Thus, adversary-aware RL is an inher- agents learn policies robust to adversary attacks at test time ently multi-agent problem. when they train with adversaries during learning (Behzadan and Munir 2017). This has important implications for robust 1.2 Coordination in MARL deep RL, as it suggests that security against attacks can be In MARL problems, the simultaneous actions of multiple derived from learning. Here, we build on this idea and sug- actors obfuscate the ground truth from any individual agent. gest that deriving adversary-aware agents from learning is This uncertainty about the state of the world is primarily a subset of the multi-agent reinforcement learning (MARL) studied in terms of 1) partial-observability wherein the in- problem. formation about a given state is only probabilistic (Omid- At the heart of this problem is the need for an individ- shafiei et al. 2017), and 2) non-stationarity where the goal of ual agent to coordinate its actions with those taken by other the task is “moving” with respect to any individual agent’s agents (Fulda and Ventura 2007). Given the role of inter- perspective (Hernandez-Leal et al. 2017). agent coordination in MARL, we suggest that operationaliz- To the extent that uncertainty from an agent’s perspec- ing coordination between agent actions as a direct learning tive can be resolved, performance in multi-agent tasks de- objective may lead to better policies for multi-agent tasks. pends critically on the degree to which agents are able to co- Here, we present a quantitative metric that can be used to ordinate their efforts (Matignon, Laurent, and Le Fort-Piat measure the degree of coordination between agents over the 2012). With MARL collaborative goals, individual agents must learn policies that increase their own reward without Copyright c by the papers authors. Copying permitted for private and academic purposes. In: Joseph Collins, Prithviraj Dasgupta, diminishing the reward received by other agents. Simple Ranjeev Mittu (eds.): Proceedings of the AAAI Fall 2018 Sympo- tasks, such as matrix or climbing games, present straight- sium on Adversary-Aware Learning Techniques and Trends in Cy- forward constraints that promote the emergence of coordi- bersecurity, Arlington, VA, USA, 18-19 October, 2018, published nation between agents, as these small state-space problems at http://ceur-ws.org make the pareto-optimal solution readily discoverable. Matignon et al. (2012) enumerate the challenges for co- coordination, independent of performance. Convergent cross operative MARL, and show that no single algorithm is suc- mapping (CCM) quantifies the unique causal influence one cessful at achieving better performance. Instead, existing al- time-series has on another (Sugihara et al. 2012). This is gorithms tend to address specific challenges at the expense accomplished by embedding each time-series in its own of others. Further, in more complex state-spaces pareto- high dimensional attractor space, and then using the embed- optimal solutions can be “shadowed” by individually opti- ded data of one time-series as a model for the other. Each mal solutions that constrain learned behavior to selfish poli- model’s accuracy is taken as a measure of the causal influ- cies (Fulda and Ventura 2007). This undermines the per- ence between the two time-series. formance gains achievable through coordinated actions in In multi-agent tasks, we can define collaboration to be the MARL problems. For these reasons, coordination between amount of causal influence between time-series of agent ac- agents can only be guaranteed in limited cases where the tions, as measured by CCM. The advantage of this metric challenges of MARL can be reasonably constrained (Lauer is that it provides a measure of coordination between agents and Riedmiller 2000). As such, partial-observability and that is independent of performance, and thus can be used non-stationarity are problems that must be overcome for as a novel training signal to optimize coordinated behav- coordination to emerge (Matignon, Laurent, and Le Fort- ior. Thus, coordination is no longer exclusively an emergent Piat 2012). For complex tasks, modern advances with DNNs property of the task, but rather a signal for driving agents’ have leveraged joint action learning to overcome the inher- learned behavior. ent uncertainty of MARL (Foerster et al. 2017). Indeed these algorithms show improved performance over decentralized 2.2 Coordination in an example MARL task and independent learning alternatives. We propose an experimental paradigm that is designed to Though this work is promising, we recently showed that measure the role of coordination in a continuous coopera- when coordination is directly measured, it cannot explain the tive/competitive task: online learning of coordination during improved performance of these algorithms in all cases (Bar- multi-agent predator-prey pursuit. In this exemplary experi- ton et al. In Press). Coordination between agents, as mea- ment, CCM is used as a direct learning signal that influences sured by the causal influence between agent actions (method how agents learn to complete a cooperative task. described below), was found to be almost indistinguishable The task is an adaptation of discrete predator-prey pursuit from hard-coded agents forced to act independently. This (Benda 1985) into a continuous bounded 2D particle envi- leads to an interesting question about how to achieve coor- ronment with three identical predator agents and a single dinated actions between learning agents in real-world tasks prey agent. Predators score points each time they make con- where there is strong interest for the deployment of RL- tact with the prey, while the prey’s points are decremented if equipped agents. contacted by any predator. 2 Approach Typically, agent learning would be driven solely by the en- vironmental reward (in this case, agent score). With this typ- A possible solution to overcome the challenges of MARL is ical framework, coordination may emerge, but is not guaran- to address coordination directly. This concept was recently teed (see (Barton et al. In Press)). In contrast, CCM provides put to the test in several simple competitive tasks being per- a direct measure of inter-agent coordination, which can be formed by two deep RL agents (Foerster et al. 2018). The used to modify agent learning through the incorporation of study explicitly took into account an opponent’s change in CCM as a term in learning loss. This can be done either in- learning parameters during its own learning step. Account- directly as a secondary reward or directly as a term applied ing for opponent behavior during learning in this manner during back-propagation. Thus, learned behavior is shaped was shown to yield human-like cooperative behaviors previ- by both, task success and inter-agent coordination. ously unobserved in MARL agents. This paradigm provides an opportunity for coordination In a similar thrust, we propose here that coordination to be manipulated experimentally by setting a desired coor- should not be left to emerge from the constraints on the dination threshold. As agents learn, they should coordinate multi-agent task, but instead be a direct objective of learn- their behaviors with their partners and/or adversaries up to ing. This may be accomplished by providing a coordination this threshold. Minimizing this threshold should yield agents measure in the loss of a MARL agent’s optimization step. that optimize the task at the expense of a partner, while 2.1 A novel measure for coordination in MARL maximizing this threshold would likely produce high dimen- sional oscillations between agent actions that ignore task de- The first step towards optimizing coordinated behavior in mands. Effective coordination likely lies between these ex- MARL is to define an adequate measure of coordination. tremes. Thus, we can directly observe the impact of coor- Historically, coordinated behavior has been evaluated by dinated behaviors in a MARL environment by varying this agent performance in tasks where cooperation is explicitly coordination threshold. To our knowledge, this has not been required (Lauer and Riedmiller 2000). As we showed previ- previously attempted. ously, performance alone is insufficient for evaluating coor- dination in more complex cases, and does not provide any new information during learning. 3 Implications and Discussion Fortunately a metric borrowed from ecological research Explicit coordination between agents can lead to greater has shown promise as a quantitative measure of inter-agent success in multi-agent systems. Our concept provides a paradigm shift towards making coordination between agents Behzadan, V., and Munir, A. 2017. Whatever does not an intended goal of learning. In contrast, many previous kill deep reinforcement learning, makes it stronger. arXiv MARL approaches assume that coordination will emerge as preprint arXiv:1712.09344. performance is optimized. In summary, we suggest that co- Benda, M. 1985. On optimal cooperation of knolwedge ordination is better thought of as a necessary driver of learn- sources. Technical Report BCS-G2010-28. ing, as important as (or possibly more important than) per- Busoniu, L.; Babuska, R.; and De Schutter, B. 2008. A formance measures alone. comprehensive survey of multiagent reinforcement learning. Our proposed use of CCM as a signal for inter-agent co- IEEE Transactions on Systems, Man, And Cybernetics-Part ordination provides a new source of information for learning C: Applications and Reviews, 38 (2), 2008. agents that can be integrated into a compound loss function during learning. This would allow agents to learn coordi- Crandall, J. W., and Goodrich, M. A. 2011. Learning to nated behaviors explicitly, rather than gambling on agents compete, coordinate, and cooperate in repeated games using discovering coordinated policies during exploration. reinforcement learning. Machine Learning 82(3):281–314. With the addition of coordination driven learning, the Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; and policies an agent learns will not take into account adversary Whiteson, S. 2017. Counterfactual Multi-Agent Policy Gra- behavior by chance, but rather by design. Such an algorithm dients. arXiv:1705.08926 [cs]. arXiv: 1705.08926. would actively seek out policies that account for the actions Foerster, J.; Chen, R. Y.; Al-Shedivat, M.; Whiteson, S.; of partners and competitors, limiting the policy search space Abbeel, P.; and Mordatch, I. 2018. Learning with opponent- to those that reason over the behavior of other agents in the learning awareness. In Proceedings of the 17th International system. We believe this is a reasonable avenue for more ef- Conference on Autonomous Agents and MultiAgent Systems, ficiently training mulit-agent policies. 122–130. International Foundation for Autonomous Agents Driving learning with coordination creates an opportunity and Multiagent Systems. for the development of agents that are inherently determined Fulda, N., and Ventura, D. 2007. Predicting and preventing to coordinate their actions with a human partner. This is im- coordination problems in cooperative q-learning systems. In portant, as without such a drive it is not clear how to guaran- IJCAI, volume 2007, 780–785. tee that humans and agents will work well together. In partic- ular, if modeling of human policies is too difficult for agents, Hernandez-Leal, P.; Kaisers, M.; Baarslag, T.; and de Cote, they may settle on policies that try to minimize the degree E. M. 2017. A survey of learning in multiagent envi- of coordination in an attempt to recover some selfishly op- ronments: Dealing with non-stationarity. arXiv preprint timal behavior. Forcing coordination to be optimized during arXiv:1707.09183. learning ensures that agents only seek out policies that are Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; and well integrated with the actions of their partners. Abbeel, P. 2017. Adversarial attacks on neural network Our concept, as presented here, is to promote coordinated policies. arXiv preprint arXiv:1702.02284. behaviors in intelligent learning agents by providing a quan- Lauer, M., and Riedmiller, M. 2000. An algorithm for dis- titative measure of coordination that can be optimized dur- tributed reinforcement learning in cooperative multi-agent ing learning. The importance of implementing coordination systems. In In Proceedings of the Seventeenth International to overcome adversarial attacks in the MARL problem can- Conference on Machine Learning. Citeseer. not be understated. Furthermore, an explicit drive towards Matignon, L.; Laurent, G. J.; and Le Fort-Piat, N. 2012. coordinated behavior between intelligent agents constitutes Independent reinforcement learners in cooperative markov a significant advancement within the fields of artificial intel- games: a survey regarding coordination problems. The ligence and computational learning. Knowledge Engineering Review 27(1):1–31. Acknowledgements This research was sponsored by the Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, Army Research Laboratory and was accomplished under J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, Cooperative Agreement Number W911NF-18-2-0058. The A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.; views and conclusions contained in this document are those Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg, of the authors and should not be interpreted as represent- S.; and Hassabis, D. 2015. Human-level control through ing the official policies, either expressed or implied, of deep reinforcement learning. Nature 518(7540):529–533. the Army Research Laboratory or the U.S. Government. Omidshafiei, S.; Pazis, J.; Amato, C.; How, J. P.; and The U.S. Government is authorized to reproduce and dis- Vian, J. 2017. Deep Decentralized Multi-task Multi- tribute reprints for Government purposes notwithstanding Agent Reinforcement Learning under Partial Observability. any copyright notation herein. arXiv:1703.06182 [cs]. arXiv: 1703.06182. Pattanaik, A.; Tang, Z.; Liu, S.; Bommannan, G.; and References Chowdhary, G. 2017. Robust deep reinforcement learning Barton, S. L.; Waytowich, N. R.; Zaroukian, E.; and Asher, with adversarial attacks. arXiv preprint arXiv:1712.03632. D. E. In Press. Measuring collaborative emergent behav- Sugihara, G.; May, R.; Ye, H.; Hsieh, C.-h.; Deyle, E.; Foga- ior in multi-agent reinforcement learning. In 1st Interna- rty, M.; and Munch, S. 2012. Detecting causality in complex tional Conference on Human Systems Engineering and De- ecosystems. science 1227079. sign. IHSED.