Coordination-driven learning in multi-agent problem spaces

                          Sean L. Barton, Nicholas R. Waytowich, and Derrik E. Asher
                                     U.S. Army Research Laboratory, Aberdeen Proving Ground
                                                   Aberdeen, Maryland 21005


                            Abstract                                course of learning. Further, we present a research concept
                                                                    for using this metric to shape agent learning towards coordi-
  We discuss the role of coordination as a direct learning objec-   nated behavior, as well as the impact that different degrees
  tive in multi-agent reinforcement learning (MARL) domains.
                                                                    of coordination can have on multi-agent task performance.
  To this end, we present a novel means of quantifying coordi-
  nation in multi-agent systems, and discuss the implications of
  using such a measure to optimize coordinated agent policies.      1.1   Adversary-aware RL as MARL
  This concept has important implications for adversary-aware       Understanding adversary-aware RL agents in terms of
  RL, which we take to be a sub-domain of multi-agent learn-        MARL is straightforward when we consider that training in
  ing.                                                              the presence of adversarial attacks is similar to training in the
                                                                    presence of agents pursuing competing goals. In competitive
                                                                    RL, outcomes are often considered zero-sum, when agents
                     1    Introduction                              reward/loss are in direct opposition (Busoniu, Babuska, and
Modern reinforcement learning (RL) has demonstrated a               De Schutter 2008; Crandall and Goodrich 2011). In the case
number of striking achievements in the realm of intelligent         of attacks on RL agents, the adversary’s goal is typically to
behavior by leveraging the power of deep neural networks            learn a cost function that, when optimized, minimizes the re-
(Mnih et al. 2015). However, like any deep-learning system,         turns of the attacked agent (Pattanaik et al. 2017). Thus, the
RL agents are vulnerable to adversarial attacks that seek to        adversary’s reward is the opposite of the attacked agent’s.
undermine their learned behaviors (Huang et al. 2017). In              If we take seriously these comparisons, the problem of
order for RL agents to function effectively along side hu-          creating adversary-aware agents is largely one of develop-
mans in real-world problems, their behaviors must be re-            ing agents that can learn to coordinate their behaviors effec-
silient against such adversarial assaults.                          tively with the actions of an adversary so as to minimize the
   Promisingly, there is recent evidence showing deep RL            impact of its attacks. Thus, adversary-aware RL is an inher-
agents learn policies robust to adversary attacks at test time      ently multi-agent problem.
when they train with adversaries during learning (Behzadan
and Munir 2017). This has important implications for robust         1.2   Coordination in MARL
deep RL, as it suggests that security against attacks can be        In MARL problems, the simultaneous actions of multiple
derived from learning. Here, we build on this idea and sug-         actors obfuscate the ground truth from any individual agent.
gest that deriving adversary-aware agents from learning is          This uncertainty about the state of the world is primarily
a subset of the multi-agent reinforcement learning (MARL)           studied in terms of 1) partial-observability wherein the in-
problem.                                                            formation about a given state is only probabilistic (Omid-
   At the heart of this problem is the need for an individ-         shafiei et al. 2017), and 2) non-stationarity where the goal of
ual agent to coordinate its actions with those taken by other       the task is “moving” with respect to any individual agent’s
agents (Fulda and Ventura 2007). Given the role of inter-           perspective (Hernandez-Leal et al. 2017).
agent coordination in MARL, we suggest that operationaliz-             To the extent that uncertainty from an agent’s perspec-
ing coordination between agent actions as a direct learning         tive can be resolved, performance in multi-agent tasks de-
objective may lead to better policies for multi-agent tasks.        pends critically on the degree to which agents are able to co-
Here, we present a quantitative metric that can be used to          ordinate their efforts (Matignon, Laurent, and Le Fort-Piat
measure the degree of coordination between agents over the          2012). With MARL collaborative goals, individual agents
                                                                    must learn policies that increase their own reward without
Copyright c by the papers authors. Copying permitted for private
and academic purposes. In: Joseph Collins, Prithviraj Dasgupta,     diminishing the reward received by other agents. Simple
Ranjeev Mittu (eds.): Proceedings of the AAAI Fall 2018 Sympo-      tasks, such as matrix or climbing games, present straight-
sium on Adversary-Aware Learning Techniques and Trends in Cy-       forward constraints that promote the emergence of coordi-
bersecurity, Arlington, VA, USA, 18-19 October, 2018, published     nation between agents, as these small state-space problems
at http://ceur-ws.org                                               make the pareto-optimal solution readily discoverable.
   Matignon et al. (2012) enumerate the challenges for co-       coordination, independent of performance. Convergent cross
operative MARL, and show that no single algorithm is suc-        mapping (CCM) quantifies the unique causal influence one
cessful at achieving better performance. Instead, existing al-   time-series has on another (Sugihara et al. 2012). This is
gorithms tend to address specific challenges at the expense      accomplished by embedding each time-series in its own
of others. Further, in more complex state-spaces pareto-         high dimensional attractor space, and then using the embed-
optimal solutions can be “shadowed” by individually opti-        ded data of one time-series as a model for the other. Each
mal solutions that constrain learned behavior to selfish poli-   model’s accuracy is taken as a measure of the causal influ-
cies (Fulda and Ventura 2007). This undermines the per-          ence between the two time-series.
formance gains achievable through coordinated actions in            In multi-agent tasks, we can define collaboration to be the
MARL problems. For these reasons, coordination between           amount of causal influence between time-series of agent ac-
agents can only be guaranteed in limited cases where the         tions, as measured by CCM. The advantage of this metric
challenges of MARL can be reasonably constrained (Lauer          is that it provides a measure of coordination between agents
and Riedmiller 2000). As such, partial-observability and         that is independent of performance, and thus can be used
non-stationarity are problems that must be overcome for          as a novel training signal to optimize coordinated behav-
coordination to emerge (Matignon, Laurent, and Le Fort-          ior. Thus, coordination is no longer exclusively an emergent
Piat 2012). For complex tasks, modern advances with DNNs         property of the task, but rather a signal for driving agents’
have leveraged joint action learning to overcome the inher-      learned behavior.
ent uncertainty of MARL (Foerster et al. 2017). Indeed these
algorithms show improved performance over decentralized          2.2   Coordination in an example MARL task
and independent learning alternatives.
                                                                 We propose an experimental paradigm that is designed to
   Though this work is promising, we recently showed that
                                                                 measure the role of coordination in a continuous coopera-
when coordination is directly measured, it cannot explain the
                                                                 tive/competitive task: online learning of coordination during
improved performance of these algorithms in all cases (Bar-
                                                                 multi-agent predator-prey pursuit. In this exemplary experi-
ton et al. In Press). Coordination between agents, as mea-
                                                                 ment, CCM is used as a direct learning signal that influences
sured by the causal influence between agent actions (method
                                                                 how agents learn to complete a cooperative task.
described below), was found to be almost indistinguishable
                                                                    The task is an adaptation of discrete predator-prey pursuit
from hard-coded agents forced to act independently. This
                                                                 (Benda 1985) into a continuous bounded 2D particle envi-
leads to an interesting question about how to achieve coor-
                                                                 ronment with three identical predator agents and a single
dinated actions between learning agents in real-world tasks
                                                                 prey agent. Predators score points each time they make con-
where there is strong interest for the deployment of RL-
                                                                 tact with the prey, while the prey’s points are decremented if
equipped agents.
                                                                 contacted by any predator.
                      2   Approach                                  Typically, agent learning would be driven solely by the en-
                                                                 vironmental reward (in this case, agent score). With this typ-
A possible solution to overcome the challenges of MARL is        ical framework, coordination may emerge, but is not guaran-
to address coordination directly. This concept was recently      teed (see (Barton et al. In Press)). In contrast, CCM provides
put to the test in several simple competitive tasks being per-   a direct measure of inter-agent coordination, which can be
formed by two deep RL agents (Foerster et al. 2018). The         used to modify agent learning through the incorporation of
study explicitly took into account an opponent’s change in       CCM as a term in learning loss. This can be done either in-
learning parameters during its own learning step. Account-       directly as a secondary reward or directly as a term applied
ing for opponent behavior during learning in this manner         during back-propagation. Thus, learned behavior is shaped
was shown to yield human-like cooperative behaviors previ-       by both, task success and inter-agent coordination.
ously unobserved in MARL agents.                                    This paradigm provides an opportunity for coordination
   In a similar thrust, we propose here that coordination        to be manipulated experimentally by setting a desired coor-
should not be left to emerge from the constraints on the         dination threshold. As agents learn, they should coordinate
multi-agent task, but instead be a direct objective of learn-    their behaviors with their partners and/or adversaries up to
ing. This may be accomplished by providing a coordination        this threshold. Minimizing this threshold should yield agents
measure in the loss of a MARL agent’s optimization step.         that optimize the task at the expense of a partner, while
2.1   A novel measure for coordination in MARL                   maximizing this threshold would likely produce high dimen-
                                                                 sional oscillations between agent actions that ignore task de-
The first step towards optimizing coordinated behavior in        mands. Effective coordination likely lies between these ex-
MARL is to define an adequate measure of coordination.           tremes. Thus, we can directly observe the impact of coor-
Historically, coordinated behavior has been evaluated by         dinated behaviors in a MARL environment by varying this
agent performance in tasks where cooperation is explicitly       coordination threshold. To our knowledge, this has not been
required (Lauer and Riedmiller 2000). As we showed previ-        previously attempted.
ously, performance alone is insufficient for evaluating coor-
dination in more complex cases, and does not provide any
new information during learning.                                           3   Implications and Discussion
  Fortunately a metric borrowed from ecological research         Explicit coordination between agents can lead to greater
has shown promise as a quantitative measure of inter-agent       success in multi-agent systems. Our concept provides a
paradigm shift towards making coordination between agents          Behzadan, V., and Munir, A. 2017. Whatever does not
an intended goal of learning. In contrast, many previous           kill deep reinforcement learning, makes it stronger. arXiv
MARL approaches assume that coordination will emerge as            preprint arXiv:1712.09344.
performance is optimized. In summary, we suggest that co-          Benda, M. 1985. On optimal cooperation of knolwedge
ordination is better thought of as a necessary driver of learn-    sources. Technical Report BCS-G2010-28.
ing, as important as (or possibly more important than) per-
                                                                   Busoniu, L.; Babuska, R.; and De Schutter, B. 2008. A
formance measures alone.
                                                                   comprehensive survey of multiagent reinforcement learning.
   Our proposed use of CCM as a signal for inter-agent co-
                                                                   IEEE Transactions on Systems, Man, And Cybernetics-Part
ordination provides a new source of information for learning
                                                                   C: Applications and Reviews, 38 (2), 2008.
agents that can be integrated into a compound loss function
during learning. This would allow agents to learn coordi-          Crandall, J. W., and Goodrich, M. A. 2011. Learning to
nated behaviors explicitly, rather than gambling on agents         compete, coordinate, and cooperate in repeated games using
discovering coordinated policies during exploration.               reinforcement learning. Machine Learning 82(3):281–314.
   With the addition of coordination driven learning, the          Foerster, J.; Farquhar, G.; Afouras, T.; Nardelli, N.; and
policies an agent learns will not take into account adversary      Whiteson, S. 2017. Counterfactual Multi-Agent Policy Gra-
behavior by chance, but rather by design. Such an algorithm        dients. arXiv:1705.08926 [cs]. arXiv: 1705.08926.
would actively seek out policies that account for the actions      Foerster, J.; Chen, R. Y.; Al-Shedivat, M.; Whiteson, S.;
of partners and competitors, limiting the policy search space      Abbeel, P.; and Mordatch, I. 2018. Learning with opponent-
to those that reason over the behavior of other agents in the      learning awareness. In Proceedings of the 17th International
system. We believe this is a reasonable avenue for more ef-        Conference on Autonomous Agents and MultiAgent Systems,
ficiently training mulit-agent policies.                           122–130. International Foundation for Autonomous Agents
   Driving learning with coordination creates an opportunity       and Multiagent Systems.
for the development of agents that are inherently determined       Fulda, N., and Ventura, D. 2007. Predicting and preventing
to coordinate their actions with a human partner. This is im-      coordination problems in cooperative q-learning systems. In
portant, as without such a drive it is not clear how to guaran-    IJCAI, volume 2007, 780–785.
tee that humans and agents will work well together. In partic-
ular, if modeling of human policies is too difficult for agents,   Hernandez-Leal, P.; Kaisers, M.; Baarslag, T.; and de Cote,
they may settle on policies that try to minimize the degree        E. M. 2017. A survey of learning in multiagent envi-
of coordination in an attempt to recover some selfishly op-        ronments: Dealing with non-stationarity. arXiv preprint
timal behavior. Forcing coordination to be optimized during        arXiv:1707.09183.
learning ensures that agents only seek out policies that are       Huang, S.; Papernot, N.; Goodfellow, I.; Duan, Y.; and
well integrated with the actions of their partners.                Abbeel, P. 2017. Adversarial attacks on neural network
   Our concept, as presented here, is to promote coordinated       policies. arXiv preprint arXiv:1702.02284.
behaviors in intelligent learning agents by providing a quan-      Lauer, M., and Riedmiller, M. 2000. An algorithm for dis-
titative measure of coordination that can be optimized dur-        tributed reinforcement learning in cooperative multi-agent
ing learning. The importance of implementing coordination          systems. In In Proceedings of the Seventeenth International
to overcome adversarial attacks in the MARL problem can-           Conference on Machine Learning. Citeseer.
not be understated. Furthermore, an explicit drive towards         Matignon, L.; Laurent, G. J.; and Le Fort-Piat, N. 2012.
coordinated behavior between intelligent agents constitutes        Independent reinforcement learners in cooperative markov
a significant advancement within the fields of artificial intel-   games: a survey regarding coordination problems. The
ligence and computational learning.                                Knowledge Engineering Review 27(1):1–31.
Acknowledgements This research was sponsored by the                Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness,
Army Research Laboratory and was accomplished under                J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland,
Cooperative Agreement Number W911NF-18-2-0058. The                 A. K.; Ostrovski, G.; Petersen, S.; Beattie, C.; Sadik, A.;
views and conclusions contained in this document are those         Antonoglou, I.; King, H.; Kumaran, D.; Wierstra, D.; Legg,
of the authors and should not be interpreted as represent-         S.; and Hassabis, D. 2015. Human-level control through
ing the official policies, either expressed or implied, of         deep reinforcement learning. Nature 518(7540):529–533.
the Army Research Laboratory or the U.S. Government.               Omidshafiei, S.; Pazis, J.; Amato, C.; How, J. P.; and
The U.S. Government is authorized to reproduce and dis-            Vian, J. 2017. Deep Decentralized Multi-task Multi-
tribute reprints for Government purposes notwithstanding           Agent Reinforcement Learning under Partial Observability.
any copyright notation herein.                                     arXiv:1703.06182 [cs]. arXiv: 1703.06182.
                                                                   Pattanaik, A.; Tang, Z.; Liu, S.; Bommannan, G.; and
                        References                                 Chowdhary, G. 2017. Robust deep reinforcement learning
Barton, S. L.; Waytowich, N. R.; Zaroukian, E.; and Asher,         with adversarial attacks. arXiv preprint arXiv:1712.03632.
D. E. In Press. Measuring collaborative emergent behav-            Sugihara, G.; May, R.; Ye, H.; Hsieh, C.-h.; Deyle, E.; Foga-
ior in multi-agent reinforcement learning. In 1st Interna-         rty, M.; and Munch, S. 2012. Detecting causality in complex
tional Conference on Human Systems Engineering and De-             ecosystems. science 1227079.
sign. IHSED.