<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Coordination-driven learning in multi-agent problem spaces</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Sean L. Barton, Nicholas R. Waytowich, and Derrik E. Asher U.S. Army Research Laboratory, Aberdeen Proving Ground Aberdeen</institution>
          ,
          <addr-line>Maryland 21005</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2018</year>
      </pub-date>
      <fpage>18</fpage>
      <lpage>19</lpage>
      <abstract>
        <p>We discuss the role of coordination as a direct learning objective in multi-agent reinforcement learning (MARL) domains. To this end, we present a novel means of quantifying coordination in multi-agent systems, and discuss the implications of using such a measure to optimize coordinated agent policies. This concept has important implications for adversary-aware RL, which we take to be a sub-domain of multi-agent learning.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Modern reinforcement learning (RL) has demonstrated a
number of striking achievements in the realm of intelligent
behavior by leveraging the power of deep neural networks
        <xref ref-type="bibr" rid="ref15">(Mnih et al. 2015)</xref>
        . However, like any deep-learning system,
RL agents are vulnerable to adversarial attacks that seek to
undermine their learned behaviors
        <xref ref-type="bibr" rid="ref11">(Huang et al. 2017)</xref>
        . In
order for RL agents to function effectively along side
humans in real-world problems, their behaviors must be
resilient against such adversarial assaults.
      </p>
      <p>
        Promisingly, there is recent evidence showing deep RL
agents learn policies robust to adversary attacks at test time
when they train with adversaries during learning
        <xref ref-type="bibr" rid="ref2">(Behzadan
and Munir 2017)</xref>
        . This has important implications for robust
deep RL, as it suggests that security against attacks can be
derived from learning. Here, we build on this idea and
suggest that deriving adversary-aware agents from learning is
a subset of the multi-agent reinforcement learning (MARL)
problem.
      </p>
      <p>
        At the heart of this problem is the need for an
individual agent to coordinate its actions with those taken by other
agents
        <xref ref-type="bibr" rid="ref9">(Fulda and Ventura 2007)</xref>
        . Given the role of
interagent coordination in MARL, we suggest that
operationalizing coordination between agent actions as a direct learning
objective may lead to better policies for multi-agent tasks.
Here, we present a quantitative metric that can be used to
measure the degree of coordination between agents over the
course of learning. Further, we present a research concept
for using this metric to shape agent learning towards
coordinated behavior, as well as the impact that different degrees
of coordination can have on multi-agent task performance.
1.1
      </p>
    </sec>
    <sec id="sec-2">
      <title>Adversary-aware RL as MARL</title>
      <p>
        Understanding adversary-aware RL agents in terms of
MARL is straightforward when we consider that training in
the presence of adversarial attacks is similar to training in the
presence of agents pursuing competing goals. In competitive
RL, outcomes are often considered zero-sum, when agents
reward/loss are in direct opposition
        <xref ref-type="bibr" rid="ref4 ref5 ref6">(Busoniu, Babuska, and
De Schutter 2008; Crandall and Goodrich 2011)</xref>
        . In the case
of attacks on RL agents, the adversary’s goal is typically to
learn a cost function that, when optimized, minimizes the
returns of the attacked agent
        <xref ref-type="bibr" rid="ref18">(Pattanaik et al. 2017)</xref>
        . Thus, the
adversary’s reward is the opposite of the attacked agent’s.
      </p>
      <p>If we take seriously these comparisons, the problem of
creating adversary-aware agents is largely one of
developing agents that can learn to coordinate their behaviors
effectively with the actions of an adversary so as to minimize the
impact of its attacks. Thus, adversary-aware RL is an
inherently multi-agent problem.
1.2</p>
    </sec>
    <sec id="sec-3">
      <title>Coordination in MARL</title>
      <p>
        In MARL problems, the simultaneous actions of multiple
actors obfuscate the ground truth from any individual agent.
This uncertainty about the state of the world is primarily
studied in terms of 1) partial-observability wherein the
information about a given state is only probabilistic
        <xref ref-type="bibr" rid="ref16">(Omidshafiei et al. 2017)</xref>
        , and 2) non-stationarity where the goal of
the task is “moving” with respect to any individual agent’s
perspective
        <xref ref-type="bibr" rid="ref10">(Hernandez-Leal et al. 2017)</xref>
        .
      </p>
      <p>
        To the extent that uncertainty from an agent’s
perspective can be resolved, performance in multi-agent tasks
depends critically on the degree to which agents are able to
coordinate their efforts
        <xref ref-type="bibr" rid="ref13">(Matignon, Laurent, and Le Fort-Piat
2012)</xref>
        . With MARL collaborative goals, individual agents
must learn policies that increase their own reward without
diminishing the reward received by other agents. Simple
tasks, such as matrix or climbing games, present
straightforward constraints that promote the emergence of
coordination between agents, as these small state-space problems
make the pareto-optimal solution readily discoverable.
      </p>
      <p>
        <xref ref-type="bibr" rid="ref13">Matignon et al. (2012)</xref>
        enumerate the challenges for
cooperative MARL, and show that no single algorithm is
successful at achieving better performance. Instead, existing
algorithms tend to address specific challenges at the expense
of others. Further, in more complex state-spaces
paretooptimal solutions can be “shadowed” by individually
optimal solutions that constrain learned behavior to selfish
policies
        <xref ref-type="bibr" rid="ref9">(Fulda and Ventura 2007)</xref>
        . This undermines the
performance gains achievable through coordinated actions in
MARL problems. For these reasons, coordination between
agents can only be guaranteed in limited cases where the
challenges of MARL can be reasonably constrained
        <xref ref-type="bibr" rid="ref12">(Lauer
and Riedmiller 2000)</xref>
        . As such, partial-observability and
non-stationarity are problems that must be overcome for
coordination to emerge
        <xref ref-type="bibr" rid="ref13">(Matignon, Laurent, and Le
FortPiat 2012)</xref>
        . For complex tasks, modern advances with DNNs
have leveraged joint action learning to overcome the
inherent uncertainty of MARL
        <xref ref-type="bibr" rid="ref7">(Foerster et al. 2017)</xref>
        . Indeed these
algorithms show improved performance over decentralized
and independent learning alternatives.
      </p>
      <p>Though this work is promising, we recently showed that
when coordination is directly measured, it cannot explain the
improved performance of these algorithms in all cases
(Barton et al. In Press). Coordination between agents, as
measured by the causal influence between agent actions (method
described below), was found to be almost indistinguishable
from hard-coded agents forced to act independently. This
leads to an interesting question about how to achieve
coordinated actions between learning agents in real-world tasks
where there is strong interest for the deployment of
RLequipped agents.</p>
      <sec id="sec-3-1">
        <title>2 Approach</title>
        <p>
          A possible solution to overcome the challenges of MARL is
to address coordination directly. This concept was recently
put to the test in several simple competitive tasks being
performed by two deep RL agents
          <xref ref-type="bibr" rid="ref8">(Foerster et al. 2018)</xref>
          . The
study explicitly took into account an opponent’s change in
learning parameters during its own learning step.
Accounting for opponent behavior during learning in this manner
was shown to yield human-like cooperative behaviors
previously unobserved in MARL agents.
        </p>
        <p>In a similar thrust, we propose here that coordination
should not be left to emerge from the constraints on the
multi-agent task, but instead be a direct objective of
learning. This may be accomplished by providing a coordination
measure in the loss of a MARL agent’s optimization step.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>2.1 A novel measure for coordination in MARL</title>
      <p>
        The first step towards optimizing coordinated behavior in
MARL is to define an adequate measure of coordination.
Historically, coordinated behavior has been evaluated by
agent performance in tasks where cooperation is explicitly
required
        <xref ref-type="bibr" rid="ref12">(Lauer and Riedmiller 2000)</xref>
        . As we showed
previously, performance alone is insufficient for evaluating
coordination in more complex cases, and does not provide any
new information during learning.
      </p>
      <p>
        Fortunately a metric borrowed from ecological research
has shown promise as a quantitative measure of inter-agent
coordination, independent of performance. Convergent cross
mapping (CCM) quantifies the unique causal influence one
time-series has on another
        <xref ref-type="bibr" rid="ref19">(Sugihara et al. 2012)</xref>
        . This is
accomplished by embedding each time-series in its own
high dimensional attractor space, and then using the
embedded data of one time-series as a model for the other. Each
model’s accuracy is taken as a measure of the causal
influence between the two time-series.
      </p>
      <p>In multi-agent tasks, we can define collaboration to be the
amount of causal influence between time-series of agent
actions, as measured by CCM. The advantage of this metric
is that it provides a measure of coordination between agents
that is independent of performance, and thus can be used
as a novel training signal to optimize coordinated
behavior. Thus, coordination is no longer exclusively an emergent
property of the task, but rather a signal for driving agents’
learned behavior.
2.2</p>
    </sec>
    <sec id="sec-5">
      <title>Coordination in an example MARL task</title>
      <p>We propose an experimental paradigm that is designed to
measure the role of coordination in a continuous
cooperative/competitive task: online learning of coordination during
multi-agent predator-prey pursuit. In this exemplary
experiment, CCM is used as a direct learning signal that influences
how agents learn to complete a cooperative task.</p>
      <p>
        The task is an adaptation of discrete predator-prey pursuit
        <xref ref-type="bibr" rid="ref3">(Benda 1985)</xref>
        into a continuous bounded 2D particle
environment with three identical predator agents and a single
prey agent. Predators score points each time they make
contact with the prey, while the prey’s points are decremented if
contacted by any predator.
      </p>
      <p>Typically, agent learning would be driven solely by the
environmental reward (in this case, agent score). With this
typical framework, coordination may emerge, but is not
guaranteed (see (Barton et al. In Press)). In contrast, CCM provides
a direct measure of inter-agent coordination, which can be
used to modify agent learning through the incorporation of
CCM as a term in learning loss. This can be done either
indirectly as a secondary reward or directly as a term applied
during back-propagation. Thus, learned behavior is shaped
by both, task success and inter-agent coordination.</p>
      <p>This paradigm provides an opportunity for coordination
to be manipulated experimentally by setting a desired
coordination threshold. As agents learn, they should coordinate
their behaviors with their partners and/or adversaries up to
this threshold. Minimizing this threshold should yield agents
that optimize the task at the expense of a partner, while
maximizing this threshold would likely produce high
dimensional oscillations between agent actions that ignore task
demands. Effective coordination likely lies between these
extremes. Thus, we can directly observe the impact of
coordinated behaviors in a MARL environment by varying this
coordination threshold. To our knowledge, this has not been
previously attempted.</p>
      <p>3</p>
      <sec id="sec-5-1">
        <title>Implications and Discussion</title>
        <p>Explicit coordination between agents can lead to greater
success in multi-agent systems. Our concept provides a
paradigm shift towards making coordination between agents
an intended goal of learning. In contrast, many previous
MARL approaches assume that coordination will emerge as
performance is optimized. In summary, we suggest that
coordination is better thought of as a necessary driver of
learning, as important as (or possibly more important than)
performance measures alone.</p>
        <p>Our proposed use of CCM as a signal for inter-agent
coordination provides a new source of information for learning
agents that can be integrated into a compound loss function
during learning. This would allow agents to learn
coordinated behaviors explicitly, rather than gambling on agents
discovering coordinated policies during exploration.</p>
        <p>With the addition of coordination driven learning, the
policies an agent learns will not take into account adversary
behavior by chance, but rather by design. Such an algorithm
would actively seek out policies that account for the actions
of partners and competitors, limiting the policy search space
to those that reason over the behavior of other agents in the
system. We believe this is a reasonable avenue for more
efficiently training mulit-agent policies.</p>
        <p>Driving learning with coordination creates an opportunity
for the development of agents that are inherently determined
to coordinate their actions with a human partner. This is
important, as without such a drive it is not clear how to
guarantee that humans and agents will work well together. In
particular, if modeling of human policies is too difficult for agents,
they may settle on policies that try to minimize the degree
of coordination in an attempt to recover some selfishly
optimal behavior. Forcing coordination to be optimized during
learning ensures that agents only seek out policies that are
well integrated with the actions of their partners.</p>
        <p>Our concept, as presented here, is to promote coordinated
behaviors in intelligent learning agents by providing a
quantitative measure of coordination that can be optimized
during learning. The importance of implementing coordination
to overcome adversarial attacks in the MARL problem
cannot be understated. Furthermore, an explicit drive towards
coordinated behavior between intelligent agents constitutes
a significant advancement within the fields of artificial
intelligence and computational learning.</p>
        <p>Acknowledgements This research was sponsored by the
Army Research Laboratory and was accomplished under
Cooperative Agreement Number W911NF-18-2-0058. The
views and conclusions contained in this document are those
of the authors and should not be interpreted as
representing the official policies, either expressed or implied, of
the Army Research Laboratory or the U.S. Government.
The U.S. Government is authorized to reproduce and
distribute reprints for Government purposes notwithstanding
any copyright notation herein.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Barton</surname>
            ,
            <given-names>S. L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Waytowich</surname>
            ,
            <given-names>N. R.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zaroukian</surname>
          </string-name>
          , E.; and
          <string-name>
            <surname>Asher</surname>
          </string-name>
          , D. E. In Press.
          <article-title>Measuring collaborative emergent behavior in multi-agent reinforcement learning</article-title>
          .
          <source>In 1st International Conference on Human Systems Engineering and Design</source>
          . IHSED.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Behzadan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Munir</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Whatever does not kill deep reinforcement learning, makes it stronger</article-title>
          .
          <source>arXiv preprint arXiv:1712</source>
          .
          <fpage>09344</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Benda</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>1985</year>
          .
          <article-title>On optimal cooperation of knolwedge sources</article-title>
          .
          <source>Technical Report BCS-G2010-28.</source>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Busoniu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Babuska</surname>
          </string-name>
          , R.; and
          <string-name>
            <surname>De Schutter</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <year>2008</year>
          .
          <article-title>A comprehensive survey of multiagent reinforcement learning</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <source>IEEE Transactions on Systems, Man</source>
          , And
          <string-name>
            <surname>Cybernetics-Part</surname>
            <given-names>C</given-names>
          </string-name>
          :
          <article-title>Applications</article-title>
          and Reviews,
          <volume>38</volume>
          (
          <issue>2</issue>
          ),
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Crandall</surname>
            ,
            <given-names>J. W.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Goodrich</surname>
            ,
            <given-names>M. A.</given-names>
          </string-name>
          <year>2011</year>
          .
          <article-title>Learning to compete, coordinate, and cooperate in repeated games using reinforcement learning</article-title>
          .
          <source>Machine Learning</source>
          <volume>82</volume>
          (
          <issue>3</issue>
          ):
          <fpage>281</fpage>
          -
          <lpage>314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Foerster</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Farquhar,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Afouras,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Nardelli</surname>
          </string-name>
          , N.; and
          <string-name>
            <surname>Whiteson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Counterfactual Multi-Agent Policy Gradients</article-title>
          . arXiv:
          <volume>1705</volume>
          .08926 [cs].
          <source>arXiv: 1705</source>
          .
          <fpage>08926</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Foerster</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Chen, R. Y.;
          <string-name>
            <surname>Al-Shedivat</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Whiteson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Abbeel,
          <string-name>
            <given-names>P.</given-names>
            ; and
            <surname>Mordatch</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Learning with opponentlearning awareness</article-title>
          .
          <source>In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems</source>
          ,
          <volume>122</volume>
          -
          <fpage>130</fpage>
          . International Foundation for Autonomous Agents and
          <string-name>
            <given-names>Multiagent</given-names>
            <surname>Systems</surname>
          </string-name>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Fulda</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ventura</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2007</year>
          .
          <article-title>Predicting and preventing coordination problems in cooperative q-learning systems</article-title>
          .
          <source>In IJCAI</source>
          , volume
          <year>2007</year>
          ,
          <fpage>780</fpage>
          -
          <lpage>785</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Hernandez-Leal</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kaisers</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Baarslag</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>de Cote</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>A survey of learning in multiagent environments: Dealing with non-stationarity</article-title>
          .
          <source>arXiv preprint arXiv:1707</source>
          .
          <fpage>09183</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Papernot,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Goodfellow</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          ; Duan,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; and Abbeel,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Adversarial attacks on neural network policies</article-title>
          .
          <source>arXiv preprint arXiv:1702</source>
          .
          <fpage>02284</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Lauer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Riedmiller</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2000</year>
          .
          <article-title>An algorithm for distributed reinforcement learning in cooperative multi-agent systems</article-title>
          .
          <source>In In Proceedings of the Seventeenth International Conference on Machine Learning. Citeseer.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Matignon</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Laurent</surname>
            ,
            <given-names>G. J.;</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Le</given-names>
            <surname>Fort-Piat</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>Independent reinforcement learners in cooperative markov games: a survey regarding coordination problems</article-title>
          .
          <source>The Knowledge Engineering Review</source>
          <volume>27</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Bellemare,
          <string-name>
            <given-names>M. G.</given-names>
            ;
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            ;
            <surname>Ostrovski</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Petersen,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Beattie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Sadik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          ; King,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Kumaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Legg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Hassabis</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Human-level control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          <volume>518</volume>
          (
          <issue>7540</issue>
          ):
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Omidshafiei</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pazis</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Amato,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>How</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. P.</given-names>
            ; and
            <surname>Vian</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Deep Decentralized Multi-task MultiAgent Reinforcement Learning under Partial Observability</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>arXiv:1703</source>
          .06182 [cs].
          <source>arXiv: 1703</source>
          .
          <fpage>06182</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Pattanaik</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Bommannan, G.; and Chowdhary,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Robust deep reinforcement learning with adversarial attacks</article-title>
          .
          <source>arXiv preprint arXiv:1712</source>
          .
          <fpage>03632</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Sugihara</surname>
            , G.; May,
            <given-names>R.</given-names>
          </string-name>
          ; Ye,
          <string-name>
            <given-names>H.</given-names>
            ; Hsieh, C.-h.;
            <surname>Deyle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            ;
            <surname>Fogarty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Munch</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>Detecting causality in complex ecosystems</article-title>
          .
          <source>science 1227079.</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>