<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Deep Models
and Artificial Intelligence for Defense Applications: Potentials, Theories,
Practices, Tools, and Risks, November</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Multi-Agent Mission Planning with Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sean Soleyman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Deepak Khosla</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>HRL Laboratories</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>LLC ssoleyman@hrl.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>dkhosla@hrl.com</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Distribution Statement “A”, Approved for Public Release</institution>
          ,
          <addr-line>Distribution Unlimited</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <volume>1</volume>
      <fpage>1</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>State of the art mission planning software packages such as AFSIM use traditional AI approaches including allocation algorithms and scripted state machines to control the simulated behavior of military aircraft, ships, and ground units. We have developed a novel AI system that uses reinforcement learning to produce more effective high-level strategies for military engagements. However, instead of learning a policy from scratch with initially random behavior, it also leverages existing traditional AI approaches for automation of simple low-level behaviors, to simplify the cooperative multi-agent aspect of the problem, and to bootstrap learning with available prior knowledge to achieve order of magnitude faster training.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Simulation software for military applications has
revolutionized battle management and analytics, and also
provides a gateway for integrating recent developments in
machine learning with real-world applications. AFSIM
(Advanced Framework for Simulation, Integration, and
Modeling) allows military analysts to build a detailed
model of a mission scenario that includes aircraft, ships,
ground units, weapons, sensors, and communication
syste
        <xref ref-type="bibr" rid="ref1">ms (Clive et al. 2015</xref>
        ). However, no mission simulation
would be complete without models for how the platforms
behave – both at a strategic and tactical level. Therefore,
users of this software are not only required to model
physical systems and their capabilities, but must also serve as AI
designers.
      </p>
      <p>The end objective of our work is development of a more
generalizable form of artificial intelligence to address
multi-domain military scenarios, with initial focus on battle
management and air-to-air engagements. Our goal is to
produce a decision-making engine that provides enhanced
automation of tactical and strategic decision-making.</p>
      <p>The current rule-based approach for specifying platform
behaviors in AFSIM is based on video game style AI. Each
unit is given a processor that executes tasks such as
following a pre-set route, firing a weapon at the appropriate time,
or pursuing a particular opponent. However, this approach
has several detrimental properties. The development of
scripted polices is time consuming, and must be performed
by analysts with an aptitude for computer programming as
well as an understanding of military strategy and tactics. In
addition, scripted policies are fragile. Minor changes to the
scenario (such as those that would be explored when
analyzing possible contingencies) can often cause the scripted
platform behavior to become nonsensical, necessitating the
expenditure of even more scenario development resources.
Most importantly, there is always the possibility that a
human analyst could fail to consider an unexpected strategy
employed by a particularly clever adversary.</p>
      <p>Figure 1 - Example of a complex AFSIM scenario involving air,
sea, and ground units. Analysts must model all of these platforms
and specify their behaviors with rule-based systems.</p>
      <p>
        Model-free reinforcement learning algorithms provide
an alternative solution that eliminates the need for
scripting. Instead of specifying behaviors for each platform, the
analyst needs only to design an agent-environment
interface with a well-defined observation space, action space,
and reward function. A reinforcement learning agent takes
care of the rest by starting out with completely random
behavior and improving by trial and error
        <xref ref-type="bibr" rid="ref10">(Lapan 2018)</xref>
        .
      </p>
      <p>First, we will describe our initial effort to apply this
naïve baseline approach in a simplified AFSIM-like 2D
multi-agent simulated environment (MA2D) that we developed
in-house. This simulator is easier to experiment with
because it is written entirely in Python. Then, we will provide
experimental evidence that reinforcement learning can be
much more effective when combined with more traditional
non-learning based AI techniques that constitute the
current state of the art in practical applications, and will
finally demonstrate that this hybrid approach can produce
robust results in an actual AFSIM-based scenario that models
aircraft and missile dynamics.</p>
      <p>
        Figure 3 - Simplified MA2D environment, written entirely in
Python. This example contains two blue fighters and two red
fighters. Dark gray areas represent each unit's weapon zone. The
objective is to destroy all opponents by getting each within this
zone, while avoiding similar destruction of friendly aircraft. This
simplification eliminates the need for modeling missile flight.
In recent years, deep reinforcement learning agents have
achieved super-human performance in complex
multiplayer games such as StarCraft II
        <xref ref-type="bibr" rid="ref2">(DeepMind 2019)</xref>
        ,
Defense of the Ancients (DOTA)
        <xref ref-type="bibr" rid="ref14">(OpenAI 2018)</xref>
        , and Quake
/ Capture the Flag (Jaderberg et al. 2019). Although these
computer games are not intended to simulate real-world
military engagements, they do possess several key
similarities that demonstrate the applicability of deep
reinforcement learning technology to military decision making.
      </p>
      <p>
        First, all of these games consist of two adversarial
teams, each composed of a number of cooperative
platforms. In Starcraft II, each team may contain over 100
individual units with capabilities loosely resembling those of
military ground units and aircraft. DeepMind’s approach is
to use a single centralized reinforcement learning agent to
control each team by selecting a set of platforms and
issuing a command to the entire
        <xref ref-type="bibr" rid="ref19">set (Vinyals et al. 2017</xref>
        ).
OpenAI Five’s DOTA solution uses a different type of
multi-agent environment interface, where each agent
receives a separate command at each time-step
        <xref ref-type="bibr" rid="ref11">(Matiisen
2018)</xref>
        . DeepMind’s Capture the Flag AI uses a distributed
approach, where a separate agent control
        <xref ref-type="bibr" rid="ref7">s each unit
(Jaderberg et al. 2019</xref>
        ). The multi-agent solution we will describe
in this paper relates most closely to the last of these three,
but also includes a novel hybridization of RL with the
nonlearning Kuhn-Munkres Hungarian algorithm
        <xref ref-type="bibr" rid="ref8">(Kuhn
1955)</xref>
        .
      </p>
      <p>
        Another major similarity between these computer games
and real-world military simulations is that both are
designed to model continuous time with short discrete
timesteps. As a consequence, each episode may consist of
thousands of discrete time-steps and each agent may therefore
need to select thousands of actions before it receives a final
win/loss reward. This creates a challenging temporal
exploration problem that is a key focus of existing work in
hierarchical reinforcement learning
        <xref ref-type="bibr" rid="ref16">(Sutton, Precup, and
Singh 1999)</xref>
        (Frans et al. 2018). Our hybrid hierarchical
approach is more closely related to dynamic scripting,
which has been applied to computer games (Spronck et al.
2006) as well as simple air engagements
        <xref ref-type="bibr" rid="ref18">(Toubman et al.
2014)</xref>
        .
      </p>
      <p>
        Finally, success of model-free deep RL in computer
game environments demonstrates that this approach will
extend naturally to partially-observable environments. In
StarCraft II and DOTA, each team can only perceive
enemy units that are within visual range of one of their own
units. In Capture the Flag, the agent actually perceives
visual images of the 3D simulated environment, and it is
possible for enemies to hide behind walls. In real-world air
engagements, pilots identify enemy units using sensing
modalities such as radar, vision, and IR. Implementation of
realistic partially-observable air engagement scenarios is
the subject of future planned work, and successes in
computer game environments demonstrate the capability of
deep reinforcement learning agents with LSTM units
        <xref ref-type="bibr" rid="ref6">(Hochreiter and Schmidhuber 1997)</xref>
        to achieve good
results even when confronted with imperfect information.
      </p>
    </sec>
    <sec id="sec-2">
      <title>Reinforcement Learning Baseline Method</title>
      <p>Our initial experiments were performed using a simple
MA2D environment similar to the one illustrated in Figure
3. A reinforcement learning agent was given control over a
single blue fighter, and traditional scripted behavior was
used to control the red fighter. In some experiments, the
red fighter was set to use a pure pursuit strategy against the
blue fighter. In others, it simply traveled straight, providing
a moving target for the blue agent to intercept. We
introduced variation to the problem by having each of the
fighters start each episode in a random location on the map,
with random heading. This ensures that the agent learns a
generalizable policy, not just a point solution to a single
scenario. Each fighter’s turn rate is limited to 2.5 degrees
per time-step, and each fighter’s acceleration is limited to 5
m/s/time-step. An opponent is instantly defeated if it
comes within the circle sector shown in dark gray with
radius 2km and angle 30 degrees.</p>
      <p>Each episode lasts for a maximum of 1000 time-steps.
The reward function consists of sparse and dense
components. At the end of each episode, the agent receives a
large positive reward if it has destroyed its opponent and a
large negative reward if it has been destroyed. The exact
size of this reward is 10.0 times the number of time-steps
remaining when one side has won. This time-based factor
provides the agent with an incentive to destroy its
opponent as quickly as possible, or to postpone its own demise.
In addition, even if there is a draw where neither side wins
within 1000 steps, the blue agent still receives a small
reward of 1.0 whenever it gets closer to the opponent. This
helps to remedy the temporal exploration problem, where it
is statistically unlikely that an agent will learn to produce a
long sequence of correct actions needed to catch its
opponent without the aid of a dense reward. Later, we will see
that our novel approach allows us to simplify this reward
function while achieving even better results.</p>
      <p>
        In this simple 1v1 environment, the blue agent’s
observation is a vector consisting of the opponent’s relative
distance, bearing, heading, closing speed, and cross speed. At
each time-step, the agent receives this observation and
selects one of the following discrete actions: turn left, turn
right, speed up, slow down, hold course. The agent uses an
actor-critic reinforcement learning architecture with
completely separate value and policy networks. Each network
consists of a hidden layer with 36 neurons and ReLU
activations, as well as an output layer. The output layer for the
policy network contains five neurons (one corresponding
to each action listed above) and uses a softmax activation
layer with distribution sampling, while the output layer for
the value network is a single linear neuron that predicts net
reward. Weights are initialized using the method described
by He et al. with a truncated normal distribution and based
on averaging the number of inputs and outputs
        <xref ref-type="bibr" rid="ref5">(He et al.
2015)</xref>
        . Use of the value network for bootstrapping does not
improve performance in this particular application, so it is
used only as a baseline to reduce variance when computing
advantage values
        <xref ref-type="bibr" rid="ref17">(Sutton and Barto 2018)</xref>
        .
      </p>
      <p>
        To compute the gradients needed to train the networks,
we use an RMSProp optimizer with learning rate 0.0007,
momentum 0.0, and epsilon 1e-10. We use the A3C
(Asynchronous Advantage Actor-Critic) parallelization
scheme, where 20 workers each run simulations and
compute gradients, and these gradients are applied to a
centralized learner
        <xref ref-type="bibr" rid="ref13">(Mnih et al. 2016)</xref>
        . We have experimented
with adding an entropy term to the objective function to
help encourage exploration, but this has not been shown to
produce a substantial performance improvement. Reward
discounting was also determined experimentally to be of
limited use in our application, and was therefore omitted.
We trained for up to 200,000 episodes, but found 10,000
episodes to be sufficient when training against the
straightflying opponent. In this simplified environment, it has
proven difficult to achieve a high win rate against a pure
pursuit opponent. However, the reinforcement learning
agent does learn to achieve roughly equal numbers of wins
and losses (it is able to match, but unable to exceed, the
performance of the MA2D scripted opponent). In the next
section, we will compare quantitative performance metrics
of this machine learning system with those of our hybrid
approach.
      </p>
    </sec>
    <sec id="sec-3">
      <title>High-Level Behavior-Based RL</title>
      <p>Our novel hybrid approach builds upon this pure
reinforcement learning baseline by leveraging traditional AI
techniques to produce low-level behaviors and to aid in
multi-target allocation. This allows the reinforcement
learning agent to focus on the part of the problem for
which traditional AI does not offer an out-of-the-box
solution. We will continue to discuss the 1v1 case in this
section and the next, and will subsequently move on to the
multi-agent MvN case, which we will explore in a more
advanced AFSIM-based environment.</p>
      <p>The 1v1 architecture consists of a high-level controller
and a set of low-level scripted behaviors. The high-level
controller is a reinforcement learning agent that takes in
observations from the environment, and uses a neural net
to select behaviors such as “lead pursuit,” “lag pursuit,”
“pure pursuit,” or “evade.” Once the behavior has been
selected, a low-level controller produces output actions
with direct control over the fighter’s motion. For example,
if an autonomous aircraft in a 1v1 engagement selects
“pure pursuit,” the corresponding low-level behavior script
will generate stick-and-throttle actions that cause the plane
to head directly toward its opponent. These low-level
actions are simply “turn right,” “turn left,” etc. in the MA2D
case, but could also produce continuous control signals
needed to pilot high-fidelity aircraft models or even real
aircraft.</p>
      <p>Figure 4 - Overview of our hybrid architecture that pairs a
highlevel reinforcement learner with low-level scripted behavior
policies. The reinforcement learning agent selects a scripted
behavior, which then produces the actual control output sent to
the environment.</p>
      <p>The high-level controller’s neural net is trained using
reinforcement learning. For each training episode, the
system keeps track of the high-level behaviors it has selected,
the observations that resulted from applying the
corresponding low-level actions to the environment, and the
rewards that were obtained from the same environment’s
reward function. After each episode has been completed,
we train the agent using a method similar to that described
in the previous section.</p>
      <p>
        One potential shortcoming of this approach is that the
high-level agent must still select a large number of actions
within a single episode. This leads to a potentially
intractable credit assignment problem
        <xref ref-type="bibr" rid="ref4">(Geron 2017)</xref>
        . We now
consider three possible remedies, each of which provides a
mechanism that restricts the times at which the high-level
controller is given a choice to switch to a different
behavior.
      </p>
      <p>The first alternative still performs high-level behavior
selection at a fixed frequency, but this frequency is lower
than the update rate of the low-level controller as
illustrated in Figure 6. Similar approaches have been used with
pure reinforcement learning (Mnih et al. 2013). In the next
section, we will show that this approach provides a slight
improvement in performance over the basic hybrid agent,
at the expense of increased complexity. We will refer to
this add-on as “action repetition.”</p>
      <p>
        The second alternative uses traditional rule-based AI to
specify a termination condition for each behavior. Once a
behavior has been selected, execution will continue until
this termination condition has been reached, at which time
the high-level controller will select a new behavior. This is
similar to the “Dynamic Scripting” approach
        <xref ref-type="bibr" rid="ref18">(Toubman et
al. 2014)</xref>
        . The disadvantage of this approach is that it lacks
flexibility. Once the reinforcement learning agent initiates
an action, it has no way of terminating this action even if
the situation changes entirely at a later time.
      </p>
      <p>The third alternative is illustrated in Figure 7. It includes
additional neural nets that restrict the times at which the
high-level controller can switch to a different behavior.
The agent starts out each episode in the “strategic” state.
When the agent is in this state, it selects a low-level
behavior using the method described earlier in this section.
However, once the agent has selected a behavior, it continues
executing this behavior until a low-level “tactical” learner
decides to transition control back to the “strategic” learner.
Each time the selected low-level controller produces an
output action, its corresponding neural net produces
probabilities for continuing with the current behavior, or for
handing control back to the high-level controller that may
then decide to switch to a different behavior. The objective
of this approach is to provide improved credit assignment
for decisions made by the strategic learner, while still
providing the learnable flexibility needed for precision
timing of behavior transitions.</p>
    </sec>
    <sec id="sec-4">
      <title>Behavior-Based RL Experiments and Results</title>
      <p>Experiments were performed using the same MA2D
simulated environment described in the section on a baseline
reinforcement learning solution. No changes were made to
the observation space. However, the action space for the
reinforcement learning agent now consists of the set of
behaviors listed in Figure 8. When the neural net selects a
lag pursuit, it causes the platform that it is controlling to
pursue a point behind its opponent. Pure pursuit and lead
pursuit are similar, except that the point is at or in front of
the target in each respective case. The evade action causes
the platform to turn away from its opponent and increase
speed as much as possible so that it can escape. Once a
behavior is selected, the corresponding low-level script
produces an output in the same action space that was
described in the previous section so that an apples-to-apples
comparison with the baseline approach can be obtained.</p>
      <p>One unexpected benefit of the hybrid approach
described in the previous section is that it eliminates the need
for dense rewards and reward function engineering. In
reinforcement learning applications, it is typical for the
environment to provide the agent with a more informative
“dense reward” function that provides a more continuous
spectrum of outcome desirability than just win or loss.
These dense reward functions can be difficult to design,
especially as scenarios become more complex. Elimination
of this requirement makes the method much easier to apply
to new scenarios because it removes the need for this
trialand-error design process.</p>
      <p>The hybrid agent is able to learn effectively with only a
win-loss reward. Each episode ends when one of the
platforms enters the other’s weapon engagement zone, at
which point a reward of +5000 is given to the platform in
firing position, and -5000 to the platform that is about to be
fired upon. If neither platform enters the other’s
engagement zone within 1000 time-steps, a draw is declared and
each platform receives 0 reward.</p>
      <p>Experimental results are shown in Figure 9. The baseline
result uses pure reinforcement learning. It takes
approximately 2,500 episodes of experience before the agent
learns to win more episodes than it loses. In contrast, the
hybrid approach described in this section uses one of its
scripted policies to achieve learning that appears almost
instantaneous by comparison. Indeed, the prior knowledge
encoded in the scripted policy greatly simplifies the
reinforcement learning task. We also experimented with an
action repetition variant where the high-level behavior is
selected 256 times less frequently than the low-level
action. This makes it even easier for the reinforcement
learning module to find a winning strategy, because it only
needs to select a behavior four times per episode instead of
1000 times (assuming that each episode lasts for 1000
steps).</p>
      <p>These results demonstrate that our novel method has
advantages over both constituent technologies from which
it is composed. It can be much faster than reinforcement
learning with a flat architecture, and more effective than a
simple scripted (traditional) AI opponent.</p>
    </sec>
    <sec id="sec-5">
      <title>Multi-Agent Hybrid Learning and Allocation</title>
      <p>Having demonstrated that the hybrid RL approach
produces vastly improved results in the simple MA2D
environment, we apply this AI solution to a more complex
decision environment developed with AFSIM. In this scenario,
each fighter has five possible actions. It can pursue an
opponent, fire a salvo of weapons, provide weapon support,
perform evasive maneuvers, or maintain a steady course.
When there is more than one opponent, the AI can also
select which one to target. In addition to observed enemy
positions and velocities, the environment also returns a
simple sparse reward at the end of each episode that is
+3000 for the winning team, and -3000 for the losing team.
For simplicity, a team is declared victorious if it destroys
all of the opponents within a time limit. Otherwise, the
outcome is declared to be a draw and each team receives
zero reward.</p>
      <p>In the 1v1 case, our hybrid reinforcement learning agent
quickly learns to defeat the scripted AFSIM opponent with
58% win rate, 26% loss rate, and 16% draw rate. Only
50,000 episodes of training are required to reach this level
of performance. Due to limitations of the AFSIM-based
scenario, we were not able to perform a baseline
experiment for comparison as we did for MA2D.</p>
      <p>We turn now to the MvN case, where each team
contains more than one fighter. Our solution uses traditional
target allocation algorithms to handle this part of the
problem. First, we compute a matrix with M rows and N
columns that contains the distance from each blue agent to
each red agent. Then, we either assign each agent to the
nearest target, or use the Hungarian algorithm to produce
an assignment. If there are more blue fighters than red
targets, multiple iterations of the Hungarian algorithm are
performed until all blue fighters have been assigned
(multiple fighters can be assigned to one target). The following
cost matrix is used to formulate this linear sum assignment
problem, where D is the distance matrix (with certain rows
removed if multiple iterations are needed – those
corresponding to already-assigned blue fighters):</p>
      <p>, = −1.0/(  , + 0.001)</p>
      <p>This effectively reduces the reinforcement learning
problem to a 1v1 scenario for each pair. The assignment is
re-computed at each time-step so that targets can be
reassigned dynamically. This solution is based on the
heuristic assumption that it is better for fighters to engage
opponents that are close by. This tends to hold up in practice
because rapid destruction of enemy threats involves
minimizing the time spent in flight, and therefore the distance
travelled. This approach has excellent scalability because
an efficient version of the Hungarian algorithm runs in
O(n^3) time. It also provides excellent generalizability in
the sense that an agent can be trained for a 1v1
engagement, and then used in a much larger scenario. It is
challenging to train a reinforcement learning agent to control
multiple platforms, and even more challenging to control
an arbitrary number of platforms. Although our software
framework allows us to train the reinforcement learning
agent in up to a 6v6 AFSIM environment, we achieved
some interesting results just by training a 1v1 agent and
placing it in the 6v6 scenario. Nevertheless, there are still
some potential benefits of training within the 6v6
environment. Most importantly, it appears that agents optimized
for a 1v1 scenario may be prone to use up all of their
missiles very quickly. Training within the 6v6 environment
may solve this problem by rewarding agents more
frequently when they try to save missiles for later
engagements.</p>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>
        When combined with traditional AI approaches,
reinforcement learning can produce high-level strategies that
are more effective than the previous state of the art.
However, a game theoretic perspective is needed to produce
truly robust strategies for a pair of adversaries. In this
paper, the blue agent learned an approximate best response to
a scripted red opponent. This capability is useful in and of
itself, but we are also applying empirical game theoretic
methods
        <xref ref-type="bibr" rid="ref9">(Lanctot et al. 2017)</xref>
        that allow the reinforcement
learning agent to learn without a pre-existing opponent
against which to train. This is the subject of a future
planned publication.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This work was funded by DARPA as part of the Serial
Interactions in Imperfect Information Games Applied to
Complex Military Decision Making (SI3-CMD) program
(contract # HR0011-19-90018). The authors thank Boeing
for providing AFSIM scenarios and scripted behaviors.
The AFSIM software is property of the Air Force Research
Laboratory. Any opinions, findings, conclusions, or
recommendations expressed in this material are those of the
authors and do not necessarily reflect the views of DARPA
or the Air Force Research Laboratory.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Hodson</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. D.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Advanced Framework for Simulation, Integration, and Modeling (AFSIM)</article-title>
          .
          <source>In Proceedings of the 2015 International Conference on Scientific Computing</source>
          . Las Vegas: CSREA Press.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>DeepMind</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>AlphaStar: Mastering the Real-Time Strategy Game of StarCraft II</article-title>
          . https://deepmind.com/blog/article/ alphastar
          <article-title>-mastering-real-time-strategy-game-starcraft-ii Frans,</article-title>
          K.;
          <string-name>
            <surname>Ho</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Abbeel</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Schulman</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <source>Meta Learning Shared Hierarchies. Paper presented at the International Conference on Learning Representations</source>
          . Vancouver, BC, April 30 - May 3.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Geron</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2017</year>
          .
          <article-title>Hands-On Machine Learning with Scikit-Learn &amp;</article-title>
          <string-name>
            <surname>TensorFlow. Sebastopol: O'Reilly.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>2015</year>
          .
          <article-title>Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification</article-title>
          .
          <source>Paper presented at the IEEE International Conference on Computer Vision</source>
          , Santiago, Chile, December 7-13.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <year>1997</year>
          .
          <article-title>Long Short-term Memory</article-title>
          .
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>; Ruderman</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ; Sonnerat N.;
            <surname>Green</surname>
          </string-name>
          <string-name>
            <given-names>T.</given-names>
            ; Deason L.;
            <surname>Leibo J. Z.; Silver D.; Hassabis D.; Kavukcuoglu</surname>
          </string-name>
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Graepel</surname>
          </string-name>
          ,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>Human-level performance in First-Person Multiplayer Games with Population-Based Deep Reinforcement Learning</article-title>
          .
          <source>Science</source>
          <volume>364</volume>
          (
          <issue>6443</issue>
          ):
          <fpage>859</fpage>
          -
          <lpage>865</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Kuhn</surname>
            ,
            <given-names>H. W.</given-names>
          </string-name>
          <year>1955</year>
          .
          <article-title>The Hungarian Method for the Assignment Problem</article-title>
          .
          <source>Naval Research Logistics Quarterly</source>
          <volume>2</volume>
          (
          <issue>1</issue>
          -2):
          <fpage>83</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Lanctot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zambaldi</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Gruslys</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lazaridou</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Tuyls</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Perolat</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Graepel,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>A Unified GameTheoretic Approach to Multiagent Reinforcement Learning</article-title>
          .
          <source>Paper presented at the 31st Conference on Neural Information Processing Systems</source>
          . Long Beach, CA, December 4-9.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Lapan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <year>2018</year>
          . Deep Reinforcement Learning Hands-On. Birmingham, UK: Packt Publishing.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <string-name>
            <surname>Matiisen</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>The Use of Embeddings in OpenAI Five</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          https://neuro.cs.
          <article-title>ut.ee/the-use-of-embeddings-in-openai-five/</article-title>
          <string-name>
            <surname>Mnih</surname>
          </string-name>
          , V.;
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wierstra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Riedmiller,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>Playing Atari with Deep Reinforcement Learning</article-title>
          .
          <source>arXiv preprint. arXiv: 1312.5602v1 [cs.LG]. Ithaca</source>
          , NY: Cornell University Library.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Badia</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mirza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Harley</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lillicrap</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ; Silver D.; and
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Asynchronous Methods for Deep Reinforcement Learning</article-title>
          .
          <source>In Proceedings of the 33rd International Conference on Machine Learning</source>
          . New York: Association for Computing Machinery.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <string-name>
            <surname>OpenAI.</surname>
          </string-name>
          <year>2018</year>
          . OpenAI Five. https://openai.com/blog/openaifive/ Spronck, P.;
          <string-name>
            <surname>Ponsen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sprinkhuizen-Kuyper</surname>
            ,
            <given-names>I.;</given-names>
          </string-name>
          and
          <string-name>
            <surname>Postma</surname>
          </string-name>
          , E.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          2006.
          <article-title>Adaptive Game AI with Dynamic Scripting</article-title>
          .
          <source>Machine Learning</source>
          ,
          <volume>63</volume>
          (
          <issue>3</issue>
          ),
          <fpage>217</fpage>
          -
          <lpage>248</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ; Precup,
          <string-name>
            <given-names>D.</given-names>
            ; and
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>1999</year>
          .
          <article-title>Between MDPs and Semi-MDPs: A Framework for Temporal Abstraction in Reinforcement Learning</article-title>
          .
          <source>Artificial Intelligence</source>
          ,
          <volume>112</volume>
          (
          <issue>1-2</issue>
          ),
          <fpage>181</fpage>
          -
          <lpage>211</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Barto</surname>
            ,
            <given-names>A. G.</given-names>
          </string-name>
          <year>2018</year>
          .
          <article-title>Reinforcement Learning: An Introduction</article-title>
          . Cambridge: The MIT Press.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Toubman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Roessingh</surname>
            ,
            <given-names>J. J.</given-names>
          </string-name>
          ; Spronck,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Plaat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ; and
            <surname>Herik</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          <year>2014</year>
          .
          <article-title>Dynamic Scripting with Team Coordination in Air Combat Simulation</article-title>
          .
          <source>In Proceedings of the 27th International Conference on Industrial, Engineering &amp; Other Applications of Applied Intelligent Systems. Kaohsiung</source>
          : Springer International.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <given-names>S.</given-names>
            <surname>; Yeo M.; Makhzani</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Kuttler H.; Agapiou</surname>
          </string-name>
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Schrittwieser</surname>
          </string-name>
          <string-name>
            <given-names>J</given-names>
            .; Quan J.; Gaffney S.;
            <surname>Petersen S.; Simonyan</surname>
          </string-name>
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Schaul</surname>
          </string-name>
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Hasselt H.; Silver D.; Lillicrap</surname>
          </string-name>
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Calderone K.; Keet P.; Brunasso</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Lawrence D.; Ekermo</surname>
          </string-name>
          <string-name>
            <given-names>A.</given-names>
            ; Repp J.; and
            <surname>Tsing</surname>
          </string-name>
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <article-title>arXiv preprint</article-title>
          .
          <source>arXiv: 1708</source>
          .04782 [cs.LG]. Ithaca, NY: Cornell University Library.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>