Multi-Agent Mission Planning with Reinforcement Learning
                                                     Sean Soleyman, Deepak Khosla
                                                               HRL Laboratories, LLC
                                                        ssoleyman@hrl.com, dkhosla@hrl.com


                               Abstract                                       management and air-to-air engagements. Our goal is to
   State of the art mission planning software packages such as                produce a decision-making engine that provides enhanced
   AFSIM use traditional AI approaches including allocation                   automation of tactical and strategic decision-making.
   algorithms and scripted state machines to control the simu-                   The current rule-based approach for specifying platform
   lated behavior of military aircraft, ships, and ground units.
                                                                              behaviors in AFSIM is based on video game style AI. Each
   We have developed a novel AI system that uses reinforce-
   ment learning to produce more effective high-level strate-                 unit is given a processor that executes tasks such as follow-
   gies for military engagements. However, instead of learning                ing a pre-set route, firing a weapon at the appropriate time,
   a policy from scratch with initially random behavior, it also              or pursuing a particular opponent. However, this approach
   leverages existing traditional AI approaches for automation                has several detrimental properties. The development of
   of simple low-level behaviors, to simplify the cooperative
                                                                              scripted polices is time consuming, and must be performed
   multi-agent aspect of the problem, and to bootstrap learning
   with available prior knowledge to achieve order of magni-                  by analysts with an aptitude for computer programming as
   tude faster training.                                                      well as an understanding of military strategy and tactics. In
                                                                              addition, scripted policies are fragile. Minor changes to the
                                                                              scenario (such as those that would be explored when ana-
                           Introduction                                       lyzing possible contingencies) can often cause the scripted
Simulation software for military applications has revolu-                     platform behavior to become nonsensical, necessitating the
tionized battle management and analytics, and also pro-                       expenditure of even more scenario development resources.
vides a gateway for integrating recent developments in                        Most importantly, there is always the possibility that a hu-
machine learning with real-world applications. AFSIM                          man analyst could fail to consider an unexpected strategy
(Advanced Framework for Simulation, Integration, and                          employed by a particularly clever adversary.
Modeling) allows military analysts to build a detailed
model of a mission scenario that includes aircraft, ships,
ground units, weapons, sensors, and communication sys-
tems (Clive et al. 2015). However, no mission simulation
would be complete without models for how the platforms
behave – both at a strategic and tactical level. Therefore,
users of this software are not only required to model physi-
cal systems and their capabilities, but must also serve as AI
designers.
   The end objective of our work is development of a more
generalizable form of artificial intelligence to address mul-
ti-domain military scenarios, with initial focus on battle

                                                                               Figure 1 - Example of a complex AFSIM scenario involving air,
Copyright 2020 for this paper by its authors. Use permitted under Crea-
tive Commons License Attribution 4.0 International (CC BY 4.0). In:           sea, and ground units. Analysts must model all of these platforms
Proceedings of AAAI Symposium on the 2nd Workshop on Deep Models                     and specify their behaviors with rule-based systems.
and Artificial Intelligence for Defense Applications: Potentials, Theories,
Practices, Tools, and Risks, November 11-12, 2020, Virtual, published at        Model-free reinforcement learning algorithms provide
http://ceur-ws.org
                                                                              an alternative solution that eliminates the need for script-
Distribution Statement “A” (Approved for Public Release, Distribution         ing. Instead of specifying behaviors for each platform, the
Unlimited)
analyst needs only to design an agent-environment inter-                                   Related Work
face with a well-defined observation space, action space,
and reward function. A reinforcement learning agent takes             In recent years, deep reinforcement learning agents have
care of the rest by starting out with completely random               achieved super-human performance in complex multi-
behavior and improving by trial and error (Lapan 2018).               player games such as StarCraft II (DeepMind 2019), De-
   First, we will describe our initial effort to apply this na-       fense of the Ancients (DOTA) (OpenAI 2018), and Quake
ïve baseline approach in a simplified AFSIM-like 2D mul-              / Capture the Flag (Jaderberg et al. 2019). Although these
ti-agent simulated environment (MA2D) that we developed               computer games are not intended to simulate real-world
in-house. This simulator is easier to experiment with be-             military engagements, they do possess several key similari-
cause it is written entirely in Python. Then, we will provide         ties that demonstrate the applicability of deep reinforce-
experimental evidence that reinforcement learning can be              ment learning technology to military decision making.
much more effective when combined with more traditional                  First, all of these games consist of two adversarial
non-learning based AI techniques that constitute the cur-             teams, each composed of a number of cooperative plat-
rent state of the art in practical applications, and will final-      forms. In Starcraft II, each team may contain over 100 in-
ly demonstrate that this hybrid approach can produce ro-              dividual units with capabilities loosely resembling those of
bust results in an actual AFSIM-based scenario that models            military ground units and aircraft. DeepMind’s approach is
aircraft and missile dynamics.                                        to use a single centralized reinforcement learning agent to
                                                                      control each team by selecting a set of platforms and issu-
                                                                      ing a command to the entire set (Vinyals et al. 2017).
                                                                      OpenAI Five’s DOTA solution uses a different type of
                                                                      multi-agent environment interface, where each agent re-
                                                                      ceives a separate command at each time-step (Matiisen
                                                                      2018). DeepMind’s Capture the Flag AI uses a distributed
                                                                      approach, where a separate agent controls each unit (Jader-
                                                                      berg et al. 2019). The multi-agent solution we will describe
                                                                      in this paper relates most closely to the last of these three,
                                                                      but also includes a novel hybridization of RL with the non-
                                                                      learning Kuhn-Munkres Hungarian algorithm (Kuhn
                                                                      1955).
                                                                         Another major similarity between these computer games
Figure 2 - Conceptual illustration of the AFSIM scenario that we      and real-world military simulations is that both are de-
are exploring initially. In each episode, a number of red and blue    signed to model continuous time with short discrete time-
  fighters are placed at random locations on a map. A baseline        steps. As a consequence, each episode may consist of thou-
 scripted AI is used to control the red team, and our new hybrid      sands of discrete time-steps and each agent may therefore
       RL agent learns a policy for defeating the red team.           need to select thousands of actions before it receives a final
                                                                      win/loss reward. This creates a challenging temporal ex-
                                                                      ploration problem that is a key focus of existing work in
                                                                      hierarchical reinforcement learning (Sutton, Precup, and
                                                                      Singh 1999) (Frans et al. 2018). Our hybrid hierarchical
                                                                      approach is more closely related to dynamic scripting,
                                                                      which has been applied to computer games (Spronck et al.
                                                                      2006) as well as simple air engagements (Toubman et al.
                                                                      2014).
                                                                         Finally, success of model-free deep RL in computer
                                                                      game environments demonstrates that this approach will
                                                                      extend naturally to partially-observable environments. In
                                                                      StarCraft II and DOTA, each team can only perceive ene-
                                                                      my units that are within visual range of one of their own
                                                                      units. In Capture the Flag, the agent actually perceives vis-
  Figure 3 - Simplified MA2D environment, written entirely in         ual images of the 3D simulated environment, and it is pos-
  Python. This example contains two blue fighters and two red
                                                                      sible for enemies to hide behind walls. In real-world air
fighters. Dark gray areas represent each unit's weapon zone. The
 objective is to destroy all opponents by getting each within this
                                                                      engagements, pilots identify enemy units using sensing
zone, while avoiding similar destruction of friendly aircraft. This   modalities such as radar, vision, and IR. Implementation of
  simplification eliminates the need for modeling missile flight.     realistic partially-observable air engagement scenarios is
the subject of future planned work, and successes in com-         policy network contains five neurons (one corresponding
puter game environments demonstrate the capability of             to each action listed above) and uses a softmax activation
deep reinforcement learning agents with LSTM units                layer with distribution sampling, while the output layer for
(Hochreiter and Schmidhuber 1997) to achieve good re-             the value network is a single linear neuron that predicts net
sults even when confronted with imperfect information.            reward. Weights are initialized using the method described
                                                                  by He et al. with a truncated normal distribution and based
                                                                  on averaging the number of inputs and outputs (He et al.
   Reinforcement Learning Baseline Method                         2015). Use of the value network for bootstrapping does not
Our initial experiments were performed using a simple             improve performance in this particular application, so it is
MA2D environment similar to the one illustrated in Figure         used only as a baseline to reduce variance when computing
3. A reinforcement learning agent was given control over a        advantage values (Sutton and Barto 2018).
single blue fighter, and traditional scripted behavior was           To compute the gradients needed to train the networks,
used to control the red fighter. In some experiments, the         we use an RMSProp optimizer with learning rate 0.0007,
red fighter was set to use a pure pursuit strategy against the    momentum 0.0, and epsilon 1e-10. We use the A3C
blue fighter. In others, it simply traveled straight, providing   (Asynchronous Advantage Actor-Critic) parallelization
a moving target for the blue agent to intercept. We intro-        scheme, where 20 workers each run simulations and com-
duced variation to the problem by having each of the fight-       pute gradients, and these gradients are applied to a central-
ers start each episode in a random location on the map,           ized learner (Mnih et al. 2016). We have experimented
with random heading. This ensures that the agent learns a         with adding an entropy term to the objective function to
generalizable policy, not just a point solution to a single       help encourage exploration, but this has not been shown to
scenario. Each fighter’s turn rate is limited to 2.5 degrees      produce a substantial performance improvement. Reward
per time-step, and each fighter’s acceleration is limited to 5    discounting was also determined experimentally to be of
m/s/time-step. An opponent is instantly defeated if it            limited use in our application, and was therefore omitted.
comes within the circle sector shown in dark gray with            We trained for up to 200,000 episodes, but found 10,000
radius 2km and angle 30 degrees.                                  episodes to be sufficient when training against the straight-
   Each episode lasts for a maximum of 1000 time-steps.           flying opponent. In this simplified environment, it has
The reward function consists of sparse and dense compo-           proven difficult to achieve a high win rate against a pure
nents. At the end of each episode, the agent receives a           pursuit opponent. However, the reinforcement learning
large positive reward if it has destroyed its opponent and a      agent does learn to achieve roughly equal numbers of wins
large negative reward if it has been destroyed. The exact         and losses (it is able to match, but unable to exceed, the
size of this reward is 10.0 times the number of time-steps        performance of the MA2D scripted opponent). In the next
remaining when one side has won. This time-based factor           section, we will compare quantitative performance metrics
provides the agent with an incentive to destroy its oppo-         of this machine learning system with those of our hybrid
nent as quickly as possible, or to postpone its own demise.       approach.
In addition, even if there is a draw where neither side wins
within 1000 steps, the blue agent still receives a small re-
                                                                           High-Level Behavior-Based RL
ward of 1.0 whenever it gets closer to the opponent. This
helps to remedy the temporal exploration problem, where it        Our novel hybrid approach builds upon this pure rein-
is statistically unlikely that an agent will learn to produce a   forcement learning baseline by leveraging traditional AI
long sequence of correct actions needed to catch its oppo-        techniques to produce low-level behaviors and to aid in
nent without the aid of a dense reward. Later, we will see        multi-target allocation. This allows the reinforcement
that our novel approach allows us to simplify this reward         learning agent to focus on the part of the problem for
function while achieving even better results.                     which traditional AI does not offer an out-of-the-box solu-
   In this simple 1v1 environment, the blue agent’s obser-        tion. We will continue to discuss the 1v1 case in this sec-
vation is a vector consisting of the opponent’s relative dis-     tion and the next, and will subsequently move on to the
tance, bearing, heading, closing speed, and cross speed. At       multi-agent MvN case, which we will explore in a more
each time-step, the agent receives this observation and se-       advanced AFSIM-based environment.
lects one of the following discrete actions: turn left, turn         The 1v1 architecture consists of a high-level controller
right, speed up, slow down, hold course. The agent uses an        and a set of low-level scripted behaviors. The high-level
actor-critic reinforcement learning architecture with com-        controller is a reinforcement learning agent that takes in
pletely separate value and policy networks. Each network          observations from the environment, and uses a neural net
consists of a hidden layer with 36 neurons and ReLU acti-         to select behaviors such as “lead pursuit,” “lag pursuit,”
vations, as well as an output layer. The output layer for the     “pure pursuit,” or “evade.” Once the behavior has been
selected, a low-level controller produces output actions                One potential shortcoming of this approach is that the
with direct control over the fighter’s motion. For example,          high-level agent must still select a large number of actions
if an autonomous aircraft in a 1v1 engagement selects                within a single episode. This leads to a potentially intracta-
“pure pursuit,” the corresponding low-level behavior script          ble credit assignment problem (Geron 2017). We now con-
will generate stick-and-throttle actions that cause the plane        sider three possible remedies, each of which provides a
to head directly toward its opponent. These low-level ac-            mechanism that restricts the times at which the high-level
tions are simply “turn right,” “turn left,” etc. in the MA2D         controller is given a choice to switch to a different behav-
case, but could also produce continuous control signals              ior.
needed to pilot high-fidelity aircraft models or even real              The first alternative still performs high-level behavior
aircraft.                                                            selection at a fixed frequency, but this frequency is lower
                                                                     than the update rate of the low-level controller as illustrat-
                                                                     ed in Figure 6. Similar approaches have been used with
                                                                     pure reinforcement learning (Mnih et al. 2013). In the next
                                                                     section, we will show that this approach provides a slight
                                                                     improvement in performance over the basic hybrid agent,
                                                                     at the expense of increased complexity. We will refer to
                                                                     this add-on as “action repetition.”


                                                                         Figure 6 - Fixed-frequency behavior selection with action
                                                                       repetition. In this example, the high-level learner selects four
 Figure 4 - Overview of our hybrid architecture that pairs a high-
                                                                       behaviors, but the environment receives 32 low-level actions.
   level reinforcement learner with low-level scripted behavior
   policies. The reinforcement learning agent selects a scripted
 behavior, which then produces the actual control output sent to        The second alternative uses traditional rule-based AI to
                         the environment.                            specify a termination condition for each behavior. Once a
                                                                     behavior has been selected, execution will continue until
   The high-level controller’s neural net is trained using           this termination condition has been reached, at which time
reinforcement learning. For each training episode, the sys-          the high-level controller will select a new behavior. This is
tem keeps track of the high-level behaviors it has selected,         similar to the “Dynamic Scripting” approach (Toubman et
the observations that resulted from applying the corre-              al. 2014). The disadvantage of this approach is that it lacks
sponding low-level actions to the environment, and the               flexibility. Once the reinforcement learning agent initiates
rewards that were obtained from the same environment’s               an action, it has no way of terminating this action even if
reward function. After each episode has been completed,              the situation changes entirely at a later time.
we train the agent using a method similar to that described             The third alternative is illustrated in Figure 7. It includes
in the previous section.                                             additional neural nets that restrict the times at which the
                                                                     high-level controller can switch to a different behavior.
                                                                     The agent starts out each episode in the “strategic” state.
                                                                     When the agent is in this state, it selects a low-level behav-
                                                                     ior using the method described earlier in this section. How-
                                                                     ever, once the agent has selected a behavior, it continues
                                                                     executing this behavior until a low-level “tactical” learner
                                                                     decides to transition control back to the “strategic” learner.
                                                                     Each time the selected low-level controller produces an
                                                                     output action, its corresponding neural net produces proba-
                                                                     bilities for continuing with the current behavior, or for
                                                                     handing control back to the high-level controller that may
Figure 5 - Pseudocode for the hybrid system consisting of            then decide to switch to a different behavior. The objective
an actor-critic agent and a number of scripted low-level             of this approach is to provide improved credit assignment
behaviors.                                                           for decisions made by the strategic learner, while still
                                                                     providing the learnable flexibility needed for precision
                                                                     timing of behavior transitions.
                                                                        which point a reward of +5000 is given to the platform in
                                                                        firing position, and -5000 to the platform that is about to be
                                                                        fired upon. If neither platform enters the other’s engage-
                                                                        ment zone within 1000 time-steps, a draw is declared and
                                                                        each platform receives 0 reward.


 Figure 7 – Depiction of a hierarchical learning agent with seven
behaviors as a state machine with eight states. Each state is tied to
 a separate reinforcement learner. There is one “strategic” learner
               and there are seven “tactical” learners.


Behavior-Based RL Experiments and Results
Experiments were performed using the same MA2D simu-
lated environment described in the section on a baseline
reinforcement learning solution. No changes were made to                  Figure 8 - Behaviors available to the reinforcement learning
the observation space. However, the action space for the                   agent. The first 13 behaviors consist of lead, lag, and pure
reinforcement learning agent now consists of the set of                 pursuits with various offsets. The final behavior causes the agent
                                                                                         to fly away from the opponent.
behaviors listed in Figure 8. When the neural net selects a
lag pursuit, it causes the platform that it is controlling to
pursue a point behind its opponent. Pure pursuit and lead                  Experimental results are shown in Figure 9. The baseline
pursuit are similar, except that the point is at or in front of         result uses pure reinforcement learning. It takes approxi-
the target in each respective case. The evade action causes             mately 2,500 episodes of experience before the agent
the platform to turn away from its opponent and increase                learns to win more episodes than it loses. In contrast, the
speed as much as possible so that it can escape. Once a                 hybrid approach described in this section uses one of its
behavior is selected, the corresponding low-level script                scripted policies to achieve learning that appears almost
produces an output in the same action space that was de-                instantaneous by comparison. Indeed, the prior knowledge
scribed in the previous section so that an apples-to-apples             encoded in the scripted policy greatly simplifies the rein-
comparison with the baseline approach can be obtained.                  forcement learning task. We also experimented with an
   One unexpected benefit of the hybrid approach de-                    action repetition variant where the high-level behavior is
scribed in the previous section is that it eliminates the need          selected 256 times less frequently than the low-level ac-
for dense rewards and reward function engineering. In re-               tion. This makes it even easier for the reinforcement learn-
inforcement learning applications, it is typical for the envi-          ing module to find a winning strategy, because it only
ronment to provide the agent with a more informative                    needs to select a behavior four times per episode instead of
“dense reward” function that provides a more continuous                 1000 times (assuming that each episode lasts for 1000
spectrum of outcome desirability than just win or loss.                 steps).
These dense reward functions can be difficult to design,                   These results demonstrate that our novel method has
especially as scenarios become more complex. Elimination                advantages over both constituent technologies from which
of this requirement makes the method much easier to apply               it is composed. It can be much faster than reinforcement
to new scenarios because it removes the need for this trial-            learning with a flat architecture, and more effective than a
and-error design process.                                               simple scripted (traditional) AI opponent.
   The hybrid agent is able to learn effectively with only a
win-loss reward. Each episode ends when one of the plat-
forms enters the other’s weapon engagement zone, at
Figure 9 - Results of training the baseline agent, the basic hybrid
learner, and an action repetition variant that produces 256 low-
level actions per high-level selection.


Multi-Agent Hybrid Learning and Allocation
                                                                       Figure 11 - Muli-agent AFSIM-based environment with 6 blue
Having demonstrated that the hybrid RL approach produc-                fighters and 6 red fighters. The blue station on the left and red
es vastly improved results in the simple MA2D environ-                    ship on the right serve only to command their fighters. The
ment, we apply this AI solution to a more complex deci-                 fighters fire missiles at one another, and enemy destruction is
sion environment developed with AFSIM. In this scenario,                 determined based on missile dynamics and weapon models.
each fighter has five possible actions. It can pursue an op-
ponent, fire a salvo of weapons, provide weapon support,                 We turn now to the MvN case, where each team con-
perform evasive maneuvers, or maintain a steady course.               tains more than one fighter. Our solution uses traditional
When there is more than one opponent, the AI can also                 target allocation algorithms to handle this part of the prob-
select which one to target. In addition to observed enemy             lem. First, we compute a matrix with M rows and N col-
positions and velocities, the environment also returns a              umns that contains the distance from each blue agent to
simple sparse reward at the end of each episode that is               each red agent. Then, we either assign each agent to the
+3000 for the winning team, and -3000 for the losing team.            nearest target, or use the Hungarian algorithm to produce
For simplicity, a team is declared victorious if it destroys          an assignment. If there are more blue fighters than red tar-
all of the opponents within a time limit. Otherwise, the              gets, multiple iterations of the Hungarian algorithm are
outcome is declared to be a draw and each team receives               performed until all blue fighters have been assigned (mul-
zero reward.                                                          tiple fighters can be assigned to one target). The following
                                                                      cost matrix is used to formulate this linear sum assignment
  In the 1v1 case, our hybrid reinforcement learning agent            problem, where D is the distance matrix (with certain rows
quickly learns to defeat the scripted AFSIM opponent with             removed if multiple iterations are needed – those corre-
58% win rate, 26% loss rate, and 16% draw rate. Only                  sponding to already-assigned blue fighters):
50,000 episodes of training are required to reach this level
of performance. Due to limitations of the AFSIM-based                               𝐶𝑖,𝑗 = −1.0/(𝐷𝑖,𝑗 + 0.001)
scenario, we were not able to perform a baseline experi-
ment for comparison as we did for MA2D.                                  This effectively reduces the reinforcement learning
                                                                      problem to a 1v1 scenario for each pair. The assignment is
                                                                      re-computed at each time-step so that targets can be re-
                                                                      assigned dynamically. This solution is based on the heuris-
                                                                      tic assumption that it is better for fighters to engage oppo-
                                                                      nents that are close by. This tends to hold up in practice
                                                                      because rapid destruction of enemy threats involves mini-
                                                                      mizing the time spent in flight, and therefore the distance
                                                                      travelled. This approach has excellent scalability because
                                                                      an efficient version of the Hungarian algorithm runs in
Figure 10 - Win/loss/draw results for engagements with up to 12
                                                                      O(n^3) time. It also provides excellent generalizability in
 fighters, with two different target allocation algorithms that we
  investigated. Each experiment consisted of 1000 trials. These
                                                                      the sense that an agent can be trained for a 1v1 engage-
   results demonstrate that the hybrid RL agent with Hungarian        ment, and then used in a much larger scenario. It is chal-
  assignment achieved more wins than losses against a standard        lenging to train a reinforcement learning agent to control
    AFSIM scripted AI in all experiments, from 1v1 up to 6v6.         multiple platforms, and even more challenging to control
an arbitrary number of platforms. Although our software               national Conference on Learning Representations. Vancouver,
framework allows us to train the reinforcement learning               BC, April 30 – May 3.
agent in up to a 6v6 AFSIM environment, we achieved                   Geron, A. 2017. Hands-On Machine Learning with Scikit-Learn
                                                                      & TensorFlow. Sebastopol: O'Reilly.
some interesting results just by training a 1v1 agent and
placing it in the 6v6 scenario. Nevertheless, there are still         He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving Deep into
                                                                      Rectifiers: Surpassing Human-Level Performance on ImageNet
some potential benefits of training within the 6v6 environ-           Classification. Paper presented at the IEEE International Confer-
ment. Most importantly, it appears that agents optimized              ence on Computer Vision, Santiago, Chile, December 7-13.
for a 1v1 scenario may be prone to use up all of their mis-           Hochreiter, S., and Schmidhuber, J. 1997. Long Short-term
siles very quickly. Training within the 6v6 environment               Memory. Neural Computation 9(8): 1735-1780.
may solve this problem by rewarding agents more fre-                  Jaderberg, M.; Czarnecki W. M.; Dunning, I.; Marris, L.; Lever,
quently when they try to save missiles for later engage-              G.; Castaneda, A. G.; Beattie C.; Rabinowitz, N. C.; Morcos A.
ments.                                                                S.; Ruderman A.; Sonnerat N.; Green T.; Deason L.; Leibo J. Z.;
                                                                      Silver D.; Hassabis D.; Kavukcuoglu K.; Graepel, T. 2019. Hu-
                                                                      man-level performance in First-Person Multiplayer Games with
                         Conclusion                                   Population-Based Deep Reinforcement Learning. Science
                                                                      364(6443): 859-865.
When combined with traditional AI approaches, rein-                   Kuhn, H. W. 1955. The Hungarian Method for the Assignment
forcement learning can produce high-level strategies that             Problem. Naval Research Logistics Quarterly 2(1-2): 83-97.
are more effective than the previous state of the art. How-           Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.;
ever, a game theoretic perspective is needed to produce               Perolat, J.; Silver, D.; and Graepel, T. 2017. A Unified Game-
                                                                      Theoretic Approach to Multiagent Reinforcement Learning. Pa-
truly robust strategies for a pair of adversaries. In this pa-
                                                                      per presented at the 31st Conference on Neural Information Pro-
per, the blue agent learned an approximate best response to           cessing Systems. Long Beach, CA, December 4-9.
a scripted red opponent. This capability is useful in and of          Lapan, M. 2018. Deep Reinforcement Learning Hands-On. Bir-
itself, but we are also applying empirical game theoretic             mingham, UK: Packt Publishing.
methods (Lanctot et al. 2017) that allow the reinforcement            Matiisen, T. 2018. The Use of Embeddings in OpenAI Five.
learning agent to learn without a pre-existing opponent               https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/
against which to train. This is the subject of a future               Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou,
planned publication.                                                  I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari with
                                                                      Deep Reinforcement Learning. arXiv preprint. arXiv:
                                                                      1312.5602v1 [cs.LG]. Ithaca, NY: Cornell University Library.
                   Acknowledgements                                   Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Harley, T.; Lillic-
                                                                      rap, T.; Silver D.; and Kavukcuoglu, K. 2016. Asynchronous
This work was funded by DARPA as part of the Serial                   Methods for Deep Reinforcement Learning. In Proceedings of the
Interactions in Imperfect Information Games Applied to                33rd International Conference on Machine Learning. New York:
Complex Military Decision Making (SI3-CMD) program                    Association for Computing Machinery.
(contract # HR0011-19-90018). The authors thank Boeing                OpenAI. 2018. OpenAI Five. https://openai.com/blog/openai-
for providing AFSIM scenarios and scripted behaviors.                 five/
The AFSIM software is property of the Air Force Research              Spronck, P.; Ponsen, M.; Sprinkhuizen-Kuyper, I.; and Postma, E.
Laboratory. Any opinions, findings, conclusions, or rec-              2006. Adaptive Game AI with Dynamic Scripting. Machine
                                                                      Learning, 63(3), 217-248.
ommendations expressed in this material are those of the
                                                                      Sutton, R.; Precup, D.; and Singh, S. 1999. Between MDPs and
authors and do not necessarily reflect the views of DARPA
                                                                      Semi-MDPs: A Framework for Temporal Abstraction in Rein-
or the Air Force Research Laboratory.                                 forcement Learning. Artificial Intelligence, 112(1-2), 181-211.
                                                                      Sutton, R. S., and Barto, A. G. 2018. Reinforcement Learning: An
                                                                      Introduction. Cambridge: The MIT Press.
                         References
                                                                      Toubman, A.; Roessingh, J. J.; Spronck, P.; Plaat, A.; and Herik,
Clive, P. D.; Johnson, J. A.; Moss, M. J.; Zeh, J. M.; Birkmire, B.   J. 2014. Dynamic Scripting with Team Coordination in Air Com-
M.; and Hodson, D. D. 2015. Advanced Framework for Simula-            bat Simulation. In Proceedings of the 27th International Confer-
tion, Integration, and Modeling (AFSIM). In Proceedings of the        ence on Industrial, Engineering & Other Applications of Applied
2015 International Conference on Scientific Computing. Las Ve-        Intelligent Systems. Kaohsiung: Springer International.
gas: CSREA Press.                                                     Vinyals O.; Ewalds T.; Bartunov S.; Georgiev P.; Vezhnevets A.
DeepMind 2019. AlphaStar: Mastering the Real-Time Strategy            S.; Yeo M.; Makhzani A.; Kuttler H.; Agapiou J., Schrittwieser
Game of StarCraft II. https://deepmind.com/blog/article/              J.; Quan J.; Gaffney S.; Petersen S.; Simonyan K.; Schaul T.;
alphastar-mastering-real-time-strategy-game-starcraft-ii              Hasselt H.; Silver D.; Lillicrap T.; Calderone K.; Keet P.; Brunas-
Frans, K.; Ho, J.; Chen, X.; Abbeel, P.; and Schulman, J. 2018.       so A.; Lawrence D.; Ekermo A.; Repp J.; and Tsing R. 2017.
Meta Learning Shared Hierarchies. Paper presented at the Inter-       StarCraft II: A New Challenge for Reinforcement Learning.
                                                                      arXiv preprint. arXiv: 1708.04782 [cs.LG]. Ithaca, NY: Cornell
                                                                      University Library.