Multi-Agent Mission Planning with Reinforcement Learning Sean Soleyman, Deepak Khosla HRL Laboratories, LLC ssoleyman@hrl.com, dkhosla@hrl.com Abstract management and air-to-air engagements. Our goal is to State of the art mission planning software packages such as produce a decision-making engine that provides enhanced AFSIM use traditional AI approaches including allocation automation of tactical and strategic decision-making. algorithms and scripted state machines to control the simu- The current rule-based approach for specifying platform lated behavior of military aircraft, ships, and ground units. behaviors in AFSIM is based on video game style AI. Each We have developed a novel AI system that uses reinforce- ment learning to produce more effective high-level strate- unit is given a processor that executes tasks such as follow- gies for military engagements. However, instead of learning ing a pre-set route, firing a weapon at the appropriate time, a policy from scratch with initially random behavior, it also or pursuing a particular opponent. However, this approach leverages existing traditional AI approaches for automation has several detrimental properties. The development of of simple low-level behaviors, to simplify the cooperative scripted polices is time consuming, and must be performed multi-agent aspect of the problem, and to bootstrap learning with available prior knowledge to achieve order of magni- by analysts with an aptitude for computer programming as tude faster training. well as an understanding of military strategy and tactics. In addition, scripted policies are fragile. Minor changes to the scenario (such as those that would be explored when ana- Introduction lyzing possible contingencies) can often cause the scripted Simulation software for military applications has revolu- platform behavior to become nonsensical, necessitating the tionized battle management and analytics, and also pro- expenditure of even more scenario development resources. vides a gateway for integrating recent developments in Most importantly, there is always the possibility that a hu- machine learning with real-world applications. AFSIM man analyst could fail to consider an unexpected strategy (Advanced Framework for Simulation, Integration, and employed by a particularly clever adversary. Modeling) allows military analysts to build a detailed model of a mission scenario that includes aircraft, ships, ground units, weapons, sensors, and communication sys- tems (Clive et al. 2015). However, no mission simulation would be complete without models for how the platforms behave – both at a strategic and tactical level. Therefore, users of this software are not only required to model physi- cal systems and their capabilities, but must also serve as AI designers. The end objective of our work is development of a more generalizable form of artificial intelligence to address mul- ti-domain military scenarios, with initial focus on battle Figure 1 - Example of a complex AFSIM scenario involving air, Copyright 2020 for this paper by its authors. Use permitted under Crea- tive Commons License Attribution 4.0 International (CC BY 4.0). In: sea, and ground units. Analysts must model all of these platforms Proceedings of AAAI Symposium on the 2nd Workshop on Deep Models and specify their behaviors with rule-based systems. and Artificial Intelligence for Defense Applications: Potentials, Theories, Practices, Tools, and Risks, November 11-12, 2020, Virtual, published at Model-free reinforcement learning algorithms provide http://ceur-ws.org an alternative solution that eliminates the need for script- Distribution Statement “A” (Approved for Public Release, Distribution ing. Instead of specifying behaviors for each platform, the Unlimited) analyst needs only to design an agent-environment inter- Related Work face with a well-defined observation space, action space, and reward function. A reinforcement learning agent takes In recent years, deep reinforcement learning agents have care of the rest by starting out with completely random achieved super-human performance in complex multi- behavior and improving by trial and error (Lapan 2018). player games such as StarCraft II (DeepMind 2019), De- First, we will describe our initial effort to apply this na- fense of the Ancients (DOTA) (OpenAI 2018), and Quake ïve baseline approach in a simplified AFSIM-like 2D mul- / Capture the Flag (Jaderberg et al. 2019). Although these ti-agent simulated environment (MA2D) that we developed computer games are not intended to simulate real-world in-house. This simulator is easier to experiment with be- military engagements, they do possess several key similari- cause it is written entirely in Python. Then, we will provide ties that demonstrate the applicability of deep reinforce- experimental evidence that reinforcement learning can be ment learning technology to military decision making. much more effective when combined with more traditional First, all of these games consist of two adversarial non-learning based AI techniques that constitute the cur- teams, each composed of a number of cooperative plat- rent state of the art in practical applications, and will final- forms. In Starcraft II, each team may contain over 100 in- ly demonstrate that this hybrid approach can produce ro- dividual units with capabilities loosely resembling those of bust results in an actual AFSIM-based scenario that models military ground units and aircraft. DeepMind’s approach is aircraft and missile dynamics. to use a single centralized reinforcement learning agent to control each team by selecting a set of platforms and issu- ing a command to the entire set (Vinyals et al. 2017). OpenAI Five’s DOTA solution uses a different type of multi-agent environment interface, where each agent re- ceives a separate command at each time-step (Matiisen 2018). DeepMind’s Capture the Flag AI uses a distributed approach, where a separate agent controls each unit (Jader- berg et al. 2019). The multi-agent solution we will describe in this paper relates most closely to the last of these three, but also includes a novel hybridization of RL with the non- learning Kuhn-Munkres Hungarian algorithm (Kuhn 1955). Another major similarity between these computer games Figure 2 - Conceptual illustration of the AFSIM scenario that we and real-world military simulations is that both are de- are exploring initially. In each episode, a number of red and blue signed to model continuous time with short discrete time- fighters are placed at random locations on a map. A baseline steps. As a consequence, each episode may consist of thou- scripted AI is used to control the red team, and our new hybrid sands of discrete time-steps and each agent may therefore RL agent learns a policy for defeating the red team. need to select thousands of actions before it receives a final win/loss reward. This creates a challenging temporal ex- ploration problem that is a key focus of existing work in hierarchical reinforcement learning (Sutton, Precup, and Singh 1999) (Frans et al. 2018). Our hybrid hierarchical approach is more closely related to dynamic scripting, which has been applied to computer games (Spronck et al. 2006) as well as simple air engagements (Toubman et al. 2014). Finally, success of model-free deep RL in computer game environments demonstrates that this approach will extend naturally to partially-observable environments. In StarCraft II and DOTA, each team can only perceive ene- my units that are within visual range of one of their own units. In Capture the Flag, the agent actually perceives vis- Figure 3 - Simplified MA2D environment, written entirely in ual images of the 3D simulated environment, and it is pos- Python. This example contains two blue fighters and two red sible for enemies to hide behind walls. In real-world air fighters. Dark gray areas represent each unit's weapon zone. The objective is to destroy all opponents by getting each within this engagements, pilots identify enemy units using sensing zone, while avoiding similar destruction of friendly aircraft. This modalities such as radar, vision, and IR. Implementation of simplification eliminates the need for modeling missile flight. realistic partially-observable air engagement scenarios is the subject of future planned work, and successes in com- policy network contains five neurons (one corresponding puter game environments demonstrate the capability of to each action listed above) and uses a softmax activation deep reinforcement learning agents with LSTM units layer with distribution sampling, while the output layer for (Hochreiter and Schmidhuber 1997) to achieve good re- the value network is a single linear neuron that predicts net sults even when confronted with imperfect information. reward. Weights are initialized using the method described by He et al. with a truncated normal distribution and based on averaging the number of inputs and outputs (He et al. Reinforcement Learning Baseline Method 2015). Use of the value network for bootstrapping does not Our initial experiments were performed using a simple improve performance in this particular application, so it is MA2D environment similar to the one illustrated in Figure used only as a baseline to reduce variance when computing 3. A reinforcement learning agent was given control over a advantage values (Sutton and Barto 2018). single blue fighter, and traditional scripted behavior was To compute the gradients needed to train the networks, used to control the red fighter. In some experiments, the we use an RMSProp optimizer with learning rate 0.0007, red fighter was set to use a pure pursuit strategy against the momentum 0.0, and epsilon 1e-10. We use the A3C blue fighter. In others, it simply traveled straight, providing (Asynchronous Advantage Actor-Critic) parallelization a moving target for the blue agent to intercept. We intro- scheme, where 20 workers each run simulations and com- duced variation to the problem by having each of the fight- pute gradients, and these gradients are applied to a central- ers start each episode in a random location on the map, ized learner (Mnih et al. 2016). We have experimented with random heading. This ensures that the agent learns a with adding an entropy term to the objective function to generalizable policy, not just a point solution to a single help encourage exploration, but this has not been shown to scenario. Each fighter’s turn rate is limited to 2.5 degrees produce a substantial performance improvement. Reward per time-step, and each fighter’s acceleration is limited to 5 discounting was also determined experimentally to be of m/s/time-step. An opponent is instantly defeated if it limited use in our application, and was therefore omitted. comes within the circle sector shown in dark gray with We trained for up to 200,000 episodes, but found 10,000 radius 2km and angle 30 degrees. episodes to be sufficient when training against the straight- Each episode lasts for a maximum of 1000 time-steps. flying opponent. In this simplified environment, it has The reward function consists of sparse and dense compo- proven difficult to achieve a high win rate against a pure nents. At the end of each episode, the agent receives a pursuit opponent. However, the reinforcement learning large positive reward if it has destroyed its opponent and a agent does learn to achieve roughly equal numbers of wins large negative reward if it has been destroyed. The exact and losses (it is able to match, but unable to exceed, the size of this reward is 10.0 times the number of time-steps performance of the MA2D scripted opponent). In the next remaining when one side has won. This time-based factor section, we will compare quantitative performance metrics provides the agent with an incentive to destroy its oppo- of this machine learning system with those of our hybrid nent as quickly as possible, or to postpone its own demise. approach. In addition, even if there is a draw where neither side wins within 1000 steps, the blue agent still receives a small re- High-Level Behavior-Based RL ward of 1.0 whenever it gets closer to the opponent. This helps to remedy the temporal exploration problem, where it Our novel hybrid approach builds upon this pure rein- is statistically unlikely that an agent will learn to produce a forcement learning baseline by leveraging traditional AI long sequence of correct actions needed to catch its oppo- techniques to produce low-level behaviors and to aid in nent without the aid of a dense reward. Later, we will see multi-target allocation. This allows the reinforcement that our novel approach allows us to simplify this reward learning agent to focus on the part of the problem for function while achieving even better results. which traditional AI does not offer an out-of-the-box solu- In this simple 1v1 environment, the blue agent’s obser- tion. We will continue to discuss the 1v1 case in this sec- vation is a vector consisting of the opponent’s relative dis- tion and the next, and will subsequently move on to the tance, bearing, heading, closing speed, and cross speed. At multi-agent MvN case, which we will explore in a more each time-step, the agent receives this observation and se- advanced AFSIM-based environment. lects one of the following discrete actions: turn left, turn The 1v1 architecture consists of a high-level controller right, speed up, slow down, hold course. The agent uses an and a set of low-level scripted behaviors. The high-level actor-critic reinforcement learning architecture with com- controller is a reinforcement learning agent that takes in pletely separate value and policy networks. Each network observations from the environment, and uses a neural net consists of a hidden layer with 36 neurons and ReLU acti- to select behaviors such as “lead pursuit,” “lag pursuit,” vations, as well as an output layer. The output layer for the “pure pursuit,” or “evade.” Once the behavior has been selected, a low-level controller produces output actions One potential shortcoming of this approach is that the with direct control over the fighter’s motion. For example, high-level agent must still select a large number of actions if an autonomous aircraft in a 1v1 engagement selects within a single episode. This leads to a potentially intracta- “pure pursuit,” the corresponding low-level behavior script ble credit assignment problem (Geron 2017). We now con- will generate stick-and-throttle actions that cause the plane sider three possible remedies, each of which provides a to head directly toward its opponent. These low-level ac- mechanism that restricts the times at which the high-level tions are simply “turn right,” “turn left,” etc. in the MA2D controller is given a choice to switch to a different behav- case, but could also produce continuous control signals ior. needed to pilot high-fidelity aircraft models or even real The first alternative still performs high-level behavior aircraft. selection at a fixed frequency, but this frequency is lower than the update rate of the low-level controller as illustrat- ed in Figure 6. Similar approaches have been used with pure reinforcement learning (Mnih et al. 2013). In the next section, we will show that this approach provides a slight improvement in performance over the basic hybrid agent, at the expense of increased complexity. We will refer to this add-on as “action repetition.” Figure 6 - Fixed-frequency behavior selection with action repetition. In this example, the high-level learner selects four Figure 4 - Overview of our hybrid architecture that pairs a high- behaviors, but the environment receives 32 low-level actions. level reinforcement learner with low-level scripted behavior policies. The reinforcement learning agent selects a scripted behavior, which then produces the actual control output sent to The second alternative uses traditional rule-based AI to the environment. specify a termination condition for each behavior. Once a behavior has been selected, execution will continue until The high-level controller’s neural net is trained using this termination condition has been reached, at which time reinforcement learning. For each training episode, the sys- the high-level controller will select a new behavior. This is tem keeps track of the high-level behaviors it has selected, similar to the “Dynamic Scripting” approach (Toubman et the observations that resulted from applying the corre- al. 2014). The disadvantage of this approach is that it lacks sponding low-level actions to the environment, and the flexibility. Once the reinforcement learning agent initiates rewards that were obtained from the same environment’s an action, it has no way of terminating this action even if reward function. After each episode has been completed, the situation changes entirely at a later time. we train the agent using a method similar to that described The third alternative is illustrated in Figure 7. It includes in the previous section. additional neural nets that restrict the times at which the high-level controller can switch to a different behavior. The agent starts out each episode in the “strategic” state. When the agent is in this state, it selects a low-level behav- ior using the method described earlier in this section. How- ever, once the agent has selected a behavior, it continues executing this behavior until a low-level “tactical” learner decides to transition control back to the “strategic” learner. Each time the selected low-level controller produces an output action, its corresponding neural net produces proba- bilities for continuing with the current behavior, or for handing control back to the high-level controller that may Figure 5 - Pseudocode for the hybrid system consisting of then decide to switch to a different behavior. The objective an actor-critic agent and a number of scripted low-level of this approach is to provide improved credit assignment behaviors. for decisions made by the strategic learner, while still providing the learnable flexibility needed for precision timing of behavior transitions. which point a reward of +5000 is given to the platform in firing position, and -5000 to the platform that is about to be fired upon. If neither platform enters the other’s engage- ment zone within 1000 time-steps, a draw is declared and each platform receives 0 reward. Figure 7 – Depiction of a hierarchical learning agent with seven behaviors as a state machine with eight states. Each state is tied to a separate reinforcement learner. There is one “strategic” learner and there are seven “tactical” learners. Behavior-Based RL Experiments and Results Experiments were performed using the same MA2D simu- lated environment described in the section on a baseline reinforcement learning solution. No changes were made to Figure 8 - Behaviors available to the reinforcement learning the observation space. However, the action space for the agent. The first 13 behaviors consist of lead, lag, and pure reinforcement learning agent now consists of the set of pursuits with various offsets. The final behavior causes the agent to fly away from the opponent. behaviors listed in Figure 8. When the neural net selects a lag pursuit, it causes the platform that it is controlling to pursue a point behind its opponent. Pure pursuit and lead Experimental results are shown in Figure 9. The baseline pursuit are similar, except that the point is at or in front of result uses pure reinforcement learning. It takes approxi- the target in each respective case. The evade action causes mately 2,500 episodes of experience before the agent the platform to turn away from its opponent and increase learns to win more episodes than it loses. In contrast, the speed as much as possible so that it can escape. Once a hybrid approach described in this section uses one of its behavior is selected, the corresponding low-level script scripted policies to achieve learning that appears almost produces an output in the same action space that was de- instantaneous by comparison. Indeed, the prior knowledge scribed in the previous section so that an apples-to-apples encoded in the scripted policy greatly simplifies the rein- comparison with the baseline approach can be obtained. forcement learning task. We also experimented with an One unexpected benefit of the hybrid approach de- action repetition variant where the high-level behavior is scribed in the previous section is that it eliminates the need selected 256 times less frequently than the low-level ac- for dense rewards and reward function engineering. In re- tion. This makes it even easier for the reinforcement learn- inforcement learning applications, it is typical for the envi- ing module to find a winning strategy, because it only ronment to provide the agent with a more informative needs to select a behavior four times per episode instead of “dense reward” function that provides a more continuous 1000 times (assuming that each episode lasts for 1000 spectrum of outcome desirability than just win or loss. steps). These dense reward functions can be difficult to design, These results demonstrate that our novel method has especially as scenarios become more complex. Elimination advantages over both constituent technologies from which of this requirement makes the method much easier to apply it is composed. It can be much faster than reinforcement to new scenarios because it removes the need for this trial- learning with a flat architecture, and more effective than a and-error design process. simple scripted (traditional) AI opponent. The hybrid agent is able to learn effectively with only a win-loss reward. Each episode ends when one of the plat- forms enters the other’s weapon engagement zone, at Figure 9 - Results of training the baseline agent, the basic hybrid learner, and an action repetition variant that produces 256 low- level actions per high-level selection. Multi-Agent Hybrid Learning and Allocation Figure 11 - Muli-agent AFSIM-based environment with 6 blue Having demonstrated that the hybrid RL approach produc- fighters and 6 red fighters. The blue station on the left and red es vastly improved results in the simple MA2D environ- ship on the right serve only to command their fighters. The ment, we apply this AI solution to a more complex deci- fighters fire missiles at one another, and enemy destruction is sion environment developed with AFSIM. In this scenario, determined based on missile dynamics and weapon models. each fighter has five possible actions. It can pursue an op- ponent, fire a salvo of weapons, provide weapon support, We turn now to the MvN case, where each team con- perform evasive maneuvers, or maintain a steady course. tains more than one fighter. Our solution uses traditional When there is more than one opponent, the AI can also target allocation algorithms to handle this part of the prob- select which one to target. In addition to observed enemy lem. First, we compute a matrix with M rows and N col- positions and velocities, the environment also returns a umns that contains the distance from each blue agent to simple sparse reward at the end of each episode that is each red agent. Then, we either assign each agent to the +3000 for the winning team, and -3000 for the losing team. nearest target, or use the Hungarian algorithm to produce For simplicity, a team is declared victorious if it destroys an assignment. If there are more blue fighters than red tar- all of the opponents within a time limit. Otherwise, the gets, multiple iterations of the Hungarian algorithm are outcome is declared to be a draw and each team receives performed until all blue fighters have been assigned (mul- zero reward. tiple fighters can be assigned to one target). The following cost matrix is used to formulate this linear sum assignment In the 1v1 case, our hybrid reinforcement learning agent problem, where D is the distance matrix (with certain rows quickly learns to defeat the scripted AFSIM opponent with removed if multiple iterations are needed – those corre- 58% win rate, 26% loss rate, and 16% draw rate. Only sponding to already-assigned blue fighters): 50,000 episodes of training are required to reach this level of performance. Due to limitations of the AFSIM-based 𝐶𝑖,𝑗 = −1.0/(𝐷𝑖,𝑗 + 0.001) scenario, we were not able to perform a baseline experi- ment for comparison as we did for MA2D. This effectively reduces the reinforcement learning problem to a 1v1 scenario for each pair. The assignment is re-computed at each time-step so that targets can be re- assigned dynamically. This solution is based on the heuris- tic assumption that it is better for fighters to engage oppo- nents that are close by. This tends to hold up in practice because rapid destruction of enemy threats involves mini- mizing the time spent in flight, and therefore the distance travelled. This approach has excellent scalability because an efficient version of the Hungarian algorithm runs in Figure 10 - Win/loss/draw results for engagements with up to 12 O(n^3) time. It also provides excellent generalizability in fighters, with two different target allocation algorithms that we investigated. Each experiment consisted of 1000 trials. These the sense that an agent can be trained for a 1v1 engage- results demonstrate that the hybrid RL agent with Hungarian ment, and then used in a much larger scenario. It is chal- assignment achieved more wins than losses against a standard lenging to train a reinforcement learning agent to control AFSIM scripted AI in all experiments, from 1v1 up to 6v6. multiple platforms, and even more challenging to control an arbitrary number of platforms. Although our software national Conference on Learning Representations. Vancouver, framework allows us to train the reinforcement learning BC, April 30 – May 3. agent in up to a 6v6 AFSIM environment, we achieved Geron, A. 2017. Hands-On Machine Learning with Scikit-Learn & TensorFlow. Sebastopol: O'Reilly. some interesting results just by training a 1v1 agent and placing it in the 6v6 scenario. Nevertheless, there are still He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet some potential benefits of training within the 6v6 environ- Classification. Paper presented at the IEEE International Confer- ment. Most importantly, it appears that agents optimized ence on Computer Vision, Santiago, Chile, December 7-13. for a 1v1 scenario may be prone to use up all of their mis- Hochreiter, S., and Schmidhuber, J. 1997. Long Short-term siles very quickly. Training within the 6v6 environment Memory. Neural Computation 9(8): 1735-1780. may solve this problem by rewarding agents more fre- Jaderberg, M.; Czarnecki W. M.; Dunning, I.; Marris, L.; Lever, quently when they try to save missiles for later engage- G.; Castaneda, A. G.; Beattie C.; Rabinowitz, N. C.; Morcos A. ments. S.; Ruderman A.; Sonnerat N.; Green T.; Deason L.; Leibo J. Z.; Silver D.; Hassabis D.; Kavukcuoglu K.; Graepel, T. 2019. Hu- man-level performance in First-Person Multiplayer Games with Conclusion Population-Based Deep Reinforcement Learning. Science 364(6443): 859-865. When combined with traditional AI approaches, rein- Kuhn, H. W. 1955. The Hungarian Method for the Assignment forcement learning can produce high-level strategies that Problem. Naval Research Logistics Quarterly 2(1-2): 83-97. are more effective than the previous state of the art. How- Lanctot, M.; Zambaldi, V.; Gruslys, A.; Lazaridou, A.; Tuyls, K.; ever, a game theoretic perspective is needed to produce Perolat, J.; Silver, D.; and Graepel, T. 2017. A Unified Game- Theoretic Approach to Multiagent Reinforcement Learning. Pa- truly robust strategies for a pair of adversaries. In this pa- per presented at the 31st Conference on Neural Information Pro- per, the blue agent learned an approximate best response to cessing Systems. Long Beach, CA, December 4-9. a scripted red opponent. This capability is useful in and of Lapan, M. 2018. Deep Reinforcement Learning Hands-On. Bir- itself, but we are also applying empirical game theoretic mingham, UK: Packt Publishing. methods (Lanctot et al. 2017) that allow the reinforcement Matiisen, T. 2018. The Use of Embeddings in OpenAI Five. learning agent to learn without a pre-existing opponent https://neuro.cs.ut.ee/the-use-of-embeddings-in-openai-five/ against which to train. This is the subject of a future Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, planned publication. I.; Wierstra, D.; and Riedmiller, M. 2013. Playing Atari with Deep Reinforcement Learning. arXiv preprint. arXiv: 1312.5602v1 [cs.LG]. Ithaca, NY: Cornell University Library. Acknowledgements Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Harley, T.; Lillic- rap, T.; Silver D.; and Kavukcuoglu, K. 2016. Asynchronous This work was funded by DARPA as part of the Serial Methods for Deep Reinforcement Learning. In Proceedings of the Interactions in Imperfect Information Games Applied to 33rd International Conference on Machine Learning. New York: Complex Military Decision Making (SI3-CMD) program Association for Computing Machinery. (contract # HR0011-19-90018). The authors thank Boeing OpenAI. 2018. OpenAI Five. https://openai.com/blog/openai- for providing AFSIM scenarios and scripted behaviors. five/ The AFSIM software is property of the Air Force Research Spronck, P.; Ponsen, M.; Sprinkhuizen-Kuyper, I.; and Postma, E. Laboratory. Any opinions, findings, conclusions, or rec- 2006. Adaptive Game AI with Dynamic Scripting. Machine Learning, 63(3), 217-248. ommendations expressed in this material are those of the Sutton, R.; Precup, D.; and Singh, S. 1999. Between MDPs and authors and do not necessarily reflect the views of DARPA Semi-MDPs: A Framework for Temporal Abstraction in Rein- or the Air Force Research Laboratory. forcement Learning. Artificial Intelligence, 112(1-2), 181-211. Sutton, R. S., and Barto, A. G. 2018. Reinforcement Learning: An Introduction. Cambridge: The MIT Press. References Toubman, A.; Roessingh, J. J.; Spronck, P.; Plaat, A.; and Herik, Clive, P. D.; Johnson, J. A.; Moss, M. J.; Zeh, J. M.; Birkmire, B. J. 2014. Dynamic Scripting with Team Coordination in Air Com- M.; and Hodson, D. D. 2015. Advanced Framework for Simula- bat Simulation. In Proceedings of the 27th International Confer- tion, Integration, and Modeling (AFSIM). In Proceedings of the ence on Industrial, Engineering & Other Applications of Applied 2015 International Conference on Scientific Computing. Las Ve- Intelligent Systems. Kaohsiung: Springer International. gas: CSREA Press. Vinyals O.; Ewalds T.; Bartunov S.; Georgiev P.; Vezhnevets A. DeepMind 2019. AlphaStar: Mastering the Real-Time Strategy S.; Yeo M.; Makhzani A.; Kuttler H.; Agapiou J., Schrittwieser Game of StarCraft II. https://deepmind.com/blog/article/ J.; Quan J.; Gaffney S.; Petersen S.; Simonyan K.; Schaul T.; alphastar-mastering-real-time-strategy-game-starcraft-ii Hasselt H.; Silver D.; Lillicrap T.; Calderone K.; Keet P.; Brunas- Frans, K.; Ho, J.; Chen, X.; Abbeel, P.; and Schulman, J. 2018. so A.; Lawrence D.; Ekermo A.; Repp J.; and Tsing R. 2017. Meta Learning Shared Hierarchies. Paper presented at the Inter- StarCraft II: A New Challenge for Reinforcement Learning. arXiv preprint. arXiv: 1708.04782 [cs.LG]. Ithaca, NY: Cornell University Library.