<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Transfer Learning Between RTS Combat Scenarios Using Component-Action Deep Reinforcement Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Richard Kelly</string-name>
          <email>richard.kelly@mun.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>David Churchill</string-name>
          <email>dave.churchill@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science Memorial University of Newfoundland St. John's</institution>
          ,
          <addr-line>NL</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Real-time Strategy (RTS) games provide a challenging environment for AI research, due to their large state and action spaces, hidden information, and real-time gameplay. StarCraft II has become a new test-bed for deep reinforcement learning systems using the StarCraft II Learning Environment (SC2LE). Recently the full game of StarCraft II has been approached with a complex multi-agent reinforcement learning (RL) system, however this is currently only possible with extremely large financial investments out of the reach of most researchers. In this paper we show progress on using variations of easier to use RL techniques, modified to accommodate actions with multiple components used in the SC2LE. Our experiments show that we can effectively transfer trained policies between RTS combat scenarios of varying complexity. First, we train combat policies on varying numbers of StarCraft II units, and then carry out those policies on larger scale battles, maintaining similar win rates. Second, we demonstrate the ability to train combat policies on one StarCraft II unit type (Terran Marine) and then apply those policies to another unit type (Protoss Stalker) with similar success.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Real-Time Strategy (RTS) games are a popular testbed
for research in Artificial Intelligence, with complex
subproblems providing many algorithmic challenges (
        <xref ref-type="bibr" rid="ref7">Ontañón
et al. 2015</xref>
        ), and the availability of multiple RTS game APIs
        <xref ref-type="bibr" rid="ref13 ref18 ref3">(Heinermann 2013; Synnaeve et al. 2016; Vinyals et al.
2017)</xref>
        , they provide an ideal environment for testing novel
AI methods. In 2018, Google DeepMind unveiled an AI
agent called AlphaStar
        <xref ref-type="bibr" rid="ref19 ref20">(Vinyals et al. 2019a)</xref>
        , which used
machine learning (ML) techniques to play StarCraft II at a
professional human level. AlphaStar was initially trained
using supervised learning from hundreds of thousands of
human game traces, and then continued to improve via self play
with deep RL, a method by which the agent improves its
policy by learning to take actions which lead to higher
rewards more often. While this method was successful in
producing a strong agent, it required a massive engineering
effort, with a team comprised of more than 30 world-class AI
researchers and software engineers. AlphaStar also required
an enormous financial investment in hardware for training,
using over 80000 CPU cores to run simultaneous instances
of StarCraft II, 1200 Tensor Processor Units (TPUs) to train
the networks, as well as a large amount of infrastructure and
electricity to drive this large-scale computation. While
AlphaStar is estimated to be the strongest existing RTS AI agent
and was capable of beating many players at the
Grandmaster rank on the StarCraft II ladder, it does not yet play at the
level of the world’s best human players (e.g. in a tournament
setting). The creation of AlphaStar demonstrated that using
deep learning to tackle RTS AI is a powerful solution,
however applying it to the entire game as a whole is not an
economically viable solution for anyone but the worlds largest
companies. In this paper we attempt to demonstrate that one
possible solution is to use the idea of transfer learning:
learning to generate policies for sub-problems of RTS games, and
then using those learned policies to generate actions for many
other sub-problems within the game, which can yield savings
in both training time and infrastructure costs.
      </p>
      <p>
        In 2017, Blizzard Entertainment (developer of the
StarCraft games), released SC2API: an API for external control
of StarCraft II. DeepMind, in collaboration with Blizzard,
simultaneously released the SC2LE with a Python interface
called PySC2 designed to enable ML research with the game
        <xref ref-type="bibr" rid="ref18">(Vinyals et al. 2017)</xref>
        . AlphaStar was created with tools built
on top of SC2API and SC2LE. PySC2 allows commands to
be issued in a way similar to how a human would play; units
are selected with point coordinates or by specifying a
rectangle the way a human would with the mouse. Actions are
formatted as an action function (e.g. move, attack, cast a spell,
unload transport, etc.) with varying numbers of arguments
depending on the function. This action representation differs
from that used in other RTS AI research APIs, including the
TorchCraft ML library for StarCraft: Broodwar
        <xref ref-type="bibr" rid="ref13">(Synnaeve et
al. 2016)</xref>
        . In this paper we refer to the action function and
its arguments as action components. Representing actions
as functions with parameters creates a complicated
actionspace that requires modifications to classic RL algorithms.
      </p>
      <p>
        Since the release of PySC2 several works other than
AlphaStar have been published using this environment with
deep RL.
        <xref ref-type="bibr" rid="ref15">Tang et al. (2018)</xref>
        trained an agent to learn unit and
building production decisions (build order) using RL while
handling resource harvesting and combat with scripted
modules.
        <xref ref-type="bibr" rid="ref12">Sun et al. (2018)</xref>
        reduced the action space by training
selected
unit_hit_points
an agent to use macro actions that combine actions available
in PySC2 to play the full game of StarCraft II against the
built-in AI. Macro actions have also been used in
combination with a hierarchical RL architecture of sub-policies and
states and curriculum transfer learning to progressively train
an agent on harder levels of the built-in AI to learn to play the
full game of StarCraft II
        <xref ref-type="bibr" rid="ref8">(Pang et al. 2019)</xref>
        .
        <xref ref-type="bibr" rid="ref9">Samvelyan et al.
2019</xref>
        introduced a platform using the SC2LE called the
StarCraft Multi-Agent Challenge used for testing multi-agent RL
by treating each unit as a separate agent. Notably, these other
works have not used the complex action-space directly
provided by PySC2, which mirrors how humans play the game.
      </p>
      <p>This paper is organized as follows: in the next section
we describe how our experiments interact with the the
SC2LE; following that, we present our implementation of
component-action DQN; in the Experiments section we
describe our experimental setup and results; finally we present
our conclusions and ideas for expanding on this research.
2</p>
    </sec>
    <sec id="sec-2">
      <title>StarCraft II Learning Environment</title>
      <p>
        The PySC2 component of the SC2LE is designed for ML/RL
research, exposing the gamestate mainly as 2D feature maps
and using an action-space similar to human input. An RL
player receives a state observation from PySC2 and then
specifies an action to take once every n frames, where n is
adjustable. The game normally runs at 24 frames per
second, and all our experiments have the RL player acting every
8 frames, as in previous work
        <xref ref-type="bibr" rid="ref18">(Vinyals et al. 2017)</xref>
        . At this
speed the RL player acts at a similar rate as a human, and the
gamestate can change meaningfully between states.
2.1
      </p>
      <sec id="sec-2-1">
        <title>PySC2 observations</title>
        <p>PySC2 observations consist of a number of 2D feature maps
conveying categorical and scalar information from the main
game map and minimap, as well as a 1D vector of other
information that does not have a spatial component. The
observation also includes a list of valid actions for the current frame
which we use to mask out illegal actions. In our experiments
we use a subset of the main map spatial features relevant to
the combat scenarios we use:
player_relative - categorical feature describing if units are
“self” or enemy (and some other categories we don’t use)
selected - Boolean feature showing which units are
currently selected (i.e. if the player can give them commands)
unit_hit_points - scalar feature giving remaining health of
units, which we convert to 3 categories
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>PySC2 actions</title>
        <p>Actions in PySC2 are conceptually similar to how a human
player interacts with the game, and consist of a function
selection and 0 or more arguments to the function. We use a
small subset of the over 500 action functions in PySC2 which
are sufficient for the combat scenarios in our experiments.
Those functions and their parameters are:
no_op - do nothing
select_rect - select units in a rectangular region
screen (x; y) (top-left position)
screen2 (x; y) (bottom-right position)
select_army - select all friendly combat units
attack_screen - attack unit or move towards a position and
stop to attack enemies in range on the way
screen (x; y)
screen (x; y)
move_screen - move to a position while ignoring enemies</p>
        <p>Any player controlled with PySC2 will still have its units
automated to an extent by built-in AI, as with a human player.
For example, if units are standing idly and an enemy enters
within range, they will attack that enemy.</p>
        <p>3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Component-Action DQN</title>
      <p>
        For this research we implemented the DQN RL algorithm,
first used to achieve human-level performance on a suite of
Atari 2600 games
        <xref ref-type="bibr" rid="ref5">(Mnih et al. 2015)</xref>
        , with the double DQN
enhancement
        <xref ref-type="bibr" rid="ref17">(Van Hasselt, Guez, and Silver 2016)</xref>
        , a
dueling network architecture
        <xref ref-type="bibr" rid="ref21">(Wang et al. 2015)</xref>
        , and prioritized
experience replay
        <xref ref-type="bibr" rid="ref10">(Schaul et al. 2016)</xref>
        . The number of unique
actions even in the small subset we are using is too large to
output an action-value for each one (for instance, if the screen
size is 84x84 pixels, there are 7056 unique “attack screen”
commands alone). To address this problem we output a
separate set of action-values for each action component. Our
version of DQN, implemented using Python and Tensorflow,
features modifications for choosing actions and calculating
training loss with actions that consist of multiple components
(component-actions). Algorithm 1 shows our
implementation, emphasizing changes for handling component-actions.
      </p>
      <p>Invalid action component choices are masked in several
parts of the algorithm. Random action function selection in
step 8 is masked according to the valid actions for that state.
When choosing a non-random action using the network
output in step 11, the action function with the highest
actionvalue is chosen first, disregarding actions marked as
unavailable in the state observation, and then each parameter to the
action function is chosen according to the highest
actionvalue (step 11). Invalid choices for parameters to the action
Algorithm 1 Double DQN with prioritized experience
replay and component-actions for StarCraft II
1: Input: minibatch size k, batch update frequency K,
target update frequency C, max steps T , memory size N ,</p>
      <p>N , initial and annealing schedule</p>
      <p>Mmin
2: M [] ;
3: Initialize Q with random weights , components d 2 D
4: Initialize Qtar with weights tar =
5: Initialize environment env with state s1
6: for t = 1 to T do
7: if rand() &lt; or t Mmin then
8: Select valid action function at0 randomly
9: Select action parameters at1; at2; : : : randomly
10: else
11:
at = fatd = arg maxQd(st; ad) j d 2 Dg</p>
      <p>ad
12:
13:
14:
15:
16:
17:
18:
function could be masked out in steps 9 and 11 in general, but
are not in this work since all values of the parameters we used
(screen and screen2) are valid. Only parameters that are used
by the chosen action function at0 are used in the environment
update in step 12.
3.1</p>
      <sec id="sec-3-1">
        <title>Loss Functions</title>
        <p>In double DQN network updates are made during training
using the TD target value
y = r +</p>
        <p>Qtar(s0; arg maxQ(s0; a0));
a0
where r is the reward, is the discount, Q and Qtar are
the primary and target network, and s0 and a0 are the next
state and action taken in the next state. With a single action
component the loss to be minimized is L(y; Q(s; a)), where
L() is some loss function (e.g. squared error). When using
component-actions, there are separate action-values for each
component of the action. In general, an action is a tuple of
action components (a0; a1; :::; an), with an individual action
depending on a subset of those components. That is, some
components are not used for every action. This raises the
question of how to calculate training loss when the action
components used from one action to the next differ.</p>
        <p>Several previous works have used variations of action
components with different RL algorithms. Tavakoli, Pardo,
and Kormushev (2018) use a method they call Branching
Dueling Q-Network (BDQ) in several MuJoCo physical control
tasks with multidimensional action spaces. In these tasks, all
components are used in each complete action. They
experimented with 1) using a separate TD target, yd, for each action
component ad corresponding to a component d 2 D, the set
of all action components; 2) using the maximum yd as a
single TD target; and 3) using the average of the yd as a single
TD target, which was found to give the best performance:
y = r +
jDj d2D</p>
        <sec id="sec-3-1-1">
          <title>X Qtdar(s0; arg maxQd(s0; a0d ))</title>
          <p>a0d
(1)
Their training loss is the average of the squared errors of each
component’s action-value and y from Equation 1.</p>
          <p>
            Action components were also used
            <xref ref-type="bibr" rid="ref4">(Huang and Ontañón
2019)</xref>
            to represent actions in the microRTS environment to
train an agent using the Advantage Actor Critic (A2C)
algorithm, a synchronous version of the policy gradient method
Asynchronous Advantage Actor Critic (A3C)
            <xref ref-type="bibr" rid="ref6">(Mnih et al.
2016)</xref>
            . In that work each component is not used in
every action, but the policy gradient action log-probability
(log( (st; at)), where is the policy network) used in
training is always the sum of the action log-probabilities
for each action component. AlphaStar
            <xref ref-type="bibr" rid="ref19 ref20">(Vinyals et al. 2019b)</xref>
            also uses a policy gradient method and sums the action
logprobabilities from each action component, but masks out the
contribution from unused components.
          </p>
          <p>In our implementation, on step 24, we use the mean target
component y as in Equation 1, but modified so that unused
components of the target action are masked out using md =
1 if component d is in use for an action, md = 0 otherwise:
y = r +
jDj d2D</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>X mdQtdar(s0; arg maxQd(s0; a0d ))</title>
          <p>a0d
(2)</p>
          <p>The loss to be minimized with gradient descent on step 25
of Algorithm 1 is the sum of the losses of each used
component’s action-value compared to the target y:</p>
          <p>X
d2Dused
LT otal =</p>
          <p>
            L(y; Qd(s; ad))
where Dused are the components in use for a particular
action, and L() is the Huber loss used in DQN
            <xref ref-type="bibr" rid="ref5">(Mnih et al.
2015)</xref>
            , which is squared error for error with absolute value
less than 1, and linear otherwise.
          </p>
          <p>We considered and tested two alternative loss calculations.
The first method was to sum the Huber losses of the
components compared pairwise with
yd =</p>
          <p>r
jDusedj
+</p>
          <p>
            Qtdar(s0; arg maxQd(s0; ad0 ));
ad0
masking out unused components from the action taken but
not from the target action. This method is similar to method
1) tested with BDQ
            <xref ref-type="bibr" rid="ref16">(Tavakoli, Pardo, and Kormushev 2018)</xref>
            .
It has the advantage of comparing losses from different
action component branches of the network to their
corresponding branch, but since it updates the primary network toward
          </p>
          <p>function branch
Shared CNN</p>
          <p>screen branch
Value branch
screen2 branch</p>
          <p>function
action values</p>
          <p>screen
action values</p>
          <p>screen2
action values
the action-values of unused components of the target action
we didn’t expect it to perform well.</p>
          <p>The second alternative method we tried was to use the
Huber loss of the average of the action-values of the action
taken by the primary network (unused components masked
out) compared with the same average y from Equation 1.
This method seems reasonable since it uses the same
formula to calculate both the primary network total action-value
and discounted target network next state value. However,
branches with positive and negative action-values can cancel
each other out in both inputs to the loss function, resulting in
a small loss when individual components of the primary
network action-value may have a large error relative to the target
network. Both alternative loss calculations resulted in worse
performance and were not used in our final experiments.
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Network and training parameters</title>
        <p>The neural network used to predict action-values consists of
a shared convolutional neural network (CNN), followed by
separate branches for state value and the three action
components we used (function, screen, screen2). The network’s
overall design is shown in Figure 2.</p>
        <p>The state input categorical feature maps are first
preprocessed to be in one hot format with dimensions 84x84x6.
Next the input is run through a series of blocks consisting
of parallel convolutional layers whose outputs are
concatenated. First the one-hot input is run through each of 3
convolutional layers with 16 filters each and kernel sizes 7, 5,
and 3. The output of those layers is concatenated along the
channel dimension and given as input to each of two
convolutional layers with 32 filters and kernel sizes 5 and 3. Those
two outputs are concatenated and fed to a final convolutional
layer with 32 filters and kernel size 3. All convolutional
layers here use padding and a stride of 1 to keep the output to
the same width and height, so as to preserve spatial data for
the action parameters which target parts of the screen.</p>
        <p>Next the network splits into a value branch and one branch
for each action component. The value and function branches
each have a max pooling layer of size and stride 3, followed
by 2 dense layers of size 256. The function branch ends with
a final dense layer outputting the action advantages of the
function action component.</p>
        <p>
          Both the screen and screen2 branches receive as input
the output of the shared CNN, which is fed into a 32 filter
3x3 convolutional layer, followed by a 1 filter 1x1
convolutional layer as described in
          <xref ref-type="bibr" rid="ref18">(Vinyals et al. 2017)</xref>
          , giving
output dimensions of 84x84x1 and the action advantages of
each screen position. We experimented with adding the
onehot encoded output from the function branch as input to the
screen branch followed by additional convolutional layers,
and similarly with the screen and screen2 branch with a
single layer added with value 1 in the position corresponding to
the screen choice and 0 elsewhere, but found that for the
action functions and scenarios used in these experiments there
was surprisingly no gain in performance. We believe a larger
network combined with more training time may be required
to take advantage of these connections.
        </p>
        <p>
          Each convolutional and dense layer (except those leading
to action advantage outputs) is followed by a ReLU activation
and batch normalization. Training hyperparameters were
selected through informal search. We use the Adam optimizer
with learning rate of 0.001, a discount of 0.99, batch size
of 64, and L2 regularization. Minibatch training updates are
performed every 4 steps, and the target network is updated
every 10,000 steps. The prioritized experience replay
memory size is 60,000 transitions, and all parameters are as
described in the sum tree implementation
          <xref ref-type="bibr" rid="ref10">(Schaul et al. 2016)</xref>
          .
In each training the exploration is exponentially annealed
from 1 to 0.05 over the first 80% of total steps.
        </p>
        <p>4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>We did three experiments to test the performance of our
network structure and action representation in different
environments. First, we compared performance when training for
different amounts of training steps; second, we compared
performance on different scenarios, ranging from small (4
units each) to large (32 units each) instances of a combat
scenario; third, we tested transfer learning performance by using
models trained exclusively on one scenario in other
scenarios. Performance of all trained models is compared to that of
a number of simple scripted players. All training and
evaluation is done with the opponent being the game’s built-in AI,
i.e. the same opponent a human player playing the
singleplayer game would face.
4.1</p>
      <sec id="sec-4-1">
        <title>StarCraft II Combat Scenarios</title>
        <p>
          Experiments were conducted on custom StarCraft II maps
that each contain a specific combat scenario. The scenarios
limit the size of the battlefield to a single-screen sized map,
which removes the requirement to navigate the view to
different areas of the environment. In each episode an equal
number of units are spawned for both the RL or scripted player
being tested, and a built-in AI controlled player. Units appear
in two randomized clusters, randomly assigned to the left or
right side of the map, which are symmetric about the
centre of the map. The built-in AI enemy is immediately given
an order to attack the opposite side of the screen, causing
them to rush at the RL player’s units, attacking the first
enemies they reach. Once given this command, the in-game AI
takes over control of the enemy units, which executes a policy
t)n 0.2
ouC 0.1
it 0
nU -0.1
iro -0.2
a
n -0.3
e
cS -0.4
(
/ -0.5
d
ra -0.6
ew-0.7
R
0
50000
which prioritizes attacking the closest units of the RL player.
This use of the built-in AI to control the enemy for combat
experiments has been shown to be effective for testing and
training combat algorithm development
          <xref ref-type="bibr" rid="ref1">(Churchill, Lin, and
Synnaeve 2017)</xref>
          .
        </p>
        <p>We trained and evaluated our models and scripted players
on scenarios with equal numbers of Terran Marines (a
basic ranged unit in the game) per side, numbering 4, 8, 16,
or 32. The Marine scenarios will be referred to as “#m vs.
#m”, where # is the number of Marines. We also
evaluated our models on scenarios with the same counts of Protoss
Stalkers, referred to as “#s vs. #s”. Stalkers are larger units
with longer range and a shield which must be reduced to zero
before their health will deplete. These scenarios were
constructed in a symmetric fashion in order to give both sides an
equal chance at winning each battle. If one side is victorious
from an even starting position, it must mean that their method
is more effective at controlling units for combat. An
effective human policy for such scenarios involves first grouping
up the players’ units into a tight formation, and then using
focus-targeting to most efficiently destroy enemy units.</p>
        <p>
          Episodes end when all units of one side are destroyed
(health reduced to 0), or until a timer runs out. The timer
is set at 45 real-time game seconds for Marine maps, and
1:30 for the Stalker maps since Stalkers have more health
and shields causing the battles to last longer. The map then
resets with a new randomized unit configuration and a new
episode begins. The maps output custom reward
information to PySC2, which are calculated using the LTD2 formula
          <xref ref-type="bibr" rid="ref2">(Churchill 2016)</xref>
          , which values a unit as by its lifetime
damage, a function of its damage output per second multiplied
by remaining health. The unit values are normalized to equal
1 for a full health unit of the highest value in the scenario,
and the difference in total unit value (self minus enemy)
between steps is added to the step reward. Since there are
positive rewards for damaging enemy units and negative rewards
for taking damage, episode rewards tend to be similar in
scenarios with different numbers of units. The damage part of
the LTD2 calculation doesn’t affect the rewards in scenarios
that feature only one unit type, as in the experiments
presented here, but it will affect future experiments with more
scenarios with mixed unit types. The total reward per step is
observed by the RL player in step 12 of Algorithm 1.
4.2
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Training and Evaluation</title>
        <p>We trained models in scenarios with 4, 8, 16, and 32 Marines
per side, for 300k and 600k steps. In informal testing we
found that 300k steps was enough to see good performance
in these scenarios, and we chose double the number of steps
to compare with a longer training time. Training rewards per
episode for 300k and 600k steps is shown in Figure 3.
Training was performed on a machine with an Intel i7-7700K CPU
running at 4.2 GHz and an NVIDIA GeForce GTX 1080 Ti
video card. It takes 5 hours to train for 300k steps using the
GPU and running the game as fast as possible. To train a
model we use PySC2 to run our custom combat scenarios for
as many episodes as needed to reach the step limit. The RL
player receives input from the game environment and takes
actions as described in Algorithm 1.
0</p>
        <p>For each scenario/step count combination we trained three
models and used the best performing of those for the transfer
learning experiments. Trained models were evaluated by
running 1000 episodes with a deterministic policy (i.e. = 0),
using the trained model for inference only. In evaluation, the
result can be a win (1 point), draw (0.5) or loss (0). Wins
occur when all enemy units are destroyed. Draws happen if both
sides simultaneously lose their last unit. If the map timer runs
out, the side with the most combined health/shields wins. If
that metric results in a tie then the episode is counted as a
draw. Evaluation results are presented as a score, equal to the
number of wins plus half of the number of draws, divided by
the number of evaluation episodes.</p>
        <p>We also evaluated several scripted players:
No Action (NA) - This player takes the no_op action only,
which is equivalent to a human player providing no input.
Its units will attack and follow enemy units if they move
into range, controlled by the in-game AI.</p>
        <p>Random (R) - This player takes random actions. Both the
function choice and any arguments are randomized.
Random Attack (RA) - This player selects all friendly
units, and subsequently attacks random screen positions.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Attack Weakest Nearest (AWN) - This player selects all</title>
        <p>friendly units on the first frame and then attacks the
enemy with the lowest health, choosing the enemy nearest to
the average of the friendly units’ positions as a tie breaker.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results and Discussion</title>
      <p>Results of our experiments can be seen in Table 1. The first
column shows the scenario for which the experiment is being
0:173
0:146
0:119
0:044
0:142
0:179
0:211
0:130</p>
      <p>Scripted Players</p>
      <p>R RA
0:013
0:019
0:004
0:002
0:000
0:000
0:001
0:000
0:570
0:561
0:391
0:116
0:352
0:240
0:030
0:002
0:907
0:721
0:038
0:000
0:910
0:678
0:013
0:000
0:691
0:694
0:591
0:329
0:497
0:420
0:114
0:075
0:661
0:691
0:546
0:280
0:491
0:406
0:183
0:239
0:619
0:692
0:695
0:441
0:008
0:015
0:034
0:223
0:625
0:541
0:450
0:332
0:479
0:386
0:356
0:320
0:663
0:693
0:643
0:340
0:525
0:512
0:288
0:386
0:606
0:585
0:213
0:247
0:291
0:192
0:152
0:193
0:444
0:356
0:300
0:130
0:236
0:204
0:154
0:355
0:414
0:440
0:503
0:471
0:408
0:303
0:196
0:170</p>
      <p>Learned Policy - 300k Steps</p>
      <p>Trained on Marines
8m 16m
Learned Policy - 600k Steps</p>
      <p>Trained on Marines
8m 16m
conducted along a row, with the other column values showing
which policy controls the player units for the experiment.
5.1</p>
      <sec id="sec-5-1">
        <title>Scripted Player Benchmarks</title>
        <p>The results of the scripted players can be seen in columns 2-5
of Table 1. The performance of the benchmark scripted
players on the Marine scenarios shows a large range of results.
The NA policy scores as high as 0.173, as units will attack
when enemies get within range even if the player does
nothing. The random bot gets scores from 0 to 0.019, whereas
the random attacking bot ranges from a score of 0.116 in the
32m vs. 32m scenario to 0.570 in the 4m vs. 4m scenario.
This shows that simply choosing to attack all the time is a
good strategy, but that it isn’t enough to win often with higher
unit counts (in the 32m vs. 32m scenario the bot is likely
attacking its own units sometimes). AWN achieves a score of
0.907 in the 4m vs. 4m scenario, but scores 0 in the 32m
vs. 32m scenario, indicating that formation becomes a
bigger factor in more complex scenarios. We observed that units
often get stuck while trying to move to attack the same target.
The scripted players perform best in the 4m vs. 4m scenario
and perform worse as the number of units increases in all
cases except R in the 8m vs. 8m scenario. Scripted players
perform similarly in the Stalker scenarios.
5.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Learned Policies - Battle Size Transfer</title>
        <p>Results for the learned policies can be seen in columns
69 (300k training steps) and 10-13 (600k training steps) of
Table 1. We believe that these results yield two major
observations. The first is that like scripted players, smaller sized
battles yielded higher scores for learned policies. We believe
this is because smaller battles are less complex, with far fewer
possible states, and therefore are easier for learning effective
policies. Also, smaller battles end faster, so more battles are
able to be carried out in the same number of training steps.
The second observation is that the experiments demonstrated
the ability to transfer a policy trained on battles of one given
size to another, while maintaining similar results. For
example, a policy trained on 4 vs. 4 Marines for 600k time steps
was able to obtain a score of 0.663 when applied to the 4
vs. 4 Marine scenario, and 0.643 when applied to the 16 vs.
16 Marines scenario, which was even better than the policy
trained itself on 16 Marines. We did however notice that
results get worse as the difference in unit count between
training and testing scenarios gets larger, which was expected.</p>
        <p>One surprising result however was that for several
scenarios, the best policy was not one one that was trained on that
same scenario. For example, the policy trained on 16 vs. 16
Marines for 600k time steps was the 2nd worst policy when
applied in the 16 vs. 16 Marine scenario. Another
surprising result was that most of the policies trained for 300k time
steps ended up performing as good, or even better than those
trained for 600k time steps. This indicates that either the
scenarios are not complicated enough for more training to result
in better policies, or that the variability in model performance
is too large to see a trend for the number of models we trained.</p>
        <p>By visual observation, most trained models learn to select
all friendly units and then mainly attack. Some learn to target
the ground near friendly units, causing those units to cluster.
Some models also learn to target damaged enemy units.
5.3</p>
      </sec>
      <sec id="sec-5-3">
        <title>Learned Policies - Unit Type Transfer</title>
        <p>The results in the bottom 4 rows of Table 1 are for
experiments carried out with policies learned on scenarios with
Marine vs. Marine battles, but applied to scenarios with
Stalker vs. Stalker battles. In general, the results show that
while the scores for these Stalker scenarios are not as high
as the Marine scenarios, the policies can indeed be
transferred to units of different types. In particular, we can see
that smaller sized scenarios perform well, with scores
tapering off for larger scenarios. We believe this is due to the fact
that as more units enter the battlefield, the differences
between those units such as size and damage type become more
apparent, causing the policy to perform worse. One
surprising result was that the policies trained on 16 vs. 16 Marines,
especially the one trained for 300k steps, performed much
worse than the other policies overall, for which we currently
have no explanation.
In this paper we presented an application of
componentaction DQN to combat scenarios in a complex real-time
strategy game domain, StarCraft II. We showed that with short
training times and a relatively easy to implement RL system,
good performance can be achieved in these combat
scenarios. We successfully demonstrated transfer learning between
battle scenarios of different sizes: that policies can be learned
in a scenario of a given size, and then applied to scenarios
of different sizes with comparable results. We also
demonstrated transfer learning between different unit types, with
policies being learned in scenarios with Marine unit battles
being successfully applied to battles with Stalker units. We
believe that these results show promise for the future of RL
in RTS games, by allowing us to train policies in smaller, less
complex scenarios, and then apply those policies to different
areas of the game, reducing the need for longer training times
and much larger networks, like those found in AlphaGo.</p>
        <p>Future work for this project can include implementing
a similar network architecture and action component
system using policy gradient RL methods to compare them to
component-action DQN. Also, testing in scenarios that
require using more action types to achieve high scores may
help to better explore contributions of the component-action
method, which may improve transfer learning performance.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <string-name>
            <surname>Churchill</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; and Synnaeve,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>An analysis of model-based heuristic search techniques for StarCraft combat scenarios</article-title>
          .
          <source>In Thirteenth Artificial Intelligence and Interactive Digital Entertainment Conference.</source>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <surname>Churchill</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Heuristic Search Techniques for RealTime Strategy Games</article-title>
          .
          <source>Ph.D. Dissertation</source>
          , University of Alberta.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          <string-name>
            <surname>Heinermann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <year>2013</year>
          .
          <article-title>Broodwar API</article-title>
          . https://github.com/ bwapi/bwapi.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Ontañón</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>Comparing observation and action representations for deep reinforcement learning in MicroRTS</article-title>
          . arXiv preprint arXiv:
          <year>1910</year>
          .12134.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rusu</surname>
            ,
            <given-names>A. A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Veness</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ; Bellemare,
          <string-name>
            <given-names>M. G.</given-names>
            ;
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Fidjeland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            ;
            <surname>Ostrovski</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; et al.
          <year>2015</year>
          .
          <article-title>Humanlevel control through deep reinforcement learning</article-title>
          .
          <source>Nature</source>
          <volume>518</volume>
          (
          <issue>7540</issue>
          ):
          <fpage>529</fpage>
          -
          <lpage>533</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <surname>Mnih</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Badia</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Mirza</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lillicrap</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Harley</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Asynchronous methods for deep reinforcement learning</article-title>
          . In Balcan, M. F., and
          <string-name>
            <surname>Weinberger</surname>
          </string-name>
          , K. Q., eds.,
          <source>Proceedings of The 33rd International Conference on Machine Learning</source>
          , volume
          <volume>48</volume>
          <source>of Proceedings of Machine Learning Research</source>
          ,
          <fpage>1928</fpage>
          -
          <lpage>1937</lpage>
          . New York, New York, USA: PMLR.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <string-name>
            <surname>Ontañón</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Synnaeve,
          <string-name>
            <given-names>G.</given-names>
            ;
            <surname>Uriarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Richoux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ;
            <surname>Churchill</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ; and Preuss,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>RTS AI problems and techniques</article-title>
          . In Lee, N., ed.,
          <source>Encyclopedia of Computer Graphics and Games</source>
          . Cham: Springer International Publishing.
          <volume>1</volume>
          -
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Pang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          -J.; Liu, R.-Z.;
          <string-name>
            <surname>Meng</surname>
            ,
            <given-names>Z</given-names>
          </string-name>
          .-Y.; Zhang,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; and Lu,
          <string-name>
            <surname>T.</surname>
          </string-name>
          <year>2019</year>
          .
          <article-title>On reinforcement learning for full-length game of StarCraft</article-title>
          .
          <source>In Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)</source>
          , volume
          <volume>33</volume>
          ,
          <fpage>4691</fpage>
          -
          <lpage>4698</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <surname>Samvelyan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Rashid</surname>
            , T.; de Witt, C. S.; Farquhar,
            <given-names>G.</given-names>
          </string-name>
          ; Nardelli,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Rudner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. G. J.</given-names>
            ; Hung,
            <surname>C.-M.; Torr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. H. S.</given-names>
            ;
            <surname>Foerster</surname>
          </string-name>
          , J.; and
          <string-name>
            <surname>Whiteson</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <year>2019</year>
          .
          <article-title>The StarCraft multiagent challenge</article-title>
          . CoRR abs/
          <year>1902</year>
          .04043.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          <string-name>
            <surname>Schaul</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Quan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.;</given-names>
          </string-name>
          and
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <article-title>Prioritized experience replay</article-title>
          . In Bengio, Y., and LeCun, Y., eds.,
          <source>4th International Conference on Learning Representations, ICLR</source>
          <year>2016</year>
          , San Juan, Puerto Rico, May 2-
          <issue>4</issue>
          ,
          <year>2016</year>
          , Conference Track Proceedings.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <surname>Sun</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sun</surname>
            , X.; Han,
            <given-names>L</given-names>
          </string-name>
          .;
          <string-name>
            <surname>Xiong</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zheng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ; Liu,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Liu,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; Liu, H.; and Zhang, T.
          <year>2018</year>
          .
          <article-title>TStarBots: Defeating the cheating level builtin AI in StarCraft II in the full game</article-title>
          . arXiv preprint arXiv:
          <year>1809</year>
          .07193.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Synnaeve</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ; Nardelli,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Auvolat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Lacroix,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ;
            <surname>Richoux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ; and
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          <article-title>TorchCraft: A library for machine learning research on realtime strategy games</article-title>
          .
          <source>arXiv preprint arXiv:1611</source>
          .
          <fpage>00625</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Zhao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; Zhu,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          ; and Guo,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Reinforcement learning for build-order production in StarCraft II</article-title>
          . In 2018 Eighth International Conference on Information Science and
          <source>Technology (ICIST)</source>
          ,
          <fpage>153</fpage>
          -
          <lpage>158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          <string-name>
            <surname>Tavakoli</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Pardo</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ; and Kormushev,
          <string-name>
            <surname>P.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Action branching architectures for deep reinforcement learning</article-title>
          .
          <source>In Thirty-Second AAAI Conference on Artificial Intelligence</source>
          ,
          <fpage>4131</fpage>
          -
          <lpage>4138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Van Hasselt</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <year>2016</year>
          .
          <article-title>Deep reinforcement learning with double Q-learning</article-title>
          .
          <source>In Thirtieth AAAI Conference on Artificial Intelligence.</source>
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ewalds</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bartunov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Georgiev,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Vezhnevets</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. S.</surname>
          </string-name>
          ; Yeo,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Makhzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ;
            <surname>Agapiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Schrittwieser</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; et al.
          <year>2017</year>
          .
          <article-title>Starcraft II: A new challenge for reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1708</source>
          .
          <fpage>04782</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Babuschkin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Chung,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Mathieu,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Jaderberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Czarnecki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            ;
            <surname>Dudzik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Georgiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Powell</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Ewalds,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Horgan</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ; Kroiss,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Danihelka</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          ; Agapiou,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Oh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Dalibard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            ;
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Sifre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            ;
            <surname>Sulsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Vezhnevets</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Molloy</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; Cai,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Budden</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ; Paine,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          ; Pfaff,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Pohlen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Yogatama</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          ; Cohen,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>McKinney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            ;
            <surname>Schaul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Apps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ;
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            ;
            <surname>Hassabis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ; and
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2019a</year>
          .
          <article-title>AlphaStar: Mastering the real-time strategy game StarCraft II</article-title>
          . https://deepmind.com/blog/alphastarmastering-real
          <article-title>-time-strategy-game-starcraft-ii/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Babuschkin</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ; Czarnecki,
          <string-name>
            <given-names>W. M.</given-names>
            ;
            <surname>Mathieu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Dudzik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. H.</surname>
          </string-name>
          ; Powell,
          <string-name>
            <surname>R.</surname>
          </string-name>
          ; Ewalds,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Georgiev</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          ; et al. 2019b.
          <article-title>Grandmaster level in StarCraft II using multi-agent reinforcement learning</article-title>
          .
          <source>Nature</source>
          <volume>575</volume>
          (
          <issue>7782</issue>
          ):
          <fpage>350</fpage>
          -
          <lpage>354</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ; Schaul,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Hessel</surname>
          </string-name>
          , M.; Van Hasselt,
          <string-name>
            <surname>H.</surname>
          </string-name>
          ; Lanctot,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>De Freitas</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <year>2015</year>
          .
          <article-title>Dueling network architectures for deep reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1511</source>
          .
          <fpage>06581</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>