=Paper= {{Paper |id=Vol-2862/paper28 |storemode=property |title=Transfer Learning Between RTS Combat Scenarios Using Component-Action Deep Reinforcement Learning |pdfUrl=https://ceur-ws.org/Vol-2862/paper28.pdf |volume=Vol-2862 |authors=Richard Kelly,David Churchill |dblpUrl=https://dblp.org/rec/conf/aiide/KellyC20 }} ==Transfer Learning Between RTS Combat Scenarios Using Component-Action Deep Reinforcement Learning== https://ceur-ws.org/Vol-2862/paper28.pdf
    Transfer Learning Between RTS Combat Scenarios Using Component-Action
                          Deep Reinforcement Learning
                                               Richard Kelly and David Churchill
                                                     Department of Computer Science
                                                  Memorial University of Newfoundland
                                                         St. John’s, NL, Canada
                                            richard.kelly@mun.ca, dave.churchill@gmail.com


                             Abstract                                 an enormous financial investment in hardware for training,
                                                                      using over 80000 CPU cores to run simultaneous instances
  Real-time Strategy (RTS) games provide a challenging en-            of StarCraft II, 1200 Tensor Processor Units (TPUs) to train
  vironment for AI research, due to their large state and ac-
                                                                      the networks, as well as a large amount of infrastructure and
  tion spaces, hidden information, and real-time gameplay. Star-
  Craft II has become a new test-bed for deep reinforcement           electricity to drive this large-scale computation. While Al-
  learning systems using the StarCraft II Learning Environment        phaStar is estimated to be the strongest existing RTS AI agent
  (SC2LE). Recently the full game of StarCraft II has been ap-        and was capable of beating many players at the Grandmas-
  proached with a complex multi-agent reinforcement learning          ter rank on the StarCraft II ladder, it does not yet play at the
  (RL) system, however this is currently only possible with ex-       level of the world’s best human players (e.g. in a tournament
  tremely large financial investments out of the reach of most        setting). The creation of AlphaStar demonstrated that using
  researchers. In this paper we show progress on using varia-         deep learning to tackle RTS AI is a powerful solution, how-
  tions of easier to use RL techniques, modified to accommo-          ever applying it to the entire game as a whole is not an eco-
  date actions with multiple components used in the SC2LE.            nomically viable solution for anyone but the worlds largest
  Our experiments show that we can effectively transfer trained
                                                                      companies. In this paper we attempt to demonstrate that one
  policies between RTS combat scenarios of varying complex-
  ity. First, we train combat policies on varying numbers of Star-    possible solution is to use the idea of transfer learning: learn-
  Craft II units, and then carry out those policies on larger scale   ing to generate policies for sub-problems of RTS games, and
  battles, maintaining similar win rates. Second, we demon-           then using those learned policies to generate actions for many
  strate the ability to train combat policies on one StarCraft II     other sub-problems within the game, which can yield savings
  unit type (Terran Marine) and then apply those policies to an-      in both training time and infrastructure costs.
  other unit type (Protoss Stalker) with similar success.                In 2017, Blizzard Entertainment (developer of the Star-
                                                                      Craft games), released SC2API: an API for external control
        1    Introduction and Related Work                            of StarCraft II. DeepMind, in collaboration with Blizzard,
                                                                      simultaneously released the SC2LE with a Python interface
Real-Time Strategy (RTS) games are a popular testbed                  called PySC2 designed to enable ML research with the game
for research in Artificial Intelligence, with complex sub-            (Vinyals et al. 2017). AlphaStar was created with tools built
problems providing many algorithmic challenges (Ontañón               on top of SC2API and SC2LE. PySC2 allows commands to
et al. 2015), and the availability of multiple RTS game APIs          be issued in a way similar to how a human would play; units
(Heinermann 2013; Synnaeve et al. 2016; Vinyals et al.                are selected with point coordinates or by specifying a rectan-
2017), they provide an ideal environment for testing novel            gle the way a human would with the mouse. Actions are for-
AI methods. In 2018, Google DeepMind unveiled an AI                   matted as an action function (e.g. move, attack, cast a spell,
agent called AlphaStar (Vinyals et al. 2019a), which used             unload transport, etc.) with varying numbers of arguments
machine learning (ML) techniques to play StarCraft II at a            depending on the function. This action representation differs
professional human level. AlphaStar was initially trained us-         from that used in other RTS AI research APIs, including the
ing supervised learning from hundreds of thousands of hu-             TorchCraft ML library for StarCraft: Broodwar (Synnaeve et
man game traces, and then continued to improve via self play          al. 2016). In this paper we refer to the action function and
with deep RL, a method by which the agent improves its                its arguments as action components. Representing actions
policy by learning to take actions which lead to higher re-           as functions with parameters creates a complicated action-
wards more often. While this method was successful in pro-            space that requires modifications to classic RL algorithms.
ducing a strong agent, it required a massive engineering ef-
                                                                         Since the release of PySC2 several works other than Al-
fort, with a team comprised of more than 30 world-class AI
                                                                      phaStar have been published using this environment with
researchers and software engineers. AlphaStar also required
                                                                      deep RL. Tang et al. (2018) trained an agent to learn unit and
Copyright c 2020 for this paper by its authors. Use permitted under   building production decisions (build order) using RL while
Creative Commons License Attribution 4.0 International (CC BY         handling resource harvesting and combat with scripted mod-
4.0).                                                                 ules. Sun et al. (2018) reduced the action space by training
                                                        pl
                                                         ayer
                                                            _rel
                                                               ati
                                                                 ve
                                                                       player_relative - categorical feature describing if units are
                                                                          “self” or enemy (and some other categories we don’t use)
                                                                       selected - Boolean feature showing which units are cur-
                                                          sel
                                                            ect
                                                              ed
                                                                          rently selected (i.e. if the player can give them commands)
                                                                       unit_hit_points - scalar feature giving remaining health of
                                                                          units, which we convert to 3 categories

                                                        uni
                                                          t_hi
                                                             t_poi
                                                                 nts
                                                                       2.2   PySC2 actions
                                                                       Actions in PySC2 are conceptually similar to how a human
                                                                       player interacts with the game, and consist of a function se-
                                                                       lection and 0 or more arguments to the function. We use a
                                                                       small subset of the over 500 action functions in PySC2 which
                                                                       are sufficient for the combat scenarios in our experiments.
Figure 1: 8m vs. 8m scenario showing game screen (left) and
                                                                       Those functions and their parameters are:
PySC2 features used (right).
                                                                       no_op - do nothing
                                                                       select_rect - select units in a rectangular region
an agent to use macro actions that combine actions available              • screen (x, y) (top-left position)
in PySC2 to play the full game of StarCraft II against the                • screen2 (x, y) (bottom-right position)
built-in AI. Macro actions have also been used in combina-
tion with a hierarchical RL architecture of sub-policies and           select_army - select all friendly combat units
states and curriculum transfer learning to progressively train         attack_screen - attack unit or move towards a position and
an agent on harder levels of the built-in AI to learn to play the         stop to attack enemies in range on the way
full game of StarCraft II (Pang et al. 2019). Samvelyan et al.            • screen (x, y)
2019 introduced a platform using the SC2LE called the Star-
Craft Multi-Agent Challenge used for testing multi-agent RL            move_screen - move to a position while ignoring enemies
by treating each unit as a separate agent. Notably, these other           • screen (x, y)
works have not used the complex action-space directly pro-                Any player controlled with PySC2 will still have its units
vided by PySC2, which mirrors how humans play the game.                automated to an extent by built-in AI, as with a human player.
   This paper is organized as follows: in the next section             For example, if units are standing idly and an enemy enters
we describe how our experiments interact with the the                  within range, they will attack that enemy.
SC2LE; following that, we present our implementation of
component-action DQN; in the Experiments section we de-                            3    Component-Action DQN
scribe our experimental setup and results; finally we present
our conclusions and ideas for expanding on this research.              For this research we implemented the DQN RL algorithm,
                                                                       first used to achieve human-level performance on a suite of
                                                                       Atari 2600 games (Mnih et al. 2015), with the double DQN
      2   StarCraft II Learning Environment                            enhancement (Van Hasselt, Guez, and Silver 2016), a duel-
The PySC2 component of the SC2LE is designed for ML/RL                 ing network architecture (Wang et al. 2015), and prioritized
research, exposing the gamestate mainly as 2D feature maps             experience replay (Schaul et al. 2016). The number of unique
and using an action-space similar to human input. An RL                actions even in the small subset we are using is too large to
player receives a state observation from PySC2 and then                output an action-value for each one (for instance, if the screen
specifies an action to take once every n frames, where n is            size is 84x84 pixels, there are 7056 unique “attack screen”
adjustable. The game normally runs at 24 frames per sec-               commands alone). To address this problem we output a sep-
ond, and all our experiments have the RL player acting every           arate set of action-values for each action component. Our
8 frames, as in previous work (Vinyals et al. 2017). At this           version of DQN, implemented using Python and Tensorflow,
speed the RL player acts at a similar rate as a human, and the         features modifications for choosing actions and calculating
gamestate can change meaningfully between states.                      training loss with actions that consist of multiple components
                                                                       (component-actions). Algorithm 1 shows our implementa-
                                                                       tion, emphasizing changes for handling component-actions.
2.1   PySC2 observations
                                                                          Invalid action component choices are masked in several
PySC2 observations consist of a number of 2D feature maps              parts of the algorithm. Random action function selection in
conveying categorical and scalar information from the main             step 8 is masked according to the valid actions for that state.
game map and minimap, as well as a 1D vector of other infor-           When choosing a non-random action using the network out-
mation that does not have a spatial component. The observa-            put in step 11, the action function with the highest action-
tion also includes a list of valid actions for the current frame       value is chosen first, disregarding actions marked as unavail-
which we use to mask out illegal actions. In our experiments           able in the state observation, and then each parameter to the
we use a subset of the main map spatial features relevant to           action function is chosen according to the highest action-
the combat scenarios we use:                                           value (step 11). Invalid choices for parameters to the action
Algorithm 1 Double DQN with prioritized experience re-                 Several previous works have used variations of action
play and component-actions for StarCraft II                         components with different RL algorithms. Tavakoli, Pardo,
 1: Input: minibatch size k, batch update frequency K, tar-         and Kormushev (2018) use a method they call Branching Du-
    get update frequency C, max steps T , memory size N ,           eling Q-Network (BDQ) in several MuJoCo physical control
    Mmin ≤ N , initial  and annealing schedule                     tasks with multidimensional action spaces. In these tasks, all
 2: M [] ← ∅                                                        components are used in each complete action. They experi-
 3: Initialize Q with random weights θ, components d ∈ D            mented with 1) using a separate TD target, y d , for each action
 4: Initialize Qtar with weights θtar = θ                           component ad corresponding to a component d ∈ D, the set
 5: Initialize environment env with state s1                        of all action components; 2) using the maximum y d as a sin-
 6: for t = 1 to T do                                               gle TD target; and 3) using the average of the y d as a single
 7:     if rand() <  or t ≤ Mmin then                              TD target, which was found to give the best performance:
 8:          Select valid action function a0t randomly                               γ X d                                 d
                                                                           y=r+             Qtar (s0 , arg maxQd (s0 , a0 )) (1)
 9:          Select action parameters a1t , a2t , . . . randomly                    |D|                   a 0d
                                                                                        d∈D
10:     else
11:          at = {adt = arg maxQd (st , ad ) | d ∈ D}              Their training loss is the average of the squared errors of each
                             ad                                     component’s action-value and y from Equation 1.
12:     Take action at in environment, observe st+1 , rt+1             Action components were also used (Huang and Ontañón
13:     M [t mod N ] ← (st , at , rt+1 , st+1 )                     2019) to represent actions in the microRTS environment to
14:     Priority[M [t mod N ]] = max(Priority)                      train an agent using the Advantage Actor Critic (A2C) algo-
15:     Anneal  according to schedule                              rithm, a synchronous version of the policy gradient method
16:     if t ≡ 0 mod C and t > Mmin then θtar ← θ                   Asynchronous Advantage Actor Critic (A3C) (Mnih et al.
17:     if t ≡ 0 mod K and t > Mmin then                            2016). In that work each component is not used in ev-
18:          Sample k transitions (sj , aj , rj , s0j ) from M      ery action, but the policy gradient action log-probability
               0
19:          a0 ← arg maxQ0 (s0j , a0 )                             (log(πθ (st , at )), where πθ is the policy network) used in
                       a0                                           training is always the sum of the action log-probabilities
                                                0
20:         md ← 1 if ad is a parameter to a0 , else md ← 0         for each action component. AlphaStar (Vinyals et al. 2019b)
21:         if s0j is terminal then                                 also uses a policy gradient method and sums the action log-
22:              yj ← rj                                            probabilities from each action component, but masks out the
23:         else                                                    contribution from unused components.
24:              Set yj as in Equation 2                               In our implementation, on step 24, we use the mean target
25:         Update θ minimizing md L(yj , Qd (sj , adj ))
                                    P                               component y as in Equation 1, but modified so that unused
                                    d                               components of the target action are masked out using md =
            Priority[j] ← md |yj − Qd (sj , adj )|                  1 if component d is in use for an action, md = 0 otherwise:
                            P
26:
                                                                                   γ X d d                                 d
                                                                        y=r+                m Qtar (s0 , arg maxQd (s0 , a0 )) (2)
                                                                                  |D|                       a0d
                                                                                     d∈D
function could be masked out in steps 9 and 11 in general, but
are not in this work since all values of the parameters we used        The loss to be minimized with gradient descent on step 25
(screen and screen2) are valid. Only parameters that are used       of Algorithm 1 is the sum of the losses of each used compo-
by the chosen action function a0t are used in the environment       nent’s action-value compared to the target y:
update in step 12.                                                                           X
                                                                                  LT otal =        L(y, Qd (s, ad ))
3.1   Loss Functions                                                                        d∈Dused
In double DQN network updates are made during training              where Dused are the components in use for a particular ac-
using the TD target value                                           tion, and L() is the Huber loss used in DQN (Mnih et al.
           y = r + γQtar (s0 , arg maxQ(s0 , a0 )),                 2015), which is squared error for error with absolute value
                                    a0                              less than 1, and linear otherwise.
where r is the reward, γ is the discount, Q and Qtar are               We considered and tested two alternative loss calculations.
the primary and target network, and s0 and a0 are the next          The first method was to sum the Huber losses of the compo-
state and action taken in the next state. With a single action      nents compared pairwise with
component the loss to be minimized is L(y, Q(s, a)), where                          r                                      0
L() is some loss function (e.g. squared error). When using                yd =            + γQdtar (s0 , arg maxQd (s0 , ad )),
                                                                                 |Dused |                   ad0
component-actions, there are separate action-values for each
component of the action. In general, an action is a tuple of        masking out unused components from the action taken but
action components (a0 , a1 , ..., an ), with an individual action   not from the target action. This method is similar to method
depending on a subset of those components. That is, some            1) tested with BDQ (Tavakoli, Pardo, and Kormushev 2018).
components are not used for every action. This raises the           It has the advantage of comparing losses from different ac-
question of how to calculate training loss when the action          tion component branches of the network to their correspond-
components used from one action to the next differ.                 ing branch, but since it updates the primary network toward
        Input                                                         Both the screen and screen2 branches receive as input
                                                   function        the output of the shared CNN, which is fed into a 32 filter
                                                 action values     3x3 convolutional layer, followed by a 1 filter 1x1 convo-
                       function branch                             lutional layer as described in (Vinyals et al. 2017), giving
                                                                   output dimensions of 84x84x1 and the action advantages of
                                                    screen
                                                                   each screen position. We experimented with adding the one-
                                                 action values     hot encoded output from the function branch as input to the
      Shared CNN        screen branch                              screen branch followed by additional convolutional layers,
                                                                   and similarly with the screen and screen2 branch with a sin-
                                                                   gle layer added with value 1 in the position corresponding to
                                                   screen2
                                                 action values     the screen choice and 0 elsewhere, but found that for the ac-
                                                                   tion functions and scenarios used in these experiments there
   Value branch        screen2 branch
                                                                   was surprisingly no gain in performance. We believe a larger
                                                                   network combined with more training time may be required
 Figure 2: Diagram showing overall design of the network.          to take advantage of these connections.
                                                                      Each convolutional and dense layer (except those leading
                                                                   to action advantage outputs) is followed by a ReLU activation
the action-values of unused components of the target action        and batch normalization. Training hyperparameters were se-
we didn’t expect it to perform well.                               lected through informal search. We use the Adam optimizer
   The second alternative method we tried was to use the           with learning rate of 0.001, a discount of 0.99, batch size
Huber loss of the average of the action-values of the action       of 64, and L2 regularization. Minibatch training updates are
taken by the primary network (unused components masked             performed every 4 steps, and the target network is updated
out) compared with the same average y from Equation 1.             every 10,000 steps. The prioritized experience replay mem-
This method seems reasonable since it uses the same for-           ory size is 60,000 transitions, and all parameters are as de-
mula to calculate both the primary network total action-value      scribed in the sum tree implementation (Schaul et al. 2016).
and discounted target network next state value. However,           In each training the exploration  is exponentially annealed
branches with positive and negative action-values can cancel       from 1 to 0.05 over the first 80% of total steps.
each other out in both inputs to the loss function, resulting in
a small loss when individual components of the primary net-                            4    Experiments
work action-value may have a large error relative to the target
network. Both alternative loss calculations resulted in worse      We did three experiments to test the performance of our net-
performance and were not used in our final experiments.            work structure and action representation in different environ-
                                                                   ments. First, we compared performance when training for
3.2     Network and training parameters                            different amounts of training steps; second, we compared
                                                                   performance on different scenarios, ranging from small (4
The neural network used to predict action-values consists of
                                                                   units each) to large (32 units each) instances of a combat sce-
a shared convolutional neural network (CNN), followed by
                                                                   nario; third, we tested transfer learning performance by using
separate branches for state value and the three action com-
                                                                   models trained exclusively on one scenario in other scenar-
ponents we used (function, screen, screen2). The network’s
                                                                   ios. Performance of all trained models is compared to that of
overall design is shown in Figure 2.
                                                                   a number of simple scripted players. All training and evalua-
   The state input categorical feature maps are first prepro-
                                                                   tion is done with the opponent being the game’s built-in AI,
cessed to be in one hot format with dimensions 84x84x6.
                                                                   i.e. the same opponent a human player playing the single-
Next the input is run through a series of blocks consisting
                                                                   player game would face.
of parallel convolutional layers whose outputs are concate-
nated. First the one-hot input is run through each of 3 con-
volutional layers with 16 filters each and kernel sizes 7, 5,      4.1   StarCraft II Combat Scenarios
and 3. The output of those layers is concatenated along the        Experiments were conducted on custom StarCraft II maps
channel dimension and given as input to each of two convo-         that each contain a specific combat scenario. The scenarios
lutional layers with 32 filters and kernel sizes 5 and 3. Those    limit the size of the battlefield to a single-screen sized map,
two outputs are concatenated and fed to a final convolutional      which removes the requirement to navigate the view to differ-
layer with 32 filters and kernel size 3. All convolutional lay-    ent areas of the environment. In each episode an equal num-
ers here use padding and a stride of 1 to keep the output to       ber of units are spawned for both the RL or scripted player
the same width and height, so as to preserve spatial data for      being tested, and a built-in AI controlled player. Units appear
the action parameters which target parts of the screen.            in two randomized clusters, randomly assigned to the left or
   Next the network splits into a value branch and one branch      right side of the map, which are symmetric about the cen-
for each action component. The value and function branches         tre of the map. The built-in AI enemy is immediately given
each have a max pooling layer of size and stride 3, followed       an order to attack the opposite side of the screen, causing
by 2 dense layers of size 256. The function branch ends with       them to rush at the RL player’s units, attacking the first ene-
a final dense layer outputting the action advantages of the        mies they reach. Once given this command, the in-game AI
function action component.                                         takes over control of the enemy units, which executes a policy
which prioritizes attacking the closest units of the RL player.                                               300k Steps, Smoothed Reward Over 100 Episodes




                                                                  Reward / (Scenario Unit Count)
This use of the built-in AI to control the enemy for combat                                         0.2
experiments has been shown to be effective for testing and                                          0.1
                                                                                                      0
training combat algorithm development (Churchill, Lin, and                                         -0.1
Synnaeve 2017).                                                                                    -0.2
   We trained and evaluated our models and scripted players                                        -0.3
                                                                                                                                                4m vs. 4m
                                                                                                   -0.4                                         8m vs. 8m
on scenarios with equal numbers of Terran Marines (a ba-
                                                                                                   -0.5                                       16m vs. 16m
sic ranged unit in the game) per side, numbering 4, 8, 16,                                         -0.6                                       32m vs. 32m
or 32. The Marine scenarios will be referred to as “#m vs.                                         -0.7
#m”, where # is the number of Marines. We also evalu-                                                     0   50000    100000    150000    200000    250000   300000

ated our models on scenarios with the same counts of Protoss                                                                      Steps
Stalkers, referred to as “#s vs. #s”. Stalkers are larger units                                               600k Steps, Smoothed Reward Over 100 Episodes




                                                                  Reward / (Scenario Unit Count)
with longer range and a shield which must be reduced to zero                                        0.1
before their health will deplete. These scenarios were con-                                           0
structed in a symmetric fashion in order to give both sides an                                     -0.1
equal chance at winning each battle. If one side is victorious                                     -0.2
                                                                                                   -0.3
from an even starting position, it must mean that their method                                                                                  4m vs. 4m
                                                                                                   -0.4
is more effective at controlling units for combat. An effec-                                                                                    8m vs. 8m
                                                                                                   -0.5
tive human policy for such scenarios involves first grouping                                                                                  16m vs. 16m
                                                                                                   -0.6                                       32m vs. 32m
up the players’ units into a tight formation, and then using                                       -0.7
focus-targeting to most efficiently destroy enemy units.                                                  0   100000   200000    300000    400000    500000   600000
   Episodes end when all units of one side are destroyed                                                                          Steps
(health reduced to 0), or until a timer runs out. The timer
is set at 45 real-time game seconds for Marine maps, and          Figure 3: Model training reward (normalized to the number
1:30 for the Stalker maps since Stalkers have more health         of Marines in the scenario) per episode (smoothed over 100
and shields causing the battles to last longer. The map then      episodes) for 300k (top) and 600k (bottom) step models.
resets with a new randomized unit configuration and a new
episode begins. The maps output custom reward informa-
tion to PySC2, which are calculated using the LTD2 formula           For each scenario/step count combination we trained three
(Churchill 2016), which values a unit as by its lifetime dam-     models and used the best performing of those for the transfer
age, a function of its damage output per second multiplied        learning experiments. Trained models were evaluated by run-
by remaining health. The unit values are normalized to equal      ning 1000 episodes with a deterministic policy (i.e.  = 0),
1 for a full health unit of the highest value in the scenario,    using the trained model for inference only. In evaluation, the
and the difference in total unit value (self minus enemy) be-     result can be a win (1 point), draw (0.5) or loss (0). Wins oc-
tween steps is added to the step reward. Since there are posi-    cur when all enemy units are destroyed. Draws happen if both
tive rewards for damaging enemy units and negative rewards        sides simultaneously lose their last unit. If the map timer runs
for taking damage, episode rewards tend to be similar in sce-     out, the side with the most combined health/shields wins. If
narios with different numbers of units. The damage part of        that metric results in a tie then the episode is counted as a
the LTD2 calculation doesn’t affect the rewards in scenarios      draw. Evaluation results are presented as a score, equal to the
that feature only one unit type, as in the experiments pre-       number of wins plus half of the number of draws, divided by
sented here, but it will affect future experiments with more      the number of evaluation episodes.
scenarios with mixed unit types. The total reward per step is        We also evaluated several scripted players:
observed by the RL player in step 12 of Algorithm 1.              No Action (NA) - This player takes the no_op action only,
                                                                     which is equivalent to a human player providing no input.
4.2   Training and Evaluation                                        Its units will attack and follow enemy units if they move
We trained models in scenarios with 4, 8, 16, and 32 Marines         into range, controlled by the in-game AI.
per side, for 300k and 600k steps. In informal testing we         Random (R) - This player takes random actions. Both the
found that 300k steps was enough to see good performance             function choice and any arguments are randomized.
in these scenarios, and we chose double the number of steps       Random Attack (RA) - This player selects all friendly
to compare with a longer training time. Training rewards per         units, and subsequently attacks random screen positions.
episode for 300k and 600k steps is shown in Figure 3. Train-
ing was performed on a machine with an Intel i7-7700K CPU         Attack Weakest Nearest (AWN) - This player selects all
running at 4.2 GHz and an NVIDIA GeForce GTX 1080 Ti                 friendly units on the first frame and then attacks the en-
video card. It takes 5 hours to train for 300k steps using the       emy with the lowest health, choosing the enemy nearest to
GPU and running the game as fast as possible. To train a             the average of the friendly units’ positions as a tie breaker.
model we use PySC2 to run our custom combat scenarios for
as many episodes as needed to reach the step limit. The RL                                                    5   Results and Discussion
player receives input from the game environment and takes         Results of our experiments can be seen in Table 1. The first
actions as described in Algorithm 1.                              column shows the scenario for which the experiment is being
                                                            Learned Policy - 300k Steps            Learned Policy - 600k Steps
                            Scripted Players                    Trained on Marines                     Trained on Marines
        Scenario     NA       R       RA        AWN        4m      8m       16m      32m          4m       8m      16m     32m
   4m vs. 4m        0.173   0.013     0.570     0.907     0.691    0.661    0.619     0.625     0.663     0.606    0.444     0.414
   8m vs. 8m        0.146   0.019     0.561     0.721     0.694    0.691    0.692     0.541     0.693     0.585    0.356     0.440
 16m vs. 16m        0.119   0.004     0.391     0.038     0.591    0.546    0.695     0.450     0.643     0.213    0.300     0.503
 32m vs. 32m        0.044   0.002     0.116     0.000     0.329    0.280    0.441     0.332     0.340     0.247    0.130     0.471
        4s vs. 4s   0.142   0.000     0.352     0.910     0.497    0.491    0.008     0.479     0.525     0.291    0.236     0.408
        8s vs. 8s   0.179   0.000     0.240     0.678     0.420    0.406    0.015     0.386     0.512     0.192    0.204     0.303
      16s vs. 16s   0.211   0.001     0.030     0.013     0.114    0.183    0.034     0.356     0.288     0.152    0.154     0.196
      32s vs. 32s   0.130   0.000     0.002     0.000     0.075    0.239    0.223     0.320     0.386     0.193    0.355     0.170

Table 1: Experiment scores ((wins + draws/2)/# episodes) of scripted players and learned policies evaluated for 1k episodes
in multiple scenarios. Columns give the policy being evaluated, and rows give the scenario. e.g. the number in the 300k 4m
column, 8m vs. 8m row gives the score of the model trained for 300k steps on the 4m vs. 4m scenario when evaluated on the
8m vs. 8m scenario. The best performing policy in each category (scripted, trained for 300k steps, and 600k steps) is bolded.


conducted along a row, with the other column values showing          was able to obtain a score of 0.663 when applied to the 4
which policy controls the player units for the experiment.           vs. 4 Marine scenario, and 0.643 when applied to the 16 vs.
                                                                     16 Marines scenario, which was even better than the policy
5.1     Scripted Player Benchmarks                                   trained itself on 16 Marines. We did however notice that re-
                                                                     sults get worse as the difference in unit count between train-
The results of the scripted players can be seen in columns 2-5
                                                                     ing and testing scenarios gets larger, which was expected.
of Table 1. The performance of the benchmark scripted play-
ers on the Marine scenarios shows a large range of results.              One surprising result however was that for several scenar-
The NA policy scores as high as 0.173, as units will attack          ios, the best policy was not one one that was trained on that
when enemies get within range even if the player does noth-          same scenario. For example, the policy trained on 16 vs. 16
ing. The random bot gets scores from 0 to 0.019, whereas             Marines for 600k time steps was the 2nd worst policy when
the random attacking bot ranges from a score of 0.116 in the         applied in the 16 vs. 16 Marine scenario. Another surpris-
32m vs. 32m scenario to 0.570 in the 4m vs. 4m scenario.             ing result was that most of the policies trained for 300k time
This shows that simply choosing to attack all the time is a          steps ended up performing as good, or even better than those
good strategy, but that it isn’t enough to win often with higher     trained for 600k time steps. This indicates that either the sce-
unit counts (in the 32m vs. 32m scenario the bot is likely at-       narios are not complicated enough for more training to result
tacking its own units sometimes). AWN achieves a score of            in better policies, or that the variability in model performance
0.907 in the 4m vs. 4m scenario, but scores 0 in the 32m             is too large to see a trend for the number of models we trained.
vs. 32m scenario, indicating that formation becomes a big-               By visual observation, most trained models learn to select
ger factor in more complex scenarios. We observed that units         all friendly units and then mainly attack. Some learn to target
often get stuck while trying to move to attack the same target.      the ground near friendly units, causing those units to cluster.
The scripted players perform best in the 4m vs. 4m scenario          Some models also learn to target damaged enemy units.
and perform worse as the number of units increases in all
cases except R in the 8m vs. 8m scenario. Scripted players           5.3   Learned Policies - Unit Type Transfer
perform similarly in the Stalker scenarios.
                                                                     The results in the bottom 4 rows of Table 1 are for exper-
                                                                     iments carried out with policies learned on scenarios with
5.2     Learned Policies - Battle Size Transfer                      Marine vs. Marine battles, but applied to scenarios with
Results for the learned policies can be seen in columns 6-           Stalker vs. Stalker battles. In general, the results show that
9 (300k training steps) and 10-13 (600k training steps) of           while the scores for these Stalker scenarios are not as high
Table 1. We believe that these results yield two major obser-        as the Marine scenarios, the policies can indeed be trans-
vations. The first is that like scripted players, smaller sized      ferred to units of different types. In particular, we can see
battles yielded higher scores for learned policies. We believe       that smaller sized scenarios perform well, with scores taper-
this is because smaller battles are less complex, with far fewer     ing off for larger scenarios. We believe this is due to the fact
possible states, and therefore are easier for learning effective     that as more units enter the battlefield, the differences be-
policies. Also, smaller battles end faster, so more battles are      tween those units such as size and damage type become more
able to be carried out in the same number of training steps.         apparent, causing the policy to perform worse. One surpris-
The second observation is that the experiments demonstrated          ing result was that the policies trained on 16 vs. 16 Marines,
the ability to transfer a policy trained on battles of one given     especially the one trained for 300k steps, performed much
size to another, while maintaining similar results. For exam-        worse than the other policies overall, for which we currently
ple, a policy trained on 4 vs. 4 Marines for 600k time steps         have no explanation.
         6    Conclusion and Future Work                            Pang, Z.-J.; Liu, R.-Z.; Meng, Z.-Y.; Zhang, Y.; Yu, Y.; and
                                                                    Lu, T. 2019. On reinforcement learning for full-length game
In this paper we presented an application of component-
                                                                    of StarCraft. In Thirty-Third AAAI Conference on Artificial
action DQN to combat scenarios in a complex real-time strat-
                                                                    Intelligence (AAAI-19), volume 33, 4691–4698.
egy game domain, StarCraft II. We showed that with short
training times and a relatively easy to implement RL system,        Samvelyan, M.; Rashid, T.; de Witt, C. S.; Farquhar, G.;
good performance can be achieved in these combat scenar-            Nardelli, N.; Rudner, T. G. J.; Hung, C.-M.; Torr, P. H. S.;
ios. We successfully demonstrated transfer learning between         Foerster, J.; and Whiteson, S. 2019. The StarCraft multi-
battle scenarios of different sizes: that policies can be learned   agent challenge. CoRR abs/1902.04043.
in a scenario of a given size, and then applied to scenarios        Schaul, T.; Quan, J.; Antonoglou, I.; and Silver, D. 2016.
of different sizes with comparable results. We also demon-          Prioritized experience replay. In Bengio, Y., and LeCun, Y.,
strated transfer learning between different unit types, with        eds., 4th International Conference on Learning Representa-
policies being learned in scenarios with Marine unit battles        tions, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016,
being successfully applied to battles with Stalker units. We        Conference Track Proceedings.
believe that these results show promise for the future of RL        Sun, P.; Sun, X.; Han, L.; Xiong, J.; Wang, Q.; Li, B.; Zheng,
in RTS games, by allowing us to train policies in smaller, less     Y.; Liu, J.; Liu, Y.; Liu, H.; and Zhang, T. 2018. TStarBots:
complex scenarios, and then apply those policies to different       Defeating the cheating level builtin AI in StarCraft II in the
areas of the game, reducing the need for longer training times      full game. arXiv preprint arXiv:1809.07193.
and much larger networks, like those found in AlphaGo.
                                                                    Synnaeve, G.; Nardelli, N.; Auvolat, A.; Chintala, S.;
   Future work for this project can include implementing
                                                                    Lacroix, T.; Lin, Z.; Richoux, F.; and Usunier, N. 2016.
a similar network architecture and action component sys-
                                                                    TorchCraft: A library for machine learning research on real-
tem using policy gradient RL methods to compare them to
                                                                    time strategy games. arXiv preprint arXiv:1611.00625.
component-action DQN. Also, testing in scenarios that re-
quire using more action types to achieve high scores may            Tang, Z.; Zhao, D.; Zhu, Y.; and Guo, P. 2018. Reinforcement
help to better explore contributions of the component-action        learning for build-order production in StarCraft II. In 2018
method, which may improve transfer learning performance.            Eighth International Conference on Information Science and
                                                                    Technology (ICIST), 153–158.
                         References                                 Tavakoli, A.; Pardo, F.; and Kormushev, P. 2018. Action
                                                                    branching architectures for deep reinforcement learning. In
Churchill, D.; Lin, Z.; and Synnaeve, G. 2017. An analy-            Thirty-Second AAAI Conference on Artificial Intelligence,
sis of model-based heuristic search techniques for StarCraft        4131–4138.
combat scenarios. In Thirteenth Artificial Intelligence and
                                                                    Van Hasselt, H.; Guez, A.; and Silver, D. 2016. Deep re-
Interactive Digital Entertainment Conference.
                                                                    inforcement learning with double Q-learning. In Thirtieth
Churchill, D. 2016. Heuristic Search Techniques for Real-           AAAI Conference on Artificial Intelligence.
Time Strategy Games. Ph.D. Dissertation, University of Al-          Vinyals, O.; Ewalds, T.; Bartunov, S.; Georgiev, P.; Vezhn-
berta.                                                              evets, A. S.; Yeo, M.; Makhzani, A.; Küttler, H.; Agapiou, J.;
Heinermann, A. 2013. Broodwar API. https://github.com/              Schrittwieser, J.; et al. 2017. Starcraft II: A new challenge for
bwapi/bwapi.                                                        reinforcement learning. arXiv preprint arXiv:1708.04782.
Huang, S., and Ontañón, S. 2019. Comparing observation              Vinyals, O.; Babuschkin, I.; Chung, J.; Mathieu, M.; Jader-
and action representations for deep reinforcement learning in       berg, M.; Czarnecki, W.; Dudzik, A.; Huang, A.; Georgiev,
MicroRTS. arXiv preprint arXiv:1910.12134.                          P.; Powell, R.; Ewalds, T.; Horgan, D.; Kroiss, M.; Dani-
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-             helka, I.; Agapiou, J.; Oh, J.; Dalibard, V.; Choi, D.; Sifre, L.;
ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.;             Sulsky, Y.; Vezhnevets, S.; Molloy, J.; Cai, T.; Budden, D.;
Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-                Paine, T.; Gulcehre, C.; Wang, Z.; Pfaff, T.; Pohlen, T.; Yo-
level control through deep reinforcement learning. Nature           gatama, D.; Cohen, J.; McKinney, K.; Smith, O.; Schaul, T.;
518(7540):529–533.                                                  Lillicrap, T.; Apps, C.; Kavukcuoglu, K.; Hassabis, D.; and
                                                                    Silver, D. 2019a. AlphaStar: Mastering the real-time strat-
Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.;       egy game StarCraft II. https://deepmind.com/blog/alphastar-
Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asyn-             mastering-real-time-strategy-game-starcraft-ii/.
chronous methods for deep reinforcement learning. In Bal-
can, M. F., and Weinberger, K. Q., eds., Proceedings of             Vinyals, O.; Babuschkin, I.; Czarnecki, W. M.; Mathieu, M.;
The 33rd International Conference on Machine Learning,              Dudzik, A.; Chung, J.; Choi, D. H.; Powell, R.; Ewalds,
volume 48 of Proceedings of Machine Learning Research,              T.; Georgiev, P.; et al. 2019b. Grandmaster level in Star-
1928–1937. New York, New York, USA: PMLR.                           Craft II using multi-agent reinforcement learning. Nature
                                                                    575(7782):350–354.
Ontañón, S.; Synnaeve, G.; Uriarte, A.; Richoux, F.;
Churchill, D.; and Preuss, M. 2015. RTS AI problems                 Wang, Z.; Schaul, T.; Hessel, M.; Van Hasselt, H.; Lanc-
and techniques. In Lee, N., ed., Encyclopedia of Computer           tot, M.; and De Freitas, N. 2015. Dueling network ar-
Graphics and Games. Cham: Springer International Publish-           chitectures for deep reinforcement learning. arXiv preprint
ing. 1–12.                                                          arXiv:1511.06581.