<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Extracting Policies from Replays to Improve MCTS in Real Time Strategy Games</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zuozhi Yang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Santiago Ontan˜ o´n</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Drexel University</institution>
          ,
          <addr-line>Philadelphia, Pennsylvania 19104</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper we study the topic of integrating supervised learning models into Monte Carlo Tree Search (MCTS) in the context of RTS games. Specifically, we focus on learning a tree policy for MCTS using existing supervised learning algorithms. We evaluate and compare two families of models: Bayesian classifiers and decision trees classifiers. Our results show that in experiments under same iteration budget for MCTS, the models with higher classification performance also have better gameplay strength when used within MCTS. However, when we constrain computation budget by time, faster models tend to outperform slower, more accurate, models. Surprisingly, the classic C4.5 algorithm stands out in our experiments as the best model since it has good classification performance and fast classification speed.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Board and computer games have always been popular in
the AI research community since they offer a
challenging and rich testbed for both machine learning and search
techniques. A prominent example is AlphaGo
        <xref ref-type="bibr" rid="ref25 ref28">(Silver et al.
2016)</xref>
        , which demonstrated that it is possible to deal with
large search spaces by integrating machine learning and
search. Real-time strategy (RTS) games have become a more
active research area as it offers a greater challenge because
of their enormous state space and branching factor, real-time
nature, and partial observability. In this paper we study the
topic of integrating supervised learning models into Monte
Carlo Tree Search (MCTS) in the context of RTS games.
Specifically, we focus on learning tree policy for MCTS
using existing supervised learning algorithms. Due to the large
state space and branching factor, it is hard to obtain good
policy real-time from MCTS. Thus, learning a reliable
policy offline and bias the tree search towards a more promising
search space can significantly improve the performance of
MCTS.
      </p>
      <p>
        In this paper we build upon previous work on Informed
MCTS, where Bayesian models were trained from data to
inform the tree policy
        <xref ref-type="bibr" rid="ref19">(Ontan˜o´n 2016)</xref>
        . Moreover, given that
RTS games have much more limited decision budget
between each two game decisions compared to other games,
like board games, the model we build must make fast
prediction at testing time, apart from the requirement of
accuracy. In consideration of the time constraint, we
evaluate and compare two family of models: Bayesian classifiers
and decision trees classifiers. Our results show that in
experiments under same iteration budget for MCTS, the
models with higher classification performance also have better
gameplay strength when used within MCTS. However, when
we constrain computation budget by time, faster models tend
to outperform slower, more accurate, models. Specifically,
the C4.5 model stands out in our experiments as the best
model since it has good classification performance and fast
classification speed.
      </p>
      <p>The rest of this paper is structure as follows: (1) In the
background section we introduce RTS games research, the
research platform we employ, and work on applying MCTS
to RTS games. (2) Then we describe and compare the two
algorithm families that we study. (3) After that, we specify the
details of our experiments on comparing the models in
classification performance and gameplay strength as tree policy
for MCTS. (4) Finally, we discuss the empirical results and
layout the possible lines of future work.</p>
    </sec>
    <sec id="sec-2">
      <title>Background</title>
      <p>
        Real-time strategy (RTS) games have been receiving an
increased amount of attention as they are even more
challenging than games like Go or Chess in at least three
different ways: (1) the combinatorial growth of the
branching factor
        <xref ref-type="bibr" rid="ref20 ref30">(Ontan˜o´n 2017)</xref>
        , (2) limited computation budget
between actions due to the real-time nature, and (3) lack
of forward model in most of research environments like
Starcraft. Many research environments and tools, such as
TorchCraft
        <xref ref-type="bibr" rid="ref25 ref28">(Synnaeve et al. 2016)</xref>
        , SCIILE
        <xref ref-type="bibr" rid="ref29 ref30">(Vinyals et al.
2017)</xref>
        , RTS
        <xref ref-type="bibr" rid="ref17 ref18">(Ontan˜o´n 2013)</xref>
        , ELF
        <xref ref-type="bibr" rid="ref29 ref30">(Tian et al. 2017)</xref>
        , and
Deep RTS
        <xref ref-type="bibr" rid="ref1">(Andersen, Goodwin, and Granmo 2018)</xref>
        have
been developed to promote research in the area. Specifically,
in this paper, to stay focused on the problem of interest of
this paper, we chose RTS as our experimental domain, as
it offers a forward model for game tree search approaches
such as minimax or Monte Carlo Tree Search.
      </p>
      <sec id="sec-2-1">
        <title>Real-Time Strategy Games</title>
        <p>RTS is a sub-genre of strategy games where players
aiming to defeat their opponents (destroying their army and
base) by strategically building an economy (gathering
resources and building a base), military power (training units
and researching technologies), and controlling those units.
The main differences between RTS games and traditional
board games are: they are simultaneous move games (more
than one player can issue actions at the same time), they
have durative actions (actions are not instantaneous), they
are real-time (each player has a very small amount of time
to decide the next move), they are partially observable
(players can only see the part of the map that has been explored,
although in this paper we assume full observability) and they
might be non-deterministic.</p>
        <p>
          Furthermore, comparing to traditional board games, RTS
games have a very large state space and action space at each
decision cycle. For example, the branching factor in
StarCraft can reach numbers between 3050 and 30200
          <xref ref-type="bibr" rid="ref17 ref18">(Ontan˜o´n
et al. 2013)</xref>
          .
        </p>
        <p>RTS</p>
        <p>RTS1 is a simple RTS game designed for testing AI
techniques. RTS provides the essential features that make RTS
games challenging from an AI point of view: simultaneous
and durative actions, combinatorial branching factors and
real-time decision making. The game can be configured to
be partially observably and non-deterministic, but those
settings are turned off for all the experiments presented in this
paper. We chose RTS, since in addition to featuring the
above properties, it does so in a very minimalistic way, by
defining only four unit types and two building types, all of
them occupying one tile, and there is only one resource type.
Additionally, as required by our experiments, RTS allows
maps of arbitrary sizes and initial configurations.</p>
        <p>There is one type of environment unit (minerals) and six
types of units controlled by players, which are:</p>
        <sec id="sec-2-1-1">
          <title>Base: can train Workers and accumulate resources</title>
        </sec>
        <sec id="sec-2-1-2">
          <title>Barracks: can train attack units</title>
        </sec>
        <sec id="sec-2-1-3">
          <title>Worker: collects resources and construct buildings</title>
        </sec>
        <sec id="sec-2-1-4">
          <title>Light: low power but fast melee unit</title>
        </sec>
        <sec id="sec-2-1-5">
          <title>Heavy: high power but slow melee unit</title>
        </sec>
        <sec id="sec-2-1-6">
          <title>Ranged: long range attack unit</title>
          <p>Additionally, the environment can have walls to block the
movement of units. A example screenshot of game is shown
in Figure 1. The squared units in green are Minerals with
numbers on them indicating the remaining resources. The
units with blue outline belong to player 1 (which we will call
max) and those with red outline belong to player 2 (which
we will call min). The light grey squared units are Bases
with numbers indicating the amount of resources owned by
the player, while the darker grey squared units are the
Barracks. Movable units have round shapes with grey units
being Workers, orange units being Lights, yellow being Heavy
units (now shown in the figure) and blue units being Ranged.
1https://github.com/santiontanon/microrts
"max"
player
units
"min"
player
units</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Monte Carlo Tree Search in RTS Games</title>
        <p>
          Monte Carlo Tree Search
          <xref ref-type="bibr" rid="ref3">(Browne et al. 2012)</xref>
          <xref ref-type="bibr" rid="ref8">(Coulom
2006)</xref>
          is a method for sequential decision making for
domains that can be represented by search trees. It has been a
successful approach to tackle complex games like Go as it
takes random samples in the search space to estimate state
value. However, most of the successful variants of MCTS,
e.g. UCT
          <xref ref-type="bibr" rid="ref9">(Kocsis and Szepesva´ri 2006)</xref>
          , do not scale up well
to RTS games due to the combinatorial growth of branching
factor with respect to the number of units. Sampling
techniques for combinatorial branching factors such as Na¨ıve
Sampling
          <xref ref-type="bibr" rid="ref17 ref18">(Ontan˜o´n 2013)</xref>
          or LSI
          <xref ref-type="bibr" rid="ref24">(Shleyfman, Komenda,
and Domshlak 2014)</xref>
          were proposed to improve the
exploration of MCTS exploiting combinatorial multi-armed
bandits. Inspired by AlphaGo, another approach to address this
problem is the use of Na¨ıve Bayes models to learn a action
probability distribution as a tree policy prior to guide the
search
          <xref ref-type="bibr" rid="ref19">(Ontan˜o´n 2016)</xref>
          . Other work to deal with this
problem involves limiting the search space by introducing action
abstractions. For example, instead of searching directly in
the raw unit action space, Portfolio greedy search
          <xref ref-type="bibr" rid="ref6 ref7">(Churchill
and Buro 2013)</xref>
          and Stratified Alpha-Beta Search
          <xref ref-type="bibr" rid="ref15">(Moraes
and Lelis 2017)</xref>
          search in the abstracted action spaces
generated by hard-coded scripts.
        </p>
        <p>
          In this paper, we extend the work of policy distribution
learning
          <xref ref-type="bibr" rid="ref19">(Ontan˜o´n 2016)</xref>
          by comparing existing machine
learning models and applying the trained model to tree
policy of MCTS.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>Na¨ıve Monte Carlo Tree Search. Na¨ıve Monte Carlo</title>
        <p>
          Tree Search (Na¨ıveMCTS)
          <xref ref-type="bibr" rid="ref17 ref18">(Ontan˜o´n 2013)</xref>
          is a variant
of MCTS specifically designed to handle RTS games. For
that purpose, Na¨ıveMCTS can also handle durative and
simultaneous actions. Most importantly, the key feature of
Na¨ıveMCTS is that rather than using standard -greedy or
UCB1, it uses a sampling policy for Combinatorial
Multiarmed Bandits (CMABs) called Na¨ıve Sampling in order
to scale up to the combinatorial branching factors of RTS
games.
        </p>
        <p>Specifically, in our experiments, we use the
implementation of Informed Na¨ıveMCTS built in into RTS, which can
be used to incorporate machine learning models into the tree
policy, as described below.</p>
        <p>
          Informed Tree Policy. MCTS algorithms can be broken
down into the following four stages
          <xref ref-type="bibr" rid="ref3">(Browne et al. 2012)</xref>
          :
1. Selection: Starting at root node, recursively select child
nodes according to a pre-defined tree policy until a leaf
node L is reached.
2. Expansion: If the selected node L is a not a terminal node
then add a child node C of L to the tree (also using the
tree policy).
3. Simulation: Run a simulation from C according to a
playout policy until a terminal state is reached.
4. Backpropagation: Update the statistics of the current
move sequence with the simulation result.
        </p>
        <p>
          The tree policy serves as the criteria of child node
selection within the search tree and balances exploration and
exploitation. It can also be informed by a given prior
distribution and then bias the search towards the desired
action space. For example, the PUCB algorithm
(Predictor+UCB)
          <xref ref-type="bibr" rid="ref22">(Rosin 2011)</xref>
          extends the standard UCB1
strategy commonly used as the tree policy with an existing
distribution. A variation of PUCB was used in AlphaGo, where
Silver et al.
          <xref ref-type="bibr" rid="ref27">(Silver, Sutton, and Mu¨ller 2012)</xref>
          used temporal
difference method to learn a value function and to inform the
tree policy. This prior distribution can be trained offline, e.g.
from existing game replays. AlphaGo, for example, trained a
neural network using the human expert plays as supervision,
and then refined it using reinforcement learning. Unlike the
game of Go, however, the RTS games are real-time, which
means the computational budget per action is much smaller
than in Go. Therefore, large neural network models become
impractical since they are currently too slow for the
available computational budget. Thus, in this paper, we select a
collection of statistical learning algorithms which are
potentially fast at classification time as our subjects of study.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Methods for Tree Policy Learning</title>
      <p>We use supervised learning to model and predict the move
probability of script bots and bias the tree search to mimic
and hopefully improve upon script bots with the help of tree
search. A supervised learning model pu(aijs) takes a
feature vector s as the representation of the game state from the
point of view of a unit u and output the probability
distribution over all n possible actions (a1; a2; a3; ; an) the unit
can perform. The model is trained beforehand and
prediction is made each time the tree policy needs to be used. Thus
our learning algorithm for modeling the problem must be
fast at classification time. Also, in order to increase the
sampling quality, the classifier should also provide good class
probability estimations. Therefore, we selected a relatively
small set of categorical features (eight features) to represent
the game states. Moreover, two types of classification
algorithms what are previously known to be suitable to
categorical data and fast at classification time are studied in this
paper: (semi) Na¨ıve Bayes and (ensembles of) decision trees.</p>
      <sec id="sec-3-1">
        <title>Na¨ıve Bayes and Semi Na¨ıve Bayes</title>
        <p>The Na¨ıve Bayes classifier is a simple probabilistic
classifier that makes a strong assumption that all the features are
independent. This allows us to write the joint density as the
product of all the component densities:
p(xjy = c; ) =</p>
        <p>D
Y p(xijy = c; ic)
i=1
Na¨ıve Bayes classifiers are highly scalable and can be
trained and tested both in linear time. However, the
assumption of independent features can be to violated for most of
the situations. Thus, the research community has developed
many semi-Na¨ıve Bayes algorithms that have a weakened
independence assumption.</p>
        <p>
          The semi-Na¨ıve Bayes algorithms tested in our paper is
called averaged one-dependence estimators (A1DE)
          <xref ref-type="bibr" rid="ref31">(Webb,
Boughton, and Wang 2005)</xref>
          . Sahami introduces
ndependence estimators, where the probability of each
feature is conditioned by the class and n other features.
          <xref ref-type="bibr" rid="ref23">(Sahami 1996)</xref>
          The averaged one-dependence estimators work
like this: for each feature x, a classifier that assumes the
other features are independent given the label and feature x.
The model makes predictions by averaging the outcome of
all classifiers. Therefore, A1DE can be also viewed as a
ensemble methods for Na¨ıve Bayes classifiers. We also test a
more relaxed variation of A1DE called A2DE, which build
estimators conditioned by each pair of features.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>Decision Trees and Ensembles</title>
        <p>
          The C4.5 classifier
          <xref ref-type="bibr" rid="ref21">(Quinlan 2014)</xref>
          iteratively builds a
decision tree by splitting the instances by the feature with the
most information gain (reduction of entropy) at each internal
node. Then it assigns class predictions at leaf nodes. Then
the algorithm prunes the tree by a threshold confidence
interval of 25%. We can estimate the class probability
naturally from the label frequency of the leaves. Moreover, this
estimation can be skewed towards 0 or 1 since the leaves are
mostly dominated by one class. In our experiments, Laplace
smoothing is applied in order to produce more accurate class
probability estimation.
        </p>
        <p>
          Bootstrapped aggregating (bagging)
          <xref ref-type="bibr" rid="ref2">(Breiman 1996)</xref>
          is an
ensemble method that helps to improve accuracy and
stability by combining results from multiple training sets
resampled with replacement. In our experiments, we apply 10
iterations of bagging to C4.5.
        </p>
        <p>
          Random forest
          <xref ref-type="bibr" rid="ref11 ref12">(Liaw, Wiener, and others 2002)</xref>
          is a meta
classifier that fits a collection of random tree classifiers on
resamples of the dataset and thus improve the predictive
accuracy and control overfitting by averaging. The internal
random trees select a subset of features to build the model.
In our case, each random tree selects a subset of log(n) + 1
where n is the number of features. In our experiments 100
random trees are built by the random forest algorithm. The
class probability estimation is generated from the proportion
of votes of the trees in the ensemble.
        </p>
        <p>In practice, we observed that learning a different model
for each different unit type in the game (workers, bases,
barracks, etc. in RTS), resulted in better estimation of the
probabilities. So, for the experiments reported in the
remainder of this paper, we generated the probability models in the
following way: (1) For each unit type we train a model
using the subset of the training data corresponding to the unit
type. (2) If this subset is empty, then just train with the whole
training set (i.e., if we have no training data to model the way
a specific unit is controlled, we just train a model with the
whole training set for such unit, hoping it will reflect what
the script we are collecting data from would have done).</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments and Results</title>
      <sec id="sec-4-1">
        <title>Bots for Data Collection</title>
        <p>
          We generated training data in the following way. We
collected data from a round-robin tournament consists of six
bots in 8 different maps, four of which are 8 8 maps
and the other four are 12 12. Four of the bots are
builtin hard-coded bots: WorkerRush, LightRush, HeavyRush,
and RangedRush. The other two were Monte-Carlo search
bots: LSI and Na¨ıveMCTS, whose computational budgets
are configurable. We run the experiments for LSI and
Na¨ıveMCTS bots for four times with computation budgets
of 500, 1000, 2000, and 5000 playouts per game frame
respectively. The description of each bot is as below:
Rush bots: hardcoded deterministic bots that implement
a rush strategy that constantly produces one type of unit
and sends them to attack the opponent’s base.
Specifically, we used the WorkerRush, LightRush, RangedRush
and HeavyRush bots of RTS, which implement rushes
with worker, light, ranged, and heavy units respectively.
LSI: Monte Carlo search with Linear Side
Information
          <xref ref-type="bibr" rid="ref24">(Shleyfman, Komenda, and Domshlak 2014)</xref>
          Na¨ıveMCTS: MCTS algorithm with a tree policy
specifically designed for games with combinatorial branching
factors. The tree policy exploits Combinatorial
MultiArmed Bandits
          <xref ref-type="bibr" rid="ref20 ref30">(Onta n˜o´ n 2017)</xref>
          to handle the
combinatorial explosion of possible moves.
        </p>
        <p>
          The feature vector x(u; s) used to represent each game
state contains only eight features: the number of
resources available to the player, the cardinal direction (north,
east,south, west) toward where most friendly units are,
the cardinal direction toward where most enemy units are,
whether we have a barracks or not, and four features
indicating the type of the unit in the cell two positions north,
east, south or west (or whether these cells are empty or are a
wall). Adding more features could certainly improve
performance, which will be part of our future work. We employed
these 8 features, as used in previous work
          <xref ref-type="bibr" rid="ref19">(Ontan˜ o´ n 2016)</xref>
          ,
for comparison purposes.
        </p>
        <p>The 12 datasets collected are described below:
IWR: consisting of all the unit-actions of WorkerRush
(32679 instances)
ILR: consisting of all the unit-actions of LightRush (30139
instances)
IHR: consisting of all the unit-actions of HeavyRush
(24385 instances)
IRR: consisting of all the unit-actions of RangedRush
(31279 instances)
IL5S0I0,ILSI</p>
        <p>1000,IL2S0I00,IL5S0I00: consisting of all the unit-actions
of LSI under computing budget of 500, 1000, 2000, and
5000 respectively (85918, 88816, 85611, and 68236
instances respectively)
IN50M0CTS,IN10M0C0TS,IN20M0C0TS,IN50M0C0TS: consisting of all the
unitactions of Na¨ıveMCTS under computing budget of 500,
1000, 2000, and 5000 respectively (73249, 83925, 78896,
and 68158 instances respectively)</p>
      </sec>
      <sec id="sec-4-2">
        <title>Empirical Comparison of Models</title>
        <p>For all six models, we used the Weka implementation2. In
this section we compare the six machine learning models
described before from the following aspects:
action prediction (accuracy and probability estimation)
classification speed
gameplay strength as tree policy under iteration budget
gameplay strength as tree policy under time budget
Unit Action Classification First we evaluate the models
on the classification and class probability estimation
performance. Table 1 shows the predictive accuracy and expected
log-likelihood of the six models in each of the 12 datasets
using a 10-fold cross validation. There is a total of 69
different actions a unit can perform in RTS, but each individual
can perform only between 5 and 33 unit-actions. We can see
that the models predict the behavior of the four hard-coded
2The open sourced implementation of C4.5 in Weka is J48.
bots better than the Monte Carlo search bots. But as the
computing budget increases, the prediction performance also
improves for the Monte Carlo search bots. This indicates that
the Monte Carlo search bots converge to more stable
strategies as the computing budget increases.</p>
        <p>We also report the expected log-likelihood of the actions
in the dataset given the model. This is a better metric to
consider than classification accuracy given that we want to
estimate the probability distribution of the actions, and not
just predicting the most likely action. The best possible
loglikelihood would be 0. We can observe that for Bayesian
models, the more we relax the independence assumption, the
better the accuracy and the log-likelihood. However, for
decision trees, bagging and random forest outperformed J48 in
accuracy but not in the log-likelihood.</p>
        <p>Table 2 report the speed in classification time, where k is
the number of classes, n is the number of features, and m is
the number of iterations and trees built for bagging and
random forest respectively. The fastest models are Na¨ıve Bayes
and J48. The classification speed is very important in
timeconstrained MCTS experiments as we will show later.
Naive Bayes
A1DE
A2DE
J48
Bagging
RandomForest
Win rates under Iteration Budget We now evaluate our
models as the prior distribution for the tree policy. We use
the trained models in the tree policy as the sampling prior
for the Na¨ıve Sampling process and we used RndBiased bot
as the default policy. In this experiment, the Na¨ıveMCTS
bots coupled with models trained from different datasets are
tested against the eight baseline bots that were used for
training trace generation. The individual results for each dataset
are reported in Tables 3-8. Figure 2 shows an aggregated bar
chart view of the average win rates grouped by training set
and model. The overall average performance of every model
is reported in Table 9.</p>
        <p>The best model overall is random forest with 0.865 win
rate against the baselines, and being the best performing
model in three out of six experiments trained separately
for each dataset. A general observation is that decision tree
models especially the tree ensembles have better gameplay
performance than Bayesian models when the training dataset
is one of the rush bots. Meanwhile the Bayesian models
outperformed the decision tree models in the more complex
datasets, IL5S0I00 and I5000</p>
        <p>NMCTS. This seems to be because
decision trees are better at predicting the deterministic
behavior of the rush bots, but struggle to estimate the probability
distribution of moves of the more stochastic search bots.
Win rates under Time Budget Due to the nature of RTS
games, the computational budget is normally constrained by
time. In that sense, complex models that are working well
under iteration budget might lose their edges because they
run less iterations than simpler but faster models. Thus, we
run the gameplay strength experiments against seven3 of the
baseline bots again with computational budget being 50
milliseconds per cycle. The results are reported in Table 10.</p>
        <p>We can observe that the model performance is
significantly affected by the time budget. Although the more
sophisticated models like A1DE, bagging, and random forest
have a better performance in experiments with iteration
bud3The current version of LSI does not support time budgets.
get, they do not have the same performance in experiments
with time budget. The reason is that those models are
significantly slower than Na¨ıve Bayes and C4.5, according to
Table 2. The C4.5 model stands out as the best performing
model (0.853 overall win rate) under experiments with time
constraints due to its speed and better class probability
estimation performance comparing to Na¨ıve Bayes model.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Conclusions and Future Work</title>
      <p>The long term goal of this paper is to study the dynamics
of the integration of machine learning models and MCTS
methods and factors that affect the performance of their
interaction in the context of RTS games. Specifically, this
paper presents a comparison study on two types of machine
learning models, Bayesian models and decision tree models,
integrated in MCTS as part of the tree policy. We compare
the models in terms of classification performance,
classification speed, and gameplay strength as part of the tree policy
under iteration and time budget.</p>
      <p>Our empirical results showed that both classification
accuracy and speed play a significant role when integrating
machine learning models into MCTS. Specifically, a random
forest model has the best classification performance and best
gameplay performance as tree policy when we models are
compared under iteration constraints. However, in
experiments under time constraint, random forest model has much
worse performance compared to faster models. Instead, the
C4.5 model, which has logarithmic complexity in
classification time, has the best performance.</p>
      <p>
        There are a number of future work directions we intend
to pursue. For example, the models in this paper ignore the
interdependence of actions (when selecting an action for a
unit, knowing which action was selected for another unit on
the same game state might be relevant). Training a model
that takes action interdependence into account should
potentially result in better performing agents. Another
possible improvement can come from models learned online
using techniques like contextual bandit algorithms
        <xref ref-type="bibr" rid="ref13 ref5">(Chu et al.
2011; Mandai and Kaneko 2016)</xref>
        .
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [Andersen, Goodwin, and Granmo 2018] Andersen,
          <string-name>
            <given-names>P.</given-names>
            -A.;
            <surname>Goodwin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ; and
            <surname>Granmo</surname>
          </string-name>
          ,
          <string-name>
            <surname>O.-C.</surname>
          </string-name>
          <year>2018</year>
          .
          <article-title>Deep rts: A game environment for deep reinforcement learning in real-time strategy games</article-title>
          .
          <source>In 2018 IEEE Conference on Computational Intelligence and Games (CIG)</source>
          ,
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [Breiman 1996]
          <string-name>
            <surname>Breiman</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <year>1996</year>
          .
          <article-title>Bagging predictors</article-title>
          .
          <source>Machine learning 24(2)</source>
          :
          <fpage>123</fpage>
          -
          <lpage>140</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [Browne et al. 2012]
          <string-name>
            <surname>Browne</surname>
            ,
            <given-names>C. B.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Powley</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Whitehouse</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lucas</surname>
            ,
            <given-names>S. M.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Cowling</surname>
            ,
            <given-names>P. I.</given-names>
          </string-name>
          ; Rohlfshagen,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Tavener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ;
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ;
            <surname>Samothrakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            ; and
            <surname>Colton</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>A survey of monte carlo tree search methods</article-title>
          .
          <source>IEEE Transactions on Computational Intelligence and AI in games 4</source>
          (
          <issue>1</issue>
          ):
          <fpage>1</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [Chu et al. 2011]
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Reyzin</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ; and Schapire,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>Contextual bandits with linear payoff functions</article-title>
          .
          <source>In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics</source>
          ,
          <volume>208</volume>
          -
          <fpage>214</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>[Churchill and Buro</source>
          <year>2013</year>
          ]
          <string-name>
            <surname>Churchill</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , and
          <string-name>
            <surname>Buro</surname>
          </string-name>
          , M.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          2013.
          <article-title>Portfolio greedy search and simulation for large-scale combat in starcraft</article-title>
          .
          <source>In Computational Intelligence in Games (CIG)</source>
          ,
          <source>2013 IEEE Conference on, 1-8</source>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [Coulom 2006] Coulom,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <year>2006</year>
          .
          <article-title>Efficient selectivity and backup operators in monte-carlo tree search</article-title>
          .
          <source>In International conference on computers and games</source>
          ,
          <volume>72</volume>
          -
          <fpage>83</fpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [Kocsis and Szepesva´ri 2006]
          <string-name>
            <surname>Kocsis</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          , and Szepesva´ri, C.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          2006.
          <article-title>Bandit based monte-carlo planning</article-title>
          .
          <source>In European conference on machine learning</source>
          ,
          <fpage>282</fpage>
          -
          <lpage>293</lpage>
          . Springer.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [Liaw, Wiener, and others 2002]
          <string-name>
            <surname>Liaw</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Wiener</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; et al.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          2002.
          <article-title>Classification and regression by randomforest</article-title>
          .
          <source>R news 2</source>
          <volume>(3)</volume>
          :
          <fpage>18</fpage>
          -
          <lpage>22</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <source>[Mandai and Kaneko</source>
          <year>2016</year>
          ] Mandai,
          <string-name>
            <given-names>Y.</given-names>
            , and
            <surname>Kaneko</surname>
          </string-name>
          , T.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          2016.
          <article-title>Linucb applied to monte carlo tree search</article-title>
          .
          <source>Theoretical Computer Science</source>
          <volume>644</volume>
          :
          <fpage>114</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <source>[Moraes and Lelis</source>
          <year>2017</year>
          ] Moraes,
          <string-name>
            <given-names>R. O.</given-names>
            , and
            <surname>Lelis</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. H.</surname>
          </string-name>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          2017.
          <article-title>Asymmetric action abstractions for multi-unit control in adversarial real-time games</article-title>
          .
          <source>arXiv preprint arXiv:1711</source>
          .
          <fpage>08101</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <source>[Ontan˜o´n et al</source>
          . 2013]
          <article-title>Ontan˜o´n, S.;</article-title>
          <string-name>
            <surname>Synnaeve</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Uriarte</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Richoux</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Churchill</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ; and Preuss,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>A survey of real-time strategy game ai research and competition in starcraft</article-title>
          .
          <source>IEEE Transactions on Computational Intelligence and AI in games 5</source>
          (
          <issue>4</issue>
          ):
          <fpage>293</fpage>
          -
          <lpage>311</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          <source>[Ontan˜o´</source>
          n 2013]
          <article-title>Ontan˜ o´n</article-title>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2013</year>
          .
          <article-title>The combinatorial multiarmed bandit problem and its application to real-time strategy games</article-title>
          .
          <source>In Ninth Artificial Intelligence and Interactive Digital Entertainment Conference.</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          <source>[Ontan˜o´</source>
          n 2016]
          <article-title>Ontan˜ o´n</article-title>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Informed monte carlo tree search for real-time strategy games</article-title>
          .
          <source>In Computational Intelligence and Games (CIG)</source>
          ,
          <source>2016 IEEE Conference on, 1-8</source>
          . IEEE.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <source>[Ontan˜o´</source>
          n 2017]
          <article-title>Ontan˜ o´n</article-title>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Combinatorial multiarmed bandits for real-time strategy games</article-title>
          .
          <source>Journal of Artificial Intelligence Research</source>
          <volume>58</volume>
          :
          <fpage>665</fpage>
          -
          <lpage>702</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <string-name>
            <surname>[Quinlan 2014] Quinlan</surname>
            ,
            <given-names>J. R.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>C4. 5: programs for machine learning</article-title>
          .
          <source>Elsevier.</source>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [Rosin 2011] Rosin,
          <string-name>
            <surname>C. D.</surname>
          </string-name>
          <year>2011</year>
          .
          <article-title>Multi-armed bandits with episode context</article-title>
          .
          <source>Annals of Mathematics and Artificial Intelligence</source>
          <volume>61</volume>
          (
          <issue>3</issue>
          ):
          <fpage>203</fpage>
          -
          <lpage>230</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [Sahami 1996] Sahami,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>1996</year>
          .
          <article-title>Learning limited dependence bayesian classifiers</article-title>
          .
          <source>In KDD</source>
          , volume
          <volume>96</volume>
          ,
          <fpage>335</fpage>
          -
          <lpage>338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [Shleyfman, Komenda, and Domshlak 2014]
          <string-name>
            <surname>Shleyfman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Komenda</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ; and
          <string-name>
            <surname>Domshlak</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <year>2014</year>
          .
          <article-title>On combinatorial actions and cmabs with linear side information</article-title>
          .
          <source>In ECAI</source>
          ,
          <fpage>825</fpage>
          -
          <lpage>830</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [Silver et al. 2016]
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Maddison</surname>
            ,
            <given-names>C. J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Guez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sifre</surname>
          </string-name>
          , L.; Van Den Driessche, G.;
          <string-name>
            <surname>Schrittwieser</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Antonoglou</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Panneershelvam</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Lanctot</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ; et al.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          2016.
          <article-title>Mastering the game of go with deep neural networks and tree search</article-title>
          .
          <source>nature</source>
          <volume>529</volume>
          (
          <issue>7587</issue>
          ):
          <fpage>484</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [Silver,
          <string-name>
            <surname>Sutton</surname>
          </string-name>
          , and Mu¨ller 2012]
          <string-name>
            <surname>Silver</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Sutton</surname>
            ,
            <given-names>R. S.</given-names>
          </string-name>
          ; and Mu¨ller,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>2012</year>
          .
          <article-title>Temporal-difference search in computer go</article-title>
          .
          <source>Machine learning 87(2)</source>
          :
          <fpage>183</fpage>
          -
          <lpage>219</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [Synnaeve et al. 2016] Synnaeve,
          <string-name>
            <surname>G.</surname>
          </string-name>
          ; Nardelli,
          <string-name>
            <given-names>N.</given-names>
            ;
            <surname>Auvolat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ;
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <surname>S.</surname>
          </string-name>
          ; Lacroix,
          <string-name>
            <given-names>T.</given-names>
            ;
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            ;
            <surname>Richoux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ; and
            <surname>Usunier</surname>
          </string-name>
          ,
          <string-name>
            <surname>N.</surname>
          </string-name>
          <year>2016</year>
          .
          <article-title>Torchcraft: a library for machine learning research on real-time strategy games</article-title>
          .
          <source>arXiv preprint arXiv:1611</source>
          .
          <fpage>00625</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [Tian et al. 2017] Tian,
          <string-name>
            <given-names>Y.</given-names>
            ;
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            ;
            <surname>Shang</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.</surname>
          </string-name>
          ; Wu,
          <string-name>
            <given-names>Y.</given-names>
            ; and
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. L.</surname>
          </string-name>
          <year>2017</year>
          .
          <article-title>Elf: An extensive, lightweight and flexible research platform for real-time strategy games</article-title>
          .
          <source>Advances in Neural Information Processing Systems (NIPS).</source>
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [Vinyals et al. 2017]
          <string-name>
            <surname>Vinyals</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Ewalds</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ;
          <string-name>
            <surname>Bartunov</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ; Georgiev,
          <string-name>
            <given-names>P.</given-names>
            ;
            <surname>Vezhnevets</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. S.</surname>
          </string-name>
          ; Yeo,
          <string-name>
            <given-names>M.</given-names>
            ;
            <surname>Makhzani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ; Ku¨ttler, H.;
            <surname>Agapiou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ;
            <surname>Schrittwieser</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.</surname>
          </string-name>
          ; et al.
          <year>2017</year>
          .
          <article-title>Starcraft ii: a new challenge for reinforcement learning</article-title>
          .
          <source>arXiv preprint arXiv:1708</source>
          .
          <fpage>04782</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [Webb, Boughton, and Wang 2005] Webb,
          <string-name>
            <surname>G. I.</surname>
          </string-name>
          ; Boughton,
          <string-name>
            <given-names>J. R.</given-names>
            ; and
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <year>2005</year>
          .
          <article-title>Not so naive bayes: aggregating one-dependence estimators</article-title>
          .
          <source>Machine learning 58(1)</source>
          :
          <fpage>5</fpage>
          -
          <lpage>24</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>