Bandit-Based Policy Optimization for Monte Carlo Tree Search in RTS Games

                                         Zuozhi Yang1 and Santiago Ontañón1,2
                                                1
                                                 Drexel University, Philadelphia, USA
                                                 2
                                                   Google AI, Mountain View, USA
                                            zy337@drexel.edu, santiontanon@google.com


                           Abstract                                  both players during the forward simulation phase of the
                                                                     search process. Since the quality of the playout policy has
  Monte Carlo Tree Search has been successfully applied to           a great impact on the overall performance of MCTS, pre-
  complex domains such as computer Go. However, despite its
                                                                     vious work has covered various methods to generate these
  success in building game-playing agents, there are still many
  questions to be answered regarding the general principles to       policies such as handcrafted patterns (Munos and Teytaud
  design or learn its playout policy, or the interaction between     2006), supervised learning (Coulom 2007), reinforcement
  tree policy and playout policy. Many systems, such as Al-          learning (Gelly and Silver 2007), simulation balancing (Sil-
  phaGo, use a policy optimized to mimic human expert is used        ver and Tesauro 2009; Huang, Coulom, and Lin 2010;
  as the playout policy of MCTS. In our recent work, we have         Graf and Platzner 2016), and online adaptation (Silver, Sut-
  shown that strong gameplay policies do not necessarily make        ton, and Müller 2012; Baier and Drake 2010). However,
  the best playout policies. In this paper, we take a step fur-      there is little generalizable understanding about how to de-
  ther and use bandit algorithms to optimize stochastic policies     sign or learn good playout policies in systematic ways. Op-
  as gameplay policies, tree policies, and playout policies for      timizing directly on the gameplay strength of the playout
  MCTS in the context of RTS games. Our results show that
                                                                     policy often yields decreased performance (Gelly and Sil-
  strong playout policies do not need to be strong gameplay
  policies, and that policies that maximize MCTS performance         ver 2007) (an effect we also observed in preliminary exper-
  as playout policies are actually weak in terms of gameplay         iments, and which partially motivated this work). In recent
  strength. Also, we found optimizing tree policy directly has       Go research, playout policies are some times abandoned and
  an edge over optimizing gameplay policy. Finally, we showed        replaced by refined evaluation functions (Silver et al. 2017b;
  that the joint optimization of tree policy and playout policy      2017a). This paper extends our previous work (Yang and
  could be beneficial to the overall performance compared to         Ontañón 2020), where the authors showed that weak poli-
  optimization separately.                                           cies can also be strong policies.
                                                                        Specifically, in this paper we evaluate the difference in be-
                        Introduction                                 havior of game-playing policies when optimized for game-
                                                                     play strength, for playout policy performance, and as tree
Monte Carlo Tree Search (MCTS) tends to outperform sys-              policies. Since our goal is just to understand what makes a
tematic search in domains with large branching factors.              good playout or tree policy, we employ very simple poli-
The most prominent success of MCTS is in the domain                  cies, and use bandit algorithms for the optimization process.
of Computer Go, where an agent, AlphaGo, built using a               µRTS 1 is used as the testbed, as it offers a minimalistic yet
combination of MCTS and neural networks achieved super-              complete RTS game environment and a collection of MCTS
human performance (Silver et al. 2016). In AlphaGo, a pol-           implementations. We optimize for different objectives: 1)
icy optimized for gameplay strength is used as the playout           winrate of the policy directly, and 2) win rate of an MCTS
policy of MCTS. However, previous work has shown that                agent when using the policy as the playout or tree policy.
having good gameplay strength is not a sufficient condi-
tion to be a good playout policy (Silver and Tesauro 2009;              The rest of the paper is structured as follows. First, we
Huang, Coulom, and Lin 2010; Graf and Platzner 2016).                provide background on RTS games, MCTS, and policy op-
   Motivated by the question of what makes a good play-              timization. Then we describe the baseline, and our approach
out policy, in this paper, we empirically study the effect           for optimizing gameplay policy, tree policy, and playout
of optimizing playout policies with different objectives for         policies and also joint optimization of tree policy and play-
MCTS in the domain of real-time strategy (RTS) games.                out policy. We show visualizations of the distributions of the
In almost all variations of MCTS, playout policies, also             trained policies, then compare them with each other and with
called simulation policies, are used to select actions for           baseline policies. Finally, we draw conclusions and discuss
                                                                     lines of future work.
Copyright c 2020, for this paper by its authors. Use permitted un-
der Creative Commons License Attribution 4.0 International (CC
                                                                        1
BY 4.0).                                                                    https://github.com/santiontanon/microrts
                       Background
Real-time strategy (RTS) is a sub-genre of strategy games
where players aim to defeat their opponents (destroying                                               "max"
                                                                                                      player    "min"
their army and base) by strategically building an economy                                              units    player
                                                                                                                 units
(gathering resources and building a base), military power
(training units and researching technologies), and control-
ling those units. The main differences between RTS games
and traditional board games are: they are simultaneous move
games (more than one player can issue actions at the same
time), they have durative actions (actions are not instan-
taneous), they are real-time (each player has a very small
amount of time to decide the next move), they are partially
observable (players can only see the part of the map that has
been explored, although in this paper we assume full observ-
ability) and they might be non-deterministic.
   RTS games have been receiving an increased amount of
attention (Ontañón et al. 2013) as they are more challeng-                    Figure 1: A Screenshot of µRTS.
ing than games like Go or Chess in at least three differ-
ent ways: (1) the combinatorial growth of the branching
factor (Ontañón 2017), (2) limited computation budget be-       games due to the combinatorial growth of branching fac-
tween actions due to the real-time nature, and (3) lack of for-   tor with respect to the number of units. Sampling tech-
ward model in most of research environments like Starcraft.       niques for combinatorial branching factors such as Naı̈ve
Specifically, in this paper, we chose µRTS as our experi-         Sampling (Ontañón 2017) or LSI (Shleyfman, Komenda,
mental domain, as it offers a forward model for application       and Domshlak 2014) were proposed to improve the explo-
of Monte Carlo Tree Search as well as existing implementa-        ration of MCTS exploiting combinatorial multi-armed ban-
tions of MCTS and stochastic policies for optimization.           dits (CMABs). There have been many other enhancement
   µRTS is a simple RTS game designed for testing AI tech-        techniques of the tree policy. But since our focus in on the
niques. µRTS provides the essential features that make RTS        playout (a.k.a. simulation) policy, we employ MCTS with
games challenging from an AI point of view: simultaneous          Naı̈ve Sampling in this paper for simplicity (Naı̈veMCTS).
and durative actions, combinatorial branching factors and
real-time decision making. The game can be configured to
be partially observable and non-deterministic, but those set-     Playout Policies in MCTS
tings are turned off for all the experiments presented in this
paper. We chose µRTS, since in addition to featuring the
above properties, it does so in a very minimalistic way, by       If we had the optimal policy available, playout according
defining only four unit types and two building types, all of      to this policy would produce accurate evaluations of states.
them occupying one tile, and using only a single resource         However, having such optimal policy is not always possible.
type. Additionally, as required by our experiments, µRTS          If a policy is not one of the optimal ones, no matter how good
allows maps of arbitrary sizes and initial configurations.        the policy is, some error is introduced into the evaluation and
   There is one type of environment unit (minerals) and six       accumulated in the playout sequences. If the error is unbal-
types of units controlled by players (bases, barracks, work-      anced, even a strong policy can result in a very inaccurate
ers, and light, heavy and ranged military units). Addition-       state evaluation. Previous work on simulation balancing (Sil-
ally, the environment can have walls to block the movement        ver and Tesauro 2009; Huang, Coulom, and Lin 2010;
of units. A example screenshot of game is shown in Figure         Graf and Platzner 2016) approach this problem by not op-
1. The squared units in green are Minerals with numbers on        timizing policy strength but optimizing policy balance. In
them indicating the remaining resources. The units with blue      that way, the errors are canceled out in the long run.
outline belong to player 1 and those with red outline belong         Although the general principles to generate good playout
to player 2. The light grey squared units are Bases with num-     policies are not yet fully understood, in practice, when learn-
bers indicating the amount of resources owned by the player,      ing a playout policy, the policy is trained to mimic a simu-
while the darker grey squared units are the Barracks.             lation balanced agent. This can be either an expert that can
                                                                  evaluate states accurately or a strong agent that can anal-
Monte Carlo Tree Search in RTS Games                              yse the positions deeply. In the work of Silver and Tesauro
Monte Carlo Tree Search (Browne et al. 2012; Coulom               (2009), the expert agent is used, and in other work (Huang,
2006) is a method for sequential decision making in domains       Coulom, and Lin 2010; Graf and Platzner 2016) apprentice-
that can be represented by search trees. It has been a suc-       ship learning of deep MCTS is shown to be effective. How-
cessful approach to tackle complex games like Go as it takes      ever, it isn’t clear that simulation balancing is the only factor
random samples in the search space to estimate state value.       to take into account when designing playout policies. Thus,
   Most of the classic tree policies of MCTS, e.g. UCT (Koc-      in this paper, we take a different approach, and optimize
sis and Szepesvári 2006), do not scale up well to RTS            playout policies to maximize MCTS performance directly.
            Policy Optimization in µRTS                           the optimal strategy needs to be computed. Many algorithms
In order to study the differences between policies optimized      can be used to compute the best response in each iteration of
for gameplay and those optimized directly as playout poli-        RGB.
cies and tree policies, we define a very simple parametrized         In particular, in this work we use multiarmed bandits as
policy, and use an optimization process to optimize these pa-     a way to compute the best response. For bandit optimiza-
rameters.                                                         tion, we discretized the search space, allowing each weight
                                                                  to take values in {0, 1, 2, 3, 4, 5}. Specifically, we model the
Policy Parameterization                                           problem using combinatorial bandits, since the problem has
We employ a simple stochastic parameterization of the pol-        a combinatorial structure where there are 6 types of actions
icy, where we define a weight vector w = (w1 , ..., w6 ),         and for each action type there are 6 different weights to
where each of the six weights wi ∈ [0, 1] corresponds to          choose from. Moreover, notice that if we multiply a weight
each of the six types of actions in the game:                     vector by a scalar strictly larger than zero, the resulting pol-
                                                                  icy is identical in behavior. Internally, when interpreting the
• NONE: no action.                                                weight vectors as policy, the vector will be normalized to a
• MOVE: move to an adjacent position.                             probability distribution (that sums up to one).
• HARVEST: harvest a resource in an adjacent position.            Zero-Sum Repeated Game of Bandits In order to find
• RETURN: return a resource to a nearby base.                     the optimal policy within the space of policies defined by our
                                                                  6-parameter vector, we use Naı̈ve Sampling within the RGB
• PRODUCE: produce a new unit (only bases and barracks            play framework. Specifically, we use Algorithm 1. Given a
  can produce units, and only workers can produce new             target set of maps m, we use RGB as follows. We initialize a
  buildings).                                                     set of policies N . And then execute T iterations of repeated
• ATTACK: attack an enemy unit that is within range.              games between two regret-minimizing bandit agents B1 and
   A policy is totally represented by the vector w. During        B2 . At each iteration k, two arms, πk1 and πk2 , are pulled
gameplay, the action for each unit is selected proportionally     from each bandit independently and simultaneously. Then
to this weight vector. To choose the action for a given unit,     10 games are played between the policies and the averaged
the following procedure is used: given all the available ac-      reward r ∈ [0, 1] is revealed to both bandits (r to B1 , 1 − r
tions for a unit, a probability distribution is formed by as-     to B2 ). Both of the selected arms are added the N . As we
signing each of these actions the corresponding weight in         discussed above, it has been shown that N converges to the
w, and then normalizing to turn the resulting vector into a       Nash Equilibrium.
probability distribution. If the weights of all the available        However, in this study, we stick to the single policy for
actions are 0, then an action is chosen uniformly at random.      analysis and in order to obtain a policy represented just as a
Notice that this defines a very simple space of policies, but     vector of 6 numbers, and make results interpretable, so we
as we will see below, it is surprisingly expressive, and in-      can compare the result of optimizing for gameplay strength,
cludes policies that are stronger than it might initially seem.   versus optimizing for playout strength. The final weight vec-
   The goal of keeping the policy space simple is to be able      tor will be the most visited arm after the bandit optimization
to find near-optimal policies (within the policy space), in a     process.
computationally inexpensive way. The same ideas presented            In order to optimize a policy for being a strong playout
here would apply to more expressive policies, parameter-          policy, rather than a strong gameplay policy, we use the same
ized by larger parameter vectors, such as those represented       exact procedure, except that when playing a game between
by a neural network, for example (although a different opti-      πk1 and πk2 , we use MCTS agents where πk1 and πk2 are used
mization algorithm might be required, such as reinforcement       as the playout policies.
learning).                                                           Furthermore, we also experiment with optimizing the pol-
                                                                  icy as the tree policy of the MCTS, in order to observe its
Policy Optimization                                               difference to policies optimized for game-playing strength.
                                                                  The research question to ask is whether it is enough to opti-
Given the parameterization, we can optimize the policy for
                                                                  mize only for game-playing strength to have a good playout
many purposes using different optimization algorithms. In
                                                                  or tree policy.
this paper, we use repeated game of bandits (RGB) (Cesa-
                                                                     Finally, we optimize the tree policy and playout policy
Bianchi and Lugosi 2006; Slivkins 2019). RGB works as
                                                                  directly at the same time. The purpose is to see if there are
in Algorithm 1, where two regret-minimizing agents repeat-
                                                                  possible interactions between the two types of policies and
edly play against each other. And if the repeated game is
                                                                  potentially obtain policy combinations that work better than
zero-sum, the empirical distribution of RGB converges to
                                                                  optimizing them separately.
Nash Equilibrium. The motivation is that if a policy is op-
timized to maximize win rates against a single other agent,
cycles might be created, where we have three policies A,                        Experiments and Results
B, and C, and A beats B, B beats C, and C beats A. To             In our previous work (Yang and Ontañón 2020) we pre-
avoid these cycles and compute the least exploitable agent,       sented the result of bandit optimized policies of the same
we need to approximate the Nash Equilibrium. In each iter-        parameterization. However, we did not compare with poli-
ation of RGB, the best-response against our current belief of     cies optimized using other techniques, such as simulation
 Algorithm 1: Repeated Game of Bandits (with                           Algorithm 2: Simulation Balancing
 Naı̈ve Sampling)                                                       θ←0
  Initialize Nash Equilibrium strategy set N = ∅.                       for t = 0 to T do
  Initialize two bandit agents B1 and B2 .                                  (s1 , V ∗ (s1 )) ← Random choice from training set
  CMAB1 = new Naı̈veSampling() bandit                                       V ←0
  CMAB2 = new Naı̈veSampling() bandit                                       for j = 0 to M do
  for k = 1, 2, 3, . . . , T do                                                  simulate (s1 , a1 , · · · , sN , aN , z) following
       Choose arm πk1 = CMAB1 .sample()                                           πθ t
       Choose arm πk2 = CMAB2 .sample()                                          V ←V +z
       r = play a game πk1 vs πk2 in map m                                  V ←M    V
       CMAB1 .observeReward(πk1 , r)                                        for i = 0 to M do
       CMAB2 .observeReward(πk2 , 1 − r)                                         simulate (s1 , a1 , · · · , sN , aN , z) following
       N ← N ∪ {πk1 ,πk2 }                                                        πθ t
                                                                                              PN
                                                                                 g ← g + z n=1 ψθt (sn , an )
                                                                                   g
                                                                            g←M
balancing (Silver and Tesauro 2009) in order to assess if just              θt+1 ← θt + α(V ∗ (s1 ) − V )g
using simulation balancing is enough to obtain strong play-
out policies. Thus, in this paper, we first establish a baseline
using simulation balancing and show that it does not scale            In the equation, φ is the feature vector and θ is the policy.
well in RTS games. Then, we further investigate the bandit               Now we experiment the performance of SB. We first col-
based optimization approach and the effect of the different           lect a dataset of estimated true state values from the three
optimization objectives describe above.                               maps using Naı̈veMCTS of 100000 iterations and the value
   Three different maps are used to test the generalizability         estimation of the root node is recorded as the estimated true
of our comparison. The maps are:                                      state value. 1000 states are sampled from 200 self-played
• Map 1: 8x8/basesWorkers8x8A.xml: In this map of size                games of two random agents. During training, we first cal-
  8 by 8, each player starts with one base and one worker.            culate the state value estimated by the playout policy V by
  Games are cut-off at 3000 cycles.                                   averaging 1000 playouts. Then we run another 1000 play-
                                                                      outs to calculate policy gradient ψθt . The resulting policy
• Map 2: 8x8/FourBasesWorkers8x8.xml: In this map of                  of SB optimization is characterized by the parameter vector
  size 8 by 8, each player starts with four bases and four            [0.02, 0.32, 0.18, 0.18, 0.17, 0.13]. Together with a purely
  worker. Games are cut-off at 3000 cycles.                           random and the built-in RandomBiased bot in µRTS, the re-
• Map 3: NoWhereToRun9x8.xml: In this map of size nine                sult from SB will be used as the baseline in our study.
  by eight, each player starts with one base and the players
  are initially separated by a wall of resources, that needs to       Optimization for Gameplay Strength
  be mined through in order to reach each other. Games are            In the first experiment, we optimize the policy with multi-
  cut-off at 3000 cycles.                                             ple maps together and compare with the policies in (Yang
                                                                      and Ontañón 2020). Specifically, we run 10000 iterations
Monte Carlo Simulation Balancing                                      of the repeated game of bandits between two Naı̈ve Sam-
Simulation Balancing (SB) (Silver and Tesauro 2009;                   pling agents to obtain a history distribution of the process.
Huang, Coulom, and Lin 2010; Graf and Platzner 2016) ap-              The arms pulled by the two bandits correspond to the game-
proach the problem of optimizing for good playout policy              playing policies and play against each other for 10 games to
by not optimizing policy strength but optimizing policy bal-          calculate the reward. The result is 20000 policies (the policy
ance. It is a policy gradient-based method that minimizes             of each of the two players over 10000 itertions). We visual-
“imbalance” in the policies so that so that the small errors          ized the weight distribution of these 20000 policies.
cancel each other out during the whole playout. The pseu-                The result for gameplay strength optimization is shown in
docode is given in Algorithm 2. The algorithm first con-              Figure 2-a. we observe that NONE, MOVE, and PRODUCE
structs a training set of state/state value pairs. The true state     are mostly assigned a 0 weight in most of the policies in
value can be estimated by performing a deep MCTS search               the distribution. RETURN and ATTACK are mostly given
when expert play is not available Then the algorithm uses             weight of 1. HARVEST and RETURN are given more di-
Monte Carlo simulation to calculate the actual state value            verse weights, probably due to the fact that we use different
estimation of the given policy. Finally, the algorithm cal-           maps, and some values might work better in some maps than
culates the difference of the true state value and estimated          in others. Later in the paper, we will evaluate how strong
state value to do policy gradient update. The policy gradient         these policies are in actual gameplay.
ψθt (sn , an ) is the following
                                                                      Optimization for Tree Policy
                                               X
ψθt (sn , an ) = ∇θ log πθ (s, a) = φ(s, a)−       πθ (s, b)φ(s, b)   In the second experiment we optimize the tree policy di-
                                                b
                                                                      rectly as opposed to optimizing for gameplay strength. The
           1                                                                1
         0.8                                                              0.8
         0.6                                                              0.6
         0.4                                                              0.4
         0.2                                                              0.2
           0                                                                0


                      weight=0   weight=1     weight=2                                      weight=0   weight=1   weight=2
                      weight=3   weight=4     weight=5                                      weight=3   weight=4   weight=5

                  (a) Gameplay Strength Optimization                                        (b) Tree Policy Optimization
           1                                                                 1
         0.8                                                               0.8
         0.6                                                               0.6
         0.4                                                               0.4
         0.2                                                               0.2
           0                                                                 0


                      weight=0 weight=1 weight=2                                            weight=0   weight=1   weight=2
                      weight=3 weight=4 weight=5                                            weight=3   weight=4   weight=5

                    (c) Playout Policy Optimization                                 (d) Tree Policy in the Joint Optimization
                                          1
                                        0.8
                                        0.6
                                        0.4
                                        0.2
                                          0


                                                         weight=0   weight=1     weight=2
                                                         weight=3   weight=4     weight=5

                                                (e) Playout Policy in the Joint Optimization

         Figure 2: Weight distribution by action types in the history distribution for different optimization objectives.


experimental setup is similar to the gameplay strength op-                than gameplay optimization. This shows that strong game-
timization but the performance of the policies are measured               play policies tend to be different from strong tree policies.
directly by useing them as tree policies of MCTS. Again, we
run 10000 iterations of the repeated game of bandits between              Optimization for Playout Policy
two Naı̈ve Sampling agents. The arms pulled by the two ban-
dits are used as tree policies and used by MCTS agents to                 In the third experiment we optimize for the playout policy
play against each other for 10 games to calculate the reward.             with a similar experimental set up as in tree policy optimiza-
                                                                          tion. We run 10000 iterations of the repeated game of bandits
   The result of the optimization is visualized in Figure 2-              between two Naı̈ve Sampling agents and the arms pulled are
b. It is easy to see that the results agree with the game-                interpret as playout policies and used by MCTS agents to
play strength optimization that NONE and MOVE should                      play against each other for 10 games to calculate the reward.
assign a 0 weight with high probability, but disagree that                   The result of the optimization is visualized in Figure 2-c.
PRODUCE should have a low probability for 0. Also, HAR-                   We can observe that the weight distribution is very different
VEST, RETURN, and ATTACK have more spread weights                         to the distribution of optimization of tree policy or gameplay
policy. In this weight distribution, NONE is mostly assigned        Now we look at the policy optimized for playout policy
to weight 1. MOVE and PRODUCE are mostly assigned                directly. The result is interesting since it is very weak in
weight 0. And ATTACK is mostly assigned to the highest           terms of gameplay (winrate of merely 0.08), but very strong
weight of 5. HARVEST is spread between 1, 2, and 3. RE-          as playout policy (winrate of 0.62). This suggest that a strong
TURN has most of the weights assigned to 1 and 5. Again,         gameplay strength is not a requirement of being a good play-
we see that strong playout policies are very different from      out policy, and that simulation balancing does not capture all
strong gameplay policies.                                        that is required for a strong playout policy.
                                                                    Lastly, we have the pair of policies that are optimized to-
Joint Optimization for Tree Policy and Playout                   gether, one as tree policy and the other as playout policy.
Policy                                                           The tree policy achieved similar winrates (winrate of 0.55
To test whether tree policy and playout policy interact with     and 0.45 respectively) as singly optimized. The policy opti-
each other, we further investigate by optimizing both at the     mized as playout policy is interesting that not only it is good
same time. Similarly, We run 10000 iterations of the re-         as playout policy (winrate of 0.56), but also it has a good
peated game of bandits between two Naı̈ve Sampling agents        gameplay strength (winrate of 0.52).
that choose values for 12 parameters rather than 6, and the
arms pulled will be interpret as two policies, one for tree      Strength of Tree Policies
policy and the other for playout policies, and used by MCTS      The result of the jointly optimized policies suggest there
agents to play against each other for 10 games to calculate      could be some factor of “match” between the tree policy and
the reward. Thus, in this experiment, arms pulled by bandits     the playout policy for them to work well together. Thus, to
have 12 parameters and the first six parameters are interpret    further verify this hypothesis, we take the best pairs of singly
as tree policy and others are interpret as playout policy.       optimized tree policy and playout policies to play against the
   The result of the optimization is visualized in Figure 2-     pair of jointly optimized policies in two MCTS agents.
d and Figure 2-e, showing that the jointly optimized poli-          We run the jointly optimized pair against the pair of best
cies are different from the policies obtained when optimiz-      gameplay policy (as tree policy) and best singly optimized
ing them separately. Let us now compare how strong these         playout policy (as playout policy) for 1000 games, and the
policies are in actual gameplay.                                 jointly optimized pair has a winrate of 0.64. We also run the
                                                                 jointly optimized pair against the pair of best tree policy and
Comparing Performance as Gameplay Policies vs.                   best singly optimized playout policy for 1000 games, and
Playout Policies                                                 the jointly optimized pair has a winrate of 0.67.
So far we have policies optimized for different objectives:         Moreover, we run gameplay optimized policy against an
• Optimizing “simulation balance”.                               optimized tree policy as the tree policy of an MCTS agent
• Optimizing gameplay strength of the policy.                    for 1000 games, both with the optimized playout policy. We
                                                                 found the gameplay policy has a winrate of 0.44, which
• Optimized as tree policy of MCTS.                              means that a gameplay optimized policy does not necessar-
• Optimized as playout policy of MCTS.                           ily make for a good tree policy.
• Joint optimization of tree policy and playout policy.
   Now, together with the two baselines, Random and Ran-                                Conclusions
domBiased, we compare them policies in two tasks: game-          In this paper, we have studied policy optimization in sev-
play strength when used directly to play (without MCTS),         eral settings. First, we tried simulation balancing for play-
and gameplay strength when used as playout policies within       out policy optimization. We found that although it is better
MCTS. We run 10 rounds of round-robin between all the            than the baselines as playout policy, its performance is not
policies. The winrates are reported in Figure 3 (we tested       comparable to optimizing as playout policy directly. We also
the tree policies separately as reported below).                 tried optimizing game policy, tree policy, and playout policy
   First, simulation balancing has a winrate of 0.16 as game-    in three maps at the same time. We observed that for some
playing policy and a winrate of 0.29 as the playout pol-         action types like NONE and MOVE, the weight distribution
icy, which are outperformed by the two baselines as game-        are in consensus for all maps, but for others, weight distribu-
playing policy (winrate of 0.20 and 0.37 respectively), but is   tions is spread to multiple categories. This might be because
better than baselines as playout policy of MCTS (winrate of      certain weights are good for some maps. Furthermore, we
0.02 and 0.09 respectively). This is expected, as simulation     compared the performance as tree policy between optimized
balancing is supposed to design strong playout policies.         gameplay policy and optimized tree policy, and confirmed
   Second, the policy optimized for gameplay strength out-       that optimize tree policy directly does help. Finally, we opti-
performed the baselines by a large margin and has the best       mized the tree policy and playout policy jointly. The result-
gameplay winrate (0.58) and third best winrate as playout        ing pair of policies outperforms the combination of the best
policy (0.52). For performance as playout policy, it is only     of tree policy and playout policies, which suggest that the
worse than the two optimized as playout policies directly.       “match” of the tree policy and playout policy can also play
The policy optimized as tree policy of MCTS also outper-         an important role in the performance of the MCTS.
formed baselines, but has worse winrate than gameplay op-           For future work, we want to further investigate the simula-
timized policy in both tracks (winrate of 0.53 and 0.47).        tion balancing algorithm, since there has been good advance
                        0.7

                        0.6

                        0.5

                        0.4

                        0.3

                        0.2

                        0.1

                          0
                                Rnd    RndBiased Simulation   Gameplay   Tree Policy   Playout    Joint Tree Joint Playout
                                                 Balancing                              Policy      Policy       Policy
                                            Winrate as Gameplay Policy     Winrate as Playout Policy


     Figure 3: Comparison on winrates of policies serving as game-playing policy and playout policy of an MCTS agent.


in gradient policy algorithms that might help scaling up SB.              with patterns in monte-carlo go. Technical Report RR-6062
Also, the joint optimization of different component of the                32:30–56.
MCTS algorithm seemed to be beneficial. It will be interest-              Ontañón, S.; Synnaeve, G.; Uriarte, A.; Richoux, F.;
ing to take more factors, like the evaluation function tuning             Churchill, D.; and Preuss, M. 2013. A survey of real-
and exploration parameters, into the optimization process to              time strategy game ai research and competition in StarCraft.
see if we can push the progress further and gain insight on               IEEE Transactions on Computational Intelligence and AI in
the interplay between the different pieces of MCTS.                       games 5(4):293–311.
                                                                          Ontañón, S. 2017. Combinatorial multi-armed bandits for
                       References                                         real-time strategy games. Journal of Artificial Intelligence
Baier, H., and Drake, P. D. 2010. The power of forget-                    Research 58:665–702.
ting: Improving the last-good-reply policy in monte carlo go.             Shleyfman, A.; Komenda, A.; and Domshlak, C. 2014. On
IEEE Transactions on Computational Intelligence and AI in                 combinatorial actions and cmabs with linear side informa-
Games 2(4):303–309.                                                       tion. In ECAI, 825–830.
Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.;                  Silver, D., and Tesauro, G. 2009. Monte-carlo simulation
Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.;                  balancing. In Proceedings of the 26th Annual International
Samothrakis, S.; and Colton, S. 2012. A survey of monte                   Conference on Machine Learning, 945–952.
carlo tree search methods. IEEE Transactions on Computa-                  Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.;
tional Intelligence and AI in games 4(1):1–43.                            Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.;
Cesa-Bianchi, N., and Lugosi, G. 2006. Prediction, learn-                 Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering
ing, and games. Cambridge university press.                               the game of go with deep neural networks and tree search.
                                                                          nature 529(7587):484–489.
Coulom, R. 2006. Efficient selectivity and backup operators
in monte-carlo tree search. In International conference on                Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai,
computers and games, 72–83. Springer.                                     M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel,
                                                                          T.; Lillicrap, T. P.; Simonyan, K.; and Hassabis, D. 2017a.
Coulom, R. 2007. Computing “elo ratings” of move patterns                 Mastering chess and shogi by self-play with a general rein-
in the game of go. ICGA journal 30(4):198–208.                            forcement learning algorithm. ArXiv abs/1712.01815.
Gelly, S., and Silver, D. 2007. Combining online and offline              Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.;
knowledge in uct. In Proceedings of the 24th international                Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton,
conference on Machine learning, 273–280.                                  A.; et al. 2017b. Mastering the game of go without human
Graf, T., and Platzner, M. 2016. Monte-carlo simulation                   knowledge. Nature 550(7676):354–359.
balancing revisited. In 2016 IEEE Conference on Computa-                  Silver, D.; Sutton, R. S.; and Müller, M. 2012. Temporal-
tional Intelligence and Games (CIG), 1–7. IEEE.                           difference search in computer go.           Machine learning
Huang, S.-C.; Coulom, R.; and Lin, S.-S. 2010. Monte-carlo                87(2):183–219.
simulation balancing in practice. In International Confer-                Slivkins, A. 2019. Introduction to multi-armed bandits.
ence on Computers and Games, 81–92. Springer.                             arXiv preprint arXiv:1904.07272.
Kocsis, L., and Szepesvári, C. 2006. Bandit based monte-                 Yang, Z., and Ontañón, S. 2020. Are strong policies also
carlo planning. In European conference on machine learn-                  good playout policies? playout policy optimization for rts
ing, 282–293. Springer.                                                   games. In Sixteenth Annual AAAI Conference on Artificial
Munos, S. G. W., and Teytaud, O. 2006. Modification of uct                Intelligence and Interactive Digital Entertainment (AIIDE).