Bandit-Based Policy Optimization for Monte Carlo Tree Search in RTS Games Zuozhi Yang1 and Santiago Ontañón1,2 1 Drexel University, Philadelphia, USA 2 Google AI, Mountain View, USA zy337@drexel.edu, santiontanon@google.com Abstract both players during the forward simulation phase of the search process. Since the quality of the playout policy has Monte Carlo Tree Search has been successfully applied to a great impact on the overall performance of MCTS, pre- complex domains such as computer Go. However, despite its vious work has covered various methods to generate these success in building game-playing agents, there are still many questions to be answered regarding the general principles to policies such as handcrafted patterns (Munos and Teytaud design or learn its playout policy, or the interaction between 2006), supervised learning (Coulom 2007), reinforcement tree policy and playout policy. Many systems, such as Al- learning (Gelly and Silver 2007), simulation balancing (Sil- phaGo, use a policy optimized to mimic human expert is used ver and Tesauro 2009; Huang, Coulom, and Lin 2010; as the playout policy of MCTS. In our recent work, we have Graf and Platzner 2016), and online adaptation (Silver, Sut- shown that strong gameplay policies do not necessarily make ton, and Müller 2012; Baier and Drake 2010). However, the best playout policies. In this paper, we take a step fur- there is little generalizable understanding about how to de- ther and use bandit algorithms to optimize stochastic policies sign or learn good playout policies in systematic ways. Op- as gameplay policies, tree policies, and playout policies for timizing directly on the gameplay strength of the playout MCTS in the context of RTS games. Our results show that policy often yields decreased performance (Gelly and Sil- strong playout policies do not need to be strong gameplay policies, and that policies that maximize MCTS performance ver 2007) (an effect we also observed in preliminary exper- as playout policies are actually weak in terms of gameplay iments, and which partially motivated this work). In recent strength. Also, we found optimizing tree policy directly has Go research, playout policies are some times abandoned and an edge over optimizing gameplay policy. Finally, we showed replaced by refined evaluation functions (Silver et al. 2017b; that the joint optimization of tree policy and playout policy 2017a). This paper extends our previous work (Yang and could be beneficial to the overall performance compared to Ontañón 2020), where the authors showed that weak poli- optimization separately. cies can also be strong policies. Specifically, in this paper we evaluate the difference in be- Introduction havior of game-playing policies when optimized for game- play strength, for playout policy performance, and as tree Monte Carlo Tree Search (MCTS) tends to outperform sys- policies. Since our goal is just to understand what makes a tematic search in domains with large branching factors. good playout or tree policy, we employ very simple poli- The most prominent success of MCTS is in the domain cies, and use bandit algorithms for the optimization process. of Computer Go, where an agent, AlphaGo, built using a µRTS 1 is used as the testbed, as it offers a minimalistic yet combination of MCTS and neural networks achieved super- complete RTS game environment and a collection of MCTS human performance (Silver et al. 2016). In AlphaGo, a pol- implementations. We optimize for different objectives: 1) icy optimized for gameplay strength is used as the playout winrate of the policy directly, and 2) win rate of an MCTS policy of MCTS. However, previous work has shown that agent when using the policy as the playout or tree policy. having good gameplay strength is not a sufficient condi- tion to be a good playout policy (Silver and Tesauro 2009; The rest of the paper is structured as follows. First, we Huang, Coulom, and Lin 2010; Graf and Platzner 2016). provide background on RTS games, MCTS, and policy op- Motivated by the question of what makes a good play- timization. Then we describe the baseline, and our approach out policy, in this paper, we empirically study the effect for optimizing gameplay policy, tree policy, and playout of optimizing playout policies with different objectives for policies and also joint optimization of tree policy and play- MCTS in the domain of real-time strategy (RTS) games. out policy. We show visualizations of the distributions of the In almost all variations of MCTS, playout policies, also trained policies, then compare them with each other and with called simulation policies, are used to select actions for baseline policies. Finally, we draw conclusions and discuss lines of future work. Copyright c 2020, for this paper by its authors. Use permitted un- der Creative Commons License Attribution 4.0 International (CC 1 BY 4.0). https://github.com/santiontanon/microrts Background Real-time strategy (RTS) is a sub-genre of strategy games where players aim to defeat their opponents (destroying "max" player "min" their army and base) by strategically building an economy units player units (gathering resources and building a base), military power (training units and researching technologies), and control- ling those units. The main differences between RTS games and traditional board games are: they are simultaneous move games (more than one player can issue actions at the same time), they have durative actions (actions are not instan- taneous), they are real-time (each player has a very small amount of time to decide the next move), they are partially observable (players can only see the part of the map that has been explored, although in this paper we assume full observ- ability) and they might be non-deterministic. RTS games have been receiving an increased amount of attention (Ontañón et al. 2013) as they are more challeng- Figure 1: A Screenshot of µRTS. ing than games like Go or Chess in at least three differ- ent ways: (1) the combinatorial growth of the branching factor (Ontañón 2017), (2) limited computation budget be- games due to the combinatorial growth of branching fac- tween actions due to the real-time nature, and (3) lack of for- tor with respect to the number of units. Sampling tech- ward model in most of research environments like Starcraft. niques for combinatorial branching factors such as Naı̈ve Specifically, in this paper, we chose µRTS as our experi- Sampling (Ontañón 2017) or LSI (Shleyfman, Komenda, mental domain, as it offers a forward model for application and Domshlak 2014) were proposed to improve the explo- of Monte Carlo Tree Search as well as existing implementa- ration of MCTS exploiting combinatorial multi-armed ban- tions of MCTS and stochastic policies for optimization. dits (CMABs). There have been many other enhancement µRTS is a simple RTS game designed for testing AI tech- techniques of the tree policy. But since our focus in on the niques. µRTS provides the essential features that make RTS playout (a.k.a. simulation) policy, we employ MCTS with games challenging from an AI point of view: simultaneous Naı̈ve Sampling in this paper for simplicity (Naı̈veMCTS). and durative actions, combinatorial branching factors and real-time decision making. The game can be configured to be partially observable and non-deterministic, but those set- Playout Policies in MCTS tings are turned off for all the experiments presented in this paper. We chose µRTS, since in addition to featuring the above properties, it does so in a very minimalistic way, by If we had the optimal policy available, playout according defining only four unit types and two building types, all of to this policy would produce accurate evaluations of states. them occupying one tile, and using only a single resource However, having such optimal policy is not always possible. type. Additionally, as required by our experiments, µRTS If a policy is not one of the optimal ones, no matter how good allows maps of arbitrary sizes and initial configurations. the policy is, some error is introduced into the evaluation and There is one type of environment unit (minerals) and six accumulated in the playout sequences. If the error is unbal- types of units controlled by players (bases, barracks, work- anced, even a strong policy can result in a very inaccurate ers, and light, heavy and ranged military units). Addition- state evaluation. Previous work on simulation balancing (Sil- ally, the environment can have walls to block the movement ver and Tesauro 2009; Huang, Coulom, and Lin 2010; of units. A example screenshot of game is shown in Figure Graf and Platzner 2016) approach this problem by not op- 1. The squared units in green are Minerals with numbers on timizing policy strength but optimizing policy balance. In them indicating the remaining resources. The units with blue that way, the errors are canceled out in the long run. outline belong to player 1 and those with red outline belong Although the general principles to generate good playout to player 2. The light grey squared units are Bases with num- policies are not yet fully understood, in practice, when learn- bers indicating the amount of resources owned by the player, ing a playout policy, the policy is trained to mimic a simu- while the darker grey squared units are the Barracks. lation balanced agent. This can be either an expert that can evaluate states accurately or a strong agent that can anal- Monte Carlo Tree Search in RTS Games yse the positions deeply. In the work of Silver and Tesauro Monte Carlo Tree Search (Browne et al. 2012; Coulom (2009), the expert agent is used, and in other work (Huang, 2006) is a method for sequential decision making in domains Coulom, and Lin 2010; Graf and Platzner 2016) apprentice- that can be represented by search trees. It has been a suc- ship learning of deep MCTS is shown to be effective. How- cessful approach to tackle complex games like Go as it takes ever, it isn’t clear that simulation balancing is the only factor random samples in the search space to estimate state value. to take into account when designing playout policies. Thus, Most of the classic tree policies of MCTS, e.g. UCT (Koc- in this paper, we take a different approach, and optimize sis and Szepesvári 2006), do not scale up well to RTS playout policies to maximize MCTS performance directly. Policy Optimization in µRTS the optimal strategy needs to be computed. Many algorithms In order to study the differences between policies optimized can be used to compute the best response in each iteration of for gameplay and those optimized directly as playout poli- RGB. cies and tree policies, we define a very simple parametrized In particular, in this work we use multiarmed bandits as policy, and use an optimization process to optimize these pa- a way to compute the best response. For bandit optimiza- rameters. tion, we discretized the search space, allowing each weight to take values in {0, 1, 2, 3, 4, 5}. Specifically, we model the Policy Parameterization problem using combinatorial bandits, since the problem has We employ a simple stochastic parameterization of the pol- a combinatorial structure where there are 6 types of actions icy, where we define a weight vector w = (w1 , ..., w6 ), and for each action type there are 6 different weights to where each of the six weights wi ∈ [0, 1] corresponds to choose from. Moreover, notice that if we multiply a weight each of the six types of actions in the game: vector by a scalar strictly larger than zero, the resulting pol- icy is identical in behavior. Internally, when interpreting the • NONE: no action. weight vectors as policy, the vector will be normalized to a • MOVE: move to an adjacent position. probability distribution (that sums up to one). • HARVEST: harvest a resource in an adjacent position. Zero-Sum Repeated Game of Bandits In order to find • RETURN: return a resource to a nearby base. the optimal policy within the space of policies defined by our 6-parameter vector, we use Naı̈ve Sampling within the RGB • PRODUCE: produce a new unit (only bases and barracks play framework. Specifically, we use Algorithm 1. Given a can produce units, and only workers can produce new target set of maps m, we use RGB as follows. We initialize a buildings). set of policies N . And then execute T iterations of repeated • ATTACK: attack an enemy unit that is within range. games between two regret-minimizing bandit agents B1 and A policy is totally represented by the vector w. During B2 . At each iteration k, two arms, πk1 and πk2 , are pulled gameplay, the action for each unit is selected proportionally from each bandit independently and simultaneously. Then to this weight vector. To choose the action for a given unit, 10 games are played between the policies and the averaged the following procedure is used: given all the available ac- reward r ∈ [0, 1] is revealed to both bandits (r to B1 , 1 − r tions for a unit, a probability distribution is formed by as- to B2 ). Both of the selected arms are added the N . As we signing each of these actions the corresponding weight in discussed above, it has been shown that N converges to the w, and then normalizing to turn the resulting vector into a Nash Equilibrium. probability distribution. If the weights of all the available However, in this study, we stick to the single policy for actions are 0, then an action is chosen uniformly at random. analysis and in order to obtain a policy represented just as a Notice that this defines a very simple space of policies, but vector of 6 numbers, and make results interpretable, so we as we will see below, it is surprisingly expressive, and in- can compare the result of optimizing for gameplay strength, cludes policies that are stronger than it might initially seem. versus optimizing for playout strength. The final weight vec- The goal of keeping the policy space simple is to be able tor will be the most visited arm after the bandit optimization to find near-optimal policies (within the policy space), in a process. computationally inexpensive way. The same ideas presented In order to optimize a policy for being a strong playout here would apply to more expressive policies, parameter- policy, rather than a strong gameplay policy, we use the same ized by larger parameter vectors, such as those represented exact procedure, except that when playing a game between by a neural network, for example (although a different opti- πk1 and πk2 , we use MCTS agents where πk1 and πk2 are used mization algorithm might be required, such as reinforcement as the playout policies. learning). Furthermore, we also experiment with optimizing the pol- icy as the tree policy of the MCTS, in order to observe its Policy Optimization difference to policies optimized for game-playing strength. The research question to ask is whether it is enough to opti- Given the parameterization, we can optimize the policy for mize only for game-playing strength to have a good playout many purposes using different optimization algorithms. In or tree policy. this paper, we use repeated game of bandits (RGB) (Cesa- Finally, we optimize the tree policy and playout policy Bianchi and Lugosi 2006; Slivkins 2019). RGB works as directly at the same time. The purpose is to see if there are in Algorithm 1, where two regret-minimizing agents repeat- possible interactions between the two types of policies and edly play against each other. And if the repeated game is potentially obtain policy combinations that work better than zero-sum, the empirical distribution of RGB converges to optimizing them separately. Nash Equilibrium. The motivation is that if a policy is op- timized to maximize win rates against a single other agent, cycles might be created, where we have three policies A, Experiments and Results B, and C, and A beats B, B beats C, and C beats A. To In our previous work (Yang and Ontañón 2020) we pre- avoid these cycles and compute the least exploitable agent, sented the result of bandit optimized policies of the same we need to approximate the Nash Equilibrium. In each iter- parameterization. However, we did not compare with poli- ation of RGB, the best-response against our current belief of cies optimized using other techniques, such as simulation Algorithm 1: Repeated Game of Bandits (with Algorithm 2: Simulation Balancing Naı̈ve Sampling) θ←0 Initialize Nash Equilibrium strategy set N = ∅. for t = 0 to T do Initialize two bandit agents B1 and B2 . (s1 , V ∗ (s1 )) ← Random choice from training set CMAB1 = new Naı̈veSampling() bandit V ←0 CMAB2 = new Naı̈veSampling() bandit for j = 0 to M do for k = 1, 2, 3, . . . , T do simulate (s1 , a1 , · · · , sN , aN , z) following Choose arm πk1 = CMAB1 .sample() πθ t Choose arm πk2 = CMAB2 .sample() V ←V +z r = play a game πk1 vs πk2 in map m V ←M V CMAB1 .observeReward(πk1 , r) for i = 0 to M do CMAB2 .observeReward(πk2 , 1 − r) simulate (s1 , a1 , · · · , sN , aN , z) following N ← N ∪ {πk1 ,πk2 } πθ t PN g ← g + z n=1 ψθt (sn , an ) g g←M balancing (Silver and Tesauro 2009) in order to assess if just θt+1 ← θt + α(V ∗ (s1 ) − V )g using simulation balancing is enough to obtain strong play- out policies. Thus, in this paper, we first establish a baseline using simulation balancing and show that it does not scale In the equation, φ is the feature vector and θ is the policy. well in RTS games. Then, we further investigate the bandit Now we experiment the performance of SB. We first col- based optimization approach and the effect of the different lect a dataset of estimated true state values from the three optimization objectives describe above. maps using Naı̈veMCTS of 100000 iterations and the value Three different maps are used to test the generalizability estimation of the root node is recorded as the estimated true of our comparison. The maps are: state value. 1000 states are sampled from 200 self-played • Map 1: 8x8/basesWorkers8x8A.xml: In this map of size games of two random agents. During training, we first cal- 8 by 8, each player starts with one base and one worker. culate the state value estimated by the playout policy V by Games are cut-off at 3000 cycles. averaging 1000 playouts. Then we run another 1000 play- outs to calculate policy gradient ψθt . The resulting policy • Map 2: 8x8/FourBasesWorkers8x8.xml: In this map of of SB optimization is characterized by the parameter vector size 8 by 8, each player starts with four bases and four [0.02, 0.32, 0.18, 0.18, 0.17, 0.13]. Together with a purely worker. Games are cut-off at 3000 cycles. random and the built-in RandomBiased bot in µRTS, the re- • Map 3: NoWhereToRun9x8.xml: In this map of size nine sult from SB will be used as the baseline in our study. by eight, each player starts with one base and the players are initially separated by a wall of resources, that needs to Optimization for Gameplay Strength be mined through in order to reach each other. Games are In the first experiment, we optimize the policy with multi- cut-off at 3000 cycles. ple maps together and compare with the policies in (Yang and Ontañón 2020). Specifically, we run 10000 iterations Monte Carlo Simulation Balancing of the repeated game of bandits between two Naı̈ve Sam- Simulation Balancing (SB) (Silver and Tesauro 2009; pling agents to obtain a history distribution of the process. Huang, Coulom, and Lin 2010; Graf and Platzner 2016) ap- The arms pulled by the two bandits correspond to the game- proach the problem of optimizing for good playout policy playing policies and play against each other for 10 games to by not optimizing policy strength but optimizing policy bal- calculate the reward. The result is 20000 policies (the policy ance. It is a policy gradient-based method that minimizes of each of the two players over 10000 itertions). We visual- “imbalance” in the policies so that so that the small errors ized the weight distribution of these 20000 policies. cancel each other out during the whole playout. The pseu- The result for gameplay strength optimization is shown in docode is given in Algorithm 2. The algorithm first con- Figure 2-a. we observe that NONE, MOVE, and PRODUCE structs a training set of state/state value pairs. The true state are mostly assigned a 0 weight in most of the policies in value can be estimated by performing a deep MCTS search the distribution. RETURN and ATTACK are mostly given when expert play is not available Then the algorithm uses weight of 1. HARVEST and RETURN are given more di- Monte Carlo simulation to calculate the actual state value verse weights, probably due to the fact that we use different estimation of the given policy. Finally, the algorithm cal- maps, and some values might work better in some maps than culates the difference of the true state value and estimated in others. Later in the paper, we will evaluate how strong state value to do policy gradient update. The policy gradient these policies are in actual gameplay. ψθt (sn , an ) is the following Optimization for Tree Policy X ψθt (sn , an ) = ∇θ log πθ (s, a) = φ(s, a)− πθ (s, b)φ(s, b) In the second experiment we optimize the tree policy di- b rectly as opposed to optimizing for gameplay strength. The 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 weight=0 weight=1 weight=2 weight=0 weight=1 weight=2 weight=3 weight=4 weight=5 weight=3 weight=4 weight=5 (a) Gameplay Strength Optimization (b) Tree Policy Optimization 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 weight=0 weight=1 weight=2 weight=0 weight=1 weight=2 weight=3 weight=4 weight=5 weight=3 weight=4 weight=5 (c) Playout Policy Optimization (d) Tree Policy in the Joint Optimization 1 0.8 0.6 0.4 0.2 0 weight=0 weight=1 weight=2 weight=3 weight=4 weight=5 (e) Playout Policy in the Joint Optimization Figure 2: Weight distribution by action types in the history distribution for different optimization objectives. experimental setup is similar to the gameplay strength op- than gameplay optimization. This shows that strong game- timization but the performance of the policies are measured play policies tend to be different from strong tree policies. directly by useing them as tree policies of MCTS. Again, we run 10000 iterations of the repeated game of bandits between Optimization for Playout Policy two Naı̈ve Sampling agents. The arms pulled by the two ban- dits are used as tree policies and used by MCTS agents to In the third experiment we optimize for the playout policy play against each other for 10 games to calculate the reward. with a similar experimental set up as in tree policy optimiza- tion. We run 10000 iterations of the repeated game of bandits The result of the optimization is visualized in Figure 2- between two Naı̈ve Sampling agents and the arms pulled are b. It is easy to see that the results agree with the game- interpret as playout policies and used by MCTS agents to play strength optimization that NONE and MOVE should play against each other for 10 games to calculate the reward. assign a 0 weight with high probability, but disagree that The result of the optimization is visualized in Figure 2-c. PRODUCE should have a low probability for 0. Also, HAR- We can observe that the weight distribution is very different VEST, RETURN, and ATTACK have more spread weights to the distribution of optimization of tree policy or gameplay policy. In this weight distribution, NONE is mostly assigned Now we look at the policy optimized for playout policy to weight 1. MOVE and PRODUCE are mostly assigned directly. The result is interesting since it is very weak in weight 0. And ATTACK is mostly assigned to the highest terms of gameplay (winrate of merely 0.08), but very strong weight of 5. HARVEST is spread between 1, 2, and 3. RE- as playout policy (winrate of 0.62). This suggest that a strong TURN has most of the weights assigned to 1 and 5. Again, gameplay strength is not a requirement of being a good play- we see that strong playout policies are very different from out policy, and that simulation balancing does not capture all strong gameplay policies. that is required for a strong playout policy. Lastly, we have the pair of policies that are optimized to- Joint Optimization for Tree Policy and Playout gether, one as tree policy and the other as playout policy. Policy The tree policy achieved similar winrates (winrate of 0.55 To test whether tree policy and playout policy interact with and 0.45 respectively) as singly optimized. The policy opti- each other, we further investigate by optimizing both at the mized as playout policy is interesting that not only it is good same time. Similarly, We run 10000 iterations of the re- as playout policy (winrate of 0.56), but also it has a good peated game of bandits between two Naı̈ve Sampling agents gameplay strength (winrate of 0.52). that choose values for 12 parameters rather than 6, and the arms pulled will be interpret as two policies, one for tree Strength of Tree Policies policy and the other for playout policies, and used by MCTS The result of the jointly optimized policies suggest there agents to play against each other for 10 games to calculate could be some factor of “match” between the tree policy and the reward. Thus, in this experiment, arms pulled by bandits the playout policy for them to work well together. Thus, to have 12 parameters and the first six parameters are interpret further verify this hypothesis, we take the best pairs of singly as tree policy and others are interpret as playout policy. optimized tree policy and playout policies to play against the The result of the optimization is visualized in Figure 2- pair of jointly optimized policies in two MCTS agents. d and Figure 2-e, showing that the jointly optimized poli- We run the jointly optimized pair against the pair of best cies are different from the policies obtained when optimiz- gameplay policy (as tree policy) and best singly optimized ing them separately. Let us now compare how strong these playout policy (as playout policy) for 1000 games, and the policies are in actual gameplay. jointly optimized pair has a winrate of 0.64. We also run the jointly optimized pair against the pair of best tree policy and Comparing Performance as Gameplay Policies vs. best singly optimized playout policy for 1000 games, and Playout Policies the jointly optimized pair has a winrate of 0.67. So far we have policies optimized for different objectives: Moreover, we run gameplay optimized policy against an • Optimizing “simulation balance”. optimized tree policy as the tree policy of an MCTS agent • Optimizing gameplay strength of the policy. for 1000 games, both with the optimized playout policy. We found the gameplay policy has a winrate of 0.44, which • Optimized as tree policy of MCTS. means that a gameplay optimized policy does not necessar- • Optimized as playout policy of MCTS. ily make for a good tree policy. • Joint optimization of tree policy and playout policy. Now, together with the two baselines, Random and Ran- Conclusions domBiased, we compare them policies in two tasks: game- In this paper, we have studied policy optimization in sev- play strength when used directly to play (without MCTS), eral settings. First, we tried simulation balancing for play- and gameplay strength when used as playout policies within out policy optimization. We found that although it is better MCTS. We run 10 rounds of round-robin between all the than the baselines as playout policy, its performance is not policies. The winrates are reported in Figure 3 (we tested comparable to optimizing as playout policy directly. We also the tree policies separately as reported below). tried optimizing game policy, tree policy, and playout policy First, simulation balancing has a winrate of 0.16 as game- in three maps at the same time. We observed that for some playing policy and a winrate of 0.29 as the playout pol- action types like NONE and MOVE, the weight distribution icy, which are outperformed by the two baselines as game- are in consensus for all maps, but for others, weight distribu- playing policy (winrate of 0.20 and 0.37 respectively), but is tions is spread to multiple categories. This might be because better than baselines as playout policy of MCTS (winrate of certain weights are good for some maps. Furthermore, we 0.02 and 0.09 respectively). This is expected, as simulation compared the performance as tree policy between optimized balancing is supposed to design strong playout policies. gameplay policy and optimized tree policy, and confirmed Second, the policy optimized for gameplay strength out- that optimize tree policy directly does help. Finally, we opti- performed the baselines by a large margin and has the best mized the tree policy and playout policy jointly. The result- gameplay winrate (0.58) and third best winrate as playout ing pair of policies outperforms the combination of the best policy (0.52). For performance as playout policy, it is only of tree policy and playout policies, which suggest that the worse than the two optimized as playout policies directly. “match” of the tree policy and playout policy can also play The policy optimized as tree policy of MCTS also outper- an important role in the performance of the MCTS. formed baselines, but has worse winrate than gameplay op- For future work, we want to further investigate the simula- timized policy in both tracks (winrate of 0.53 and 0.47). tion balancing algorithm, since there has been good advance 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Rnd RndBiased Simulation Gameplay Tree Policy Playout Joint Tree Joint Playout Balancing Policy Policy Policy Winrate as Gameplay Policy Winrate as Playout Policy Figure 3: Comparison on winrates of policies serving as game-playing policy and playout policy of an MCTS agent. in gradient policy algorithms that might help scaling up SB. with patterns in monte-carlo go. Technical Report RR-6062 Also, the joint optimization of different component of the 32:30–56. MCTS algorithm seemed to be beneficial. It will be interest- Ontañón, S.; Synnaeve, G.; Uriarte, A.; Richoux, F.; ing to take more factors, like the evaluation function tuning Churchill, D.; and Preuss, M. 2013. A survey of real- and exploration parameters, into the optimization process to time strategy game ai research and competition in StarCraft. see if we can push the progress further and gain insight on IEEE Transactions on Computational Intelligence and AI in the interplay between the different pieces of MCTS. games 5(4):293–311. Ontañón, S. 2017. Combinatorial multi-armed bandits for References real-time strategy games. Journal of Artificial Intelligence Baier, H., and Drake, P. D. 2010. The power of forget- Research 58:665–702. ting: Improving the last-good-reply policy in monte carlo go. Shleyfman, A.; Komenda, A.; and Domshlak, C. 2014. On IEEE Transactions on Computational Intelligence and AI in combinatorial actions and cmabs with linear side informa- Games 2(4):303–309. tion. In ECAI, 825–830. Browne, C. B.; Powley, E.; Whitehouse, D.; Lucas, S. M.; Silver, D., and Tesauro, G. 2009. Monte-carlo simulation Cowling, P. I.; Rohlfshagen, P.; Tavener, S.; Perez, D.; balancing. In Proceedings of the 26th Annual International Samothrakis, S.; and Colton, S. 2012. A survey of monte Conference on Machine Learning, 945–952. carlo tree search methods. IEEE Transactions on Computa- Silver, D.; Huang, A.; Maddison, C. J.; Guez, A.; Sifre, L.; tional Intelligence and AI in games 4(1):1–43. Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Cesa-Bianchi, N., and Lugosi, G. 2006. Prediction, learn- Panneershelvam, V.; Lanctot, M.; et al. 2016. Mastering ing, and games. Cambridge university press. the game of go with deep neural networks and tree search. nature 529(7587):484–489. Coulom, R. 2006. Efficient selectivity and backup operators in monte-carlo tree search. In International conference on Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, computers and games, 72–83. Springer. M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; Lillicrap, T. P.; Simonyan, K.; and Hassabis, D. 2017a. Coulom, R. 2007. Computing “elo ratings” of move patterns Mastering chess and shogi by self-play with a general rein- in the game of go. ICGA journal 30(4):198–208. forcement learning algorithm. ArXiv abs/1712.01815. Gelly, S., and Silver, D. 2007. Combining online and offline Silver, D.; Schrittwieser, J.; Simonyan, K.; Antonoglou, I.; knowledge in uct. In Proceedings of the 24th international Huang, A.; Guez, A.; Hubert, T.; Baker, L.; Lai, M.; Bolton, conference on Machine learning, 273–280. A.; et al. 2017b. Mastering the game of go without human Graf, T., and Platzner, M. 2016. Monte-carlo simulation knowledge. Nature 550(7676):354–359. balancing revisited. In 2016 IEEE Conference on Computa- Silver, D.; Sutton, R. S.; and Müller, M. 2012. Temporal- tional Intelligence and Games (CIG), 1–7. IEEE. difference search in computer go. Machine learning Huang, S.-C.; Coulom, R.; and Lin, S.-S. 2010. Monte-carlo 87(2):183–219. simulation balancing in practice. In International Confer- Slivkins, A. 2019. Introduction to multi-armed bandits. ence on Computers and Games, 81–92. Springer. arXiv preprint arXiv:1904.07272. Kocsis, L., and Szepesvári, C. 2006. Bandit based monte- Yang, Z., and Ontañón, S. 2020. Are strong policies also carlo planning. In European conference on machine learn- good playout policies? playout policy optimization for rts ing, 282–293. Springer. games. In Sixteenth Annual AAAI Conference on Artificial Munos, S. G. W., and Teytaud, O. 2006. Modification of uct Intelligence and Interactive Digital Entertainment (AIIDE).