=Paper=
{{Paper
|id=Vol-2719/paper11
|storemode=property
|title=Parametric action pre-selection for MCTS in real-time strategy games
|pdfUrl=https://ceur-ws.org/Vol-2719/paper11.pdf
|volume=Vol-2719
|authors=Abdessamed Ouessai,Mohammed Salem,Antonio M. Mora
|dblpUrl=https://dblp.org/rec/conf/cosecivi/OuessaiSM20
}}
==Parametric action pre-selection for MCTS in real-time strategy games==
<pdf width="1500px">https://ceur-ws.org/Vol-2719/paper11.pdf</pdf>
<pre>
                                   Parametric Action Pre-Selection for MCTS in
                                          Real-Time Strategy Games ?

                                      Abdessamed Ouessai1 , Mohammed Salem1 , and Antonio M. Mora2
                                        1
                                         Dept. Computer Sciences, University of Mascara, Mascara, Algeria
                                        abdessamed.ouessai@univ-mascara.dz — salem@univ-mascara.dz
                                      2
                                        Dept. Signal Theory, Telematics and Communications, ETSIIT-CITIC,
                                                      University of Granada, Granada, Spain
                                                                  amorag@ugr.es


                                       Abstract. The core challenge facing search techniques when used to
                                       play Real-Time Strategy (RTS) games is the extensive combinatorial de-
                                       cision space. Several approaches were proposed to alleviate this dimen-
                                       sionality burden, using scripts or action probability distributions, based
                                       on expert knowledge. We propose to replace expert-authored scripts by
                                       a collection of smaller parametric scripts we call heuristics and use them
                                       to pre-select actions for Monte Carlo Tree Search (MCTS). The advan-
                                       tages of this proposal consist of granular control of the decision space
                                       and the ability to adapt the agent’s strategy in-game, all by altering the
                                       heuristics and their parameters. Experimentation results in µRTS using
                                       a proposed implementation have shown a significant performance gain
                                       over state-of-the-art agents.

                                       Keywords: Game AI · Real-Time Strategy · MCTS · µRTS


                              1      Introduction
                              Video games belonging to the Real-Time Strategy (RTS) sub-genre can be
                              viewed as an evolution of classic board games. Games such as Chess, Checkers,
                              and Go depict an abstract conflict situation between two parties. The capabili-
                              ties of modern computing devices, usually paired with advanced game-engines,
                              allow RTS games to portray concrete conflict situations, approximately simu-
                              lating the dynamics of real-world military disputes. Players in an RTS game
                              can simultaneously control multiple entities (units) in real-time, within a large,
                              dynamic, and uncertain environment.
                                  Designing a successful artificial RTS player is a demanding task that requires
                              overcoming many challenges. In particular, the large combinatorial decision and
                              state spaces constitute a significant bottleneck for game-playing agents based on
                              search or machine learning. Task decomposition is a common solution to mitigate
                               ?
                                   This work has been supported in part by projects B-TIC-402-UGR18 (FEDER and
                                   Junta de Andalucı́a), RTI2018-102002-A-I00 (Ministerio Español de Ciencia, Inno-
                                   vación y Universidades), projects TIN2017-85727-C4-1-2-P (Ministerio Español de
                                   Economı́a y Competitividad), and TEC2015-68752 (also funded by FEDER).


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
2       Abdessamed Ouessai, Mohammed Salem, and Antonio M. Mora

the domain’s complexities, usually by defining multiple sub-tasks in each level
of abstraction. Decision making in an RTS game flows through three levels of
abstraction: strategic, tactical, and reactive-control [14].
    Monte Carlo Tree Search (MCTS) is a sampling-based search approach with
a successful track record in many board games, especially those known for their
high branching factor such as Go. MCTS was also adapted for use as a holistic
agent in RTS games, but it still struggles to replicate the same degree of success
as in Go, due to the branching-factor bottleneck. The proposed improvements
usually fall in two main scopes: (1) The use of expert-knowledge to abstract the
decision space and lower the branching factor, and (2) the use of MCTS solely
as a component dedicated to smaller tactical and reactive situations, as part of
an agent. While there are efforts to combine both scopes [10], the reliance on a
fixed set of expert-authored scripts remains a vector of exploitation.
    In this paper, we propose an action pre-selection algorithm that relies on the
building blocks of expert scripts rather than full scripts. Expert-authored scripts
combine several smaller scripts, we call heuristics, to form a scripted agent. A
heuristic represents an isolated task performed by a unit or a group of units,
such as harvesting or attacking. Our algorithm depends on parametric heuris-
tics to generate a wide variety of scripts that can be used to pre-select actions for
predefined groups of units, to feed into an MCTS agent. The heuristics’ param-
eters can be modified in real-time to adjust both the decision granularity and
the adopted strategy, which opens the perspectives for dynamic strategy adap-
tation through MCTS. Experimentation results in the µRTS test-bed reveals a
significant MCTS performance gain related to the reduced branching factor and
the more focused decision space.
    In the next section, we will review some preliminaries about RTS games and
MCTS. In Section 3 we will explore some relevant works related to our proposal,
and in Section 4 we will detail the theoretic and implementation aspects of our
approach. Experimentation results will be presented and discussed in Section 5,
followed by a conclusion in Section 6.


2     Background
2.1   RTS Games
An RTS game is a zero-sum, multiplayer, non-deterministic, and imperfect-
information game, where players compete for resources to defeat an opposing
side using military force. To win, the player is expected to devise and execute a
strategy that can overcome his opponent and ultimately eliminate all his units.
The player manages his units by issuing a unit-action to each, forming a player-
action in each decision. The game’s environment usually consists of a large map
with topographic features, covered by a fog-of-war layer that reduces visibility.
    RTS games gained significant popularity on the PC platform thanks in part
to their PC-friendly user interfaces and control schemes, reminiscent of a typ-
ical PC software. The most commercially successful RTS titles include Star-
Craft, Command & Conquer, and Age of Empires. Enabling AI research
      Parametric Action Pre-Selection for MCTS in Real-Time Strategy Games       3

on commercial games started as an independent community effort that resulted
in several unofficial APIs, such as BWAPI and Wargus. Much later, Blizzard
provided an official StarCraft II AI API and toolset in collaboration with
DeepMind3 . Independent RTS AI research platforms have also emerged, such as
ORTS, µRTS, ELF, and DeepRTS.
    We used µRTS [12] as our experimentation test-bed. Conceived by Santiago
Ontañón, µRTS is the most lightweight and accessible RTS AI research plat-
form [15]. It features a minimalistic RTS implementation focused on the most
fundamental RTS challenges and includes an efficient forward model necessary
for implementing lookahead-based approaches.
    Formally, an RTS game can be defined as a tuple G = (S, A, P, τ, L, W, sinit )
[12], where S, A, and P represent the state space, the player-action space, and the
players’ set, respectively. The transition function, τ : S×A×A → S, takes a game
state at time t and the player-action of each player (assuming two-player setting)
and returns a new game state at time t+1. Function L : S×A×P → {true, f alse}
returns true when a player-action considered in a given game state by a given
player is legal. Function W : S → P ∪ {ongoing, draw} returns the winner of
the game, if any, or whether the game is still ongoing or is a draw. sinit is the
initial game state.


2.2     MCTS

MCTS [5] is a family of anytime (yields a solution under any amount of pro-
cessing time) sampling-based search algorithms applicable to Markov decision
processes (MDPs) [17]. They work by incrementally constructing a game tree
across multiple iterations. A single MCTS iteration consists of four basic phases.
The first phase, selection, employs a tree-policy to select which tree node to ex-
pand. Next, an expansion phase creates and appends a new node to the selected
node. A simulation (or playout) is then executed starting from the new node
using a default-policy and lastly a backpropagation phase uses the simulation’s
outcome to update the reward estimates and the visit count of all nodes on the
way up to the root. The action leading to the most visited node is usually the
one returned.
    Upper Confidence Bounds for Trees (UCT) [7] is a popular MCTS algorithm
that frames the selection phase as a Multi-Armed Bandit (MAB) [1] problem,
then uses the UCB1 formula to select nodes for expansion. UCT works well in
high branching-factor domains, such as Go, but suffers greatly when the decision
space has also a combinatorial structure, as in RTS games. This drawback is due
to UCB1’s requirement to explore all possible moves at least once to commence
exploitation. The short decision cycle and huge average number of possible moves
at a decision point in RTS games do not allow UCT the chance to explore all
moves.
    Naı̈veMCTS [12] addressed UCT’s shortcomings in combinatorial decision
spaces by framing the selection phase as a Combinatorial MAB (CMAB) prob-
3
    https://github.com/deepmind/pysc2
4       Abdessamed Ouessai, Mohammed Salem, and Antonio M. Mora

lem. This change highlights the reward contribution of each component (unit-
action) of a decision (player-action) and makes it possible to employ a naı̈ve
sampling strategy. Naı̈ve sampling assumes that the reward estimates of unit-
actions can be summed to obtain the reward estimate of the player-action they
form. The problem is thus subdivided into n local MABs (n: the number of
units) and one global MAB. Naı̈veMCTS does not need to explore all moves
before exploitation since it relies on an -greedy strategy instead of UCB1.


3   State of the Art

Managing the complex decision-space of RTS games in the context of search is
usually done by defining a proxy layer that guides search towards high-value
actions while ignoring the rest. This layer is defined by expert knowledge and
may take the form of expert-authored scripts, or action probability distributions
learned from expert traces. State abstraction through unit clustering is also a way
to further reduce the branching factor by considering clusters of units instead of
individual units.
    In RTS combat scenarios, Balla and Fern [2] defined a pair of abstract actions
and a unit grouping approach to facilitate search using UCT. Justesen et al [6]
used UCT in the space of actions proposed by scripts and further simplified
search using unit clustering by K-means. For full RTS scenarios, Barriga et al [4]
proposed to configure scripts with exposed choice points and use UCT to plan in
the space of choice points. In [10], Moraes and Lelis combined Naı̈veMCTS with
the concept of asymmetric abstraction to search in a decision space where each
group of units draws its actions from a distinct set of scripts, or the low-level
actions set. Scripts were also used to bias the selection phase of Naı̈veMCTS in
[19]. Inspired by AlphaGo, Ontañón [13] trained a Bayesian model from expert
traces to guide Naı̈veMCTS selection.
    Notable non-MCTS abstraction-based search algorithms include Stratified
Strategy Selection (SSS) [8], which uses a type system for unit clustering, and
hill-climbing to search for an optimal script for each cluster. The main drawback
of action-abstraction approaches is their reliance on a fixed set of scripts that
may produce a predictable agent. In [18], the authors propose to generate a
larger set of scripts from a small initial set through a voting scheme. Marino et
al [9] sought to find an optimal action abstraction by evolving a large pool of
scripts generated by varying the parameters of predefined rule-based scripts.
    Planning exclusively in a scripts-induced decision space can compromise low-
level tactical performance. For this reason, several works paired action abstrac-
tion with low-level search. StrategyTactics [3] uses a pre-trained model based
on Puppet Search for strategic planning and Naı̈veMCTS for tactics. Similarly,
Neufeld et al [11] used Naı̈veMCTS for tactical planning in their HTN (Hier-
archical Task Networks)-MCTS hybrid agent. In [18], the authors used ABCD
(AlphaBeta Considering Duration) for tactical decisions, in a limited game state.
    We propose a parametric action pre-selection scheme that, in contrast with
previous approaches, relies on heuristics instead of full scripts. Our approach
    Parametric Action Pre-Selection for MCTS in Real-Time Strategy Games             5

works by altering the decision space for Naı̈veMCTS, according to a set of pro-
vided parameters that govern the possible strategic and tactical decisions. More-
over, our approach supports assigning heuristics to individual units or groups of
units. The advantage of this approach is its ability to implement a wide spectrum
of strategies by adjusting the heuristics’ parameters pre-game or in-game.


4    Parametric Action Pre-Selection

An expert-authored strategy or script in an RTS game can be broken down
into a set of heuristics, each controlling a group of units. Changing or replac-
ing one heuristic can affect the goal and the performance of the strategy with
varying degrees. If we assume that all the possible RTS heuristics are known,
then it becomes possible to generate all RTS strategies by combining heuristics.
An agent can adapt its strategy according to its opponent or environment by
replacing problematic heuristics. To preserve tactical performance, a heuristic
can delegate all tactical choices to a search approach.
    We suggest that one way to capture a broad spectrum of all possible heuristics
is by defining parametric heuristics. Each parameter may govern an aspect of
the heuristic, creating multiple possible heuristics with each different parameter
value combination. Therefore, the combination of heuristics and their parameters
define a strategy. A heuristic can be expert-authored or automatically learned.
    Formally, we define a heuristic h ∈ H, where H is the set of heuristics, as a
function h : S × U × Al × Rh → Ak taking a game state s ∈ S, a unit u ∈ U, all
legal unit-actions (a1 , · · · , al ) ∈ Al possible for u in s, and a parameter vector
p ∈ Rh as input to produce a unit-action tuple α = (a1 , · · · , ak ) ∈ Ak where
Ak ⊆ Al and k ≤ l. The sets U, Rh and A represent the set of units, parameter
vectors of heuristic h, and legal unit-actions, respectively.
    In hard-coded scripts, parameter vectors are usually constant, but in our pro-
posed approach, the parameter vector of a heuristic can be variable. A heuristic
h is fully deterministic if k = 1 and no random aspect intervenes in its execu-
tion. In case k > 1, a search algorithm can select an optimal action from the
heuristic’s output, α. A heuristic may employ any suitable algorithm, including
pathfinding, to narrow down the number of actions in α and reduce the overall
branching factor.
    For instance, a common heuristic in RTS games is the harvest heuristic which
when applied to a Worker unit discards all unit-actions in favor of those that
guide the Worker back and forth between a resource deposit (harvest) and a
Base (return). harvest may expose parameters such as the maximum resources
to harvest or the pathfinding algorithm to use.
    A heuristic h can be associated with a group of units g ∈ P(U), in which
case it will be applied to each unit in g under parameter vector p, and we
denote it h[g, p]. We define D as the set of all possible unit partitionings, where
a partitioning d ∈ D can be defined as d = {g1 , · · · , gm | gi ∈ P(U)}.
    An action pre-selection process T (s, U, A0 , x1 , · · · , xn ) is an n-phase algo-
rithm where each phase xi (Ai−1 , di , Hi , θi ) consists of a unit partitioning scheme
6              Abdessamed Ouessai, Mohammed Salem, and Antonio M. Mora

                  T

         A0                                            A1                     A n-1

Game State s           d1       H1            d2       H2            dn       Hn
                       g1                     g1                     g1
    Units U                     h1                     h1                     h1             Search

                                                                                       An

                       gm       hm            gm       hm            gm       hm            Execution
                            1        1             2        2             n        n


                  x1            Ò1       x2            Ò2       xn            Òn


                                 Fig. 1. The action pre-selection process.


di ∈ D that generates a set of mi unit groups gj for which a heuristic hj [gj , pj ] ∈
Hi (Hi ⊂ H) is applied ∀j ∈ 1, · · · , mi , under the parameter vector pj ∈ θi . Each
pre-selection phase operates on the unit-actions output of the previous phase,
Ai−1 , and the initial phase operates on A0 , the set of all legal unit-actions of
the units in U. The output of phase xi consists of the set of unit-actions Ai
calculated by the heuristics of Hi for each unit in U. The final output of T is
the unit-action set An . Figure 1 illustrates the pre-selection process.
    Intuitively, an action pre-selection process is a successive refinement strategy
that works by manipulating the set of unit-actions possible for each unit in a
game state according to a global strategy expressed by the set of heuristics, par-
titionings, and parameters defined for each pre-selection phase. The resulting set
of unit-actions represent a decision to execute, or the possible options admissi-
ble by the global strategy. In the latter case, a search algorithm such as MCTS
can be employed to find an optimal player-action in accordance with the global
strategy, in a much smaller and focused decision space. A global strategy is rep-
resented by σn = (d1 , · · · , dn , H1 , · · · , Hn , θ1 , · · · , θn ) in an action pre-selection
process T .
    As an example, it is possible to define a hierarchical 2-phase pre-selection
process, where in the first phase the units are split into two large groups d1 =
{defense, offense} and are assigned to two heuristics H1 = {defend , attack } un-
der parameters θ1 . In the second phase, the units could be split into more spe-
cialized groups such as: d2 = {baseDef , barracksDef , offense}. The first phase
assures a common behavior for defense units, and the second phase builds on
that to create specialized defense units. We will describe an action pre-selection
implementation proposition for RTS games next.


4.1       Implementation

We propose an action pre-selection implementation for RTS games that delegates
tactical decision making to Naı̈veMCTS and allows the expression of strategic
decisions through parametric heuristics. Our implementation, ParaMCTS, relies
on a 2-phase action pre-selection process that partitions units into four functional
   Parametric Action Pre-Selection for MCTS in Real-Time Strategy Games         7

groups in the first phase, and into two situational groups in the second phase.
The intuition behind both group types comes from the way human RTS players
attribute different functions to different unit groups while macro-managing, and
how they switch to micro-management for units in conflict situations. We define
d1 and d2 , in the context of µRTS, as such:

d1 : Functional groups
 – Harvesters: Worker units dedicated to gathering resources and building struc-
   tures.
 – Offense: Mobile units with the purpose of assaulting opponent units.
 – Defense: Mobile units assigned to defend the Base’s perimeter.
 – Structures: Barracks and Base. Responsible for producing mobile units.

d2 : Situational groups
 – Front-Line: Mobile units in close-contact with opponent units.
 – Back: All units not in the Front-Line group.
   d1 assigns each unit to its relevant group based on a predefined unit-composition
that declares the maximum count of each unit-type to be found in each group.
For instance, it is possible to specify the maximum number of Workers in the
Harvesters, Offense or Defense group. As for d2 , it populates the Front-Line
group by selecting a number of units within the fire-range of opponent units, or
those targeting an opponent unit. The unit-composition and front-line selection
method are both partitioning parameters to provide. The heuristics in H1 and
H2 are rule-based and are described as follows:

H1 :
 – Harvest: Applies to the Harvesters group. Automates the resource har-
   vesting process and provides Barracks building options whenever possible.
   Parameters include the building location selection mode (isolated, random,
   ..., etc.) and the number of build options.
 – Attack: Applies to the Offense group. Find and track opponent units for
   suppression. Parameters include the tracking mode (closest, minHP,...etc.),
   the maximum number of units to track, and the number of escape routes to
   consider in a close encounter.
 – Defend: Applies to the Defense group. Remain within a defense perimeter
   around the Base and attack incoming opponent units. Parameters include
   the geometry and size of the defense perimeter, the defense mode, and the
   maximum number of units to attack.
 – Train: Applies to the Structures group. Trains units following the unit-
   composition provided to d1 . Parameters include the unit-composition, the
   training mode (isolated side, random side, ..., etc), and the number of train-
   ing options.
8         Abdessamed Ouessai, Mohammed Salem, and Antonio M. Mora

    Note how in each heuristic a numeric parameter decides the number of options
for certain actions (build, targets, ...). This type of parameter dictates how many
choices (k) Naı̈veMCTS can operate on for each unit of the same group, which
directly impacts the branching factor. Harvest, Attack, and Defend use a
pathfinding algorithm to direct units towards their goals.

H2 :
    – Front-Line Tactics: Applies to the Front-Line group. Reduces the Wait
      unit-action duration to increase the units’ reactivity while in combat.
    – Back Tactics: Applies to the Back group. Keeps the default Wait unit-
      action duration.
    Note that Front-Line units do not depend on d1 partitioning, meaning that
any unit in a d1 group can also be considered a Front-Line unit. In total 47
parameter was defined for all heuristics and partitionings. The full parameters
list can be consulted in this approach’s source code repository 4 .
    Switching heuristics on the fly is a way to adapt the agent’s strategy according
to changes in the overall situation. We propose to switch the heuristics of d1 ’s
Defense group from Defend to Attack according to a conditional trigger. The
switch is triggered whenever the score of the army composition is greater than
the opponent’s by a predefined margin we call the overpower factor. The score
simply counts all mobile units and attributes greater weight to assault units.
    Lastly, ParaMCTS uses a Naı̈veMCTS enhancement, previously conceived
by the authors (to appear in [16]). This enhancement hard-prunes a portion
of player-actions that include a Wait unit-action for the sake of decreasing the
branching factor.


5      Experiments & Results
Using ParaMCTS, we conducted a series of experiments to gauge the benefits
of action pre-selection on the performance of Naı̈veMCTS. Since the main effect
of action pre-selection is a significantly reduced branching factor, we would like
to test which MCTS parameter can exploit the downsized decision space to add
the most value to performance. We focus on the two prominent search parame-
ters, maximum depth, and playout duration. We experiment using the following
sets of possible values for both parameters: depthV als = {10, 15, 20, 30, 50} and
durationV als = {100, 150, 200, 300, 500}.
    We define ParaMCTS (depth, duration) as the ParaMCTS variant using both
depth and duration as the maximum search depth, and playout duration, respec-
tively. All experiments were run on two PCs with relatively similar processors
using the latest version of µRTS as of 10 July 2020. Each agent was attributed
a 100ms computation budget, similarly to the µRTS competition setting, and
each experiment was repeated on three maps representing increasingly larger
4
    https://github.com/Acemad/UMSBot
              Parametric Action Pre-Selection for MCTS in Real-Time Strategy Games                                                                                          9

             (1) Search depth experiment                                                    (2) Playout duration experiment
             80                                                                             80
             70                                                                             70
             60                                                                             60
             50                                                                             50
Score


                                                                                    Score
             40                                                                             40
             30                                                                             30
             20                                                                             20
             10                                                                             10
              0                                                                              0
                     10           15            20           30            50                       100            150             200             300             500

                                        Maximum Depth                                                                    Playout Duration
                                                                            8x8       16x16         32x32


Fig. 2. The results obtained by each ParaMCTS variant in both tournaments of Ex-
periment 1. The score represents the win rate of each variant against the other variants.


Table 1. Experiment 2 results. Each square represents the score obtained by ParaM-
CTS (depth, duration) against MixedBot. Cells with a score above 50 are gradually
saturated in four levels: 50-59, 60-69, 70-79, and 80-89

                                                                                  Playout Duration
                                       8x8                                             16 x 16                                                32 x 32
                    10    150    200    300    500    Avg           100     150     200      300    500     Avg             100      150    200      300    500          Avg
              10   74.5   69.5    59     46     19    53.6            3      74      63       20     20      36             68.5     54.5   35.5      33    20.5         42.4
 Max Depth


              15    82     65     56     46    28.5   55.5            1      80      71       10     28      38             72.5      44     42       34    25.5         43.6
              20    86     74     51     40     23    54.8           88      77      59       15     25     52.8            65.5     50.5    46      39.5   17.5         43.8
              30    81     68     62     40     18    53.8           71      73      66        9     21      48              74       46     43      38.5    25          45.3
              50   83.5    68     69     46     21    57.5           68      74      62       4      29     47.4            76.5      41    46.5     45.5    24          46.7

             Avg   81.4   68.9   59.4   43.6   21.9           Avg   46.2    75.6    64.2     11.6   24.6            Avg     71.4     47.2   42.6     38.1   22.5


branching factors, namely basesWorkers8 × 8, 16 × 16, and 32 × 32. To guarantee
a fair comparison, ParaMCTS was parameterized to adopt strategies similar to
its opponents’, even when there is a possibility for exploitation. In all experi-
ments, we calculate the score of an agent likewise: score = wins + draws/2, and
normalize it between 0 and 100.


5.1                Experiments 1 & 2: Search Depth and Playout Duration

To study the effect of various search depths and playout durations on ParaMCTS
we ran 120 iteration of a round-robin tournament between each ParaMCTS (10,
duration) variant for each duration in durationV als, and another 120 round-
robin iteration between each ParaMCTS (depth, 100) variant for each depth in
depthVals. Fixed values 10 and 100 for depth and duration, respectively, are the
default Naı̈veMCTS values. Results of this experiment are shown in Figure 2.
    In the second experiment, we took all the possible depth and duration combi-
nations from depthV als×durationV als and ran 100 matches (switching sides af-
ter 50 matches) between each resulting ParaMCTS (depth, duration) and Mixed-
Bot, a state-of-the-art agent combining various techniques [9] [8] [10] [3]. The
results of this experiment in the three maps are presented in Table 1.
    From the results of both experiments, we can see how overall ParaMCTS
performance seems to be particularly sensitive to the playout duration. In Fig-
10      Abdessamed Ouessai, Mohammed Salem, and Antonio M. Mora

Table 2. Overall results of Experiment 3 tournament, in all maps. Row vs Column.

                     ParaMCTS   MixedBot         Izanagi      Droplet   NMCTS*     NMCTS     Average

        ParaMCTS         -        84.8            50.0          72.0     96.0       94.7       79.5
        MixedBot       22.8        -              32.8          53.5     96.8       96.0       60.4
        Izanagi        50.7       64.8              -           42.7     89.5       90.2       67.6
        Droplet        30.2       45.0            53.3           -       89.7       90.2       61.7
        NMCTS*          7.3       2.2              9.2          7.2       -         50.2       15.2
        NMCTS           6.8       2.5              9.7          8.7      47.7        -         15.1


               100
                                                                                           ParaMCTS
                90
                80                                                                         MixedBot
                70                                                                         Izanagi
                60
                                                                                           Droplet
       Score


                50
                40                                                                         NMCTS*
                30
                20                                                                         NMCTS
                10
                 0
                      8x8              16 x 16             32 x 32       Overall


Fig. 3. The scores obtained by each agent in each map in Experiment 3 tournament.


ure 2-(2) performance vary significantly between the playout durations. In the
smallest maps, short playouts work best, but in larger maps, slightly longer play-
outs work well up to a certain threshold. As for search depth, it is clear that in
the largest map a deeper search yields the most benefit, as seen in Figure 2-(1).
As for the small and medium maps, deeper search holds fewer benefits.
    Against MixedBot (Table 1), the best results are obtained when the playout
duration equals 100 cycles, in all three map sizes, even if 150 cycles seem promis-
ing for the 16 × 16 map. When looking at the search depth, going down 20 levels
in the tree is the optimal depth for 8 × 8 and 16 × 16 maps. In the 32 × 32 map,
searching as deep as 50 levels yield the best performance. Clearly, deeper search
yields the most performance increase than longer playouts. We believe this is
true because deeper search may be responsible for more accurate player-action
reward estimates, due to the increased number of visited nodes, and playouts,
towards the depth of the game tree. On the other hand, longer playouts yield a
lower number of visited nodes, and playouts, which could negatively impact per-
formance. Larger maps benefit the most from deeper search because the map’s
dimensions contribute to the sparsity of rewards, and a deeper search can reach
rewarding states more frequently.

5.2   Experiment 3: Comparison Against State-of-the-Art
To compare the overall performance of ParaMCTS against the current best
performing agents, we took three top ranking agents from 2019’s µRTS compe-
tition: MixedBot, Izanagi, [10] [9] and Droplet [19], and performed a 100-iteration
    Parametric Action Pre-Selection for MCTS in Real-Time Strategy Games        11

round-robin tournament between them, ParaMCTS (with the optimal depth and
duration values found in previous experiments, in each map), Naı̈veMCTS (as
a baseline) and a Naı̈veMCTS variant, NMCTS*, using the same search depth
and playout duration as ParaMCTS. Results of the tournament are presented in
Table 2 and Figure 3.
    In terms of overall performance, ParaMCTS outperformed all state-of-the-art
agents by a comfortable margin. ParaMCTS was able to achieve an 11.9 points
margin over the 2nd best agent, Izanagi, and 19.1 points margin over the 4th
best, MixedBot. Both agents make use of a combination of advanced techniques.
This result is direct evidence of the potency of our action pre-selection approach
when coupled with Naı̈veMCTS. In individual maps, ParaMCTS outperformed
the other agents in 8×8 and 32×32 maps, but it was outdone in the 16×16 map
by Droplet. This can be explained by the adoption of an opportunistic strategy
by Droplet in the 16 × 16 map. Although ParaMCTS can be easily configured
to adopt such strategies, we chose not to do so to keep the comparison as fair as
possible. NMCTS* did not offer any tangible performance gain over Naı̈veMCTS,
which indicates that increasing the search’s depth will not yield any performance
gain if not paired with a significant decision-space reduction.


6    Conclusions & Future Work


We have presented an integrated action and state abstraction process that parti-
tions units into multiple groups, and assigns heuristics to each group for the sake
of obtaining a downsized set of pre-selected unit-actions. The proposed process
receives a collection of parameters that define the partitioning and heuristics.
It is possible to alter the heuristics parameters in-game to adapt the agent’s
strategy in a granular fashion. We have proposed a theoretical definition and
demonstrated how it can be implemented through a full RTS agent in µRTS.
Experimentation results show a significant improvement over state-of-the-art
µRTS agents. ParaMCTS will take part in the 4th µRTS competition (to be
organized as part of IEEE CoG 2020), under the alias UMSBot.
   The proposed action pre-selection implementation, ParaMCTS, is a single
possibility among many. Using action pre-selection as a basis to develop more
sophisticated agents is conceivable. Although proposed as an approach to lower
the RTS decision space dimensionality, we believe this technique could be easily
adapted to any multi-unit real-time game. Moreover, we believe that this tech-
nique can also provide difficulty adjustment through parameter optimization.
    As for future works, we are interested in auto-adapting the heuristics and
their parameters on the fly, using opponent modeling in different environments.
We are also looking into applying evolutionary algorithms to evolve and find
optimal strategies. Automatically learning new heuristics and unit partitionings
in the context of action pre-selection is also an interesting direction.
12       Abdessamed Ouessai, Mohammed Salem, and Antonio M. Mora

References
 1. Auer, P., Cesa-Bianchi, N., Fischer, P.: Finite-time Analysis of the Multiarmed
    Bandit Problem. Machine Learning 47(2/3), 235–256 (2002)
 2. Balla, R.K., Fern, A.: UCT for Tactical Assault Planning in Real-Time Strategy
    Games. In: Proceedings of IJCAI’09. pp. 40–45 (2009)
 3. Barriga, N.A., Stanescu, M., Besoain, F., Buro, M.: Improving RTS Game AI by
    Supervised Policy Learning, Tactical Search, and Deep Reinforcement Learning.
    IEEE Comput. Intell. Mag. 14(3), 8–18 (2019)
 4. Barriga, N.A., Stanescu, M., Buro, M.: Puppet Search: Enhancing Scripted Be-
    havior by Look-Ahead Search with Applications to Real-Time Strategy Games.
    In: AIIDE’15. pp. 9–15. Santa Cruz, California (2015)
 5. Browne, C.B., Powley, E., Whitehouse, D., Lucas, S.M., Cowling, P.I., Rohlfshagen,
    P., Tavener, S., Perez, D., Samothrakis, S., Colton, S.: A Survey of Monte Carlo
    Tree Search Methods. IEEE Trans. Comput. Intell. AI in Games 4(1), 1–49 (2012)
 6. Justesen, N., Tillman, B., Togelius, J., Risi, S.: Script- and cluster-based UCT for
    StarCraft. In: 2014 IEEE CIG. Dortmund, Germany (2014)
 7. Kocsis, L., Szepesvári, C.: Bandit Based Monte-Carlo Planning. In: Machine Learn-
    ing: ECML 2006. pp. 282–293. Lecture Notes in Computer Science, Springer Berlin
    Heidelberg (2006)
 8. Lelis, L.H.S.: Stratified Strategy Selection for Unit Control in Real-Time Strategy
    Games. In: Proceedings of IJCAI’17. pp. 3735–3741. Melbourne, Australia (2017)
 9. Mariño, J.R.H., Moraes, R.O., Toledo, C., Lelis, L.H.S.: Evolving Action Abstrac-
    tions for Real-Time Planning in Extensive-Form Games. In: Proceedings of the
    Conference on Artificial Intelligence (AAAI) (2018)
10. Moraes, R.O., Mariño, J.R.H., Lelis, L.H.S., Nascimento, M.A.: Action Abstrac-
    tions for Combinatorial Multi-Armed Bandit Tree Search. In: Proceedings of the
    14th AIIDE. AAAI Publications (2018)
11. Neufeld, X., Mostaghim, S., Perez-Liebana, D.: A Hybrid Planning and Execu-
    tion Approach Through HTN and MCTS. In: The 3rd Workshop on Integrated
    Planning, Acting, and Execution - ICAPS’19. pp. 37–45 (2019)
12. Ontañón, S.: The Combinatorial Multi-Armed Bandit Problem and Its Application
    to Real-Time Strategy Games. In: Proceedings of the 9th AIIDE. pp. 58–64. AAAI
    Publications (2013)
13. Ontañón, S.: Informed Monte Carlo Tree Search for Real-Time Strategy games.
    In: 2016 IEEE CIG. Santorini, Greece (2016)
14. Ontañón, S., Synnaeve, G., Uriarte, A., Richoux, F., Churchill, D., Preuss, M.: A
    Survey of Real-Time Strategy Game AI Research and Competition in StarCraft.
    IEEE Trans. Comput. Intell. AI in Games 5(4), 293–311 (2013)
15. Ouessai, A., Salem, M., Mora, A.M.: Online Adversarial Planning in µRTS : A
    Survey. In: 2019 International Conference on Theoretical and Applicative Aspects
    of Computer Science (ICTAACS). Skikda, Algeria (2019)
16. Ouessai, A., Salem, M., Mora, A.M.: Improving the Performance of MCTS-Based
    µRTS Agents Through Move Pruning. In: IEEE CoG. Osaka, Japan (2020)
17. Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall,
    Upper Saddle River, New Jersey (2010)
18. Silva, C.R., Moraes, R.O., Lelis, L.H.S., Gal, K.: Strategy Generation for Multi-
    Unit Real-Time Games via Voting. IEEE Transactions on Games (2018)
19. Yang, Z., Ontañón, S.: Guiding Monte Carlo Tree Search by Scripts in Real-Time
    Strategy Games. In: Proceedings of the AAAI Conference on Artificial Intelligence
    and Interactive Digital Entertainment. vol. 15, pp. 100–106 (2019)

</pre>