=Paper= {{Paper |id=Vol-3417/short6 |storemode=property |title=Monte Carlo Tree Search Planning for Continuous Action and State Spaces (short paper) |pdfUrl=https://ceur-ws.org/Vol-3417/short6.pdf |volume=Vol-3417 |authors=Federico Bianchi,Lorenzo Bonanni,Alberto Castellini,Alessandro Farinelli |dblpUrl=https://dblp.org/rec/conf/aiia/0002BCF22 }} ==Monte Carlo Tree Search Planning for Continuous Action and State Spaces (short paper)== https://ceur-ws.org/Vol-3417/short6.pdf
Monte Carlo Tree Search Planning for continuous
action and state spaces
Federico Bianchi1 , Lorenzo Bonanni1 , Alberto Castellini1 and Alessandro Farinelli1
1
    University of Verona, Department of Computer Science, Strada Le Grazie 15, 37134, Verona, Italy


                                         Abstract
                                         Sequential decision-making in real-world environments is an important problem of artificial intelligence
                                         and robotics. In the last decade reinforcement learning has provided effective solutions in small and
                                         simulated environments but it has also shown some limits on large and real-world domains characterized
                                         by continuous state and action spaces. In this work, we aim to evaluate some state-of-the-art algorithms
                                         based on Monte Carlo Tree Search planning in continuous state/action spaces and propose a first version
                                         of a new algorithm based on action widening. Algorithms are evaluated on a synthetic domain in which
                                         the agent aims to control a car through a narrow curve for reaching the goal in the shortest possible
                                         time and avoiding the car going off the road. We show that the proposed method outperforms the
                                         state-of-the-art techniques.

                                         Keywords
                                         Reinforcement Learning, Monte Carlo Tree Search, Planning in continuous space, Progressive widening




1. Introduction
Planning for sequential decision-making in real-world environments is an important problem
in artificial intelligence and robotics. The interest in this topic has rapidly grown and recently
Reinforcement Learning (RL) [1] methods have been applied to it. Several of these approaches
have however shown to be effective on small or simulated environments but unable to scale to
real-world problems. A feature that characterizes several real-world planning problems is the
usage of continuous actions and states. Some examples are autonomous driving [2, 3], search
and rescue [4] or dynamic resource allocation [5]. In these domains finding an optimal policy
is often unfeasible due to the large or multi-dimensional action space, in which the number
of possible actions is infinite hence making the exploration of all available actions impossible.
In particular, large high-dimensional spaces make often space discretization unfeasible due
to computational constraints. Even when such discretization is computationally feasible, the
exploratory ability is limited and action accuracy is low.
   Monte Carlo Tree Search (MCTS) [6, 7] based planning is an online method widely used
to allow planning in Markov Decision Processes (MDP) on large domains. The methodology
reached very good results on discrete state/action domains [8] and partially observable domains
[9]. It was also extended to consider prior knowledge about the environment [10, 11] and to

The 9th Italian Workshop on Artificial Intelligence and Robotics (AIRO 2022)
$ federico.bianchi@univr.it (F. Bianchi); lorenzo.bonanni@studenti.univr.it (L. Bonanni);
alberto.castellini@univr.it (A. Castellini); alessandro.farinelli@univr.it (A. Farinelli)
                                       Β© 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073       CEUR Workshop Proceedings (CEUR-WS.org)
improve the trustworthiness of the policy [12, 13, 14], and it was applied to real-world domains
[15, 16]. However, only a few works focus on the extension of MCTS to continuous state and
action spaces. Some techniques for MDPs were proposed in 2008. In particular, a method called
MCTS Action Progressive Widening (MCTS-APW) that controls the widening of actions for
continuous action space was introduced in [17] and theoretically analyzed in [18]. In 2013 a
methodology based on [17] that extends the widening to actions and states nodes was proposed,
namely, the method which is called MTCS-Double Progressive Widening (MCTS-DPW) controls
the widening of the tree search both for the states and actions, balancing the exploration
of new states and actions or the exploitation of visited nodes [19]. In 2018 the idea of [19]
was extended for Partially Observable Markov Decision Processes (POMDPs). The extension
introduces a method called Partially-Observable Monte Carlo Planning Observation Widening
(POMCPOW) that controls the widening of actions, states, and observations [20]. Finally, in
2019 a technique based on Voronoi diagrams [21] was proposed and subsequently extended
to progressive widening for POMDPs in 2021 [22]. This extension uses Voronoi diagrams to
partition the action space in cells in which centroids represent the action with the highest
state-action value and perform a global search in order to identify the cell and then the best
action by sampling new actions around it.
   In this work, we aim to i) evaluate the most prominent state-of-the-art methodologies for
MCTS-based planning in continuous state and action spaces, ii) propose a new method that
outperforms state-of-the-art techniques. We present, in particular, the results achieved so far
comparing standard MCTS-planning [7] using discretized actions, MCTS-APW [17, 19] and
an extended version of MCTS-APW, that we call MCTS-APW2, which performs widening by
sampling new actions in an efficient way. The methods are compared on a synthetic domain
in which the agent aims to control a car through a narrow curve for reaching the goal in the
shortest possible time and avoiding the car going off the road. We show that the proposed
methodology outperforms the other approaches.


2. Background
2.1. Markov Decision Processes
Markov Decision Processes (MDPs) [23, 24] can formalize a broad range of sequential decision-
making problems. Formally, an MDP models problems in fully observable environments using a
5-tuple 𝑀 = βŸ¨π‘†, 𝐴, 𝑇, 𝑅, π›ΎβŸ©, where 𝑆 is a set of states, 𝐴 is a set of actions, 𝑇 : 𝑆 Γ— 𝐴 β†’ 𝒫(𝑆)
is a stochastic transition function, in which 𝑇 (𝑠, π‘Ž, 𝑠′ ) indicates the probability of landing on
state 𝑠′ ∈ 𝑆 after executing action π‘Ž ∈ 𝐴 in state 𝑠 ∈ 𝑆, 𝑅 : 𝑆 Γ— 𝐴 β†’ R is a reward function
with 𝑅(𝑠, π‘Ž) ∈ [βˆ’π‘…π‘šπ‘Žπ‘₯ , π‘…π‘šπ‘Žπ‘₯ ], and 𝛾 ∈ [0, 1) is a discount factor. When 𝑇 cannot be implicitly
defined, a generative model 𝐺(𝑠, π‘Ž) can be used to stochastically generate a new state 𝑠′ and
reward π‘Ÿ for MDP. The set of stochastic policies for 𝑀 is Ξ  = {πœ‹ : 𝑆 β†’ 𝒫(𝐴)} and the goal
                      βˆ‘οΈ€ πœ‹(𝑠,
is to find a control policy       π‘Ž) = 𝑃 (π‘Ž|𝑠) that maximizes the expected discounted cumulative
reward defined as E[ ∞   𝑑=0 𝛾 𝑑 𝑅(𝑠 , π‘Ž )].
                                    𝑑 𝑑
2.2. Monte Carlo Tree Search Planning for MDPs with discrete states/actions
MCTS is a simulation-based search algorithm for online decision-making [25, 7] that works by
incrementally creating a policy tree until some predefined computational budget is reached. The
building of a policy tree consists of alternating layers of state nodes generated by a generative
model 𝐺 and action nodes estimated by state-action value function 𝑄(𝑠, π‘Ž). Each state node in
the search tree represents a state of the domain, and direct links to child action nodes represent
actions leading to the next states.
   A search iteration starts from the root node and four steps are applied: i) Selection: An
action node is recursively selected to descend through the tree until a node is expandable, i.e., it
represents a non-terminal state and has unvisited children. The most common selection function
for MCTS is the Upper Confidence Tree (UCT) which expands the tree by selecting nodes and
balancing exploration and exploitation in order to maximize the upper confidence bound UCB
[26, 27]; ii) Expansion: One (or more) child nodes are added to expand the tree according to the
available actions; iii) Simulation: A simulation is run from the expanded nodes to the leaf nodes
according to the default policy to produce an outcome. The default policy selects sequences of
actions randomly; iv) Backpropagation: The result of the simulation is backpropagated through
the selected nodes to update their statistics, which are the average reward obtained from that
node onwards and the number of times the node has been visited. At the end of simulations, the
action with the maximum Q-value is selected and executed in the environment. This method
assumes that the states and actions are discrete.


3. Methods
We first summarize some of the main methods in the literature for MCTS-based planning in
continuous state/action spaces and then we describe a new technique aiming to outperform
state-of-the-art methods.

3.1. MCTS Planning with discretized actions
We refer here to the standard version of MCTS that discretizes the action space by splitting it
into intervals and maps each interval to a discrete value. This technique poses the problem of
tuning the number of actions for maximizing the performance, which is often complicated in
real-world domains with continuous and multi-dimensional action space. When the number
of possible actions is infinite it is impossible to explore all the space and it is often difficult to
achieve a good trade-off between accuracy and MC tree width. The curse of dimensionality is
a problem for this approach because a large number of actions/spaces makes the tree width
explode. In other words, the tree gets wider and never develops in depth obtaining policies
optimized only in the first step and using random (rollout) policies in all following steps. This
type of policies are not optimal but very close to random.

3.2. MCTS-APW
MCTS Action Progressive Widening (MCTS-APW) was first proposed in [17, 19]. It explicitly
controls the branching factor of the action nodes of the tree search. MCTS-APW is necessary
when the action space is continuous or so large that cannot be fully explored. During the search,
the number of actions for each state is controlled by two parameters, namely, π‘˜ ∈ R+ , 𝛼 ∈ [0, 1]
that control the number of action nodes for each state. In particular, the technique samples
a new random action from the action space if ||𝐴(𝑠)|| < π‘˜π‘ (𝑠)𝛼 , where ||𝐴(𝑠)|| is the set of
actions added to the node with state 𝑠 and 𝑁 (𝑠) is the number of visits of state 𝑠. The strategy
used by this algorithm to generate new actions is to randomly sample from all the action space.
If no new action has to be added to the current set, then the algorithm selects an action using
UCT [27].

3.3. MCTS-APW2
We propose an extension of MCTS-APW, that substitutes the random sampling of new actions
from all the action space with a new sampling strategy. The first three actions selected by
MCTS-APW2 are always the minimum, median and maximum points of the action space (in
multi-dimensional action spaces the minimum, median, and maximum are obtained by selecting,
respectively, the minimum, median and maximum values for each action). From the fourth
action, with probability 𝑝 < πœ– we select the two actions π‘Ž1 , π‘Ž2 , in the current set of selected
actions, having the highest 𝑄-value (this is why the name of the method has a "2" at the end), and
we compute the new action as π‘Ž =mean(π‘Ž1 , π‘Ž2 ); with probability 1 βˆ’ πœ– we sample a new action
randomly in all the action space. The algorithm is described in more detail in the Supplementary
Material. After a new action is added to the state node, its statistics are initialized and UCT is
used to select actions in the new action set.


4. Results
This section presents experiments comparing standard MCTS, MCTS-APW and MCTS-APW2.
We first describe the domain, then we introduce the experimental setting and parameter tuning
performed to determine good sets of parameters for each method. Finally, we show the global
performance in terms of average cumulative reward and analyze the best and average trajectories
generated to explain the reasons why MCTS-APW2 outperforms MCTS-APW.

4.1. Domain
We evaluate the algorithms in an environment representing a curved road with a bottleneck in
which the agent has to reach the other side of the bottleneck in the minimum number of steps
(i.e., in the shortest possible time or with the highest possible average speed). The environment
is displayed in Figure 1, where the blue and red lines represent the edges of the road and the
green dashed line is the trajectory executed by the agent. The trajectory starts in 𝑆0 with an
initial speed 10 m/s with angle 90∘ (0∘ is a horizontal movement to the right) and it ends when
the agent crosses the vertical dashed green line on the right (or if it goes off the road). The state
and action spaces are continuous. Namely, the agent can position itself at any point inside the
road and it can control the acceleration π‘Ž ∈ [βˆ’5, 5] and the steering πœ‘ ∈ [βˆ’30∘ , 30∘ ]. The speed
                                                                                                  𝛿
of the vehicle must be in the range 𝑣 ∈ [0, 20]. The reward function returns a score π‘Ÿ = βˆ’ 𝑛𝑠𝑑𝑒𝑝𝑠
at each step, where 𝛿 is a reward proportional to the distance between the agent and the target
and 𝑛𝑠𝑑𝑒𝑝𝑠 is the number of steps taken until the current point. The episode is terminated in
three cases: if the agent reaches the target (with reward π‘Ÿ = 𝑛10000
                                                                𝑠𝑑𝑒𝑝𝑠
                                                                      ), if the agent goes off-road
(with reward π‘Ÿ = βˆ’1000), or if it reaches 100 steps (with reward π‘Ÿ = βˆ’1000). The transition
model is a simple dynamical model that allows to move the agent deterministically according to
the current speed, acceleration, and steering.

4.2. Parameter optimization
We perform parameter tuning to select the most suitable parameters for each method. Specifi-
cally, we perform a grid search running each combination of parameters for 10 episodes and then
identify the set of parameters that maximize the average discounted reward. For standard MCTS
(called MCTS in the following) the main parameter to tune is the number of intervals in which
each action is discretized. We perform a grid search over each combination of number of acceler-
ations πœ” ∈ [7, 11, 15, 21, 50, 70, 120, 150] and angles πœ‘ ∈ [3, 5, 7, 11, 15]. For MCTS-APW and
MCTS-APW2 we perform a grid search to determine the parameters π‘˜ and 𝛼 that control the
action widening. We tested, in particular, all combinations of π‘˜ ∈ [30, 40, 50, 60, 70, 80, 90] and
𝛼 ∈ [0, 0.1, 0.4, 0.5, 0.6, 0.8, 0.9]. With MCTS-APW2 we also considered all πœ– ∈ [0.4, 0.6, 0.9].
   The best parameters for the methods are the following: i) MCTS: πœ” = 7 and πœ‘ = 7; ii)
MCTS-APW: π‘˜ = 40 and 𝛼 = 0; iii) MCTS-APW2: π‘˜ = 40, 𝛼 = 0 and πœ– = 0.4. In all methods
we use 100 simulations to generate the tree, the exploration factor is 𝑐 = 11 and the discount
factor is 𝛾 = 0.99. After selecting the best parameters for each method we computed their
cumulative reward on 100 episodes using those parameters. Tests are run on two laptops, the
first with processor Intel Core i7 - 6500 CPU 2.50 GHz x 4, RAM 16 GB and operating system
Ubuntu 20.04.5 LTS, the second with processor Intel Core i7-8550U CPU 1.80GHz Γ— 8, RAM 16
GB and operating system Ubuntu 22.04 LTS.

4.3. Performance comparison
Table 1 shows the performance of the three methods using the best parameters. The parameters
used are reported in brackets close to the name of the method in the first column. The second
column shows the average performance (over 100 runs), which is the main measure we use
to evaluate the algorithms. MCTS-APW2 has the highest cumulative reward (πœŒπ‘Žπ‘£π‘” ), namely
48.3, then there is MCTS with βˆ’309.2 and finally MCTS-APW with βˆ’909.8. We notice that the
algorithm that we propose, i.e., MCTS-APW2, provides a large improvement in terms of average
performance. The third and fourth columns of Table 1 show, respectively, the maximum and
minimum reward (over 100 runs). It is interesting to notice that also on this measure MCTS-
APW2 wins, with a value of 1774.7, but in this case, MCTS reaches a similar performance, i.e.,
1773.2. The fifth column shows the number of times the agent went off-road (over 100 runs).
MCTS-APW2 does it 42 times, MCTS 66, and MCTS-APW 91. The average reward is of course
influenced by the number of times the agent goes off-road and also by the average reward of
runs in which the agent reaches the goal. The last column of Table 1 shows the number of
actions used on average by each method. It is clear that the two methods based on widening,
namely MCTS-APW and MCTS-APW2 use fewer actions (i.e., 41) than standard MCTS with
discretized actions (which uses 49 actions). This reduction of actions allows MCTS-APW and
   Method                                     πœŒπ‘Žπ‘£π‘”       πœŒπ‘šπ‘Žπ‘₯       πœŒπ‘šπ‘–π‘›       #offroad     #actions
   MCTS (πœ”=7, πœ‘=7)                           βˆ’309.2     1773.2     βˆ’953.2       66/100         49
   MCTS-APW (π‘˜ = 40, 𝛼 = 0)                  βˆ’809.8     1371.8     βˆ’956.8       91/100         41
   MCTS-APW2 (π‘˜ = 40, 𝛼 = 0, πœ– = 0.4)         48.3      1774.7      βˆ’955       42/100          41
Table 1
Summary of performance among MCTS, MCTS-APW, and MCTS-APW2.


MCTS-APW2 to perform more simulations for each action on average, developing the tree more
in depth than standard MCTS. MCTS-APW2 is also more precise in the selection of the actions,
managing to pass through the bottleneck better than other methods.




Figure 1: Distributions of reward and analysis of the trajectories in the best and in the average case for
each method.

   Figure 1 shows the distributions of rewards for all the methods, in particular, the 58% of the
mass for MCTS-APW2 is positive, which means that the episodes are completed and the agent
is able to pass the curve and reaches the target without getting off the road. For MCTS and
MCTS-APW the mass of positive rewards decreases respectively to 34% and 9%. Below the
performance distributions, the bottom part of Figure 1 shows, for each method, the trajectory
with maximum reward (first row) and the trajectory with reward closer to the average reward
(second row). For each scenario, we show the number of steps that methods take to complete
the episode and the mean depth (𝑑) of the tree for each step of the episode. In the best case,
MCTS-APW2 and MCTS reach the target in 6 steps, while MCTS-APW takes 8 steps, but
MCTS-APW2 has a deeper search tree than the other methods and this results in a longer
planning horizon. For instance, we can notice the first step of trajectory in Figure 1.c.3 in which
MCTS-APW2 computes a wider trajectory than MCTS to pass faster through the bottleneck. In
the average case, we can notice that the number of steps for completing the episode increases
and that MCTS-APW2 has better accuracy in choosing the actions and is the only able to
compute a trajectory that reaches the goal (see Figure 1.c.3), even if the performance of the
proposed method are improvable and far from the optimal. MCTS (see Figure 1.a.3) reaches the
bottleneck while reducing the velocity and remains stuck. Analyzing the trajectories we notice
that the discretization of the action space introduced by MCTS reduces the accuracy of actions
preventing the robot to pass the bottleneck (and the curve) in several cases. In MCTS-APW
the random sampling of new actions in the continuous space implies that the exploration of
action space is often inefficient and yields a performance decrease, on average, with respect to
standard MCTS. The new sampling strategy proposed in MCTS-APW2 manages instead to solve
the problem of action space exploration and to increase the average return from -309.2 to 48.3.


5. Conclusions and future work
In this study, we consider different MCTS-based planning methods that work on discretized
action space and continuous state/action spaces. We propose a new method that extends the
state-of-the-art technique on action progressive widening in continuous domains and we show
that the method we proposed outperforms the state-of-the-art. Future work will focus on
the improvement of the performance of our method and on extending the analysis to other
progressive widening techniques. Future work will also propose a Python framework that
implements all the algorithms for MCTS-based planning in continuous action/state spaces and
some environments for evaluation.


References
 [1] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, second ed., The MIT
     Press, 2018.
 [2] T. Bandyopadhyay, K. S. Won, E. Frazzoli, D. Hsu, W. S. Lee, D. Rus, Intention-aware
     motion planning, in: Algorithmic foundations of robotics X, Springer, 2013, pp. 475–491.
 [3] H. Gupta, B. Hayes, Z. Sunberg, Intention-aware navigation in crowds with extended-
     space POMDP planning, in: 21st International Conference on Autonomous Agents and
     Multiagent Systems, AAMAS 2022, Auckland, New Zealand, May 9-13, 2022, 2022, pp.
     562–570.
 [4] Y. Liu, G. Nejat, Robotic urban search and rescue: A survey from the control perspective,
     Journal of Intelligent & Robotic Systems 72 (2013) 147–165.
 [5] D. Bertsimas, S. Gupta, G. Lulli, Dynamic resource allocation: A flexible and tractable
     modeling framework, European Journal of Operational Research 236 (2014) 14–26.
 [6] R. Coulom, Efficient selectivity and backup operators in monte-carlo tree search, in: H. J.
     van den Herik, P. Ciancarini, H. H. L. M. J. Donkers (Eds.), Computers and Games, Springer
     Berlin Heidelberg, Berlin, Heidelberg, 2007, pp. 72–83.
 [7] C. Browne, E. J. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen,
     S. Tavener, D. P. Liebana, S. Samothrakis, S. Colton, A survey of monte carlo tree search
     methods, IEEE Transactions on Computational Intelligence and AI in Games 4 (2012) 1–43.
 [8] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser,
     I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch-
     brenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, D. Hassabis,
     Mastering the game of Go with deep neural networks and tree search, Nature 529 (2016)
     484–489.
 [9] D. Silver, J. Veness, Monte-Carlo planning in large POMDPs, in: Advances in Neural
     Information Processing Systems, NeurIPS 2010, 2010, pp. 2164–2172.
[10] A. Castellini, G. Chalkiadakis, A. Farinelli, Influence of State-Variable Constraints on
     Partially Observable Monte Carlo Planning, in: IJCAI 2019, Macao, China, August 10-16,
     2019, ijcai.org, 2019, pp. 5540–5546.
[11] M. Zuccotto, A. Castellini, A. Farinelli, Learning state-variable relationships for improving
     pomcp performance, in: Proceedings of the 37th ACM/SIGAPP Symposium on Applied
     Computing, SAC ’22, Association for Computing Machinery, New York, NY, USA, 2022, p.
     739–747.
[12] G. Mazzi, A. Castellini, A. Farinelli, Identification of unexpected decisions in partially
     observable monte-carlo planning: A rule-based approach, in: Proceedings of the 20th
     International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), In-
     ternational Foundation for Autonomous Agents and Multiagent Systems, 2021, p. 889–897.
[13] G. Mazzi, A. Castellini, A. Farinelli, Rule-based shielding for partially observable monte-
     carlo planning, in: Proceedings of the International Conference on Automated Planning
     and Scheduling (ICAPS), volume 31, 2021, pp. 243–251.
[14] G. Mazzi, D. Meli, A. Castellini, A. Farinelli, Learning logic specifications for soft policy
     guidance in pomcp, in: Proceedings of the 22nd Conference on Autonomous Agents and
     MultiAgent Systems, 2023, p. in publication.
[15] A. Castellini, E. Marchesini, A. Farinelli, Partially observable monte carlo planning with
     state variable constraints for mobile robot navigation, Eng. Appl. Artif. Intell. 104 (2021)
     104382.
[16] M. Zuccotto, M. Piccinelli, A. Castellini, E. Marchesini, A. Farinelli, Learning state-variable
     relationships in POMCP: A framework for mobile robots, Frontiers Robotics AI 9 (2022).
[17] G. M. J. Chaslot, M. H. Winands, H. J. v. d. Herik, J. W. Uiterwijk, B. Bouzy, Progressive
     strategies for monte-carlo tree search, New Mathematics and Natural Computation 4
     (2008) 343–357.
[18] Y. Wang, J.-Y. Audibert, R. Munos, Algorithms for infinitely many-armed bandits, 2008,
     pp. 1729–1736.
[19] A. Couetoux, Monte carlo tree search for continuous and stochastic sequential decision
     making problems, 2013.
[20] Z. N. Sunberg, M. J. Kochenderfer, Online algorithms for pomdps with continuous state,
     action, and observation spaces, in: Twenty-Eighth International Conference on Automated
     Planning and Scheduling, 2018.
[21] M. H. Lim, C. J. Tomlin, Z. N. Sunberg, Sparse tree search optimality guarantees in pomdps
     with continuous observation spaces, arXiv preprint arXiv:1910.04332 (2019).
[22] M. H. Lim, C. J. Tomlin, Z. N. Sunberg, Voronoi progressive widening: Efficient on-
     line solvers for continuous state, action, and observation pomdps, in: 2021 60th IEEE
     Conference on Decision and Control (CDC), IEEE Press, 2021, p. 4493–4500.
[23] R. Bellman, A markovian decision process, Journal of Mathematics and Mechanics 6 (1957)
     679–684.
[24] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming,
     John Wiley & Sons, 2014.
[25] G. Chaslot, S. Bakkes, I. Szita, P. Spronck, Monte-carlo tree search: A new framework for
     game ai, in: Proceedings of the Fourth AAAI Conference on Artificial Intelligence and
     Interactive Digital Entertainment, AIIDE’08, AAAI Press, 2008, p. 216–217.
[26] P. Auer, N. Cesa-Bianchi, P. Fischer, Finite-time analysis of the multiarmed bandit problem,
     Mach. Learn. 47 (2002) 235–256.
[27] L. Kocsis, C. SzepesvΓ‘ri, Bandit based monte-carlo planning, in: Proceedings of the 17th
     European Conference on Machine Learning, ECML’06, Springer-Verlag, 2006, p. 282–293.



A. Supplementary material
In this section, we provide the pseudo-code of procedures that are in common among the
methods we compared, namely PLAN and ROLLOUT. For simplicity, we provide only the
pseudo-code of MCTS-APW2. For the pseudo-code of MCTS-APW please refer to this paper
[19].

A.1. Pseudo-code of common procedures and MCTS-APW2

Algorithm 1 Common procedures
 1: procedure Rollout(𝑠, 𝑑)
 2:    if 𝑑 = 0 then
 3:        return 0
 4:    end if
 5:    π‘Ž = πœ‹0 (𝑆)
 6:    return π‘Ÿ = π‘Ÿ + 𝛾Rollout(𝑠′ , π‘Ž, 𝑑 βˆ’ 1)
 7: end procedure
 8:
 9: procedure Plan(π‘Ÿ)
10:     for 𝑖 ∈ 𝑛 do
11:         Simulate(π‘Ÿ, π‘‘π‘šπ‘Žπ‘₯ )
12:     end for
13:     return argmax 𝑄(𝑠)
                   π‘Ž
14: end procedure
Algorithm 2 MCTS-APW2
 1: procedure NewActionSelection(𝐴(𝑠))
 2:    𝑝 = Unif[0, 1]
 3:    if 𝑝 < πœ– then
 4:        π‘Ž1 , π‘Ž2 = select two actions with the two highest 𝑄(𝑠)
 5:        a = mean(π‘Ž1 , π‘Ž2 )
 6:    else
 7:        π‘Ž = randomAction()
 8:    end if
 9:    return π‘Ž
10: end procedure
11:
12: procedure NewActionProgressiveWidening( )
13:    if ||𝐴(𝑠)|| < π‘˜π‘ (𝑠)𝛼 then
14:         if 𝑁 (𝑠) = 0 then
15:             π‘Ž = median(𝐴(𝑠))
16:         else if 𝑁 (𝑠) = 1 then
17:             π‘Ž = min(𝐴(𝑠))
18:         else if 𝑁 (𝑠) = 2 then
19:             π‘Ž = max(𝐴(𝑠))
20:         else
21:             π‘Ž = NewActionSelection(𝐴(𝑠))
22:         end if
23:         𝑁 (𝑠, π‘Ž), 𝑄(𝑠, π‘Ž), 𝑉 (𝑠, π‘Ž) = 𝑁0 (𝑠, π‘Ž), 𝑄0 (𝑠, π‘Ž), 𝑉 (𝑠, π‘Ž)
24:         𝐴(𝑠) = 𝐴(𝑠) βˆͺ {π‘Ž}
25:    end if                         √︁
26:    return argmax 𝑄(𝑠, π‘Ž) + 𝑐 log        𝑁 (𝑠)
                                         𝑁 (𝑠,π‘Ž)
                   π‘Ž
27: end procedure
28:
29: procedure Simulate(𝑠, 𝑑)
30:    if 𝑑 = 0 then
31:          return 0
32:    end if
33:    if 𝑠 ∈ / 𝑇 then
34:          𝑇 = 𝑇 βˆͺ {𝑠}
35:          𝑁 (𝑠) = 𝑁0 (𝑠)
36:          return Rollout(𝑠, 𝑑)
37:    end if
38:    π‘Ž = NewActionProgressiveWidening()
39:    𝑠′ , π‘Ÿ = 𝐺(𝑠, π‘Ž)
40:    π‘ž = π‘Ÿ + 𝛾SIMULATE(𝑠′ , 𝑑 βˆ’ 1)
41:    𝑁 (𝑠, π‘Ž) = 𝑁 (𝑠, π‘Ž) + 1
42:    𝑄(𝑠, π‘Ž) = 𝑄(𝑠, π‘Ž) + π‘žβˆ’π‘„(𝑠,π‘Ž)
                               𝑁 (𝑠,π‘Ž)
43:    return π‘ž
44: end procedure