Monte Carlo planning for mobile robots in large action spaces with velocity obstacles Lorenzo Bonanni*1 , Daniele Meli*1 , Alberto Castellini1 and Alessandro Farinelli1 Abstract Motion planning in dynamic environments is a challenging robotic task, requiring collision avoidance and real-time computation. State-of-the-art online methods as Velocity Obstacles (VO) guarantee safe local planning, while global planning methods based on reinforcement learning or graph discretization are either computationally inefficient or not provably collision-safe. In this paper, we combine Monte Carlo Tree Search (MCTS) with VO to prune unsafe actions (i.e., colliding velocities). In this way, we can plan with very few MCTS simulations even in very large action spaces (60 actions), achieving higher cumulative reward and lower computational time per step than pure MCTS with many simulations. Moreover, our methodology guarantees collision avoidance thanks to action pruning with VO, while pure MCTS does not. Results in this paper pave the way towards deployment of MCTS planning on real robots and multi-agent decentralized motion planning. Keywords Monte Carlo Planning, Velocity Obstacles, Markov Decision Processes, Robot Motion Planning 1. Introduction Motion planning is a wide research area in robotics, especially mobile robotics. Autonomous navigation in dynamic, possibly uncertain, environments is a non-trivial task, which requires the computation of a safe trajectory towards a predefined goal, while avoiding obstacles and preserving computational efficiency for fast reaction to changes in the environment [1]. State- of-the-art approaches can be mainly categorized into reactive, look-ahead and learning-based planners. Reactive planners include the Velocity Obstacle (VO) paradigm [2, 3] and artificial potential fields [4, 5]. They compute the motion command at run time, evaluating the current configuration of the environment, e.g., obstacles, hence guaranteeing good performance even in dynamic scenarios. However, they present several limitations, e.g., they may get stuck in local minima in narrow or cluttered maps. On the contrary, look-ahead planners compute trajectories over a finite time horizon, hence evaluating a larger part of the environmental map and achieving higher convergence to the goal, even in complex maps. Exploring actions over a horizon, they can often generate optimal plans, according to user-defined criteria. State-of-the- art examples include graph-based planners as rapidly-exploring random trees [6] and Model Predictive Control (MPC) [7]. However, graph-based planners are typically computationally The 10th Italian Workshop on Artificial Intelligence and Robotics – AIRO 2023. *Authors contributed equally. βˆ— Corresponding author. † These authors contributed equally. Envelope-Open lorenzo.bonanni@studenti.univr.it (L. Bonanni*); daniele.meli@univr.it (D. Meli*); alberto.castellini@univr.it (A. Castellini); alessandro.farinelli@univr.it (A. Farinelli) Β© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings (a) (b) Figure 1: On the left, a sketch of the VO paradigm with one static obstacle. π‘£π‘π‘Ÿπ‘’π‘“ ∈ 𝐢𝐢𝑖 , thus a new velocity 𝑣 ∈ V𝑐 must be selected. On the right, the map of the environment used in this work, the agent has to move from the start point (blue cross) to the goal point (red cross), avoiding static obstacles (black circles). inefficient in dynamic scenarios where replanning is needed. Moreover, MPC performs explicit optimization of a cost function, thus requiring an analytical definition which is not always easily available in complex scenarios [8]. Recently, learning-based planners adopting neural networks and reinforcement learning have been proposed [9], which are excellent solutions for optimal trajectory planning and are robust to uncertainty in the scenario. However, they require intense training resources (both data and time) and do not guarantee generalization to unseen scenarios. Finally, most optimal planners perform cost minimization, hence not guaranteeing safe navigation [10]. In this paper, we approximate the navigation problem as a Markov Decision Process (MDP), and solve it with Monte Carlo Tree Search (MCTS), an online optimal look-ahead planner which performs online simulations to compute the best trajectory. However, MCTS requires task-specific heuristics for computational efficiency [11, 12, 13, 14] and typically does not scale to large action spaces. We then combine it with the VO paradigm for collision avoidance. In particular, we guarantee safe (i.e., obstacle-free) trajectory generation by pruning colliding actions (i.e., velocities) in Monte Carlo tree and in the rollout phase. In this way, we are able to improve the performance of pure MCTS even with very few simulations, maintaining a large action space (60 actions) which allows a finer approximation of the environment. This is a preliminary but fundamental step towards deployment of Monte Carlo planning to real robots. 2. Background We now introduce the fundamentals of the VO paradigm and Monte Carlo planning for MDPs, which are the base of the methodology described in this paper. 2.1. Velocity Obstacles (VO) Consider a robot 𝑅 which must reach a target 𝐺 in an environment with 𝑁 obstacles (an example with one obstacle is shown in Figure 1a). For simplicity, we assume that the robot and the obstacles are spherical, with radii π‘Ÿπ‘… and π‘Ÿπ‘– , 𝑖 = 1, … , 𝑁. At a given time step 𝑑, the robot is at (vector) location 𝑝𝑅 , while the obstacles have positions 𝑝𝑖 and velocity (vector) 𝑣𝑖 . Given the set V of admissible velocities for robot 𝑅, this would move with velocity π‘£π‘π‘Ÿπ‘’π‘“ ∈ V towards the goal, but this may cause collisions with obstacles. The VO paradigm is then used to compute the set of collision-free velocities V𝑐 βŠ† V . Specifically, for each 𝑖-th obstacle, we define a collision cone 𝐢𝐢𝑖 as: 𝐢𝐢𝑖 = { 𝑣 ∈ V | βˆƒπ‘‘ > 𝑑 s.t. 𝑝𝑅 (𝑑) + (𝑣 βˆ’ 𝑣𝑖 )(𝑑 βˆ’ 𝑑) ∩ B (𝑝𝑖 , π‘Ÿπ‘… + π‘Ÿπ‘– ) β‰  βˆ…} (1) 𝑁 where B (𝑝𝑖 , π‘Ÿπ‘… + π‘Ÿπ‘– ) is the ball centered at 𝑝𝑖 with radius π‘Ÿπ‘… + π‘Ÿπ‘– . We then define V𝑐 = V β§΅ ⋃𝑖=1 𝐢𝐢𝑖 , namely, the union of cones for every obstacle. 2.2. Markov Decision Processes A Markov Decision Process (MDP) [15, 16] is a tuple 𝑀 = βŸ¨π‘†, 𝐴, 𝑇 , 𝑅, 𝛾 ⟩, where 𝑆 is a finite set of states (e.g., robot and obstacle positions in the VO setting), 𝐴 is a finite set of actions - we represent each action with its index, i.e., 𝐴 = {1, … , |𝐴|} (e.g., linear velocity and movement direction in the VO setting), 𝑇 ∢ 𝑆 Γ— 𝐴 β†’ P(𝑆) is a stochastic or deterministic transition function (e.g., the deterministic dynamics of the robot in the VO setting), where P(𝐸) denotes the space of probability distributions over the finite set 𝐸, therefore 𝑇 (𝑠, π‘Ž, 𝑠 β€² ) indicates the probability of reaching the state 𝑠 β€² ∈ 𝑆 after executing π‘Ž ∈ 𝐴 in 𝑠 ∈ 𝑆, 𝑅 ∢ 𝑆 Γ— 𝐴 β†’ [βˆ’π‘…π‘šπ‘Žπ‘₯ , π‘…π‘šπ‘Žπ‘₯ ] is a bounded stochastic reward function (e.g., a function that rewards the robot if it gets close to or reaches the goal avoiding the obstacles, in the VO setting), and 𝛾 ∈ [0, 1) is a discount factor. The set of stochastic policies for 𝑀 is Ξ  = {πœ‹ ∢ 𝑆 β†’ P(𝐴)}. In the VO setting used in this paper a policy is a function that suggests the linear velocity and direction of the movement given the current position of the robot and the obstacles. Given an MDP 𝑀 and a policy πœ‹ we can compute state values π‘‰π‘€πœ‹ (𝑠), 𝑠 ∈ 𝑆, namely, the expected value acquired by πœ‹ from 𝑠; and action values 𝑄𝑀 (𝑠, π‘Ž), 𝑠 ∈ 𝑆, π‘Ž ∈ 𝐴, namely, the expected value acquired by πœ‹ when action π‘Ž is performed from state 𝑠. To evaluate the performance of a policy πœ‹ in an MDP 𝑀, i.e., 𝜌(πœ‹, 𝑀), we compute its expected return (i.e., its value) in the initial state 𝑠0 , namely, 𝜌(πœ‹, 𝑀) = π‘‰π‘€πœ‹ (𝑠0 ). The goal of MDP solvers [17, 18] is to compute optimal policies, namely, policies having maximal values (i.e., expected return) in all their states. 2.3. Monte Carlo Tree Search MCTS [19, 20] is an online solver, namely, it computes the optimal policy only for the current state of the agent. This feature of MCTS allows it to scale to large state spaces, which are typical of real-world domains. Given the current state of the agent, MCTS first generates a Monte Carlo tree rooted in the state to estimate in a sample-efficient way the Q-values (i.e., action values) for that state. Then, it uses these estimates to select the best action. A certain number π‘š ∈ N of simulations is performed using, at each step, Upper Confidence Bound applied to Trees [21, 22] (inside the tree) or a rollout policy (from a leaf to the end of the simulation) to select the action, and the transition model (or an equivalent simulator) to perform the step from one state to the next. Simulations allow to update two node statistics, namely, the average discounted return 𝑄(𝑠, π‘Ž) obtained selecting action π‘Ž, and the number of times 𝑁 (𝑠, π‘Ž) action π‘Ž was selected from node (state) 𝑠. UCT extends UCB1 [21] to sequential decisions and allows to balance exploration and exploitation in the simulation steps performed inside the tree, and to find the Μ„ (𝑑) of each action π‘Ž ∈ 𝐴 of a optimal action as π‘š tends to infinity. Given the average return π‘‹π‘Ž,𝑇 π‘Ž node, where π‘‡π‘Ž (𝑑) is the number of times action π‘Ž has been selected up to simulation 𝑑 from that node, UCT selects the action with the best upper confidence bound. In other words, the index Μ„ (𝑑) + 2𝐢𝑝 ln(π‘‘βˆ’1) of the action selected at the 𝑑-th visit of a node is 𝐼𝑑 = argmaxπ‘Žβˆˆ1,…,|𝐴| π‘‹π‘Ž,𝑇 , π‘Ž √ π‘Ž (π‘‘βˆ’1) 𝑇 with appropriate constant 𝐢𝑝 > 0. When all π‘š simulations are performed, the action π‘Ž with Μ„ (𝑑) (i.e., Q-value) in the root is executed in the real environment. maximum average return π‘‹π‘Ž,𝑇 π‘Ž 3. Methodology We define the motion planning problem with 𝑁 obstacles as a MDP with 𝑆 = {βŸ¨π‘π‘… , 𝑣𝑅 , 𝑝𝑖 βˆ€π‘– = 1, … , 𝑁 ⟩} and 𝐴 = V . Notice that the state space is continuous because the robot can reach all possible positions in the environment. For simplicity, we assume static obstacles (𝑣𝑖 = 0) in this paper. The main idea of the proposed methodology is to introduce the VO constraint in the simulation process performed by MCTS to estimate action values. This can improve the efficiency of the simulation process, allowing only exploration of V𝑐 βŠ† V . In order to solve the MDP problem with MCTS, we first discretize the continuous action space 𝑉. Specifically, we express each 𝑣 ∈ V as a tuple 𝑣 = βŸ¨π‘£, π›ΌβŸ©, where 𝑣 is the module of the velocity and 𝛼 is the heading angle in radians. We assume that the physical constraints of the robot impose a minimum and maximum velocity module, respectively π‘£π‘šπ‘–π‘› , π‘£π‘šπ‘Žπ‘₯ , and a maximum angular velocity πœ”π‘šπ‘Žπ‘₯ . Thus, at each time step 𝑑𝑠 , where the robot has heading angle 𝛼0 , the discrete action space can be expressed as 𝐴 = {βŸ¨π‘£, π›ΌβŸ© | π‘£π‘šπ‘–π‘› ≀ 𝑣 ≀ π‘£π‘šπ‘Žπ‘₯ , 𝛼0 βˆ’ πœ”π‘šπ‘Žπ‘₯ 𝑑𝑠 ≀ 𝛼 ≀ 𝛼0 + πœ”π‘šπ‘Žπ‘₯ 𝑑𝑠 }. We then obtain the action space for MCTS by discretizing 𝑣 and 𝛼 and considering all possible combinations of them. We also force actions in the form ⟨0, π›ΌβŸ©, in order to allow the robot to only rotate without moving. We introduce VO in two phases of MCTS, namely, in the steps performed inside the Monte Carlo tree (where UCT is used to select the action) and in the steps performed outside the Monte Carlo tree (where the rollout policy is used to select the action). The first phase concerns simulation steps taken near the robot’s current position, while the second phase concerns simulation steps performed also far from the robot’s current position. In this way, not only we guarantee reactive collision avoidance within few steps of execution, but we also exploit VO when planning on a longer time horizon, improving the quality and safety of the overall trajectory to the goal. We build collision cones as in Eq. (1), but considering only collisions happening within 𝑑𝑠 [23]. For computational simplicity, we consider the worst-case scenario where the module of the robot’s velocity is π‘£π‘šπ‘Žπ‘₯ . We can then prune away all discrete angles in the action space which would lead to a collision with any obstacle, reducing the action space considered at each simulation step. If no angle is safe, we compute the cone for π‘£π‘šπ‘–π‘› . Since actions with null velocity are forced in 𝐴, at least one action will always be feasible. As a rollout policy we use an heuristic to encourage the robot to move in the direction of (a) (b) Figure 2: On the left, mean and standard deviation of the discounted return for different number of simulations. On the right, mean and standard deviation of the computational time required by MCTS to compute an action for the next step. the goal. Specifically, if the direction between the robot and the goal corresponds to angle 𝛼𝐺 , we sample safe angles within V𝑐 with uniform distribution in [𝛼𝐺 βˆ’ 1, 𝛼𝐺 + 1]. Since the robot may be stuck in maps with large obstacles on the way to the goal, we implement an πœ–-greedy algorithm, selecting a random angle in V𝑐 with probability πœ– ∈ [0, 1], and following the heuristic with probability 1 βˆ’ πœ–. 4. Results Experimental setting. We test our planner in the map shown in Figure 1b. Our goal is to eval- uate the advantage of introducing VO-constraints in MCTS. Videos with 2 and 4 moving agents (from which we can draw similar conclusions) are available at http://tinyurl.com/airo23bonanni Algorithms evaluated. We compare the performance of standard MCTS (VANILLA) with that of MCTS with VO constraints. In particular, we make an ablation study of our method- ology, considering MCTS with VO only in UCT (VO_TREE), MCTS with VO only in rollout (VO_ROLLOUT) and MCTS with VO both in UCT and rollout (VO2). We evaluate each algorithm with a number of simulations ranging in [10, 1000] to understand if VO-constraints improve simulation efficiency and, consequently, allow to reduce the number of simulations and the related simulation time. To account for the stochasticity of MCTS, we run 10 tests for each configuration. Results and discussion. In Fig. 2a, we show the return achieved by the 4 algorithms considered in the analysis. The use of VO in rollout significantly improves the performance on average, in fact VO2 (orange line) and VO_rollout (green line) achieve similar results and both outperform MCTS (blue line) and MCTS_VO_TREE (red line). Also the standard deviation is lower, thus the behavior is more stable. Importantly, with less than 200 simulations there is a significant drop in the performance of MCTS and MCTS_VO_TREE, while with MCTS_VO2 and MCTS_VO_ROLLOUT the return remains high even with only 10 simulations. In Fig. 2b, we show the computational time per step required each algorithm. Computing collision cones in the rollout significantly increases the required time but the effect of this constraint on performance is also very strong, namely, VO2 and VO_ROLLOUT with only 10 simulations reach the same performance of VANILLA with 200 simulations. In this way, the time per step required by MCTS with VO in the rollout is lower than that of VANILLA with equal performance (VO2 takes β‰ˆ 0.1 s and has an average return of -30, while VANILLA requires β‰ˆ 0.8 s and has an average return of -43). Moreover, we remark that VO2 guarantees safe collision avoidance, since VO prunes away unsafe actions, while such guarantee does not come with MCTS. 5. Conclusion We presented a method for safe motion planning in dynamic environments, based on MCTS and the VO paradigm. MCTS computes a global trajectory towards the goal, limiting the risk of getting stuck in local minima with purely reactive planning algorithms. VO prunes colliding actions, guaranteeing safe obstacle avoidance, differently from common approaches based on reward shaping, and scaling to large action spaces (60 actions). Preliminary results in a simulated environment with static obstacles show that VO allows to significantly improve the performance of MCTS given the same time budget, or, equivalently, it allows to reach similar performance of standard MCTS with much less simulations (and related time per step). Our ablation study evidences the key role of VO in the rollout phase of MCTS. The very low number of required simulations allows to mitigate the complexity of computing collision cones at run time, achieving an average time per step of 0.1 s vs. 0.8 s required by pure MCTS. In the future, we will assess the performance of our algorithm in multi-agent settings with moving obstacles and more complex maps. Moreover, we plan to implement it on a real robot, further improving the computational time per step using C instead of Python. Acknowledgement This project has received funding from the Italian Ministry for University and Research, under the PON β€œRicerca e Innovazione” 2014-2020” (grant agreement No. 40-G-14702-1). References [1] M. Mohanan, A. Salgoankar, A survey of robotic motion planning in dynamic environments, Robotics and Autonomous Systems 100 (2018) 171–185. [2] P. Fiorini, Z. Shiller, Motion planning in dynamic environments using velocity obstacles, The International Journal of Robotics Research 17 (1998) 760–772. [3] F. Vesentini, N. Piccinelli, R. Muradore, Velocity obstacle-based trajectory planner for anthropomorphic arms, European Journal of Control (2023) 100901. [4] C. W. Warren, Global path planning using artificial potential fields, in: 1989 IEEE International Conference on Robotics and Automation, IEEE Computer Society, 1989, pp. 316–317. [5] M. Ginesi, D. Meli, A. Roberti, N. Sansonetto, P. Fiorini, Dynamic movement primitives: Volumetric obstacle avoidance using dynamic potential functions, Journal of Intelligent & Robotic Systems 101 (2021) 1–20. [6] A. Lukyanenko, D. Soudbakhsh, Probabilistic motion planning for non-euclidean and multi-vehicle problems, Robotics and Autonomous Systems 168 (2023) 104487. [7] C. E. Luis, M. Vukosavljev, A. P. Schoellig, Online trajectory generation with distributed model predictive control for multi-robot motion planning, IEEE Robotics and Automation Letters 5 (2020) 604–611. [8] W. Edwards, G. Tang, G. Mamakoukas, T. Murphey, K. Hauser, Automatic tuning for data-driven model predictive control, in: 2021 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2021, pp. 7379–7385. [9] A. H. Qureshi, Y. Miao, A. Simeonov, M. C. Yip, Motion planning networks: Bridging the gap between learning-based and classical motion planners, IEEE Transactions on Robotics 37 (2020) 48–66. [10] D. Kamran, T. D. SimΓ£or, Q. Yang, C. T. Ponnambalam, J. Fischer, M. T. Spaan, M. Lauer, A modern perspective on safe automated driving for different traffic dynamics using constrained reinforcement learning, in: 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), IEEE, 2022, pp. 4017–4023. [11] D. Meli, A. Castellini, A. Farinelli, Learning logic specifications for policy guidance in pomdps: an inductive logic programming approach, Journal of Artificial Intelligence Research 79 (2024) 725–776. [12] G. Mazzi, D. Meli, A. Castellini, A. Farinelli, Learning logic specifications for soft policy guidance in POMCP, in: Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems, AAMAS ’23, IFAAMAS, 2023, p. 373–381. [13] G. Mazzi, A. Castellini, A. Farinelli, Risk-aware shielding of Partially Observable Monte Carlo Planning policies, Artificial Intelligence 324 (2023) 103987. [14] A. Castellini, E. Marchesini, A. Farinelli, Partially Observable Monte Carlo Planning with state variable constraints for mobile robot navigation, Engineering Applications of Artificial Intelligence 104 (2021) 104382. [15] R. Bellman, A markovian decision process, Journal of Mathematics and Mechanics 6 (1957) 679–684. [16] M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming, John Wiley & Sons, 2014. [17] S. J. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, Pearson, 2020. [18] R. Sutton, A. Barto, Reinforcement Learning, An Introduction, 2nd ed., MIT Press, 2018. [19] G. Chaslot, S. Bakkes, I. Szita, P. Spronck, Monte-Carlo Tree Search: A new framework for game ai, in: Proceedings of the Fourth AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, AIIDE’08, AAAI Press, 2008, p. 216–217. [20] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, S. Colton, A survey of Monte Carlo Tree Search methods, IEEE Transactions on Computational Intelligence and AI in Games 4 (2012) 1–43. [21] P. Auer, N. Cesa-Bianchi, P. Fischer, Finite-time analysis of the multiarmed bandit problem, Machine Learning 47 (2002) 235–256. [22] L. Kocsis, C. SzepesvΓ‘ri, Bandit based Monte-Carlo planning, in: Machine Learning: ECML 2006. 17th European Conference on Machine Learning, volume 4212 of LNCS, Springer-Verlag, 2006, pp. 282–293. [23] D. Claes, D. Hennes, K. Tuyls, W. Meeussen, Collision avoidance under bounded localization uncertainty, in: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 2012, pp. 1192–1198.