<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Scalable Safe Policy Improvement for Single and Multi-Agent Systems</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Federico Bianchi</string-name>
          <email>federico.bianchi@univr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Castellini</string-name>
          <email>alberto.castellini@univr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Farinelli</string-name>
          <email>alessandro.farinelli@univr.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Reinforcement Learning, Safe Policy Improvement, Single-agent systems, Multi-agent systems</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Verona</institution>
          ,
          <addr-line>Str. le Grazie, 15, 37134 Verona</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Safe Policy Improvement (SPI) is crucial in domains where reliable decision-making must be achieved with limited environmental interaction, given the high costs and risks involved. Although existing SPI algorithms ensure improved safety over baseline policies, they struggle to scale to large and complex problems. In this work, we discuss new approaches to enhance the scalability and safety of SPI for both single-agent and multi-agent systems. For single-agent scenarios, we introduce MCTS-SPIBB, which combines Monte Carlo Tree Search with Safe Policy Improvement with Baseline Bootstrapping, and SDP-SPIBB, a scalable dynamic programming approach that extends SPI to large domains while preserving safety guarantees. For multi-agent settings, we present Factored Value-MCTS-SPIBB, the first SPI method to address large-scale multi-agent problems efectively. Through theoretical and empirical evaluation, we show that our algorithms scale eficiently and maintain the safety properties of SPI, thus making SPI applicable to complex and large-scale scenarios.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Safety [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is critical for deploying Reinforcement Learning (RL) algorithms in real-world scenarios [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
especially in domains like autonomous driving, healthcare, robotics and environmental monitoring [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4,
5</xref>
        ], where reliable decision-making is essential, and data collection can be risky or expensive [
        <xref ref-type="bibr" rid="ref6 ref7 ref8 ref9">6, 7, 8, 9</xref>
        ].
Safe Policy Improvement [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is a specialization of ofline RL [
        <xref ref-type="bibr" rid="ref11 ref12">11, 12</xref>
        ] that assumes knowledge of a
baseline policy and limited interaction with the environment from which a dataset is collected. It
provides probabilistic guarantees that the new policy’s performance will improve over the baseline,
thereby addressing reliability issues inherent to ofline RL, such as distributional shifts and extrapolation
errors [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ], that arise when the policy encounters states and actions not well represented in the
training data. SPI methods can be broadly categorized into two main groups based on how they manage
uncertainty in the agent’s states and actions: i) methods that handle uncertainty by reducing the
estimated values of uncertain actions and ii) methods that handle uncertainty by restricting the space
policies that can be learned. SPIBB [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is a state-of-the-art method for SPI that constrains the space
of learnable policies and extends the optimal policy iteration algorithm by bootstrapping from the
baseline policy in states where the actions have high uncertainty, i.e., states and actions not suficiently
represented in the collected dataset. This strategy efectively limits the search for improved policies to
a region where the model’s estimates are suficiently reliable. However, the computational complexity
of SPIBB and other SPI methods restricts their applicability to real-world problems, largely due to the
complexity of the underlying algorithms, such as policy iteration [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ].
      </p>
      <p>
        In this work, we discuss key contributions to improve SPI scalability in both single-agent and
multiagent systems. For single-agent systems, we propose Monte Carlo Tree Search SPIBB (MCTS-SPIBB),
which integrates MCTS [
        <xref ref-type="bibr" rid="ref16 ref17 ref18 ref19">16, 17, 18, 19</xref>
        ] with SPIBB for scalable policy computation in large state spaces.
Additionally, we propose SDP-SPIBB, which reduces SPIBB complexity by focusing policy updates
only on relevant state-action subspaces, enabling it to scale to large domains. For multi-agent systems,
we introduce Factored Value Monte Carlo Tree Search SPIBB (FV-MCTS-SPIBB), which leverages
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
action-value factorization to scale eficiently in state and action spaces with dimension which grows
exponentially on the number of agents. In this context, we also propose two novel action-selection
strategies, Constrained Max-Plus and Constrained Variable Elimination (Var-El), which guarantee safety
criteria defined by SPIBB. Additionally, the factorization of the transition model allows the algorithm to
trust a larger number of state-action pairs, improving the baseline policy more efectively. Empirical
evaluations on large-scale problems highlight the efectiveness of these methods in improving policy
performance while maintaining safety guarantees.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Preliminaries</title>
      <p>In this section, we introduce the background for Safe Policy Improvement and Safe Policy Improvement
with Baseline Bootstrapping.</p>
      <sec id="sec-2-1">
        <title>2.1. Safe Policy Improvement</title>
        <p>
          Let an unknown finite Markov Decision Process (MDP)  ∗ = ⟨, , 
∗, ,  ⟩
represent a true
enviactions  . Given a policy subset Π′ ⊆ Π, a policy Π′ is Π′-optimal for an MDP 
ronment where  ∗ is an unknown transition model and  a known reward function. Π = { → Δ  }
is the set of stochastic policies, where Δ denotes the set of probability distributions over the set of
when it maximizes its
performance on Π′ [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]:
( ′,  ) =
∈Π ′
max ( ,  ).
        </p>
        <p>Likelihood Estimator (MLE)   = ⟨, , 
 , ,  ⟩</p>
        <p>
          of  ∗ as follows:
SPI approaches focus on the Ofline RL setting where the algorithm does its best at learning a policy
let  
from a fixed set of experiences. Given a dataset  = (  ,   ,   ,  
)=1 collected by a baseline policy  0,
(, ) denote the number of visits to the state-action pair (, ) ∈  . We construct the Maximum
′ 

 ( ′ ∣ , ) =
∑
(  =,  =, ′= ′) 1
  (, )
This is common in real-world domains where the transition model must be estimated or inferred
from small amounts of data. The safety of the improvement must be guaranteed, specifically,   must
outperform  0 with an admissible performance loss. A significant approach in this context is the
percentile criterion [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], which aims to improve policy performance while maintaining a high probability
of improvement over a baseline policy  0. The percentile criterion is defined as follows:
subject to the constraint:
        </p>
        <p>∈Π
  = arg max [( ,  ) ∣  ∼ ℙ</p>
        <p>(⋅ ∣  )],
ℙ((  ,  ) ≥ (
0,  ) −  ∣  ∼ ℙ

(⋅ ∣  )) ≥ 1 − ,
where ℙ</p>
        <p>(⋅| ) represents the posterior probability distribution of the MDP parameters given the
dataset  ,  denotes the policy performance,  is an approximation parameter (or precision level), and
1 −  denotes a high confidence level.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Safe Policy Improvement with Baseline Bootstrapping</title>
        <p>
          The SPIBB [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] algorithm reformulates the percentile criterion and aims to maximize the policy’s
performance in the estimated MDP   , guaranteeing that the improved policy   is  -approximately at
least as good as the baseline policy  0:
  ∈Π
max (  ,   ), s.t. ∀ ∈ Ξ, (
 ,  ) ≥ (
0,  ) −  ,
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
within the set of admissible MDPs Ξ:
Ξ(  , ) = { ∣ ∀(, ) ∈  × , ‖ (, , ⋅) − 
 (, , ⋅)‖ 1 ≤ (, )},
where  ∶  ×  → ℝ
        </p>
        <p>is an error function depending on  and  . Based on Theorem 8 from Petrik et al.
[21], SPIBB guarantees that if all state-action pair counts   (, ) meet the condition:
  (, ) ≥  ∧ =</p>
        <p>8  2
 2(1 −  )2 log (
2||||2

||
) ,
and   is the Maximum Likelihood Estimation MDP, then with high probability 1 −  , the optimal
policy  ∗ = arg max∈Π ( ,</p>
        <p>)in   is  -approximately safe in the true environment  ∗:
( ∗,  ∗) ≥ ( ∗,   ) −  ≥ ( 0,  ∗) −  .</p>
        <p>These conditions ensure that performance estimates in   generalize safely to  ∗. To implement this,
SPIBB splits state-action pairs into two subsets, the bootstrapped subset ℬ = {(, ) ∶   (, ) &lt;  ∧},
which includes state-action pairs that occur fewer than  ∧ times in 
and the non-bootstrapped set
ℬ = {(, ) ∶   (, ) ≥  ∧}, which includes state-action pairs that occur at least  ∧ times in  .
Hereafter, we assume that the terms baseline policy and behavior policy can be used interchangeably.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>We developed three scalable SPI algorithms, two for single-agent domains [22] and one for multi-agent
domains [23]. The main ideas and contributions are explained below.</p>
      <sec id="sec-3-1">
        <title>3.1. Safe Policy Improvement for single-agent systems</title>
        <p>non-bootstrapped action is selected with probability</p>
        <p>
          ℬ =
MCTS-SPIBB. The first algorithm is a Monte Carlo Tree Search [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] extension of SPIBB [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. As
MCTS can approximate optimal policies generated by policy iteration, MCTS-SPIBB can approximate
Π0-optimal policies generated by SPIBB, starting from a baseline  0. Since MCTS-SPIBB computes the
policy online and locally it can scale to larger state problems than SPIBB.
        </p>
        <p>The core idea of MCTS-SPIBB is to extend UCT [24] while considering the safety constraint on
action selection. This presents several challenges, such as the fact that UCT selects actions based on
Q-values while the safety constraint is on action selection probabilities, and this constraint’s impact
accumulates throughout the layers of the Monte Carlo tree. For a given state  in the tree, actions are
divided into two categories: bootstrapped state-action pairs (, ) ∈ ℬ
and non-bootstrapped pairs
(, ) ∈ ℬ. When the simulation reaches state  a bootstrapped action is selected with a probability
∑∈ℬ  ()  0(, ) , where ℬ () represents the set of bootstrapped actions for state  , while a
ℬ = 1 −  ℬ. If a bootstrapped action is selected,

it is chosen according to the probability distribution of the baseline policy  0(, ⋅). If a non-bootstrapped
action is chosen, it is selected using the UCT strategy, which considers current Q-value estimates and
visit counts, ensuring that the optimal action is chosen given enough simulations. During the rollout
phase, baseline probabilities are applied to bootstrapped actions, while non-bootstrapped actions are
selected uniformly. At the end of the simulations, the estimated Q-values (, )
for the root state 
are used to compute the probabilities for the improved policy  ∘(, ) as follows: i)  0(, ) if  ∈ ℬ() ,
ii) 1 −  ℬ if  = arg max ′∈ℬ ()</p>
        <p>(, 
integrates UCT and baseline probabilities in the MCTS allowing the generation of improved policies
with probabilistic guarantees on the improvement.</p>
        <p>The complexity of MCTS-SPIBB scales linearly with the number of Monte Carlo simulations  ,
namely, it is ()</p>
        <p>. This means that the computational efort required by MCTS-SPIBB is directly
proportional to the number of simulations performed. Each simulation in MCTS-SPIBB is used to
estimate the value of diferent actions by exploring potential future states. As the number of simulations
′), and iii) 0 otherwise. The proposed action selection strategy
increases, the accuracy of the action value estimates improves, resulting in a corresponding linear
increase in the computational complexity.</p>
        <p>Scalable Dynamic Programming SPIBB (SDP-SPIBB). SPIBB uses policy iteration to generate
the improved policy. This algorithm has a complexity (|| 2|| 2) due to the four nested loops over
states, actions, next states, and next actions required to update the Q-function. However, in SPI, the
value updates can be performed only on state-action pairs where MLE transition probabilities are
non-zero, which correspond to state-action pairs observed a suficient number of times in the dataset
of trajectories. This observation allows us to reduce the complexity of SPIBB from (|| 2|| 2) to a
term depending only on the dataset size | | , and in particular on the size of the non-bootstrapped
state-action pairs, ℬ. These pairs have been observed at least  ∧ times in the dataset. This complexity
change provides a strong increase of scalability because the dataset size is usually much smaller than
(|| 2|| 2) in large domains. However, since the complexity still depends on the dataset size, it can
grow in applications where huge amounts of data are collected over time (e.g., streaming data).</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Safe Policy Improvement for multi-agent systems</title>
        <p>FV-MCTS-SPIBB. Let the Factored Multi-agent MDP (FMMDP)  ∗ = ⟨, , {  }∈ ,  ∗, ,  ⟩ [25]
represent the true environment, where  is a set of agents and   is the set of actions of agent  . A central
behavior policy  0 is executed in this environment controlling all agents, and a dataset  of trajectories
is collected. Each sample in  consists of a joint state, joint action, joint next state, and joint reward,
denoted as (, ̄, ̄, ̄ ̄ ′). This dataset is then used to compute the MLE FMMDP   = ⟨, ,   ∈ ,   , ,  ⟩ ,
where the transition model   is factorized according to dependency functions   . We define the set of
bootstrapped joint state-action pairs as
and the set of non-bootstrapped joint state-action pairs as
ℬ = {(, ̄)̄ ∈  ×  ∣ ∃</p>
        <p>∶ (  (, ̄)̄ ) &lt;   }
ℬ̄ = {(, ̄)̄ ∈  ×  ∣ ∀
 ∶ (  (, ̄)̄ ) ≥   }.</p>
        <p>FV-MCTS-SPIBB is based on the percentile criterion, adapted to FMMDPs, and achieves scalability
in multi-agent systems by leveraging the factorization of the action-value function induced by a
coordination graph (CG)  = ⟨ , ℰ ⟩ [26, 27], where each agent is represented as a node  ∈  and each
pair of agents that needs to coordinate is represented by an edge (, ) ∈ ℰ . The action-value function is
decomposed as
()̄ =
∑   (  ) + ∑   (  ,   ),
∈ ,∈ℰ
providing an action-value function for each agent and each edge of the graph. This decomposition
greatly reduces the number of joint actions to consider at each state in MCTS. To select among
nonbootstrapped actions, FV-MCTS-SPIBB uses two novel action selection strategies: Constrained Max-Plus
and Constrained Variable Elimination, both of which guarantee that the selected actions are optimal
(despite a large number of possible joint actions) and that the resulting policy safely improves upon the
behavior policy. FV-MCTS-SPIBB extends the Factored SPIBB approach from [28] by considering local
actions   associated with the components   , rather than using joint actions  .̄</p>
        <p>This modification exploits the factorization of the transition model, allowing state-action counts
at the agent level, which results in larger counts than other SPIBB variants that rely on joint actions.
This requires transition function factorizability, which leads to greater potential for improving the
behavior policy. Flat approaches in the SPI literature count state-action pairs at the joint state and
joint action levels, producing a smaller set of non-bootstrapped joint state-action pairs  ̄ compared to
our approach, which counts state-action pairs at local levels. Full details about the FV-MCTS-SPIBB
algorithm can be found in [23]. FV-MCTS-SPIBB’s complexity depends on the action selection strategy.
In the case of Max-Plus, the complexity is linear in the size of the CG, and guarantees of convergence to
optimality are provided for acyclic CGs. On cyclic CGs, these guarantees do not hold, but empirically,
(9)
(10)
(11)
Max-Plus provides approximately optimal results even on cyclic structures. The complexity of the
Var-Elimination algorithm is exponential in the treewidth, a parameter related to the graph’s cyclicity.
The algorithm guarantees convergence for any type of CG, but finding the treewidth of a graph is a
dificult (NP-hard) problem, although it can be easily estimated with Depth-First Search (DFS). Therefore,
it is possible to decide whether to use Max-Plus or Var-Elimination by evaluating the treewidth of the
graph.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Empirical analysis</title>
      <sec id="sec-4-1">
        <title>4.1. Benchmark</title>
        <p>In this section, we present results focusing on the scalability and safety of our proposed methods in the
multi-agent SysAdmin domain [26, 27].</p>
        <p>In the multi-agent SysAdmin domain, each agent controls a machine characterized by two state variables:
a status, which can be good, faulty, or dead, and a load, which can be idle, loaded, or success, both initially
set to good and idle. At each step, agents can activate their machines or do nothing, aiming to achieve
(good, success) states for rewards. Although the coordination graph is static, complexity arises from
reasoning about joint actions and their network-wide efects. Poor coordination can result in suboptimal
outcomes, such as unnecessary simultaneous reboots. As the number of agents  increases, the size
of the state and action spaces grows exponentially. Specifically, the complexity follows || = 9  and
|| = 2  , where || is the number of possible states and || is the number of possible actions.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Experimental overview</title>
        <p>
          Among the SPI methods tested, only MCTS-SPIBB, SDP-SPIBB, and FV-MCTS-SPIBB can handle the
problem, since SPIBB [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] and other SPI approaches [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] cannot efectively scale to such large domains.
Figure 1 provides box-plots of the average return (̄ ,  ∗)(y-axis) as the number of agents increases from
4 to 32. For FV-MCTS-SPIBB-Max-Plus and FV-MCTS-SPIBB-Var-El, we use the following parameters:
100 simulations, an empirically determined exploration constant of  =  (where  is the number of
agents), an MCTS tree depth of 20 steps,  = 0.9 , and 8 iterations of message passing in Constrained
Max-Plus. For MCTS-SPIBB, similar parameters are used, but 10000 simulations are required, as it does
not leverage model factorization and needs more simulations.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results on scalability</title>
        <p>With 4 agents, all methods show improvements over the behavior policy  0 (orange box).
FV-MCTSSPIBB-Var-El (green box) slightly outperforms FV-MCTS-SPIBB-Max-Plus (red box), with both methods
outperforming MCTS-SPIBB (blue box) and SDP-SPIBB (steelblue). With 8 agents,
FV-MCTS-SPIBBMax-Plus and FV-MCTS-SPIBB-Var-El provide significant and similar improvements over the behavior
policy, but the performance gap between these methods and MCTS-SPIBB widened (i.e., FV-MCTS-SPIBB
methods achieves around 22.0, while MCTS-SPIBB and SDP-SPIBB score around 15.0, and the behavior
policy reaches approximately 13.0). With 16 agents, FV-MCTS-SPIBB-Var-El cannot compute actions
within a reasonable time due to the exponential complexity of Var-El, which is tied to the induced
width of the CG and the elimination order. MCTS-SPIBB and SDP-SPIBB show some improvement
over the baseline, but FV-MCTS-SPIBB-Max-Plus outperforms them. With 24 and 32 agents, only
FV-MCTS-SPIBB-Max-Plus can improve performance compared to the behavior policy, as MCTS-SPIBB
and SDP-SPIBB break due to the exponential number of available actions. This experiment shows that
FV-MCTS-SPIBB-Max-Plus is the only approach able to scale to large multi-agent domains, in which
the dimensions of the state and action spaces become huge because they grow exponentially on the
number of agents (e.g., multi-agent SysAdmin with 32 agents has 1030 possible joint states and 109
possible joint actions).</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we discuss approaches for scalable, safe policy improvement in both single-agent and
multi-agent systems. We introduced two novel Safe Policy Improvement methods for single-agent
systems for large-scale problems. The first method, MCTS-SPIBB, is an MCTS-based approach and
the second, SDP-SPIBB, is based on dynamic programming. For multi-agent systems, we introduced
FV-MCTS-SPIBB, an extension of MCTS-SPIBB that scales assuming a transition model and Q-function
factorization. Our empirical evaluation, conducted in a large-scale benchmark domain, shows that both
SDP-SPIBB and MCTS-SPIBB can scale and achieve policy improvement in single-agent scenarios where
other state-of-the-art SPI algorithms cannot work. FV-MCTS-SPIBB outperforms all other algorithms
in multi-agent scenarios. By addressing the computational limitations of current SPI algorithms,
particularly SPIBB methods, this work expands the range of problems that can be safely addressed with
reinforcement learning, contributing to developing more reliable AI systems.</p>
      <p>While these contributions represent significant advancements over the state-of-the-art SPI, they also
raise important directions for future research. An open challenge in this context is to provide theoretical
guarantees on policy improvement when the policy is approximated by a general function (e.g., a neural
network). In particular, incorporating function approximation techniques, such as linear models or deep
neural networks, into our SPI algorithms may enhance scalability by addressing bottlenecks related
to the space complexity associated with large-scale problems. Furthermore, applying the proposed
methodologies to real-world systems, such as autonomous vehicles or robotic platforms, would provide
valuable insights into their practical utility and limitations.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This paper has been prepared as a part of a collaboration between the University of Verona and Leonardo
Labs, belonging to Leonardo SpA.
[21] M. Petrik, M. Ghavamzadeh, Y. Chow, Safe policy improvement by minimizing robust baseline
regret, in: Proceedings of the 30th International Conference on Neural Information Processing
Systems (NIPS 2016), Curran Associates Inc., Red Hook, NY, USA, 2016, p. 2306–2314.
[22] A. Castellini, F. Bianchi, E. Zorzi, T. D. Simão, A. Farinelli, M. T. J. Spaan, Scalable safe policy
improvement via Monte Carlo tree search, in: Proceedings of the 40th International Conference
on Machine Learning (ICML 2023), PMLR, 2023, pp. 3732–3756.
[23] F. Bianchi, E. Zorzi, A. Castellini, T. D. Simão, M. T. J. Spaan, A. Farinelli, Scalable safe policy
improvement for factored multi-agent MDPs, in: Proceedings of the 41st International Conference
on Machine Learning (ICML 2024), PMLR, 2024, pp. 3952–3973.
[24] L. Kocsis, C. Szepesvári, Bandit based monte-carlo planning, in: Proceedings of the 17th European</p>
      <p>Conference on Machine Learning (ECML 2006), Springer-Verlag, 2006, p. 282–293.
[25] C. Boutilier, Planning, learning and coordination in multi-agent decision processes, in: Proceedings
of the 6th Conference on Theoretical Aspects of Rationality and Knowledge (TARK 1996), Morgan
Kaufmann Publishers Inc., 1996, p. 195–210.
[26] S. Choudhury, J. K. Gupta, P. Morales, M. J. Kochenderfer, Scalable anytime planning for
multiagent MDPs, in: Proceedings of the 20th International Conference on Autonomous Agents and
MultiAgent Systems (AAMAS 2021), IFAAMAS, 2021, p. 341–349.
[27] C. Guestrin, D. Koller, R. Parr, S. Venkataraman, Eficient solution algorithms for factored MDPs,</p>
      <p>Journal of Artificial Intelligence Research 19 (2003) 399–468.
[28] T. D. Simão, M. T. J. Spaan, Safe policy improvement with baseline bootstrapping in factored
environments, in: Proceedings AAAI Conference on Artificial Intelligence, AAAI Press, 2019, pp.
4967–4974.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <string-name>
            <surname>A Comprehensive</surname>
          </string-name>
          <article-title>Survey on Safe Reinforcement Learning</article-title>
          , JMLR
          <volume>16</volume>
          (
          <year>2015</year>
          )
          <fpage>1437</fpage>
          -
          <lpage>1480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          , Reinforcement Learning: An Introduction, second ed., The MIT Press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cacace</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Caccavale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Finzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Grieco</surname>
          </string-name>
          ,
          <article-title>Combining human guidance and structured task execution during physical human-robot collaboration</article-title>
          ,
          <source>Journal of Intelligent Manufacturing</source>
          <volume>34</volume>
          (
          <year>2022</year>
          )
          <fpage>3053</fpage>
          -
          <lpage>3067</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>R. De Benedictis</surname>
            ,
            <given-names>G.</given-names>
            Beraldo, G.
          </string-name>
          <string-name>
            <surname>Cortellessa</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Fracasso</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Cesta</surname>
          </string-name>
          ,
          <article-title>A transformer-based approach for choosing actions in social robotics</article-title>
          ,
          <source>in: Social Robotics</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>198</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zuccotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Torre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning applications in environmental sustainability: a review</article-title>
          ,
          <source>Artificial Intelligence Review</source>
          <volume>57</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>68</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Meli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Learning logic specifications for policy guidance in POMDPs: an inductive logic programming approach</article-title>
          ,
          <source>Journal of Artificial Intelligence Research (JAIR) 79</source>
          (
          <year>2024</year>
          )
          <fpage>725</fpage>
          -
          <lpage>776</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Cipollone</surname>
          </string-name>
          , G. De Giacomo,
          <string-name>
            <given-names>M.</given-names>
            <surname>Favorito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Patrizi</surname>
          </string-name>
          ,
          <article-title>Exploiting multiple abstractions in episodic RL via reward shaping</article-title>
          ,
          <source>Proceedings AAAI Conference on Artificial Intelligence</source>
          <volume>37</volume>
          (
          <year>2023</year>
          )
          <fpage>7227</fpage>
          -
          <lpage>7234</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Risk-aware shielding of Partially Observable Monte Carlo Planning policies</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>324</volume>
          (
          <year>2023</year>
          )
          <fpage>103987</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Mazzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Active generation of logical rules for pomcp shielding</article-title>
          ,
          <source>in: Proceedings AAMAS</source>
          <year>2022</year>
          , IFAAMAS,
          <year>2022</year>
          , pp.
          <fpage>1696</fpage>
          -
          <lpage>1698</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Scholl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dietrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Otte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Udluft</surname>
          </string-name>
          ,
          <article-title>Safe policy improvement approaches and their limitations</article-title>
          ,
          <source>in: Agents and Artificial Intelligence</source>
          , Springer International Publishing, Cham,
          <year>2022</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>98</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          , G. Tucker,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <article-title>Ofline reinforcement learning: tutorial, review, and perspectives on open problems</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .01643,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Prudencio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Maximo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Colombini</surname>
          </string-name>
          ,
          <article-title>A survey on ofline reinforcement learning: Taxonomy, review, and open problems</article-title>
          ,
          <source>IEEE Trans. Neural Networks and Learning Systems</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Fujimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Meger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Precup</surname>
          </string-name>
          ,
          <article-title>Of-policy deep reinforcement learning without exploration</article-title>
          ,
          <source>in: Proceedings of the 36th International Conference on Machine Learning (ICML)</source>
          , PMLR,
          <year>2019</year>
          , pp.
          <fpage>2052</fpage>
          -
          <lpage>2062</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Soh</surname>
          </string-name>
          , G. Tucker,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <article-title>Stabilizing of-policy Q-learning via bootstrapping error reduction</article-title>
          ,
          <source>in: Proceedings of the 32th Conference on Neural Information Processing Systems (NeurIPS)</source>
          , Curran Ass. Inc.,
          <year>2019</year>
          , pp.
          <fpage>11761</fpage>
          -
          <lpage>11771</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Laroche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Trichelair</surname>
          </string-name>
          , R. Tachet Des Combes,
          <article-title>Safe policy improvement with baseline bootstrapping</article-title>
          ,
          <source>in: Proceedings 36th International Conference on Machine Learning (ICML)</source>
          , PMLR,
          <year>2019</year>
          , pp.
          <fpage>3652</fpage>
          -
          <lpage>3661</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Browne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Powley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Whitehouse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Lucas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. I.</given-names>
            <surname>Cowling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rohlfshagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tavener</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Liebana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Samothrakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Colton</surname>
          </string-name>
          ,
          <article-title>A survey of monte carlo tree search methods</article-title>
          ,
          <source>IEEE Transactions on Computational Intelligence and AI in Games</source>
          <volume>4</volume>
          (
          <year>2012</year>
          )
          <fpage>1</fpage>
          -
          <lpage>43</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Chalkiadakis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Influence of State-Variable Constraints on Partially Observable Monte Carlo Planning</article-title>
          ,
          <source>in: Proceedings of the 28th International Joint Conference on Artificial Intelligence</source>
          , IJCAI,
          <source>International Joint Conferences on Artificial Intelligence Organization</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>5540</fpage>
          -
          <lpage>5546</lpage>
          . doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2019</year>
          /769.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zuccotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Piccinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Marchesini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Learning state-variable relationships in pomcp: A framework for mobile robots</article-title>
          ,
          <source>Frontiers in Robotics and AI</source>
          <volume>9</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zuccotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Fusa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castellini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Online model adaptation in monte carlo tree search planning, Optimization and Engineering (</article-title>
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>E.</given-names>
            <surname>Delage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mannor</surname>
          </string-name>
          ,
          <article-title>Percentile optimization for Markov Decision Processes with parameter uncertainty</article-title>
          ,
          <source>Operations Research</source>
          <volume>58</volume>
          (
          <year>2010</year>
          )
          <fpage>203</fpage>
          -
          <lpage>213</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>