<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Ital-IA</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Explaining Reinforcement Learning Policies for Power Grid Operations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Marzari</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesco Leofante</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Enrico Marchesini</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Massachusetts Institute of Technology</institution>
          ,
          <addr-line>Cambridge (MA)</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Verona</institution>
          ,
          <addr-line>Verona</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>5</volume>
      <fpage>23</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>Reinforcement learning (RL) ofers significant potential for improving decision-making in power grid operations by enabling adaptive and scalable control through interaction with these complex systems. However, real-world deployment of RL in this domain faces key challenges, including uncertainty in system dynamics, the need to achieve long-term objectives, and strict physical and safety constraints. Moreover, the black-box nature of deep RL models limits interpretability, making them dificult to trust their deployment in safety-critical power grid applications. Overcoming these obstacles requires close collaboration with system operators to develop RL methods that are not only efective but also transparent and reliable. In this work, we present our recent advances in applying RL to power grids and highlight the importance of combining RL algorithms with explainable artificial intelligence techniques to enable safe, interpretable, and trustworthy control solutions for power grid operations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Explainable AI</kwd>
        <kwd>Reinforcement Learning</kwd>
        <kwd>Power Grid</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Power grid operations are undergoing rapid transformation to support global decarbonization goals.
This transition demands greater operational flexibility, enhanced reliability, and large-scale integration
of variable renewable energy (VRE) sources. One key strategy enabling this shift and highlighted
by transmission system operators (TSOs) is topology optimization—a cost-efective control method
that dynamically reconfigures grid connectivity to alleviate congestion, handle contingencies (i.e.,
unexpected disruptions), and improve overall system security [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Another category of actions involves
modifying power flows by redispatching or curtailing the output of fossil and renewable generators.
These modifications are often costly, as they disrupt third-party operations and may incur additional
costs. However, traditional optimization solvers based on these interventions often struggle to manage
the growing variability introduced by VRE [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        Reinforcement learning (RL) is gaining traction as a promising approach to automate real-time
control in power systems. It has shown strong performance in complex, sequential decision-making
tasks across domains such as games, robotics, and physics-based environments [
        <xref ref-type="bibr" rid="ref3 ref4 ref5">3, 4, 5</xref>
        ]. Despite these
advances, several fundamental challenges continue to limit the real-world deployment of RL—such as
managing complex system dynamics and inherent uncertainty, achieving long-term objectives, and
adhering to strict physical constraints [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Power grids exemplify many of these issues, which remain
open problems in the RL community. As such, studying realistic power grid tasks through the lens of
RL presents a unique opportunity to drive progress in both critical infrastructure management and RL
research. Yet, development in this area is slowed by the absence of standardized benchmarks that can
guide progress and generate actionable insights for tackling real-world problems.
      </p>
      <p>
        Our current work, RL2Grid [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], introduces the first reinforcement learning benchmark specifically
tailored to realistic power grid operations, developed in collaboration with leading transmission system
operators (TSOs). The benchmark is designed to drive progress in grid control and advance the
development of RL methods by ofering a standardized suite of increasingly complex tasks. These tasks
reflect the real-world challenges of power grid management, including the combinatorial complexity of
the action space typical in grid operations. Figure 1 depicts a simplified power grid setup with four
substations connected by transmission lines (edges), where each substation contains buses linked to
two generators and two loads. Generators supply electricity to meet demand, and power is transmitted
across the network, incurring losses due to line resistance. Substations, each comprising multiple buses,
can partially control power routing by switching connections. Every component in this system is
subject to physical constraints: generators have ramp rate limits that restrict sudden changes in output,
and transmission lines have thermal limits, where sustained overloads can lead to damage or forced
disconnection. Thus, advancing RL algorithms on top of RL2Grid to tackle the unique challenges of
power grid operations has the potential to significantly benefit both RL research and deployment.
      </p>
      <p>
        However, successfully deploying RL in these complex physical systems requires close collaboration
with system operators and the development of solutions that are not only efective but also safe,
transparent, and trustworthy. Achieving this is particularly dificult due to the black-box nature of deep
neural networks, which underpin most scalable RL methods. Regarding safety, our ongoing research on
formal and probabilistic verification for deep neural networks (FV) [
        <xref ref-type="bibr" rid="ref10 ref11 ref8 ref9">8, 9, 10, 11, 12</xref>
        ] led us to design novel
safe RL algorithms, detailed in Section 3. On top of this, we aim to make RL models output decisions
that are intelligible (transparent) and thus acceptable by system operators, making explainability a
critical requirement for deployment. Our future research, detailed in Section 4, will focus on explainable
AI (XAI) techniques tailored to RL, with the goal of improving the interpretability of learned policies
[13, 14, 15].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. RL for Power Grid Operations</title>
      <p>
        We model power grid operation tasks as Markov decision processes (MDPs)—a tuple (, , , , ,  );
 and  are the finite sets of states and actions, respectively,  :  ×  ×  → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is the state
transition probability distribution,  :  → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is the initial uniform state distribution,  :  × → R
is a reward function, and  ∈ [0, 1) is the discount factor. In policy optimization algorithms [16], agents
learn a parameterized stochastic policy  :  ×  → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], modeling the probability to take an action
 ∈  in a state  ∈  at a certain step . The agent gets a reward for its actions, and the goal is to
ifnd the parameters that maximize the expected discounted reward  ( ) = E ∼  [∑︀∞
=0  (, )],
where  := (0, 0, 1, 1, . . . ) is a trajectory with 0 ∼  (0),  ∼  (|), +1 ∼  (+1|, ).
Problem setup. Following this standard MDP formalization, our RL benchmark addresses power grid
operation through two types of actions:
• Topology (discrete actions). Electrical devices (e.g., loads, generators, batteries) are connected to
one of two buses within each substation. These discrete actions involve selecting substations
where a bus-split reconfiguration can help mitigate contingencies—unplanned events that disrupt
normal grid behavior—by altering how components are interconnected.
• Redispatch or curtailment (continuous actions). These continuous actions involve modifying power
lfows by redispatching fossil generators or curtailing output from renewable sources to maintain
system stability.
      </p>
      <p>These actions are conditioned on the state of the power grid—a features vector including active and
reactive power injections, charge levels, and maximum productions.1</p>
      <p>To encourage the agent to keep the grid operational for as long as possible while minimizing the (i)
overloads and disconnections of transmission lines and (ii) economic costs related to redispatching, the
environment returns a reward signal that is the combination of three weighted components:
• survive: A constant value awarded at each step to encourage sustained grid operation.
• overload: Penalizes line overloads and disconnections, and rewards available line capacity based
on the diference between line flows and capacity limits. Disconnected lines incur a fixed penalty.</p>
      <p>
        This term is normalized to [
        <xref ref-type="bibr" rid="ref1">− 1, 1</xref>
        ].
• cost: Penalizes redispatching or curtailment actions based on deviations from planned dispatch
schedules and energy losses. This component is normalized to [− 1, 0].
      </p>
      <p>The overall reward is computed as a weighted sum  =  survive +  overload +  cost, where  ,  ,
and  are weights tuned in an initial grid search and set to 1.0, 0.5, 0.5, respectively. The goal is thus to
keep the grid operational over a long horizon (typically, a month of operations divided into 5 minute
intervals where the agent executes an action).</p>
      <p>Sample training environment. While the RL2Grid benchmark
ofers close to 100 tasks characterized by diferent power grids, Substation Generator Load
action spaces, and customizable configurations, in this work we 1 2
focus on an explanatory grid with 6 substations. Figure 2 shows this 0 5
setup on which we test well-known RL baselines. The power grid 3 4
includes 6 substations (blue circles), 3 loads (yellow triangles), and 4
generators (green pentagons). For simplicity, we only consider the
case where the agent controls power injections and curtailments Figure 2: Explanatory power
using a 6-dimensional continuous action space. grid used in our</p>
      <p>Data collection is performed on Xeon E5-2650 CPU nodes with experiments.
64GB of RAM, using the popular PPO and TRPO algorithms
(widelyadopted baselines in the RL community [18]) as well as our recent  version, where agents are penalized
for violating operational limits (more details in the next section). To this end, we also use the cost—an
auxiliary indicator function highlighting operational violations (i.e., agents get a positive cost when
transmission line capacities exceed a safety limit of 95% their operational capacity).</p>
      <p>Figure 3 reports the average return, cost, and standard error as shaded regions over 50 independent
runs per method. -TRPO and -PPO are notably safer, more sample eficient, and have significantly
higher performance than the baseline counterparts (TRPO and PPO).</p>
    </sec>
    <sec id="sec-3">
      <title>3. Fostering Safety in RL for Power Grids</title>
      <p>
        Our research naturally extends to safe RL methods [19], since ensuring safe grid operations is key to
future deployments of the learned policies. In more detail, safe RL problems are typically modeled using
constrained MPDs [20], where an agent (or multiple agents [21]) aims at maximizing a reward signal
while limiting the accumulation of the previously mentioned cost signals under a desired threshold.
However, these constraints naturally hinder exploration, failing to learn safe, efective behaviors in
complex environments [22]. On top of these issues, deep neural networks (DNNs), which characterize
RL policies for complex, high-dimensional tasks, are known to be vulnerable to small input variations
[23, 24]. These variations can easily fool a policy to output an undesired (and unsafe) action. For these
reasons, FV [
        <xref ref-type="bibr" rid="ref8">8, 25</xref>
        ] tools have arisen to tackle this issue, leveraging state-action relationships (called
1For the sake of clarity and brevity, we refer to RTE France [17] for an exhaustive overview of the MDP formalization, the
state and action spaces, reward, and value ranges.
safety properties) to provably detect these unsafe input variations. However, applying FV during training
presents many challenges. For example, the safety properties are hand-designed by a system designer,
which may be unfeasible in complex tasks, and FV is an NP-complete problem and thus computationally
untractable to apply when training RL policies [26].
      </p>
      <p>We tackle the problem of fostering safety at training time by proposing a technique that
collects parts of the state space where the agent is prone to unsafe actions at training time [27].
In power grid operations, and in particular in the redispatch task of Section 2, unsafe actions
translate into critical failures due to potentially cascading failures, system instability, and
generation not meeting the demand. Hence, when training our  policies for the grid in Figure 2
we collect the state-action pairs that lead to grid instability for each unsafe action detected.
Inspired by human learning, where
consistently repeating tasks enhances the learning
of a particular behavior, we then propose to
switch between random initial training
configurations for the grid state distribution, typical
of RL methods, and a retraining phase in one of
these collected potentially unsafe states.
Crucially, to maintain the convergence properties
of the original algorithm, we balance the
exFigure 4: Pareto frontier of reward versus cost for ploration between the two scenarios using a
-PPOLagr, -TRPOLagr, PPOLagr and TR- decaying factor  that scales down to 0 over
POLagr at convergence. the training.</p>
      <p>The results in Figure 3 already show how RL baselines augmented with this  strategy significantly
improve performance and safety. Here, we investigate whether the proposed  approach can also
improve the performance of existing safe RL methods. We summarize these additional results in
Figure 4, showing the Pareto frontier of average reward versus average cost at convergence obtained
by the Lagrangian implementations of PPO and TRPO (i.e., PPOLagr and TRPOLagr, respectively) as
well-known safe RL algorithms [22], and their  counterpart. These results highlight how  retraining
an agent in potentially unstable (and unsafe) power grid configurations helps agents learn to operate the
grid efectively for longer periods of time when compared to existing Lagrangian algorithms. Notably,
our  retraining approach applied to existing safe RL baselines results in the best trade-of between
average reward and cost, confirming the benefits of our approach in safety-critical contexts.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Explainability: an open challenge</title>
      <p>Despite these advances, RL models output decisions that are not intelligible and thus acceptable
by human operators, making explainability a critical requirement for deployment (see e.g., https:
//post.parliament.uk/research-briefings/post-pn-0735/). As a next step, we aim to co-designing
RLbased controllers and XAI methods to build trust and enhance usability in critical infrastructure. In
particular, we will advance (robust) counterfactual explanations (CEs) [28, 29, 15] as an XAI technique
for RL models. CEs will provide actionable insights into RL decisions through rigorous "what-if" analysis,
by clarifying how changes in the input of an RL model impact system performance and by ensuring
robust, interpretable explanations for real-world applications. Crucially, a recent survey highlighted that
current methods for CE generation for RL are limited, thus calling for new developments in this space
[13]. This interdisciplinary research bridges computational advancements with practical deployment
needs, aligning RL optimization with grid operation principles and ensuring transparency and trust
through robust XAI methods. The outcomes will support the development of reliable, explainable, and
scalable solutions for power system operations, accelerating the transition to a fully decarbonized and
resilient energy grid.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT and Grammarly in order to grammar
and spelling check, paraphrase, and reword. After using these tools, the authors reviewed and edited
the content as needed and took full responsibility for the publication’s content.
for deep neural networks, in: Proceedings of the Thirty-Second International Joint Conference on
Artificial Intelligence, 2023, pp. 217–224.
[12] L. Marzari, D. Corsi, E. Marchesini, F. Alessandro, F. Cicalese, Enumerating safe regions in deep
neural networks with provable probabilistic guarantees, Proceedings of the AAAI Conference on
Artificial Intelligence (2024) 21387–21394.
[13] J. Gajcin, I. Dusparic, Redefining counterfactual explanations for reinforcement learning: Overview,
challenges and opportunities, ACM Computing Surveys 56 (2024) 1–33.
[14] J. Jiang, F. Leofante, A. Rago, F. Toni, Robust counterfactual explanations in machine learning: a
survey, in: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,
2024, pp. 8086–8094.
[15] F. Leofante, M. Wicker, Robust Explainable AI, Springer Nature, 2025.
[16] E. Marchesini, C. Amato, Improving deep policy gradients with value function search, in:
International Conference on Learning Representations (ICLR), 2023. URL: https://openreview.net/
forum?id=6qZC7pfenQm.
[17] RTE France, Dive into grid2op sequential decision process, 2025. URL: https://grid2op.readthedocs.</p>
      <p>io/en/latest/mdp.html#some-constraints, accessed: 2025-05-8.
[18] J. Ji, J. Zhou, J. D. Borong Zhang, R. S. Xuehai Pan, W. Huang, Y. Geng, M. Liu, Y. Yang,
Omnisafe: An infrastructure for accelerating safe reinforcement learning research, arXiv preprint
arXiv:2305.09304 (2023).
[19] J. Garcıa, F. Fernández, A comprehensive survey on safe reinforcement learning, in: Journal of</p>
      <p>Machine Learning Research (JMLR), 2015.
[20] E. Altman, Constrained markov decision processes, in: CRC Press, 1999.
[21] A. A. Aydeniz, E. Marchesini, R. Loftin, C. Amato, K. Tumer, Safe entropic agents under team
constraints, in: Proceedings of the 24th International Conference on Autonomous Agents and
Multiagent Systems, International Foundation for Autonomous Agents and Multiagent Systems,
2025, p. 2411–2413.
[22] J. Ji, B. Zhang, J. Zhou, X. Pan, W. Huang, R. Sun, Y. Geng, Y. Zhong, J. Dai, Y. Yang, Safety
gymnasium: A unified safe reinforcement learning benchmark, in: Thirty-seventh Conference
on Neural Information Processing Systems Datasets and Benchmarks Track, 2023. URL: https:
//openreview.net/forum?id=WZmlxIuIGR.
[23] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R. Fergus, Intriguing
properties of neural networks, arXiv preprint arXiv:1312.6199 (2013).
[24] G. Amir, D. Corsi, R. Yerushalmi, L. Marzari, D. Harel, A. Farinelli, G. Katz, Verifying learning-based
robotic navigation systems, in: 29th Int. Conf., TACAS 2023, Springer, 2023, pp. 607–627.
[25] T. Wei, H. Hu, L. Marzari, K. S. Yun, P. Niu, X. Luo, C. Liu, Modelverification.jl: A
comprehensive toolbox for formally verifying deep neural networks, in: Computer Aided
Verification - 37th International Conference, CAV 2025, volume 15932 of Lecture Notes in
Computer Science, Springer, 2025, pp. 395–408. URL: https://doi.org/10.1007/978-3-031-98679-6_18.
doi:10.1007/978-3-031-98679-6_18.
[26] G. Katz, C. Barrett, D. L. Dill, K. Julian, M. J. Kochenderfer, Reluplex: An eficient smt solver
for verifying deep neural networks, in: International conference on computer aided verification,
Springer, 2017, pp. 97–117.
[27] L. Marzari, P. L. Donti, C. Liu, E. Marchesini, Improving policy optimization via -retrain, in:
Proceedings of the 24th International Conference on Autonomous Agents and Multiagent Systems,
2025, p. 1464–1472.
[28] L. Marzari, F. Leofante, F. Cicalese, A. Farinelli, Rigorous probabilistic guarantees for robust
counterfactual explanations, in: ECAI 2024, IOS Press, 2024, pp. 1059–1066.
[29] J. Jiang, L. Marzari, A. Purohit, F. Leofante, Robustx: Robust counterfactual explanations made easy,
in: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,
IJCAI25, International Joint Conferences on Artificial Intelligence Organization, 2025, pp. 11067–11071.
URL: https://doi.org/10.24963/ijcai.2025/1264. doi:10.24963/ijcai.2025/1264.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Donnot</surname>
          </string-name>
          ,
          <article-title>Grid2op-A testbed platform to model sequential decision making in power systems</article-title>
          , https://GitHub.com/rte-france/
          <year>grid2op</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Marot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Naglic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Barbesant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cremer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stefanov</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Viebahn,</surname>
          </string-name>
          <article-title>Perspectives on future power system control centers for energy transition</article-title>
          ,
          <source>Journal of Modern Power Systems and Clean Energy</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>328</fpage>
          -
          <lpage>344</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wierstra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Riedmiller</surname>
          </string-name>
          ,
          <article-title>Playing atari with deep reinforcement learning</article-title>
          ,
          <source>in: Conference on Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Maddison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sifre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Van Den Driessche</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schrittwieser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Antonoglou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Panneershelvam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lanctot</surname>
          </string-name>
          , et al.,
          <article-title>Mastering the game of go with deep neural networks and tree search</article-title>
          ,
          <source>Nature</source>
          <volume>529</volume>
          (
          <year>2016</year>
          )
          <fpage>484</fpage>
          -
          <lpage>489</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Wurman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kawamoto</surname>
          </string-name>
          , J. MacGlashan, K. Subramanian,
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Walsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Capobianco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Devlic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Eckert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fuchs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gilpin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kompella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>MacAlpine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Oller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sherstan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Seno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. D.</given-names>
            <surname>Thomure</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Aghabozorgi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Douglas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Whitehead</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Duerr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Spranger</surname>
          </string-name>
          , , H. Kitano,
          <article-title>Outracing champion gran turismo drivers with deep reinforcement learning</article-title>
          ,
          <source>Nature</source>
          <volume>62</volume>
          (
          <year>2022</year>
          )
          <fpage>223</fpage>
          -
          <lpage>28</lpage>
          . doi:
          <volume>10</volume>
          .1038/s41586-021-04357-7.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Dulac-Arnold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Mankowitz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Paduraru</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gowal</surname>
          </string-name>
          , T. Hester,
          <article-title>Challenges of real-world reinforcement learning: definitions, benchmarks and analysis</article-title>
          ,
          <source>Machine Learning</source>
          <volume>110</volume>
          (
          <year>2021</year>
          )
          <fpage>2419</fpage>
          -
          <lpage>2468</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>E.</given-names>
            <surname>Marchesini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Donnot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Crozier</surname>
          </string-name>
          , I. Dytham,
          <string-name>
            <given-names>C.</given-names>
            <surname>Merz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Schewe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Westerbeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Marot</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. L.</given-names>
            <surname>Donti</surname>
          </string-name>
          ,
          <article-title>Rl2grid: Benchmarking reinforcement learning in power grid operations</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2503.23101. arXiv:
          <volume>2503</volume>
          .
          <fpage>23101</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Arnon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lazarus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Strong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Kochenderfer</surname>
          </string-name>
          , et al.,
          <article-title>Algorithms for verifying deep neural networks, Foundations and Trends® in Optimization 4 (</article-title>
          <year>2021</year>
          )
          <fpage>244</fpage>
          -
          <lpage>404</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E.</given-names>
            <surname>Marchesini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Marzari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Amato</surname>
          </string-name>
          ,
          <article-title>Safe deep reinforcement learning by verifying task-level properties</article-title>
          ,
          <source>in: Proceedings of the 2023 International Conference on Autonomous Agents and Multiagent Systems</source>
          ,
          <year>2023</year>
          , p.
          <fpage>1466</fpage>
          -
          <lpage>1475</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Marzari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Marchesini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>Online safety property collection and refinement for safe deep reinforcement learning in mapless navigation</article-title>
          ,
          <source>in: 2023 IEEE International Conference on Robotics and Automation (ICRA)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>7133</fpage>
          -
          <lpage>7139</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>L.</given-names>
            <surname>Marzari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Corsi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Cicalese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farinelli</surname>
          </string-name>
          ,
          <article-title>The #dnn-verification problem: counting unsafe inputs</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>