<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>RLRom: Monitoring and Training Reinforcement Learning Agents using Signal Temporal Logic</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ana María Gómez-Ruiz</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexandre Donzé</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thao Dang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>VERIMAG</institution>
          ,
          <addr-line>Bâtiment IMAG</addr-line>
          ,
          <institution>Université Grenoble Alpes</institution>
          ,
          <addr-line>150 Pl du Torrent, 38401 Grenoble</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Reinforcement learning (RL) excels at solving complex tasks, but its application presents challenges in interpretability, safety and reliability. This is in part due to the dynamic nature of the RL problems, for which timing interactions are crucial. RLRom is a novel tool that integrates Signal Temporal Logic (STL) with common RL frameworks. STL, a formal language for time-dependent properties, enables RLRom to monitor agent behavior against specified requirements, identifying both intended and unexpected behaviors. STL properties can be robustly monitored online, generating quantitative satisfaction values that can serve as new observations or reward components during training. This improves the design, interpretability and development of RL problems of safe, reliable and eficient agents. We demonstrate RLRom's utility using the highway-env environment, showing its ability to influence driving agent behavior, such as speed, safety and rule adherence.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Reinforcement Learning</kwd>
        <kwd>Signal Temporal Logic</kwd>
        <kwd>Monitoring</kwd>
        <kwd>Formal Methods</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Reinforcement learning (RL) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a powerful and versatile unsupervised learning method used to teach
artificial systems how to solve complex tasks. The advent of deep learning and the development of
mature RL algorithms have made it applicable to a variety of situations, from advanced robotics to
ifne-tuning large language models. However, applying RL to a given problem remains a challenging
endeavor. This is because, as with most modern artificial intelligence (AI) tasks, dificult issues of
interpretability and uncertainty of the results arise during and after the learning process. Moreover,
RL is concerned with dynamical or sequential problems, where timing between events is important,
making it even more dificult to design and debug solutions. As a consequence, several authors have
looked into temporal logics to reason formally about the characteristics of traces, i.e., sequences of
observations of episodes realised by trained agent.
      </p>
      <p>
        One such logic is Signal Temporal Logic (STL) introduced in [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] to specify properties of signals with
continuous times and values. STL syntax is minimalist, yet quite expressive and intuitive. Atomic
predicates are inequalities over quantities of interest (for example the speed is greater than 50, height
less than 5, etc) and formulas are constructed using standard Boolean operators and temporal operator,
e.g., ♢ [0,30] 1 is “eventually in 30 seconds, p1 is true” or “if p1 is true then p2 is always true from 5
seconds to 10 seconds” can be written as 1 ⇒ □ [
        <xref ref-type="bibr" rid="ref5">5,10</xref>
        ] 2. One crucial advantage of STL is that it admits
a quantitative semantics [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] which can be very eficiently evaluated (monitored) even on long traces
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and online [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], i.e., when a trace is still under computation. Sometimes called “robustness”, such
semantics estimate the degree of satisfaction of a formula by a trace not only as true or false, but as a
number which amplitude represents how “robust” or fragile the true or false evaluation is. This number
can then be used for various purposes. For instance by trying to minimize it, one can find a trace that
violates a given requirement, potentially exhibiting a bug [6]. Conversely, by maximizing it, one can
ifnd a controller or policy that robustly satisfy a design requirement [ 7, 8]. The idea to use RL to train
a policy satisfying STL requirements was then naturally explored in the past decade ([9, 10, 11, 12])
though few tools have been implemented.
      </p>
      <p>In most of the works around RL and STL, the goal is that of controller synthesis, i.e., the problem
assumes that a set of STL requirements is given and the goal is to train the best policy satisfying them.
However, even if STL is a relatively simple language, specifying the right goals can be dificult and
such approach can be prone to the reward hacking problem, wherein the trained agent satisfy the
requirements but by deploying unexpected and often undesirable strategies.</p>
      <p>Our view is that in practice, STL requirements (and requirements in general, not necessarily expressed
in STL) are part of the design problem as much as the system or its controller itself. Hence, one should be
able to manipulate them in a flexible way, within an iterative process converging towards a satisfactory
design with a clearly and formally defined set of requirements.</p>
      <p>
        For these reasons, we implemented RLRom 1, which integrates a rich STL parser with eficient online
monitoring methods (taken from [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]) with popular RL frameworks. Our tool is similar to stl-gym2
based on [13], which allows to replace the reward function with the robustness of an STL formula.
RLROM goes beyond by allowing richer formulas to be written and monitored, as well as to modify and
augment not only the reward functions but also the observation vector.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Tool Overview</title>
      <sec id="sec-2-1">
        <title>2.1. Implementation</title>
        <p>RLRom is built over three main existing components, namely:
1https://github.com/decyphir/rlrom
2https://github.com/nphamilton/stl-gym
• Gymnasium [14]: A Python library ofering a standardized API for developing and
comparing reinforcement learning algorithms by providing a variety of pre-built and customizable
environments. It serves as a widely used platform for RL research and development;
• Stable-Baselines3 [15]: A Python library that provides well-documented implementations of
model-free reinforcement learning algorithms (such as PPO, DQN, SAC, etc) in PyTorch. It ofers
a consistent interface for training and evaluating RL agents.
• STLRom3:A C++ library with Python bindings for robust online monitoring of Signal Temporal
Logic (STL). It enables users to define temporal properties of signals and compute their robustness
value online.</p>
        <p>The main implementation idea is based on the wrapper mechanism of Gymnasium, which takes
an existing RL environment with a reset(), step() and reward() methods and observation and action
spaces, and returns a new environment that changes one or all of these components. In RLRom, we
implemented an stl_wrapper that does the following:
1. It adds an stl_driver object from STLRom which can read data and monitors STL formulas;
2. Within the step() function, a new observation of the wrapped environment is added to the
stl_driver and the satisfaction of the formulas is updated; the resulting robustness values are then
used to augment the observation and/or influence the reward.</p>
        <p>The resulting augmented environment can then be used for extended evaluation of a trained agent
(monitoring) or for training. RLRom is designed to be modular and as interpretable as possible.
Modularity is achieved by separating the configurations of environment, training algorithm and specifications.
This is illustrated in Figure 1. Monitoring and visualization are the basics for interpretability. RLRom
implements very flexible monitoring methods, inherited from STLRom, and plotting, as illustrated in
Figure 2.1.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Workflows and features</title>
        <p>When working with a new RL problem, assuming we are given a RL environment implementing a
Gymnasium-compatible API, RLRom is intended to support the following steps, with workflows of
increasing complexities.</p>
        <p>1. (WF1) Testing: RLRom provides a simple way to run multiple episodes varying initial seed.</p>
        <p>Models can be imported from files or from popular repos, such as Hugging Face 4 Eventually, we
plan on implementing more advanced simulation-based testing methods, in the spirit of [16].
2. (WF2) Training: new models can be trained with RLRom interfacing to existing stable-baselines3
algorithms - existing hyper-parameters can be quickly fetched and tried to obtain some baseline
model for the base problem;
3. (WF3) Monitoring: Once a baseline model has been obtained, STL specifications can be written
and monitored. This requires deciding which feature of the observation space should be monitored
as a signal, and what properties should be checked. A typical first step is to formalize the initial
goal of the environment, and check that the baseline model satisfy this formalization. E.g., for the
classic cart pole environment, the goal is to keep the pole up, which can be translated into the
simple predicate pole_up := abs(theta[t]) &lt; epsi for some small value of epsi. The
goal specification can then be:
phi_goal_cartpole := ev_[0,400] alw_[0,100] mu
that is, before 400 steps, the pole should stay up for at least 100 steps.
4. (WF4) Training with specifications : Depending on the intentions and potential desirable or
undesirable behaviors discovered in (WF3), STL specifications can be incorporated into the model
to solve a problem with more complex objectives.
3https://github.com/decyphir/stlrom
4https://huggingface.com
(a)
(b)
(c)</p>
        <p>(WF1) and (WF2) are standard workflows in the RL community, implemented for example in the
RL-Zoo framework 5. (WF3) and (WF4) are more unique to RLRom. Although various integrations of
STL specifications into RL problems have been proposed before, our intention is not to propose an
elaborate solution to one well posed dificult problem, such as finding the optimal agent satisfying some
complex STL formula, but rather to propose flexible ways to incorporate and exploit STL monitoring in
a RL framework, providing a base for many diferent implementations of elaborate solutions to well
posed problems.</p>
        <p>To illustrate this, our framework already allows the following transformations:
• Augment the observation space with the robustness of arbitrary formulas, say</p>
        <p>new_obs := [old_obs, obs_rob1, obs_rob2, . . .]
• Add the robustness of arbitrary formulas to the reward:</p>
        <p>new_reward := old_reward + 1 rob1 + 2 rob2 + . . .</p>
        <p>We believe that this can already allow a large panel of applications, while remaining interpretable
transformations. However we are still exploring their theorical and practical implications in a cautious
way. For instance, since the satisfaction of STL formula obviously depends on history, the Markov
assumption can easily be violated by these transformations. Hence, in our early experiments, we started
with zero or short horizon formulas to include in observation and/or rewards. We found that in order
to satisfy a global specification, a simple approach is to identify local sub-formulas or predicates that
afects monotonically the global specifications, and add them in the reward and/or the observation
accordingly. We believe RLRom will make it simpler to validate such intuitions and derive from them
more systematic approaches.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion and future work</title>
      <p>RLRom goal is to provide an experimental platform for first testing RL problems against STL
requirements, then augment these problems with adjusted, formally defined goals. Basic functionalities such as
writing STL formulas, monitoring and training with augmented rewards and observations have already
been implemented in a way that provides great flexibility and potential of exploration. This already lead
to promising results in terms of training agents in a controlled and monitored way to satisfy certain
behaviors. Future work include consolidating the existing features and apply to more RL problems,
as well as exploring the integration and implementation of formal frameworks, such as in particular,
reward machines, not as an alternative but as a complement to STL monitoring.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Acknowledgments</title>
      <p>This work is partially supported by the joint French-Japanese ANR-JST project CyPhAI, the
AuvergneRhône-Alpes Region Project DetAI, and by a French state grant managed by the National Research
Agency as part of France 2030, with the reference ANR-23-IACL-0006.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>The authors certify that no Generative AI tool was used during the production of this work.
[6] T. Dreossi, A. Donzé, S. A. Seshia, Compositional Falsification of Cyber-Physical Systems with
Machine Learning Components, in: C. W. Barrett, M. Davies, T. Kahsai (Eds.), NASA Formal
Methods - 9th International Symposium, NFM 2017, Mofett Field, CA, USA, May 16-18, 2017,
Proceedings, volume 10227 of Lecture Notes in Computer Science, 2017, pp. 357–372. URL: https:
//doi.org/10.1007/978-3-319-57288-8_26. doi:10.1007/978-3-319-57288-8_26.
[7] V. Raman, A. Donzé, M. Maasoumy, R. M. Murray, A. Sangiovanni-Vincentelli, S. A. Seshia, Model
predictive control with signal temporal logic specifications, in: 53rd IEEE Conference on Decision
and Control, 2014, pp. 81–87. doi:10.1109/CDC.2014.7039363, iSSN: 0191-2216.
[8] V. Raman, A. Donzé, D. Sadigh, R. M. Murray, S. A. Seshia, Reactive synthesis from signal temporal
logic specifications, in: A. Girard, S. Sankaranarayanan (Eds.), Proceedings of the 18th International
Conference on Hybrid Systems: Computation and Control, HSCC’15, Seattle, WA, USA, April
14-16, 2015, ACM, 2015, pp. 239–248. doi:10.1145/2728606.2728628.
[9] X. Li, C.-I. Vasile, C. Belta, Reinforcement Learning With Temporal Logic Rewards, 2017. URL:
http://arxiv.org/abs/1612.03471. doi:10.48550/arXiv.1612.03471, arXiv:1612.03471 [cs].
[10] H. Venkataraman, D. Aksaray, P. Seiler, Tractable Reinforcement Learning of Signal Temporal
Logic Objectives, in: Proceedings of the 2nd Conference on Learning for Dynamics and Control,
PMLR, 2020, pp. 308–317. URL: https://proceedings.mlr.press/v120/venkataraman20a.html, iSSN:
2640-3498.
[11] L. Berducci, E. A. Aguilar, D. Ničković, R. Grosu, Hierarchical Potential-based Reward Shaping
from Task Specifications, 2022. URL: http://arxiv.org/abs/2110.02792. doi: 10.48550/arXiv.2110.
02792, arXiv:2110.02792 [cs].
[12] P. Kapoor, K. Mizuta, E. Kang, K. Leung, Stlcg++: A masking approach for diferentiable signal
temporal logic specification, 2025. URL: https://arxiv.org/abs/2501.04194. arXiv:2501.04194.
[13] D. Nickovic, T. Yamaguchi, RTAMT: Online Robustness Monitors from STL, arXiv:2005.11827 [cs]
(2020). URL: http://arxiv.org/abs/2005.11827, arXiv: 2005.11827.
[14] J. Towers, A. Sowinski, L. Happ, A. Gleave, M. Pinto, E. Grefenstette, Gymnasium: An API for
Reinforcement Learning Research, Journal of Machine Learning Research 25 (2024) 1–11. URL:
http://jmlr.org/papers/v25/24-0010.html.
[15] A. Rafin, A. Hill, A. Gleave, A. Kanervisto, M. Park, N. Piot, M. Ernestus, Q. Dormann, N. Ploufe,
Stable Baselines3: Reliable Reinforcement Learning Implementations, in: Journal of Machine
Learning Research, 2024, pp. 25 (284): 1–11. URL: https://stable-baselines3.readthedocs.io/.
[16] A. Donzé, Breach, A Toolbox for Verification and Parameter Synthesis of Hybrid Systems, in:
T. Touili, B. Cook, P. B. Jackson (Eds.), Computer Aided Verification, 22nd International Conference,
CAV 2010, Edinburgh, UK, July 15-19, 2010. Proceedings, volume 6174 of Lecture Notes in Computer
Science, Springer, 2010, pp. 167–170. URL: https://doi.org/10.1007/978-3-642-14295-6_17. doi:10.
1007/978-3-642-14295-6_17.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          , Reinforcement Learning: An Introduction, second ed., The MIT Press,
          <year>2018</year>
          . URL: http://incompleteideas.net/book/the-book-2nd.html.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Maler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nickovic</surname>
          </string-name>
          ,
          <article-title>Monitoring Temporal Properties of Continuous Signals</article-title>
          , in: Y.
          <string-name>
            <surname>Lakhnech</surname>
          </string-name>
          , S. Yovine (Eds.),
          <source>Formal Techniques, Modelling and Analysis of Timed and Fault-Tolerant Systems</source>
          , Springer, Berlin, Heidelberg,
          <year>2004</year>
          , pp.
          <fpage>152</fpage>
          -
          <lpage>166</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>540</fpage>
          -30206-3_
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Donzé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Maler</surname>
          </string-name>
          ,
          <article-title>Robust Satisfaction of Temporal Logic over Real-Valued Signals</article-title>
          , in: K. Chatterjee,
          <string-name>
            <given-names>T. A</given-names>
            .
            <surname>Henzinger</surname>
          </string-name>
          (Eds.),
          <source>Formal Modeling and Analysis of Timed Systems, Lecture Notes in Computer Science</source>
          , Springer, Berlin, Heidelberg,
          <year>2010</year>
          , pp.
          <fpage>92</fpage>
          -
          <lpage>106</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>642</fpage>
          -15297-
          <issue>9</issue>
          _
          <fpage>9</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Donzé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ferrère</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Maler</surname>
          </string-name>
          ,
          <article-title>Eficient Robust Monitoring for STL</article-title>
          , in: N.
          <string-name>
            <surname>Sharygina</surname>
          </string-name>
          , H. Veith (Eds.),
          <source>Computer Aided Verification - 25th International Conference, CAV</source>
          <year>2013</year>
          ,
          <string-name>
            <given-names>Saint</given-names>
            <surname>Petersburg</surname>
          </string-name>
          , Russia,
          <source>July 13-19</source>
          ,
          <year>2013</year>
          . Proceedings, volume
          <volume>8044</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2013</year>
          , pp.
          <fpage>264</fpage>
          -
          <lpage>279</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -39799-8_
          <fpage>19</fpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>642</fpage>
          -39799-8_
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J. V.</given-names>
            <surname>Deshmukh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Donzé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jin</surname>
          </string-name>
          , G. Juniwal,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Seshia</surname>
          </string-name>
          ,
          <article-title>Robust Online Monitoring of Signal Temporal Logic</article-title>
          , in: E. Bartocci, R. Majumdar (Eds.), Runtime Verification - 6th International Conference, RV 2015 Vienna, Austria,
          <source>September 22-25</source>
          ,
          <year>2015</year>
          . Proceedings, volume
          <volume>9333</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2015</year>
          , pp.
          <fpage>55</fpage>
          -
          <lpage>70</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>319</fpage>
          -23820-
          <issue>3</issue>
          _4. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -23820-
          <issue>3</issue>
          _
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>