<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Workshop on Artificial Intelligence and Formal Verification, Logic, Automata, and Synthesis,
September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Automated Reasoning for Reinforcement Learning Agents in Structured Environments</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Gianola</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Montali</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Papini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Free University of Bozen-Bolzano</institution>
          ,
          <addr-line>Bolzano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Pompeu Fabra</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>22</volume>
      <issue>2021</issue>
      <abstract>
        <p>Designing agents that are both adaptive and trustworthy is a long-standing problem at the intersection of symbolic AI and Machine Learning. In this position paper, we discuss several benefits of combining automated reasoning and reinforcement learning techniques to formally verify agents' behavior in structured environments, both during and after the learning process. These are systems where agents have access to an explicit structure representing what they know about the world. Since we care both about the verifiability and the eficiency of the learning process, we argue why it is crucial to eficiently integrate complex structures in the learning algorithms themselves.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Reinforcement Learning</kwd>
        <kwd>Automated Reasoning</kwd>
        <kwd>Formal Methods</kwd>
        <kwd>Verification</kwd>
        <kwd>Data-aware Processes</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Reinforcement Learning (RL) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] has emerged as one of the most important techniques for
equipping agents with learning capabilities used to maximize a reward while operating in an
unknown environment. In this respect, RL incarnates one of the main distinctive features of
Machine Learning (ML) approaches, namely that the complexity of the world is not explicitly
represented, but it is instead reflected through a number of implicit soft constraints that emerge
from the learning process. This poses a twofold challenge when it comes to hard constraints
either related to the environment in which the agent operates, or to the behavior that the agent
is expected to show during and/or at the end of the learning phase. Conventional RL-based
approaches struggle in dealing with this challenge: mainly based on statistical inference, they
cannot prevent the learning agents from reaching unsafe/undesired/dangerous configurations
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In fact, in each state of a Markov decision process, every action is in principle executable,
with an associated probability. This entails that, especially during learning, every possible
execution is allowed, inducing a huge state space that contains all possible configurations,
including the unsafe ones (or even those that cannot be reached in the real world).
      </p>
      <p>
        Safety-oriented extensions of the conventional RL framework have consequently been studied,
but they typically achieve safety in a probabilistic sense, that is, guaranteeing that there is a high
probability for the system to be safe [
        <xref ref-type="bibr" rid="ref3 ref4 ref5 ref6">3, 4, 5, 6</xref>
        ], or that safety will hold only in the limit policy
obtained at the end of the learning process [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. In fact, existing symbolic approaches do not
explicitly distinguish between two diferent, but equally important, temporal dimensions: the
temporal dimension related to the system dynamics with a fixed agent policy, and the temporal
evolution of the policy as a result of the learning process. Nevertheless, in the conventional RL
literature both the aforementioned dimensions have been considered [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref13 ref14 ref15 ref9">9, 10, 11, 12, 13, 14, 15</xref>
        ].
      </p>
      <p>
        The verification of dynamical systems against some property of interest is traditionally
investigated in the realm of formal methods, from general, domain-independent properties
(such as safety, liveness, and fairness) [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ] to domain-specific properties typically expressed
using temporal logics [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], possibly also predicating over the strategies agents may adopt
[
        <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
        ]. The application of techniques from formal methods and knowledge representation
in an RL-based setting has so far typically revolved around guaranteeing that a single agent
[
        <xref ref-type="bibr" rid="ref21 ref7 ref8">7, 21, 8, 22</xref>
        ] or multiple agents [23] in the limit ensure temporal specifications of interest. They
do so in a generic context where no assumption is made on the structure of the state space.
      </p>
      <p>
        Our interest is to study systems where agents come with an explicit structure, for example
relational data representing what the agents know about the world (in the style of [
        <xref ref-type="bibr" rid="ref17">24, 17</xref>
        ]), and
it is to investigate such systems both during and after the learning process. This is for example
relevant for applications like business process management, where the states are process states
and there are some constraints that must not only be satisfied in the limit but also during
learning. This requires to incorporate at once background knowledge on the process itself
in the form of (safety) constraints that should always be respected, preconditions and efects
dictacting how states can evolve into other states (and, conversely, which transitions do not
exist at all), and finally also on goals that should be satisfied in the limit.
      </p>
      <p>All in all, our interest is on verifying properties that are significant for learning agents
carrying a structure in the aforementioned sense. This is explained in the next section.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Research Directions and Challenges</title>
      <p>
        Currently, RL algorithms considered in symbolic approaches (e.g., [
        <xref ref-type="bibr" rid="ref21 ref8">21, 8</xref>
        ]) do not exploit the
structural information on the state space that the agents explore. Moreover, in the conventional
RL theory only simple structures are taken in consideration (for example linear Markov Decision
Processes, where the transition probabilities are linear in some given state features [25, 26]).
      </p>
      <p>Structural information can help improve the eficiency of the learning process, but only if it is
correctly incorporated in the learning algorithm. Indeed, learning from complex structured data
with general-purpose algorithms can cause an explosion of the number of states. Designing
learning algorithms with a notion of structure is then crucial to prune the state space. For
example, take a personalized recommendation problem where states are users, actions are
items to recommend and rewards encode user satisfaction. If we simply enrich states with
user information (age, sex, etc.), the number of possible states explodes. However, if we find a
simpler relationship between user features and system dynamics, and the learning algorithm is
designed to exploit this structure, eficiency can be even better than by considering each user as
an atomic state. For instance, in the special case of a linear relationship, the efective dimension
of the problem goes from  (the number of states) to the number of features  ≪  [25].</p>
      <p>On the other hand, in the last decades a lot of efort has been made in formal methods to
automatically verify structured systems combining processes and data [24, 27]. Specifically,
formal techniques based on automated reasoning have been successfully employed [28, 29]. The
models formalizing these systems are usually not only able to represent the dynamics of evolving
(business) processes, but also the structured data storage that those processes interact with.
Unfortunately, these models are usually ‘static’ in the sense that, once they have been created
by the users, they cannot be modified over time and improved by receiving some feedback
from the real environment that they are intended to represent. Moreover, traditional formal
frameworks are not able to natively formalize a learning process as the ones studied in ML.</p>
      <p>A holistic framework for structured systems combining automated reasoning and
reinforcement learning is missing. Obtaining a working combination would bring several benefits: (i) to
exploit automatic analysis that is not afected by statistical uncertainty as learning agents are
during their exploration; (ii) to employ highly eficient tools like state-of-the-art model checkers
or SMT-solvers for performing the automatic analysis; (iii) to leverage well-studied formalisms
for expressive structured systems, like those studied for data-aware processes in [28, 30, 31].</p>
      <p>
        In the remainder of the section we focus in more detail on specific problems that arise in the
context of structured RL. Before doing so, we remark the common challenges emerging in all
such problems. First, the problem of verifying structured systems like data-aware processes
is intrinsically dificult [ 24, 27], even disregarding the learning aspects. Adding probabilistic
features can only make it more challenging. Second, adding structure to states is a genuine
novel aspect in RL. From the RL perspective, it is still unclear how to equip learning algorithms
with the ability of modeling and using complex, structured states and actions (e.g., activities
guided by a business process), and which form of transition systems consequently emerge.
Safe Structured Reinforcement Learning. The most natural problem to consider is that of
safety verification of a single agent. Safety is particularly important for RL agents that interact
with the real world, since they could harm themselves and their environment [32]. Physical
agents like robots [33] or autonomous cars represent the most prominent example, since they
could directly harm human operators. However, also software agents responsible of critical
decisions (e.g., in healthcare [34]) must be subject to strict safety constraints. A literature on safe
learning has developed within RL [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. In verification, this is the problem of determining whether
it is possible that an agent reaches some undesired configuration. This becomes particularly
challenging in case of structured agents carrying data, as data bring a dimension of infinity that
is dificult to tame even in the conventional setting where learning is not considered [24, 27].
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8, 23</xref>
        ], the learning process and the temporal specification are introduced separately and
only then combined: this creates a new learning problem where the original reward function
needs to be maximized together with the probability of satisfying the temporal specification.
Diferently from these approaches, when dealing with data-aware learning agents, it appears to
us more natural to study the case where the logical specification of the structure and the reward
function are intrinsically connected: this can happen, for example, when the actions that the
agent can perform are guarded by conditions that can query not only the relational database but
also the reward obtained so far. Another natural way for equipping agents with structures is to
consider not only purely non-deterministic actions with some associated probability, but also
temporal/dynamic constraints over their mutual order, as dictated by a business process model.
Incorporating these features in algorithms becomes crucial for eficiency as discussed below.
RL eficiency. As most ML problems, RL is afected by the curse of dimensionality: as the
number of variables describing states grows, the number of states explodes combinatorially.
So, even if the sample complexity (the amount of data required to learn an optimal policy) of
classical RL algorithms is polynomial in the number of states, these methods are unpractical for
most real problems. The true complexity of any learning problem depends on the structure of
the state space, i.e., on the similarity of states w.r.t. their associated rewards and transitions.
Structure can be exploited to learn with fewer data, recalling that every data sample acts as a
representative for (possibly infinitely) many other data that induce the same behavior. However,
a RL algorithm can only exploit this feature if it can actually incorporate this structure-related
information. The promise of deep RL [35] is to learn the structure along the decision policy,
everything from data, by exploiting the representational power of deep neural networks and
the eficiency of first-order optimization methods in high-dimensional settings. Unfortunately,
the theoretical understanding of deep learning is still limited. Recently, RL theory has focused
on simple structures, such as linear ones [25], where it is possible to derive sample complexity
guarantees that do not depend on the number of states. The challenge is to obtain similar results
for more complex structures, such as the relational databases and other data-storing structures.
RL-specific Properties. A fundamental problem in RL is the exploration-exploitation
tradeof. To converge to optimal policy, an agent must suficiently try the available options ( explore)
before settling with what appears to be the best solution (exploit). RL theory largely studies how
to make exploration eficient, i.e., to limit the amount of time the agent must spend (equivalently,
data it must collect) in exploratory behavior. However, a suficient amount of exploration must
always be guaranteed, and this often means the agent must be able to visit all states [36, 37].
The problem of suficient exploration must be considered both in the modeling of the learning
process (the definition of the state space) and in the design of the learning algorithm (the policies
the agent will execute while learning). Formalizing and verifying such conditions requires to go
beyond safety, and to appeal to more sophisticated properties, e.g., liveness and ergodicity [38].
Quantifying over learning steps. A recurring theme in what discussed so far is the fact
that sometimes properties of interest have to be always enforced/verified, or only satisfied in the
limit (i.e., once the agent has completed its learning phase). This is even trickier, as properties
may be used to influence the learning process itself (perturbing the system), or instead be
checked without any intervention on the original system. This problem has been considered in
the literature. E.g., ensuring that an agent is safe not only in the limit (as in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), but also during
learning (as in [39]), is what Amodei et al. [40] call safe exploration in their survey of AI safety
problems. However, a systematic exploration of this space of possibilities is missing. We intend
to explore this space with two key features. On the one hand, we intend to express properties
in a logic that provides two diferent types of temporal quantification: one quantifying over the
system dynamics when the agent employs a fixed policy; the other over the steps of the learning
process, which in turn represent diferent refinements of the agent policy. On the other hand,
we want to equip agents with the ability of checking such properties, influencing its behavior.
[22] M. Hasanbeig, D. Kroening, A. Abate, Deep reinforcement learning with temporal logics,
in: Proc. of FORMATS 2020, volume 12288 of LNCS, Springer, 2020, pp. 1–22.
[23] L. Hammond, A. Abate, J. Gutierrez, M. J. Wooldridge, Multi-agent reinforcement learning
with temporal logic specifications, in: Proc. of AAMAS ’21, ACM, 2021, pp. 583–592.
[24] D. Calvanese, G. D. Giacomo, M. Montali, Foundations of data-aware process analysis: a
database theory perspective, in: Proc. of PODS 2013, ACM, 2013, pp. 1–12.
[25] C. Jin, Z. Yang, Z. Wang, M. I. Jordan, Provably eficient reinforcement learning with linear
function approximation, in: Proc. of COLT 2020, volume 125, PMLR, 2020, pp. 2137–2143.
[26] L. F. Yang, M. Wang, Reinforcement leaning in feature space: Matrix bandit, kernels, and
regret bound, CoRR abs/1905.10389 (2019).
[27] A. Deutsch, R. Hull, Y. Li, V. Vianu, Automatic verification of database-centric systems,
      </p>
      <p>ACM SIGLOG News 5 (2018) 37–56.
[28] D. Calvanese, S. Ghilardi, A. Gianola, M. Montali, A. Rivkin, SMT-based verification of
data-aware processes: a model-theoretic approach, Math. Struct. Comput. Sci. 30 (2020)
271–313.
[29] D. Calvanese, S. Ghilardi, A. Gianola, M. Montali, A. Rivkin, Model completeness, covers
and superposition, in: Proc. of CADE 27, volume 11716 of LNCS, Springer, 2019, pp.
142–160.
[30] D. Calvanese, S. Ghilardi, A. Gianola, M. Montali, A. Rivkin, Formal modeling and
SMTbased parameterized verification of data-aware BPMN, in: Proc. of BPM 2019, volume
11675 of LNCS, Springer, 2019, pp. 157–175.
[31] S. Ghilardi, A. Gianola, M. Montali, A. Rivkin, Petri nets with parameterised data, in: Proc.</p>
      <p>of BPM 2020, volume 12168 of LNCS, Springer, 2020, pp. 55–74.
[32] G. Dulac-Arnold, D. J. Mankowitz, T. Hester, Challenges of real-world reinforcement
learning, CoRR abs/1904.12901 (2019).
[33] D. Büchler, S. Guist, R. Calandra, V. Berenz, B. Schölkopf, J. Peters, Learning to play table
tennis from scratch using muscular robots, CoRR abs/2006.05935 (2020).
[34] T. Zhu, K. Li, L. Kuang, P. Herrero, P. Georgiou, An insulin bolus advisor for type 1 diabetes
using deep reinforcement learning, Sensors 20 (2020) 5058.
[35] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves,
M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou,
H. King, D. Kumaran, D. Wierstra, S. Legg, D. Hassabis, Human-level control through
deep reinforcement learning, Nat. 518 (2015) 529–533.
[36] T. Jaksch, R. Ortner, P. Auer, Near-optimal regret bounds for reinforcement learning, J.</p>
      <p>Mach. Learn. Res. 11 (2010) 1563–1600.
[37] A. Agarwal, S. M. Kakade, J. D. Lee, G. Mahajan, Optimality and approximation with
policy gradient methods in markov decision processes, in: Proc. of COLT 2020, volume
125, PMLR, 2020, pp. 64–66.
[38] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming,</p>
      <p>Wiley, 1994.
[39] M. Alshiekh, R. Bloem, R. Ehlers, B. Könighofer, S. Niekum, U. Topcu, Safe reinforcement
learning via shielding, in: Proc. of AAAI 2018, AAAI Press, 2018, pp. 2669–2678.
[40] D. Amodei, C. Olah, J. Steinhardt, P. F. Christiano, J. Schulman, D. Mané, Concrete problems
in AI safety, CoRR abs/1606.06565 (2016).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning: An introduction</article-title>
          , MIT press,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>García</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Fernández</surname>
          </string-name>
          ,
          <article-title>A comprehensive survey on safe reinforcement learning</article-title>
          ,
          <source>J. Mach. Learn. Res</source>
          .
          <volume>16</volume>
          (
          <year>2015</year>
          )
          <fpage>1437</fpage>
          -
          <lpage>1480</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Verma</surname>
          </string-name>
          , I. Dillig,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chaudhuri</surname>
          </string-name>
          ,
          <article-title>Neurosymbolic reinforcement learning with formally verified exploration</article-title>
          ,
          <source>in: Proc. of NeurIPS</source>
          <year>2020</year>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T. M.</given-names>
            <surname>Moldovan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <article-title>Safe exploration in Markov decision processes</article-title>
          ,
          <source>in: Proc. of ICML</source>
          <year>2012</year>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Nachum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Duéñez-Guzmán</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghavamzadeh</surname>
          </string-name>
          ,
          <article-title>A Lyapunov-based approach to safe reinforcement learning</article-title>
          ,
          <source>in: Proc. of NeurIPS</source>
          <year>2018</year>
          ,
          <year>2018</year>
          , pp.
          <fpage>8103</fpage>
          -
          <lpage>8112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Held</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tamar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <article-title>Constrained policy optimization</article-title>
          ,
          <source>in: Proc. of ICML</source>
          <year>2017</year>
          , volume
          <volume>70</volume>
          ,
          <string-name>
            <surname>PMLR</surname>
          </string-name>
          ,
          <year>2017</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>31</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanbeig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abate</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kroening</surname>
          </string-name>
          ,
          <article-title>Certified reinforcement learning with logic guidance</article-title>
          , CoRR abs/
          <year>1902</year>
          .00778 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1902</year>
          .00778.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanbeig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kroening</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abate</surname>
          </string-name>
          ,
          <article-title>Towards verifiable and safe model-free reinforcement learning</article-title>
          ,
          <source>in: Proc. of OVERLAY</source>
          <year>2019</year>
          , volume
          <volume>2509</volume>
          , CEUR-WS.org,
          <year>2019</year>
          , p.
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V. S.</given-names>
            <surname>Borkar</surname>
          </string-name>
          ,
          <article-title>An actor-critic algorithm for constrained Markov decision processes, Syst</article-title>
          . Control.
          <source>Lett</source>
          .
          <volume>54</volume>
          (
          <year>2005</year>
          )
          <fpage>207</fpage>
          -
          <lpage>213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Moody</surname>
          </string-name>
          , M. Safell,
          <article-title>Learning to trade via direct reinforcement</article-title>
          ,
          <source>IEEE Trans. Neural Networks</source>
          <volume>12</volume>
          (
          <year>2001</year>
          )
          <fpage>875</fpage>
          -
          <lpage>889</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Kakade</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Langford</surname>
          </string-name>
          ,
          <article-title>Approximately optimal approximate reinforcement learning</article-title>
          ,
          <source>in: Proc. of ICML</source>
          <year>2002</year>
          , Morgan Kaufmann,
          <year>2002</year>
          , pp.
          <fpage>267</fpage>
          -
          <lpage>274</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Theocharous</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Ghavamzadeh, High confidence policy improvement</article-title>
          ,
          <source>in: Proc. of ICML</source>
          <year>2015</year>
          , volume
          <volume>37</volume>
          , JMLR.org,
          <year>2015</year>
          , pp.
          <fpage>2380</fpage>
          -
          <lpage>2388</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Levine</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Abbeel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. I.</given-names>
            <surname>Jordan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Moritz</surname>
          </string-name>
          ,
          <article-title>Trust region policy optimization</article-title>
          ,
          <source>in: ICML</source>
          , volume
          <volume>37</volume>
          , JMLR.org,
          <year>2015</year>
          , pp.
          <fpage>1889</fpage>
          -
          <lpage>1897</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>F.</given-names>
            <surname>Berkenkamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Turchetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Schoellig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krause</surname>
          </string-name>
          ,
          <article-title>Safe model-based reinforcement learning with stability guarantees</article-title>
          ,
          <source>in: Proc. of NeurIPS</source>
          <year>2017</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>908</fpage>
          -
          <lpage>918</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Papini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pirotta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Restelli</surname>
          </string-name>
          ,
          <article-title>Smoothing policies and safe policy gradients</article-title>
          , CoRR abs/
          <year>1905</year>
          .03231 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Baier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Katoen</surname>
          </string-name>
          , Principles of model checking, MIT Press,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Calvanese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghilardi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gianola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Montali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rivkin</surname>
          </string-name>
          ,
          <article-title>Verification of data-aware processes: Challenges and opportunities for automated reasoning</article-title>
          ,
          <source>in: Proc. of ARCADE</source>
          <year>2019</year>
          , volume
          <volume>311</volume>
          ,
          <string-name>
            <surname>EPTCS</surname>
          </string-name>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>A.</given-names>
            <surname>Pnueli</surname>
          </string-name>
          ,
          <article-title>The temporal logic of programs</article-title>
          ,
          <source>in: Proc. of FOCS '77</source>
          , IEEE Computer Society,
          <year>1977</year>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Alur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Henzinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Kupferman</surname>
          </string-name>
          ,
          <article-title>Alternating-time temporal logic</article-title>
          ,
          <source>J. ACM</source>
          <volume>49</volume>
          (
          <year>2002</year>
          )
          <fpage>672</fpage>
          -
          <lpage>713</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>K.</given-names>
            <surname>Chatterjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. A.</given-names>
            <surname>Henzinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Piterman</surname>
          </string-name>
          , Strategy logic,
          <source>Inf. Comput</source>
          .
          <volume>208</volume>
          (
          <year>2010</year>
          )
          <fpage>677</fpage>
          -
          <lpage>693</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Giacomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Favorito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Patrizi</surname>
          </string-name>
          ,
          <article-title>Foundations for restraining bolts: Reinforcement learning with LTLf/LDLf restraining specifications</article-title>
          ,
          <source>in: Proc. of ICAPS</source>
          <year>2018</year>
          , AAAI Press,
          <year>2019</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>