<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Fully Learnable Neural Reward Machines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hazem Dewidar</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Umili</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>La Sapienza University of Rome</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>26</volume>
      <issue>2025</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Non-Markovian Reinforcement Learning (RL) tasks present significant challenges, as agents must reason over entire trajectories of state-action pairs to make optimal decisions. A common strategy to address this is through symbolic formalisms, such as Linear Temporal Logic (LTL) or automata, which provide a structured way to express temporally extended objectives. However, these approaches often rely on restrictive assumptions-such as the availability of a predefined Symbol Grounding (SG) function mapping raw observations to high-level symbolic representations, or prior knowledge of the temporal task. In this work, we propose a fully learnable version of Neural Reward Machines (NRM), which can learn both the SG function and the automaton end-to-end, removing any reliance on prior knowledge. Our approach is therefore as easily applicable as classic deep RL (DRL) approaches, while being far more explainable, because of the finite and compact nature of automata. Furthermore we show that by integrating Fully Learnable Reward Machines (FLNRM) with DRL outperforms previous approaches based on Recurrent Neural Networks (RNNs).</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Automata Learning</kwd>
        <kwd>Neurosymbolic learning</kwd>
        <kwd>Deep Reinforcement Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>structure provides a powerful inductive bias, enabling FLNRM to outperform standard RNN-based
baselines, especially in tasks with complex logical constraints.Our method therefore retains the general
applicability of standard Deep RL approaches, while improving performance and interpretability, taking
the best from both automata-based and deep learning-based RL.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <p>
        Temporal logic formalisms are widely used in Reinforcement Learning (RL) to specify non-Markovian
tasks [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], allowing agents to reason about temporally extended goals and constraints. Much of the
existing literature assumes that: (1) the temporal specification is given, and (2) the boolean propositions
used in the specification are observable in the environment—either perfectly [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref7 ref8 ref9">7, 8, 9, 10, 11, 12</xref>
        ] or with
some noise [
        <xref ref-type="bibr" rid="ref13 ref14 ref15">13, 14, 15</xref>
        ]. Many prior approaches relax only assumption (1), by integrating automata
learning within RL agents [
        <xref ref-type="bibr" rid="ref10 ref12 ref9">9, 10, 12</xref>
        ]; or only assumption (2), using neurosymbolic (NeSy) frameworks
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or multi-task RL techniques [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]; yet they still rely on one of the two.
      </p>
      <p>
        Notably, recent work [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] learns both automata and latent event triggers from data without
requiring predefined labeling functions or prior temporal knowledge. However, its use of Inductive
Logic Programming (ILP) limits applicability only to discrete and finite symbolic domains, excluding
environments providing raw observations, such as images or sensor data. In our approach, we learn the
automaton describing the RL task structure directly from raw experience, without any prior knowledge
or assumptions about the type of observations—which may be high-dimensional and continuous.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Background</title>
      <p>Notation In this work, we consider sequential data of various types, including both symbolic and
subsymbolic representations. Symbolic sequences are also called traces. Each element in a trace is a
symbol  drawn from a finite alphabet Σ. We denote sequences using bold notation. For example,
 = ( (1),  (2), . . . ,  ( )) represents a trace of length  . Each symbolic variable in the sequence can
be grounded either categorically or probabilistically. In the case of categorical grounding, each element
of the trace is assigned a symbol from Σ, denoted simply as  (). In the case of probabilistic grounding,
each symbolic variable is associated with a probability distribution over Σ, represented as a vector
˜() ∈ Δ(Σ), where Δ(Σ) denotes the probability simplex defined as
Δ(Σ) = ⎨⎧˜ ∈ R|Σ| ⃒⃒⃒⃒ ˜ ≥ 0, ∑|Σ︁| ˜ = 1⎬⎫ .</p>
      <p>⎩ ⃒⃒ =1 ⎭
Accordingly, we distinguish between categorically grounded sequences  , and probabilistically grounded
sequences ˜ using the tilde notation. Finally, note that we use superscripts to indicate time steps in the
() denotes the -th component of
sequence and subscripts to denote vector components. For instance, ˜
the probabilistic grounding of  at time step .</p>
      <p>
        Non-Markovian Reward Decision Processes In Reinforcement Learning (RL) [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] the
agentenvironment interaction is generally modeled as a Markov Decision Process (MDP). An MDP is a tuple
(, , , ,  ), where  is the set of environment states,  is the set of agent’s actions,  :  ×  ×  →
[
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is the transition function,  :  ×  → R is the reward function, and  ∈ [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ] is the discount factor
expressing the preference for immediate over future reward. In this classical setting, transitions and
rewards are assumed to be Markovian – i.e., they are functions of the current state only. Although this
formulation is general enough to model most decision problems, it has been observed that many natural
tasks are non-Markovian [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A decision process can be non-markovian because markovianity does not
hold on the reward function  : ( × )* → R, or the transition function  : ( × )* ×  → [
        <xref ref-type="bibr" rid="ref1">0, 1</xref>
        ], or
both. In this work we focus on Non-Markovian Reward Decision Processes (NMRDP) [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ].
Reward Machines Rather than developing new RL algorithms to tackle NMRDP, the research has
focused mainly on how to construct Markovian state representations of NMRDP. An approach of this
kind are the so called Reward Machines (RMs). RMs are an automata-based representation of
nonMarkovian reward functions [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. Given a finite set of propositions  representing abstract properties
or events observable in the environment, a Reward Machine  is a tuple (, , , 0, , ,  ), where
 is the automaton alphabet,  is the set of automaton states,  is a finite set of continuous reward
values, 0 is the initial state,  :  ×  →  is the automaton transition function,  :  →  is the
reward function, and  :  →  is the labeling (or symbol grounding) function, which recognizes
symbols in the environment states. Let  = ((1), (2), ..., ()) be a sequence of states the agent has
observed in the environment up to the current time instant . This is transformed into a sequence of
symbols  = (((1)), ((2)), ..., (())) by using the labeling function. This string of symbols is
processed by the Moore Machine (, , , 0, ,  ) so to produce an history-dependent reward (output)
value at time , (), and an automaton state at time , (). The reward value can be used to guide the
agent toward the satisfaction of the task expressed by the automaton, while the automaton state can
be used to construct a Markovian state representation. In fact it was proven that the augmented state
((), ()) is a Markovian state representation for the task expressed by the RM [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
Neural Reward Machines Neural Reward Machines (NRMs) are a probabilistic relaxation of standard
Reward Machines, where the Moore machine is represented in matrix form, and input symbols, states,
and rewards are probabilistically grounded. Given a Moore machine (, , , 0, ,  ) representing the
task’s reward structure—which is assumed to be known—we denote the transition and output (reward)
functions in matrix form as  ∈ R| |×| |×| | and ℛ ∈ R||×| |, respectively. NRMs assume that the
labeling function  is unknown and must be approximated by a neural network , which takes an
environment state  ∈  as input and outputs a probability distribution over symbols ˜ ∈ Δ( ), having
trainable parameters  . The full model is formulated as follows:
˜() = (();  )
˜() =
=∑=|︀1 | ˜()(˜(− 1) ·   )
˜() = ˜() · ℛ
(1)
The model is fully continuous and diferentiable, allowing its parameters   to be learned through
gradient-based optimization on input-output target sequences. In particular, [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] train the model on
episodes (, ) collected from interactions with the environment.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Method</title>
      <p>Fully Learnable Reward Machines In this paper, we extend NRMs to be fully learnable, and refer to
our model as Fully Learnable Neural Reward Machines (FLNRM). We assume that no prior knowledge
is provided to the model and it must learn an approximation of both the labeling function and the
Moore Machine from experience. Since the task’s Moore machine specification is unknown, the number
of required states and symbols is also unknown. We initialize the number of symbols to |^ | and the
number of states to |^|. In contrast, the number of distinct reward values can be inferred through
interaction with the environment, so we assume |^| = ||. As a result, |^ | and |^| are the only two
hyperparameters of our model. The FLNRM model is shown in Figure 1, and it is formulated as follows
 = softmax(  / )
ℛ = softmax( ℛ/ )
˜() = ∑︀|^=|1 ˜()(˜(− 1) ·   )
˜() = softmax((();  )/ ) ˜() = ˜() · ℛ
Our model has three learnable sets of parameters:  ,   , and  ℛ. Specifically,   ∈ R|^ |×| ^|×| ^| and
 ℛ ∈ R||×| ^| are matrices with the same dimensions as  and ℛ, respectively. The matrices  and ℛ
^
are obtained by applying a softmax activation to the corresponding parameters. This activation ensures
that  and ℛ define valid probability distributions over the next state and output 1. A temperature
1Unless otherwise specified, the activation operates over the last dimension of each tensor. In this case, softmax ensures that
each row of the matrix sums to one.
(2)
parameter  , with 0 &lt;  ≤ 1, controls the sharpness of the softmax. When  = 1, the activation
behaves normally; as  approaches zero, the softmax approximates an argmax, and the model behaves
increasingly like a deterministic finite state machine rather than a probabilistic one. Deterministic
behavior emerges when all rows in the transition and reward matrices become one-hot vectors. We
apply the same temperature-controlled activation to the symbol grounder network, so to smothly force
the grounder to select only one symbol with maximum probability at each time-step.
Integrating FLNRM with deepRL In this section, we describe how FLNRM is integrated with policy
learning through RL in non-Markovian domains. As in standard RL, we consider an agent interacting
with an unknown environment. At each time step , the agent takes an action (), observes the current
state (), and receives a reward (). The agent’s objective is to learn a policy  :  →  that maximizes
the cumulative discounted reward: ∑︀∞</p>
      <p>=0  (+1). We assume the reward signal is non-Markovian and
can be modeled by a Reward Machine—namely, as the composition of a symbol perception function
and a Moore machine. As the agent explores the environment, we record each episode as a sequence of
states  and corresponding rewards . At regular intervals, we use the collected experience to train the
FLNRM parameters by minimizing the cross-entropy loss between the predicted reward sequence ˜
and the observed ground-truth rewards . Once the FLNRM has been trained, we use it to construct a
history-dependent state representation to mitigate non-Markovianity. Specifically, we augment each
environment state () with the probabilistically grounded machine state ˜(), and learn the policy over
the augmented state space  :  × Δ(^) → . A schema of this process is shown in Figure 1.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>
        We validate our framework by replicating the experimental setup presented in the NRM paper [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Our
implementation code is available at Github . In particular, we focus on navigation environments, where
multiple items are present, and the agent must navigate among them so to satisfy a specific formula
in Linear Temporal Logic over finite traces (LTLf) [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. Two environments are designed to illustrate
varying levels of dificulty in symbol grounding: (i) Map Environment – where the state is represented
by a 2D vector indicating the agent’s current (, ) location; (ii) Image Environment – where the
state consists of a 64 × 64 × 3 pixel image depicting the agent within the grid. For each of these two
environments we tested two classes of temporal tasks, focusing on formula patterns commonly used in
non-Markovian reinforcement learning [
        <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
        ] and denoted as in [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ]: (i) first class - includes tasks
defined as conjunctions of Visit formulas (the agent must reach some items without a predefined
order) and Seq_Visit formulas (the agent must reach the items in a certain sequence). (ii) second
class - includes tasks defined as conjunctions of Visit, Seq_Visit, and Glob_Avoid formulas (the
agent must always avoid certain items). The complete list of formulas is reported in the Appendix.
      </p>
      <sec id="sec-5-1">
        <title>FLNRM with 30 states</title>
      </sec>
      <sec id="sec-5-2">
        <title>FLNRM with 5 states RNN</title>
        <p>
          Results We compare our method with RNN-based approaches using A2C [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] as RL algorithm, |^ |
equal to the groundtruth number of symbols | | = 5, and |^| equal to 5 and 30 states. Figures 2
show the training rewards obtained in both the image and map environments. For each task and
method, we perform five runs with diferent random seeds. The results indicate that our method
generally outperforms the baseline. Notably, the performance gap is most evident in the second class
of tasks, which include the Global_Avoidance constraint. We attribute this to the strong and frequent
feedback signals these clauses provide: violations trigger immediate and unambiguous negative rewards,
which improve credit assignment and accelerate representation learning. All methods share the same
hyperparameter settings for A2C, as well as for the neural networks used in the policy, value function,
and feature extraction (the latter is only applied in the image environment), which are detailed in the
appendix. The results shows that the number of states will not afect much the quality of the model (the
rewards are almost the same). Also changing the observation function only brings minor variations in
the results. Indeed, for the same LTLf task, the reward trend is similar in both environments, despite
one being based on images and the other on vector observations. This demonstrates that our method
efectively handle diferent types of raw observations without any issues.
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions and Future Works</title>
      <p>In this paper, we extend NRMs into Fully Learnable NRMs, which learn an automaton representation
of the RL task directly from raw observations and exploit it in real time to accelerate RL performance.
Through extensive experimentation, we show that our method generally surpasses the performance
of Deep RL baselines based on RNNs. Our method thus retains the same broad applicability and
improved performance compared to DRL approaches, while being grounded in symbolic, explainable,
and logic-based methods—combining the best of both worlds. One current limitation of our experiments
is the assumption of knowing the ground-truth number of symbols—an unrealistic constraint in many
real-world scenarios. In future work, we aim to test the framework with imprecise estimates of the
number of symbols, further widening its applicability.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>The work of Hazem Dewidar was carried out when he was enrolled in the Italian National Doctorate
on Artificial Intelligence run by Sapienza University of Rome. This work has been partially supported
by PNRR MUR project PE0000013-FAIR.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used chat-GPT in order to: Grammar and spelling
check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and
take(s) full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-9">
      <title>A. Experimental details</title>
      <sec id="sec-9-1">
        <title>A.1. Task Formulas</title>
        <p>We selected 8 formulas as RL tasks, 4 of class 1 and 4 of class 2, that are detailed in Table 1.</p>
      </sec>
      <sec id="sec-9-2">
        <title>A.2. Hyperparameters Setting</title>
        <p>The proposed model is designed to learn a variable number of latent states, denoted by ^. In our
experiments, we evaluated performance under two configurations: ^ = 5 and ^ = 30. The recurrent
neural network (RNN) component was configured as an LSTM of two layers ( num_layers = 2) and
an output dimensionality of rnn_outputs = 5. The size of the hidden state in the RNN was set to
rnn_hidden_size = 50. For the Advantage Actor-Critic (A2C) architecture, the hidden layer size
1
2
3
4
5
6
7
8
1
1
1
1
2
2
2
2
F(a) ∧ F(b) ∧ F(c)</p>
        <p>F(a ∧ F(b))
F(a ∧ F(b)) ∧ F(c)</p>
        <p>F(a) ∧ F(b) ∧ G(¬c)
F(a) ∧ F(b) ∧ G(¬c) ∧ G(¬d)</p>
        <p>F(a ∧ F(b)) ∧ G(¬c)
F(a ∧ F(b)) ∧ G (¬c) ∧ G(¬d)
was fixed at hidden_size = 120. The learning rate of the optimizer is set to lr=0.0004 while the
temperature value used is 0.5.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <article-title>Recurrent world models facilitate policy evolution</article-title>
          , in: S. Bengio,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Grauman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cesa-Bianchi</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>31</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kapturowski</surname>
          </string-name>
          , G. Ostrovski,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dabney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Quan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Munos</surname>
          </string-name>
          ,
          <article-title>Recurrent experience replay in distributed reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 7th International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2019</year>
          . URL: https://openreview.net/forum?id=r1lyTjAqYX.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Icarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Q.</given-names>
            <surname>Klassen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Valenzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>McIlraith</surname>
          </string-name>
          ,
          <article-title>Reward machines: Exploiting reward function structure in reinforcement learning</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>73</volume>
          (
          <year>2022</year>
          )
          <fpage>173</fpage>
          -
          <lpage>208</lpage>
          . doi:
          <volume>10</volume>
          .1613/JAIR.1.12440.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>De Giacomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Vardi</surname>
          </string-name>
          ,
          <article-title>Linear temporal logic and linear dynamic logic on finite traces</article-title>
          ,
          <source>in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI '13)</source>
          , AAAI Press,
          <year>2013</year>
          , pp.
          <fpage>854</fpage>
          -
          <lpage>860</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Umili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Argenziano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Capobianco</surname>
          </string-name>
          , Neural reward machines,
          <year>2024</year>
          . URL: https://arxiv.org/abs/ 2408.08677. arXiv:
          <volume>2408</volume>
          .
          <fpage>08677</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>M. L.</given-names>
            <surname>Littman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Topcu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L. I.</given-names>
            <surname>Jr.</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. MacGlashan</surname>
          </string-name>
          ,
          <article-title>Environment-independent task specifications via gltl</article-title>
          ,
          <source>CoRR abs/1704</source>
          .04341 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1704.04341.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Camacho</surname>
          </string-name>
          , R. T. Icarte,
          <string-name>
            <given-names>T. Q.</given-names>
            <surname>Klassen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Valenzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>McIlraith</surname>
          </string-name>
          ,
          <article-title>Ltl and beyond: Formal languages for reward function specification in reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19)</source>
          ,
          <source>International Joint Conferences on Artificial Intelligence Organization</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>6065</fpage>
          -
          <lpage>6073</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Giacomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Favorito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Patrizi</surname>
          </string-name>
          ,
          <article-title>Foundations for restraining bolts: Reinforcement learning with ltlf/ldlf restraining specifications</article-title>
          ,
          <source>in: Proceedings of the International Conference on Automated Planning and Scheduling</source>
          , volume
          <volume>29</volume>
          ,
          <year>2021</year>
          , pp.
          <fpage>128</fpage>
          -
          <lpage>136</lpage>
          . URL: https://ojs.aaai.org/ index.php/ICAPS/article/view/3549.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gaon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brafman</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning with non-markovian rewards</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Artificial Intelligence</source>
          <volume>34</volume>
          (
          <year>2020</year>
          )
          <fpage>3980</fpage>
          -
          <lpage>3987</lpage>
          . URL: https://ojs.aaai.org/index. php/AAAI/article/view/5814. doi:
          <volume>10</volume>
          .1609/aaai.v34i04.
          <fpage>5814</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ojha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Neider</surname>
          </string-name>
          , U. Topcu,
          <article-title>Active finite reward automaton inference and reinforcement learning using queries and counterexamples, in: Machine Learning and Knowledge Extraction (CD-MAKE)</article-title>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>115</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ronca</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. P.</given-names>
            <surname>Licks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Giacomo</surname>
          </string-name>
          ,
          <article-title>Markov abstractions for pac reinforcement learning in non-markov decision processes</article-title>
          ,
          <source>in: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI</source>
          <year>2022</year>
          ), Vienna, Austria,
          <year>2022</year>
          , pp.
          <fpage>3408</fpage>
          -
          <lpage>3415</lpage>
          . URL: https://doi.org/ 10.24963/ijcai.
          <year>2022</year>
          /473. doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2022</year>
          /473.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Furelos-Blanco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Law</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jonsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Broda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <article-title>Induction and exploitation of subgoal automata for reinforcement learning</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>70</volume>
          (
          <year>2021</year>
          )
          <fpage>1031</fpage>
          -
          <lpage>1116</lpage>
          . URL: https://doi.org/10.1613/jair.1.12372. doi:
          <volume>10</volume>
          .1613/jair.1.12372.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Kan</surname>
          </string-name>
          ,
          <article-title>Reinforcement learning based temporal logic control with maximum probabilistic satisfaction</article-title>
          ,
          <source>in: 2021 IEEE International Conference on Robotics and Automation (ICRA)</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>806</fpage>
          -
          <lpage>812</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>C. K. Verginis</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Koprulu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Chinchali</surname>
          </string-name>
          , U. Topcu,
          <article-title>Joint learning of reward machines and policies in environments with partially known semantics</article-title>
          ,
          <source>CoRR abs/2204</source>
          .11833 (
          <year>2022</year>
          ). URL: https://doi. org/10.48550/arXiv.2204.11833. doi:
          <volume>10</volume>
          .48550/arXiv.2204.11833.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vaezipoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Q.</given-names>
            <surname>Klassen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Icarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>McIlraith</surname>
          </string-name>
          ,
          <article-title>Noisy symbolic abstractions for deep rl: A case study with reward machines</article-title>
          ,
          <source>CoRR abs/2211</source>
          .10902 (
          <year>2022</year>
          ). URL: https://doi.org/10.48550/arXiv.2211.10902. doi:
          <volume>10</volume>
          .48550/arXiv.2211.10902.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Katz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barbu</surname>
          </string-name>
          ,
          <article-title>Encoding formulas as deep networks: Reinforcement learning for zeroshot execution of ltl formulas</article-title>
          ,
          <source>in: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)</source>
          <year>2020</year>
          ,
          <year>2020</year>
          , pp.
          <fpage>5604</fpage>
          -
          <lpage>5610</lpage>
          . URL: https://doi.org/10.1109/IROS45743.
          <year>2020</year>
          .
          <volume>9341325</volume>
          . doi:
          <volume>10</volume>
          .1109/IROS45743.
          <year>2020</year>
          .
          <volume>9341325</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Hyde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. S.</given-names>
            <surname>Jr</surname>
          </string-name>
          ,
          <article-title>Detecting hidden triggers: Mapping non-markov reward functions to markov, 2024</article-title>
          . URL: https://arxiv.org/abs/2401.11325. arXiv:
          <volume>2401</volume>
          .
          <fpage>11325</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Sutton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Barto</surname>
          </string-name>
          ,
          <source>Reinforcement Learning: An Introduction</source>
          , 2nd ed., The MIT Press,
          <year>2018</year>
          . URL: http://incompleteideas.net/book/the-book-2nd.html.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Giacomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Iocchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Favorito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Patrizi</surname>
          </string-name>
          ,
          <article-title>Foundations for restraining bolts: Reinforcement learning with ltlf/ldlf restraining specifications</article-title>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Icarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Q.</given-names>
            <surname>Klassen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Valenzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>McIlraith</surname>
          </string-name>
          ,
          <article-title>Reward machines: Exploiting reward function structure in reinforcement learning</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>73</volume>
          (
          <year>2022</year>
          )
          <fpage>173</fpage>
          -
          <lpage>208</lpage>
          . doi:
          <volume>10</volume>
          .1613/JAIR.1.12440.
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>G. D.</given-names>
            <surname>Giacomo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Y.</given-names>
            <surname>Vardi</surname>
          </string-name>
          ,
          <article-title>Linear temporal logic and linear dynamic logic on finite traces</article-title>
          ,
          <source>in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI '13)</source>
          , AAAI Press,
          <year>2013</year>
          , pp.
          <fpage>854</fpage>
          -
          <lpage>860</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Icarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Q.</given-names>
            <surname>Klassen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Valenzano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>McIlraith</surname>
          </string-name>
          ,
          <article-title>Reward machines: Exploiting reward function structure in reinforcement learning</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>73</volume>
          (
          <year>2022</year>
          )
          <fpage>173</fpage>
          -
          <lpage>208</lpage>
          . doi:
          <volume>10</volume>
          .1613/JAIR.1.12440.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>P.</given-names>
            <surname>Vaezipoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. T.</given-names>
            <surname>Icarte</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>McIlraith</surname>
          </string-name>
          ,
          <article-title>Ltl2action: Generalizing LTL instructions for multi-task reinforcement learning</article-title>
          ,
          <source>in: Proceedings of the 38th International Conference on Machine Learning (ICML)</source>
          , PMLR,
          <string-name>
            <surname>Virtual</surname>
            <given-names>Event</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>10497</fpage>
          -
          <lpage>10508</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>C.</given-names>
            <surname>Menghi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tsigkanos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pelliccione</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ghezzi</surname>
          </string-name>
          , T. Berger,
          <article-title>Specification patterns for robotic missions</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>47</volume>
          (
          <year>2021</year>
          )
          <fpage>2208</fpage>
          -
          <lpage>2224</lpage>
          . URL: https://doi.org/ 10.1109/TSE.
          <year>2019</year>
          .
          <volume>2945329</volume>
          . doi:
          <volume>10</volume>
          .1109/TSE.
          <year>2019</year>
          .
          <volume>2945329</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>V.</given-names>
            <surname>Mnih</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. P.</given-names>
            <surname>Badia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. P.</given-names>
            <surname>Lillicrap</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Harley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Silver</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          ,
          <article-title>Asynchronous methods for deep reinforcement learning</article-title>
          ,
          <source>CoRR abs/1602</source>
          .01783 (
          <year>2016</year>
          ). URL: http://arxiv.org/abs/1602.01783. arXiv:
          <volume>1602</volume>
          .
          <fpage>01783</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>