1. Introduction

October

Fully Learnable Neural Reward Machines

Hazem Dewidar

Elena Umili

0 0 La Sapienza University of Rome

2025

26 2025 0000 0002

Non-Markovian Reinforcement Learning (RL) tasks present significant challenges, as agents must reason over entire trajectories of state-action pairs to make optimal decisions. A common strategy to address this is through symbolic formalisms, such as Linear Temporal Logic (LTL) or automata, which provide a structured way to express temporally extended objectives. However, these approaches often rely on restrictive assumptions-such as the availability of a predefined Symbol Grounding (SG) function mapping raw observations to high-level symbolic representations, or prior knowledge of the temporal task. In this work, we propose a fully learnable version of Neural Reward Machines (NRM), which can learn both the SG function and the automaton end-to-end, removing any reliance on prior knowledge. Our approach is therefore as easily applicable as classic deep RL (DRL) approaches, while being far more explainable, because of the finite and compact nature of automata. Furthermore we show that by integrating Fully Learnable Reward Machines (FLNRM) with DRL outperforms previous approaches based on Recurrent Neural Networks (RNNs).

eol>Automata Learning Neurosymbolic learning Deep Reinforcement Learning

1. Introduction

structure provides a powerful inductive bias, enabling FLNRM to outperform standard RNN-based baselines, especially in tasks with complex logical constraints.Our method therefore retains the general applicability of standard Deep RL approaches, while improving performance and interpretability, taking the best from both automata-based and deep learning-based RL.

2. Related Works

Temporal logic formalisms are widely used in Reinforcement Learning (RL) to specify non-Markovian tasks [ 6 ], allowing agents to reason about temporally extended goals and constraints. Much of the existing literature assumes that: (1) the temporal specification is given, and (2) the boolean propositions used in the specification are observable in the environment—either perfectly [ 7, 8, 9, 10, 11, 12 ] or with some noise [ 13, 14, 15 ]. Many prior approaches relax only assumption (1), by integrating automata learning within RL agents [ 9, 10, 12 ]; or only assumption (2), using neurosymbolic (NeSy) frameworks [ 5 ] or multi-task RL techniques [ 16 ]; yet they still rely on one of the two.

Notably, recent work [ 17 ] learns both automata and latent event triggers from data without requiring predefined labeling functions or prior temporal knowledge. However, its use of Inductive Logic Programming (ILP) limits applicability only to discrete and finite symbolic domains, excluding environments providing raw observations, such as images or sensor data. In our approach, we learn the automaton describing the RL task structure directly from raw experience, without any prior knowledge or assumptions about the type of observations—which may be high-dimensional and continuous.

3. Background

Notation In this work, we consider sequential data of various types, including both symbolic and subsymbolic representations. Symbolic sequences are also called traces. Each element in a trace is a symbol drawn from a finite alphabet Σ. We denote sequences using bold notation. For example, = ( (1), (2), . . . , ( )) represents a trace of length . Each symbolic variable in the sequence can be grounded either categorically or probabilistically. In the case of categorical grounding, each element of the trace is assigned a symbol from Σ, denoted simply as (). In the case of probabilistic grounding, each symbolic variable is associated with a probability distribution over Σ, represented as a vector ˜() ∈ Δ(Σ), where Δ(Σ) denotes the probability simplex defined as Δ(Σ) = ⎨⎧˜ ∈ R|Σ| ⃒⃒⃒⃒ ˜ ≥ 0, ∑|Σ︁| ˜ = 1⎬⎫ .

⎩ ⃒⃒ =1 ⎭ Accordingly, we distinguish between categorically grounded sequences , and probabilistically grounded sequences ˜ using the tilde notation. Finally, note that we use superscripts to indicate time steps in the () denotes the -th component of sequence and subscripts to denote vector components. For instance, ˜ the probabilistic grounding of at time step .

Non-Markovian Reward Decision Processes In Reinforcement Learning (RL) [ 18 ] the agentenvironment interaction is generally modeled as a Markov Decision Process (MDP). An MDP is a tuple (, , , , ), where is the set of environment states, is the set of agent’s actions, : × × → [ 0, 1 ] is the transition function, : × → R is the reward function, and ∈ [ 0, 1 ] is the discount factor expressing the preference for immediate over future reward. In this classical setting, transitions and rewards are assumed to be Markovian – i.e., they are functions of the current state only. Although this formulation is general enough to model most decision problems, it has been observed that many natural tasks are non-Markovian [ 6 ]. A decision process can be non-markovian because markovianity does not hold on the reward function : ( × )* → R, or the transition function : ( × )* × → [ 0, 1 ], or both. In this work we focus on Non-Markovian Reward Decision Processes (NMRDP) [ 19 ]. Reward Machines Rather than developing new RL algorithms to tackle NMRDP, the research has focused mainly on how to construct Markovian state representations of NMRDP. An approach of this kind are the so called Reward Machines (RMs). RMs are an automata-based representation of nonMarkovian reward functions [ 20 ]. Given a finite set of propositions representing abstract properties or events observable in the environment, a Reward Machine is a tuple (, , , 0, , , ), where is the automaton alphabet, is the set of automaton states, is a finite set of continuous reward values, 0 is the initial state, : × → is the automaton transition function, : → is the reward function, and : → is the labeling (or symbol grounding) function, which recognizes symbols in the environment states. Let = ((1), (2), ..., ()) be a sequence of states the agent has observed in the environment up to the current time instant . This is transformed into a sequence of symbols = (((1)), ((2)), ..., (())) by using the labeling function. This string of symbols is processed by the Moore Machine (, , , 0, , ) so to produce an history-dependent reward (output) value at time , (), and an automaton state at time , (). The reward value can be used to guide the agent toward the satisfaction of the task expressed by the automaton, while the automaton state can be used to construct a Markovian state representation. In fact it was proven that the augmented state ((), ()) is a Markovian state representation for the task expressed by the RM [ 8 ]. Neural Reward Machines Neural Reward Machines (NRMs) are a probabilistic relaxation of standard Reward Machines, where the Moore machine is represented in matrix form, and input symbols, states, and rewards are probabilistically grounded. Given a Moore machine (, , , 0, , ) representing the task’s reward structure—which is assumed to be known—we denote the transition and output (reward) functions in matrix form as ∈ R| |×| |×| | and ℛ ∈ R||×| |, respectively. NRMs assume that the labeling function is unknown and must be approximated by a neural network , which takes an environment state ∈ as input and outputs a probability distribution over symbols ˜ ∈ Δ( ), having trainable parameters . The full model is formulated as follows: ˜() = ((); ) ˜() = =∑=|︀1 | ˜()(˜(− 1) · ) ˜() = ˜() · ℛ (1) The model is fully continuous and diferentiable, allowing its parameters to be learned through gradient-based optimization on input-output target sequences. In particular, [ 5 ] train the model on episodes (, ) collected from interactions with the environment.

4. Method

Fully Learnable Reward Machines In this paper, we extend NRMs to be fully learnable, and refer to our model as Fully Learnable Neural Reward Machines (FLNRM). We assume that no prior knowledge is provided to the model and it must learn an approximation of both the labeling function and the Moore Machine from experience. Since the task’s Moore machine specification is unknown, the number of required states and symbols is also unknown. We initialize the number of symbols to |^ | and the number of states to |^|. In contrast, the number of distinct reward values can be inferred through interaction with the environment, so we assume |^| = ||. As a result, |^ | and |^| are the only two hyperparameters of our model. The FLNRM model is shown in Figure 1, and it is formulated as follows = softmax( / ) ℛ = softmax( ℛ/ ) ˜() = ∑︀|^=|1 ˜()(˜(− 1) · ) ˜() = softmax(((); )/ ) ˜() = ˜() · ℛ Our model has three learnable sets of parameters: , , and ℛ. Specifically, ∈ R|^ |×| ^|×| ^| and ℛ ∈ R||×| ^| are matrices with the same dimensions as and ℛ, respectively. The matrices and ℛ ^ are obtained by applying a softmax activation to the corresponding parameters. This activation ensures that and ℛ define valid probability distributions over the next state and output 1. A temperature 1Unless otherwise specified, the activation operates over the last dimension of each tensor. In this case, softmax ensures that each row of the matrix sums to one. (2) parameter , with 0 < ≤ 1, controls the sharpness of the softmax. When = 1, the activation behaves normally; as approaches zero, the softmax approximates an argmax, and the model behaves increasingly like a deterministic finite state machine rather than a probabilistic one. Deterministic behavior emerges when all rows in the transition and reward matrices become one-hot vectors. We apply the same temperature-controlled activation to the symbol grounder network, so to smothly force the grounder to select only one symbol with maximum probability at each time-step. Integrating FLNRM with deepRL In this section, we describe how FLNRM is integrated with policy learning through RL in non-Markovian domains. As in standard RL, we consider an agent interacting with an unknown environment. At each time step , the agent takes an action (), observes the current state (), and receives a reward (). The agent’s objective is to learn a policy : → that maximizes the cumulative discounted reward: ∑︀∞

=0 (+1). We assume the reward signal is non-Markovian and can be modeled by a Reward Machine—namely, as the composition of a symbol perception function and a Moore machine. As the agent explores the environment, we record each episode as a sequence of states and corresponding rewards . At regular intervals, we use the collected experience to train the FLNRM parameters by minimizing the cross-entropy loss between the predicted reward sequence ˜ and the observed ground-truth rewards . Once the FLNRM has been trained, we use it to construct a history-dependent state representation to mitigate non-Markovianity. Specifically, we augment each environment state () with the probabilistically grounded machine state ˜(), and learn the policy over the augmented state space : × Δ(^) → . A schema of this process is shown in Figure 1.

5. Experiments

We validate our framework by replicating the experimental setup presented in the NRM paper [ 5 ]. Our implementation code is available at Github . In particular, we focus on navigation environments, where multiple items are present, and the agent must navigate among them so to satisfy a specific formula in Linear Temporal Logic over finite traces (LTLf) [ 21 ]. Two environments are designed to illustrate varying levels of dificulty in symbol grounding: (i) Map Environment – where the state is represented by a 2D vector indicating the agent’s current (, ) location; (ii) Image Environment – where the state consists of a 64 × 64 × 3 pixel image depicting the agent within the grid. For each of these two environments we tested two classes of temporal tasks, focusing on formula patterns commonly used in non-Markovian reinforcement learning [ 22, 23 ] and denoted as in [ 24 ]: (i) first class - includes tasks defined as conjunctions of Visit formulas (the agent must reach some items without a predefined order) and Seq_Visit formulas (the agent must reach the items in a certain sequence). (ii) second class - includes tasks defined as conjunctions of Visit, Seq_Visit, and Glob_Avoid formulas (the agent must always avoid certain items). The complete list of formulas is reported in the Appendix.

FLNRM with 30 states FLNRM with 5 states RNN

Results We compare our method with RNN-based approaches using A2C [ 25 ] as RL algorithm, |^ | equal to the groundtruth number of symbols | | = 5, and |^| equal to 5 and 30 states. Figures 2 show the training rewards obtained in both the image and map environments. For each task and method, we perform five runs with diferent random seeds. The results indicate that our method generally outperforms the baseline. Notably, the performance gap is most evident in the second class of tasks, which include the Global_Avoidance constraint. We attribute this to the strong and frequent feedback signals these clauses provide: violations trigger immediate and unambiguous negative rewards, which improve credit assignment and accelerate representation learning. All methods share the same hyperparameter settings for A2C, as well as for the neural networks used in the policy, value function, and feature extraction (the latter is only applied in the image environment), which are detailed in the appendix. The results shows that the number of states will not afect much the quality of the model (the rewards are almost the same). Also changing the observation function only brings minor variations in the results. Indeed, for the same LTLf task, the reward trend is similar in both environments, despite one being based on images and the other on vector observations. This demonstrates that our method efectively handle diferent types of raw observations without any issues.

6. Conclusions and Future Works

In this paper, we extend NRMs into Fully Learnable NRMs, which learn an automaton representation of the RL task directly from raw observations and exploit it in real time to accelerate RL performance. Through extensive experimentation, we show that our method generally surpasses the performance of Deep RL baselines based on RNNs. Our method thus retains the same broad applicability and improved performance compared to DRL approaches, while being grounded in symbolic, explainable, and logic-based methods—combining the best of both worlds. One current limitation of our experiments is the assumption of knowing the ground-truth number of symbols—an unrealistic constraint in many real-world scenarios. In future work, we aim to test the framework with imprecise estimates of the number of symbols, further widening its applicability.

Acknowledgments

The work of Hazem Dewidar was carried out when he was enrolled in the Italian National Doctorate on Artificial Intelligence run by Sapienza University of Rome. This work has been partially supported by PNRR MUR project PE0000013-FAIR.

Declaration on Generative AI

During the preparation of this work, the author(s) used chat-GPT in order to: Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

A. Experimental details A.1. Task Formulas

We selected 8 formulas as RL tasks, 4 of class 1 and 4 of class 2, that are detailed in Table 1.

A.2. Hyperparameters Setting

The proposed model is designed to learn a variable number of latent states, denoted by ^. In our experiments, we evaluated performance under two configurations: ^ = 5 and ^ = 30. The recurrent neural network (RNN) component was configured as an LSTM of two layers ( num_layers = 2) and an output dimensionality of rnn_outputs = 5. The size of the hidden state in the RNN was set to rnn_hidden_size = 50. For the Advantage Actor-Critic (A2C) architecture, the hidden layer size 1 2 3 4 5 6 7 8 1 1 1 1 2 2 2 2 F(a) ∧ F(b) ∧ F(c)

F(a ∧ F(b)) F(a ∧ F(b)) ∧ F(c)

F(a) ∧ F(b) ∧ G(¬c) F(a) ∧ F(b) ∧ G(¬c) ∧ G(¬d)

F(a ∧ F(b)) ∧ G(¬c) F(a ∧ F(b)) ∧ G (¬c) ∧ G(¬d) was fixed at hidden_size = 120. The learning rate of the optimizer is set to lr=0.0004 while the temperature value used is 0.5.

[1]

Ha ,

Schmidhuber , Recurrent world models facilitate policy evolution , in: S. Bengio,

Wallach ,

Larochelle ,

Grauman ,

Cesa-Bianchi , R. Garnett (Eds.), Advances in Neural Information Processing Systems , volume 31 , Curran

Associates

, Inc., 2018 .

[2]

Kapturowski , G. Ostrovski,

Dabney ,

Quan ,

Munos , Recurrent experience replay in distributed reinforcement learning , in: Proceedings of the 7th International Conference on Learning Representations (ICLR) , 2019 . URL: https://openreview.net/forum?id=r1lyTjAqYX.

[3]

R. T.

Icarte ,

T. Q.

Klassen ,

R. A.

Valenzano ,

S. A.

McIlraith , Reward machines: Exploiting reward function structure in reinforcement learning , Journal of Artificial Intelligence Research 73 ( 2022 ) 173 - 208 . doi: 10 .1613/JAIR.1.12440.

[4]

De Giacomo ,

M. Y.

Vardi , Linear temporal logic and linear dynamic logic on finite traces , in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence (IJCAI '13) , AAAI Press, 2013 , pp. 854 - 860 .

[5]

Umili ,

Argenziano ,

Capobianco , Neural reward machines, 2024 . URL: https://arxiv.org/abs/ 2408.08677. arXiv: 2408 . 08677 .

[6]

M. L.

Littman ,

Topcu ,

Fu ,

C. L. I.

Jr. ,

Wen , J. MacGlashan , Environment-independent task specifications via gltl , CoRR abs/1704 .04341 ( 2017 ). URL: http://arxiv.org/abs/1704.04341.

[7]

Camacho , R. T. Icarte,

T. Q.

Klassen ,

Valenzano ,

S. A.

McIlraith , Ltl and beyond: Formal languages for reward function specification in reinforcement learning , in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) , International Joint Conferences on Artificial Intelligence Organization , 2019 , pp. 6065 - 6073 .

[8]

G. D.

Giacomo ,

Iocchi ,

Favorito ,

Patrizi , Foundations for restraining bolts: Reinforcement learning with ltlf/ldlf restraining specifications , in: Proceedings of the International Conference on Automated Planning and Scheduling , volume 29 , 2021 , pp. 128 - 136 . URL: https://ojs.aaai.org/ index.php/ICAPS/article/view/3549.

[9]

Gaon ,

Brafman , Reinforcement learning with non-markovian rewards , Proceedings of the AAAI Conference on Artificial Intelligence 34 ( 2020 ) 3980 - 3987 . URL: https://ojs.aaai.org/index. php/AAAI/article/view/5814. doi: 10 .1609/aaai.v34i04. 5814 .

[10]

Xu ,

Wu ,

Ojha ,

Neider , U. Topcu, Active finite reward automaton inference and reinforcement learning using queries and counterexamples, in: Machine Learning and Knowledge Extraction (CD-MAKE) 2021 , 2021 , pp. 115 - 135 .

[11]

Ronca ,

G. P.

Licks ,

G. D.

Giacomo , Markov abstractions for pac reinforcement learning in non-markov decision processes , in: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI 2022 ), Vienna, Austria, 2022 , pp. 3408 - 3415 . URL: https://doi.org/ 10.24963/ijcai. 2022 /473. doi: 10 .24963/ijcai. 2022 /473.

[12]

Furelos-Blanco ,

Law ,

Jonsson ,

Broda ,

Russo , Induction and exploitation of subgoal automata for reinforcement learning , Journal of Artificial Intelligence Research 70 ( 2021 ) 1031 - 1116 . URL: https://doi.org/10.1613/jair.1.12372. doi: 10 .1613/jair.1.12372.

[13]

Cai ,

Xiao ,

Li ,

Kan , Reinforcement learning based temporal logic control with maximum probabilistic satisfaction , in: 2021 IEEE International Conference on Robotics and Automation (ICRA) , 2020 , pp. 806 - 812 .

[14] C. K. Verginis , C.

Koprulu , S.

Chinchali , U. Topcu, Joint learning of reward machines and policies in environments with partially known semantics , CoRR abs/2204 .11833 ( 2022 ). URL: https://doi. org/10.48550/arXiv.2204.11833. doi: 10 .48550/arXiv.2204.11833.

[15]

A. C.

Li ,

Chen ,

Vaezipoor ,

T. Q.

Klassen ,

R. T.

Icarte ,

S. A.

McIlraith , Noisy symbolic abstractions for deep rl: A case study with reward machines , CoRR abs/2211 .10902 ( 2022 ). URL: https://doi.org/10.48550/arXiv.2211.10902. doi: 10 .48550/arXiv.2211.10902.

[16]

Kuo ,

Katz ,

Barbu , Encoding formulas as deep networks: Reinforcement learning for zeroshot execution of ltl formulas , in: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2020 , 2020 , pp. 5604 - 5610 . URL: https://doi.org/10.1109/IROS45743. 2020 . 9341325 . doi: 10 .1109/IROS45743. 2020 . 9341325 .

[17]

Hyde ,

E. S.

Jr , Detecting hidden triggers: Mapping non-markov reward functions to markov, 2024 . URL: https://arxiv.org/abs/2401.11325. arXiv: 2401 . 11325 .

[18]

R. S.

Sutton ,

A. G.

Barto , Reinforcement Learning: An Introduction , 2nd ed., The MIT Press, 2018 . URL: http://incompleteideas.net/book/the-book-2nd.html.

[19]

G. D.

Giacomo ,

Iocchi ,

Favorito ,

Patrizi , Foundations for restraining bolts: Reinforcement learning with ltlf/ldlf restraining specifications , 2019 .

[20]

R. T.

Icarte ,

T. Q.

Klassen ,

R. A.

Valenzano ,

S. A.

McIlraith , Reward machines: Exploiting reward function structure in reinforcement learning , Journal of Artificial Intelligence Research 73 ( 2022 ) 173 - 208 . doi: 10 .1613/JAIR.1.12440.

[21]

G. D.

Giacomo ,

M. Y.

[22]

R. T.

Icarte ,

T. Q.

Klassen ,

R. A.

Valenzano ,

S. A.

McIlraith , Reward machines: Exploiting reward function structure in reinforcement learning , Journal of Artificial Intelligence Research 73 ( 2022 ) 173 - 208 . doi: 10 .1613/JAIR.1.12440.

[23]

Vaezipoor ,

A. C.

Li ,

R. T.

Icarte ,

S. A.

McIlraith , Ltl2action: Generalizing LTL instructions for multi-task reinforcement learning , in: Proceedings of the 38th International Conference on Machine Learning (ICML) , PMLR, Virtual

Event

, 2021 , pp. 10497 - 10508 .

[24]

Menghi ,

Tsigkanos ,

Pelliccione ,

Ghezzi , T. Berger, Specification patterns for robotic missions , IEEE Transactions on Software Engineering 47 ( 2021 ) 2208 - 2224 . URL: https://doi.org/ 10.1109/TSE. 2019 . 2945329 . doi: 10 .1109/TSE. 2019 . 2945329 .

[25]

Mnih ,

A. P.

Badia ,

Mirza ,

Graves ,

T. P.

Lillicrap ,

Harley ,

Silver ,

Kavukcuoglu , Asynchronous methods for deep reinforcement learning , CoRR abs/1602 .01783 ( 2016 ). URL: http://arxiv.org/abs/1602.01783. arXiv: 1602 . 01783 .