Transfer Learning between non-Markovian RL Tasks through Semantic Representations of Temporal States Andrea Fanti1,∗ , Elena Umili1 and Roberto Capobianco2 1 Sapienza University of Rome 2 Sony AI Abstract Reinforcement Learning (RL) faces challenges in transferring knowledge across tasks efficiently. In this work we focus on transferring policies between different temporally extended tasks expressed in Linear Temporal Logic. Existing methodologies either rely on task-specific representations or train deep-neural-networks-based task-state representations exploiting the interaction with the environment. We propose a novel approach, leveraging semantic similarity of the formulas, to compute transferable task state representations directly from task specifications, offline, and without any learning process. Preliminary experiments on temporally-extended navigation in a grid world domain demonstrate the superiority of our semantic representation over baseline methods. This approach lays the groundwork for lightweight, transferable task state representations based solely on task semantics, offering a promising avenue for efficient RL in temporally-extended tasks without extensive retraining. Keywords Non-Markovian Reinforcement Learning, Instruction following, Transfer learning in RL 1. Introduction Reinforcement Learning (RL) is a powerful Artificial Intelligence paradigm that can solve a great variety of complex decision problems with minimal domain knowledge [1]. In this work, we focus on the age-old problem of using Reinforcement Learning to make an artificial agent follow temporally-extended instructions, such as “Do A and then B”, without training it on every possible task. This setting poses two major challenges to standard RL algorithms: (i) it breaks the so-called Markov property [2], which prevents their application to temporally-extended goals; and (ii) it requires zero-shot transfer learning — i.e., solving unseen tasks without any additional training. Several methodologies have been proposed to apply RL to temporally-extended tasks exploiting their specification in formal languages [3], or finite state machines [4]. Despite their success on solving single tasks, however, these approaches aggravate the transfer learning problem, as they are based on task state representations that are specific to each task, and cannot be transferred to new, unseen tasks. To overcome this limitation, more recent work focuses on exploiting the power of Deep Neural Networks to learn latent temporal state representations using the experience collected in the environment, showing promising results [5] [6]. However, we argue that the correlation between different temporally-extended tasks, when expressed formally through instructions, depends solely on the semantics of the instructions. This correlation can be computed offline through semantic analysis, independently of the environment and agent-environment interaction. Therefore, in this work, we explore a radically different approach, in which a transferable task state representation is not learned during model training, and is instead directly computed from the task specification as a vector of real numbers. This computation is based on evaluating the semantic similarity between the task state and the initial state of one or more reference tasks, which form the components of the resulting vector space. Inspired by [7] and [8], we use a kernel function to compute the semantic similarity, and thus refer to the final result as a Kernel Representation. AAPEI ’24: 1st International Workshop on Adjustable Autonomy and Physical Embodied Intelligence, October 20, 2024, Santiago de Compostela, Spain. ∗ Corresponding author. Envelope-Open fanti@diag.uniroma1.it (A. Fanti); umili@diag.uniroma1.it (E. Umili); Roberto.Capobianco@sony.com (R. Capobianco) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings An immediate benefit of this approach is eliminating the need to employ ad-hoc model architectures, such as Graph Neural Networks in [5]. On the other hand, an immediate limitation is that computing the exact value of a Kernel Representation on all possible traces is intractable, meaning it must be approximated. To tackle this challenge, we begin by proposing two metrics to asses the quality of an approximated representation without collecting any experience from the environment. Then, we introduce a method to reduce the number of traces needed by sampling them from the characteristic set [9] of the task distribution, rather than at random. We validate these two metrics and Kernel Representations with transfer RL experiments on a simple grid world domain, inspired by the video game Minecraft. In this world, the tasks consist in visiting cells in a suitable order, that is encoded as Linear Temporal Logic formulas evaluated on finite traces (LTLf) [10]. A frame from the environment, along with some examples of tasks, are shown in figure 1 (a-b). The results suggest a clear advantage of our Kernel Representation against the non-transfer-aware baseline given by Reward Machines [4]. At the same time, they confirm our hypothesis that our representation quality metrics can give precious insights about the transfer performance before any interaction with the environment. We believe that this work could represent a solid foundation for more advanced ways to use the semantic similarity between temporal states to obtain lightweight and simple transferable tasks state representations, directly exploiting the task semantics, without the need to learn such representations from the environment. In other words, this is a first crucial step towards the long-standing challenge of making an artificial agent follow temporally-extended instructions without extensive training on every possible scenario, exploiting the semantics of the language in which the instructions are given, and nothing else. 2. Related Work Extending classical RL to temporal goals requires augmenting the state of the MDP with some kind of representation of the current progress towards satisfying the goal. The goal itself is most commonly specified with LTL, as in geometric-LTL (G-LTL) [11], and Restraining Bolts [3]. In these cases, the current progress towards the goal is encoded as the state of an automaton equivalent to the LTL formula. Reward Machines (RMs) [12] [4] [13] follow the same principle, but they directly represent tasks as Finite State Automata. In all of the above approaches, however, the representation of the current progress is specific to a particular task, meaning that the learned policies cannot, in general, be reused for other tasks. Early work on semantic-based transfer has focused on allowing transfer when a new task is entirely composed by situations already seen during training [14]. The main limitation of these kind of approach is that it cannot transfer to unseen temporal states that are not exactly equivalent to those seen during training. In contrast, our method allows to map semantically close temporal states to close points in the semantic space, generalizing to unseen situations. More recent work, instead, focuses on fully transferable task state representations, such that a more general policy can be learned and transferred to entire task distributions. One direction is that of learning independent policies for various subtasks in the environment [15, 16, 17, 18, 19], which are then combined to solve more complex tasks. Another class of approaches is based on exploiting the syntactic structure of the task formula in the design of the agent Neural Network, either by using it to shape the policy networks [6], or by feeding it directly to a Graph Neural Network to learn a latent embedding [20]. However, none of these approaches makes use of a semantic representation of the task specification to make it transferable. Instead, in contrast to both compositional and NN-based methods, we explore this kind of approach, inspired by [7] and [8], which investigate how to generate semantic-preserving embeddings of logical formulae. 3. Background Linear Temporal Logic and Deterministic Finite Automata Linear Temporal Logic (LTL) [21] is a language which extends traditional propositional logic with modal operators. With the latter we can specify rules that must hold through time. In this work, we use LTL interpreted over finite traces (LTLf) [10]. Such interpretation allows the executions of arbitrarily long traces, but not infinite, and is adequate for finite-horizon planning or RL problems. Given a set 𝑃 of propositions, the syntax for constructing an LTLf formula 𝜙 is given by 𝜙 ∶= ⊤ | ⊥ | 𝑝 | ¬𝜙 | 𝜙1 ∧ 𝜙2 | 𝑋 𝜙 | 𝜙1 𝑈 𝜙2 (1) where 𝑝 ∈ 𝑃. We use ⊤ and ⊥ to denote true and false respectively. 𝑋 (Next) and 𝑈 (Until) are temporal operators. Other temporal operators are: 𝑁 (Weak Next) and 𝑅 (Release) respectively, defined as 𝑁 𝜙 ≡ ¬𝑋 ¬𝜙 and 𝜙1 𝑅𝜙2 ≡ ¬(¬𝜙1 𝑈 ¬𝜙2 ); 𝐺 (globally) 𝐺𝜙 ≡ ⊥𝑅𝜙 and 𝐹 (eventually) 𝐹 𝜙 ≡ ⊤𝑈 𝜙. A trace (0) (1) (𝑡) 𝑥𝑝 = 𝑥𝑝 , 𝑥𝑝 , ... is a sequence of propositional assignments, where 𝑥𝑝 ∈ 2𝑃 (𝑡 ≥ 0) is the point of 𝑥𝑝 a (𝑡) time 𝑡. Intuitively, 𝑥𝑝 is the set of propositions that are true at instant 𝑡. Additionally, |𝑥𝑝 | represents the length of 𝑥𝑝 . Since each trace is finite, |𝑥𝑝 | < ∞ and 𝑥𝑝 ∈ (2𝑃 )∗ . If the propositional symbols in 𝑃 are (𝑡) mutually exclusive, e.g. the domain produces one and only one symbol true at each step, 𝑥𝑝 ∈ 𝑃 (𝑡 ≥ 0). In this work we assume this second setting of the domain, that is known as the Declare assumption [22]. We denote with 𝑥𝑝 ⊨ 𝜙, when 𝑥𝑝 satisfy the formula 𝜙. We refer the reader to [21] for a formal description of the operators’ semantics. Any LTLf formula 𝜙 can be translated in an equivalent Deterministic Finite Automaton (DFA) 𝐴𝜙 = (Σ𝜙 , 𝑄𝜙 , 𝑞0 , 𝛿𝑡,𝜙 , 𝐹𝜙 ) [10], where Σ𝜙 is the automaton alphabet, 𝑄𝜙 is the set of states, 𝑞0 ∈ 𝑄𝜙 is the initial state, 𝛿𝑡,𝜙 ∶ 𝑄𝜙 × Σ𝜙 → 𝑄𝜙 is the transition function and 𝐹𝜙 ⊆ 𝑄𝜙 is the set of final states. Σ𝜙 is equal to 𝑃 or 2𝑃 , depending on whether the Declare assumption is upheld. Thus, in our case, Σ𝜙 = 𝑃. We define the transition function over strings 𝛿𝑡,𝜙 ∗ ∶ 𝑄 × Σ∗ → 𝑄 recursively 𝜙 𝜙 𝜙 ∗ (𝑞, 𝜖) = 𝑞 𝛿𝑡,𝜙 ∗ (𝑞, 𝑝 + 𝑥 ) = 𝛿 ∗ (𝛿 (𝑞, 𝑝), 𝑥 ) (2) 𝛿𝑡,𝜙 𝑝 𝑡,𝜙 𝑡,𝜙 𝑝 Where 𝑝 ∈ 𝑃 is a symbol, 𝜖 is the empty traces. 𝑥𝑝 ∈ Σ∗𝜙 is a trace, and 𝑝 + 𝑥𝑝 is the concatenation of 𝑝 and 𝑥𝑝 . Let 𝐿(𝐴𝜙 ) be the language composed by all the strings accepted by the 𝐴𝜙 we have 𝑥𝑝 ⊨ 𝜙 iff 𝑥𝑝 ∈ 𝐿(𝐴𝜙 ). Non-Markovian Reward Decision Processes and Reward Machines In RL the agent- environment interaction is generally modeled as a Markov Decision Process (MDP) [1]. An MDP is a tu- ple (𝑆, 𝐴, 𝑡, 𝑟, 𝛾 ), where 𝑆 is the set of environment states, 𝐴 is the set of agent’s actions, 𝑡 ∶ 𝑆×𝐴×𝑆 → [0, 1] is the transition function, 𝑟 ∶ 𝑆 × 𝐴 → ℝ is the reward function, and 𝛾 ∈ [0, 1] is the discount factor expressing the preference for immediate over future reward. In this classical setting, transitions and rewards are assumed to be Markovian – i.e., they are functions of the current state only. A decision process can be non-markovian because the markov property does not hold on the reward function 𝑟 ∶ (𝑆 × 𝐴)∗ → ℝ, or the transition function 𝑡 ∶ (𝑆 × 𝐴)∗ × 𝑆 → [0, 1], or both. In this work we focus on Non-Markovian Reward Decision Processes (NMRDP) [23]. Learning an optimal policy in such settings is hard, since the current environment outcome depends on the entire history of state-action pairs the agent has explored from the beginning of the episode. Many works have investigated how to construct Markovian state representations of NMRDP, such for example Reward Machines (RMs). RMs are an automata-based representation of non-Markovian reward functions [24]. Given a finite set of propositions 𝑃 representing abstract properties or events observable in the environment, RMs specify temporally extended rewards over these propositions while exposing the compositional reward structure to the learning agent. Formally, in this work we assume the reward can be represented as a Reward Machine 𝑅𝑀 = (𝑃, 𝑄, 𝑅, 𝑞0 , 𝛿𝑡 , 𝛿𝑟 , 𝐿), where 𝑃 is the automaton alphabet, 𝑄 is the set of automaton states, 𝑅 is a finite set of continuous reward values, 𝑞0 is the initial state, 𝛿𝑡 ∶ 𝑄 × 𝑃 → 𝑄 is the transition function, 𝛿𝑟 ∶ 𝑄 × 𝑃 → 𝑅 is the reward function, and 𝐿 ∶ 𝑆 → 𝑃 is the labelling function, which recognizes symbols in the environment states. Let 𝑥𝑠 = [𝑠 (1) , 𝑠 (2) , ..., 𝑠 (𝑡) ] be a sequence of states the agent has observed in the environment up to the current time instant 𝑡, we define the labeling function over strings 𝐿∗ ∶ 𝑆 ∗ → 𝑃 ∗ as 𝐿∗ (𝑥𝑠 ) = [𝐿(𝑠 (1) ), 𝐿(𝑠 (2) ), ..., 𝐿(𝑠 (𝑡) )]. We denote with 𝛿𝑡∗ and 𝛿𝑟∗ the transition function and the reward over strings, which are defined recursively, analogously to Equation 2. Given 𝑥𝑠 , the RM produces an history-dependent reward value at time 𝑡, 𝑟 (𝑡) = 𝛿𝑟∗ (𝑞0 , 𝐿∗ (𝑥𝑠 )) and an automaton state at time 𝑡, 𝑞 (𝑡) = 𝛿𝑡∗ (𝑞0 , 𝐿∗ (𝑥𝑠 )). The reward value can be used to guide the agent toward the satisfaction of the task expressed by the automaton, while the automaton state can be used to construct a Markovian state representation. In fact it was proven that the augmented state (𝑠 (𝑡) , 𝑞 (𝑡) ) is a Markovian state for the task expressed by the RM [3]. Semantic Similarity Function for Signal Temporal Logic Formulas Signal Temporal Logic (STL) [25] is an extension of LTL to continuous variables and continuous time. STL formulas are used to monitor properties of trajectories. A trajectory is a function 𝜉 ∶ 𝐼 → 𝐷, with a time domain 𝐼 ⊆ ℝ≥0 , and a state space 𝐷 ⊆ ℝ𝑛 , for some 𝑛 ∈ ℕ. We define the trajectory space 𝒯 as the set of all possible continuous functions over 𝐷. An atomic predicate 𝑝 of STL is a continuous computable predicate on 𝑥 ∈ ℝ𝑛 of the form of 𝑓 (𝑥1 , ..., 𝑥𝑛 ) ≥ 0. STL has a sintax similar to that of LTL defined in Equation 1, where the semantic of temporal operators is just slightly modified to take into account the continuity of time and input variables. In particular, in STL can be given not only the classic boolean notion of satisfaction, denoted by 𝑠(𝜙, 𝜉 ) = 1 if 𝜉 satisfies 𝜙, and 0 otherwise, but also a quantitative one, denoted by 𝜌(𝜙, 𝜉 ), with −1 ≤ 𝜌(𝜙, 𝜉 ) ≤ 1. This measures the quantitative level of satisfaction of a formula for a given trajectory, evaluating how “robust” is the satisfaction of 𝜙 with respect to perturbations in the signal [26]. In [7] the authors define a kernel function approximating the semantic distance between two STL formulas 𝜙 and 𝜓 as: 𝑘(𝜙, 𝜓 ) = ∫ 𝜌(𝜙, 𝜉 )𝜌(𝜓 , 𝜉 )𝑑𝜇0 (𝜉 ) (3) 𝜉 ∈𝒯 Note that the integral is defined over the entire trajectory space 𝒯. Integrating over uncountably many trajectories requires to put a finite measure on them, according to which to integrate. Therefore 𝜇0 is a probability measure over trajectories which prefers “simple” trajectories, where the signals do not change too dramatically or, in other words, the total variation is low. Notice that 𝑘(𝜙, 𝜓 ) depends on the language of the two formulas, assigning high similarity for couples of formulas that are satisfied by the same (or similar) set of trajectories. This similarity function will be our starting point to define a transferable semantic representation state-space for NMRDPs. 4. Method Given a certain temporally extended task expressed as an LTLf formula 𝜙, we aim to define a new representation of the task’s temporal states, 𝛼𝜙 (𝑞), with 𝑞 ∈ 𝑄𝜙 , which can be transferred to new temporal tasks executed in the same environment. We consider the temporal states of the task as the states of the automaton obtained by translating 𝜙 into the DFA 𝐴𝜙 . In particular, after training the agent on a certain set of training formulas Φ𝑡𝑟𝑎𝑖𝑛 , we test it on a set of test tasks Φ𝑡𝑒𝑠𝑡 without acquiring any experience for these new tasks in a zero-shot transfer learning experiment. More formally, we consider NMRDPs constructed as the combination of a transition system (an MDP without reward) and a reward expressed as an RM, 𝑁 𝑀𝑅𝐷𝑃 = (𝑆, 𝐴, 𝑡, 𝑅𝑀𝜙 , 𝛾 ). Let 𝑃 be a finite set of observable symbols in the environment via the labeling function 𝐿, 𝑅𝑀𝜙 = (𝑃𝜙 , 𝑄𝜙 , 𝑅 = {0, 1}, 𝑞0 = 0, 𝛿𝑡,𝜙 , 𝛿𝑟,𝜙 , 𝐿), where 1 if 𝑞 ∈ 𝐹𝜙 𝛿𝑟,𝜙 (𝑞) = { (4) 0 otherwise and we assume that 𝑃𝜙 ⊆ 𝑃 ∀𝜙 ∈ Φ𝑡𝑟𝑎𝑖𝑛 ∪ Φ𝑡𝑒𝑠𝑡 . Notice that: (i) the labeling function is the same for all reward machines, as it depends on the environment rather than the task; (ii) we assume a binary reward function (𝑅 = {0, 1}) for all tasks; in particular, we assume 𝛿𝑟,𝜙 takes the value 1 in all final states of 𝐴𝜙 and 0 in the other states of the automaton. As is customary in non-Markovian-reward RL [4, 27], we condition the policy on both the environment raw state and the temporal state in order to build a markovian state representation. We therefore train a policy 𝜋 ∶ 𝑆 × 𝑇 → 𝐴, returning an action for the couple (environment state, temporal state), where 𝑇 ⊆ ℝ𝐵 is the semantic temporal space that we will define in the next section. 4.1. Kernel Representation for LTLf Tasks We aim to define a semantic metric space for mapping the temporal states of LTLf formulas, where the metric measures the semantic distance between two temporal states. More formally, we denote with 𝐴𝜙,𝑞 the automaton generated by the formula 𝜙, with the initial state 𝑞: 𝐴𝜙,𝑞 = (𝑃, 𝑄𝜙 , 𝑞0 = 𝑞, 𝛿𝑡,𝜙 , 𝐹𝜙 ), and we denote its language by 𝐿𝜙,𝑞 . Let 𝑞 and 𝑞 ′ be two temporal states generated by two formulas 𝜙 and 𝜙 ′ , respectively (which could also be the same formula). Ideally, we want the distance between the two points in 𝑇, 𝛼𝜙 (𝑞) and 𝛼𝜙 ′ (𝑞 ′ ), to be as small as possible when the languages 𝐿𝜙,𝑞 and 𝐿𝜙 ′ ,𝑞′ are similar. 𝑑𝑖𝑠𝑡(𝛼𝜙 (𝑞), 𝛼𝜙 ′ (𝑞 ′ )) ≃ 𝑑𝑖𝑠𝑡(𝐿𝜙,𝑞 , 𝐿𝜙 ′ ,𝑞′ ) (5) The representation space above could be approximated by constructing an embedding space. However, to avoid computing the distance between all combinations of states generated by all considered formulas, we choose a different solution. Specifically, we construct the representation in terms of some formulas that will serve as base of the metric space. Thus, we select a set of 𝐵 LTLf Φ𝑏𝑎𝑠𝑒 = {𝜙𝑏1 , 𝜙𝑏2 , ..., 𝜙𝑏𝐵 }, and we define the Kernel Representation 𝛼𝜙 (𝑞) as the continuous vector with 𝐵 components. 𝛼𝜙 (𝑞) = [𝑑𝑖𝑠𝑡(𝐿𝜙𝑏 ,𝑞0 , 𝐿𝜙,𝑞 ), ..., 𝑑𝑖𝑠𝑡(𝐿𝜙𝑏 ,𝑞0 , 𝐿𝜙,𝑞 )] (6) 1 𝐵 The base formulas are fixed before the training phase and must remain the same during the test phase to ensure transferability. Now, let us define the semantic distance between two languages by adapting the similarity function in Equation 3 to our case. 𝑑𝑖𝑠𝑡(𝐿𝜙,𝑞 , 𝐿𝜙 ′ ,𝑞′ ) = 1 − 𝑠𝑖𝑚(𝐿𝜙,𝑞 , 𝐿𝜙 ′ ,𝑞′ ) 1 (7) 𝑠𝑖𝑚(𝐿𝜙,𝑞 , 𝐿𝜙 ′ ,𝑞′ ) = |𝑃|̄ ∑ 1 {(𝑥𝑝 ∈ (𝐿𝜙,𝑞 ∩ 𝐿𝜙 ′ ,𝑞′ ) ∨ (𝑥𝑝 ∉ (𝐿𝜙,𝑞 ∪ 𝐿𝜙 ′ ,𝑞′ ))} ̄ ∗ 𝑥𝑝 ∈𝑃⊆𝑃 The similarity between two temporal states 𝑞 and 𝑞 ′ generated respectively by two formulas 𝜙 and 𝜙 ′ is higher when the number of traces in the set 𝑃 ̄ that belong to both or none of the languages 𝐿 𝜙,𝑞 and 𝐿𝜙 ′ ,𝑞′ is higher. Notice that the similarity value is 1 if the two languages contain exactly the same traces. Moreover, we replace the integral from Equation 3 with a summation, since in our case the traces are discrete in both value and time. For the same reason, we do not weigh the importance of a trace 𝑥𝑝 in the sum by the probability 𝜇0 (𝑥𝑝 ) preferring low total variation traces because this probability function does not favor any of the digital signals we are interested in. Therefore, we decided to limit the summation to a finite set of traces 𝑃 ̄ ⊆ 𝑃 ∗ , with |𝑃|̄ < ∞, and weigh each trace uniformly. This simplifies the calculation of similarity but also makes the similarity function in Equation 7 only an approximation of the true similarity between the two languages, which is more faithful the more 𝑃 ̄ resembles 𝑃 ∗ . We investigate in experiments how different choices of 𝑃 ̄ influence the performance of the semantic representation for RL transfer. Last but not least, note that while the similarity function in Equation 3 is defined between two formulas, our similarity function is defined between two formula-state pairs. This extension is necessary to apply the representation to non-Markovian RL, where knowing the formula alone is not enough, but we want to know at each time instant the current temporal state of the task. In the next section we will discuss how to practically set the hyperparameters of our representation to better transfer the policy between multiple temporal tasks. (a) (b) (c) (d) Figure 1: (a) Minecraft-like environment for temporally extended navigation tasks. (b) The automata corre- sponding to two tasks of example. T1: ”reach the pickaxe (P) and the lava cell (L) in whatever order”, T2: ”go to the pickaxe and after that reach the lava”. Note that state 2 of T1 and state 1 of T2 (colored in green) are semantically equivalent temporal states, as state 3 of T1 and state 2 of T2 (colored in red). (c) Sums of inter-task variation between test and train tasks for different Kernel representation. (d) Distribution of the intra-task variation for different Kernel Representations among all sets. 4.2. Representation Hyperparameters Given the definition of our semantic temporal representation for regular tasks, (Equations 6 and 7), notice that this depends on two reference choices: (i) the set of reference automata Φ𝑏𝑎𝑠𝑒 , (ii) the set of reference traces 𝑃.̄ The selection of these references can significantly impact the quality of the representation. In our investigation, we explore two settings for the representation, each corresponding to different strategies for sampling reference automata and traces: (i) a training-formulas-independent representation , (ii) a training-formulas-dependent representation In the first setting, we operate without any prior knowledge about the environment or the training tasks. Here, we define the set of reference traces 𝑃 ̄ as the complete set of traces with a maximum length of 𝑙𝑚𝑎𝑥 , denoted as 𝑃 𝑙𝑚𝑎𝑥 . Note that the number of traces in 𝑃 𝑙𝑚𝑎𝑥 grows exponentially with the maximum length. Therefore, setting a high value for 𝑙𝑚𝑎𝑥 is computationally impractical. However, this remains the best sampling strategies for simple tasks, that can be captured by short traces. For the selection of reference automata, we opt for a random sampling approach, generating 𝐵 random automata as a base, following a method similar to that proposed by [28] to select non trivial automata of a target size |𝑄|. This first representation benefits from being completely independent by both the tasks chosen and the environment, however has as limitations that can struggle to capture complex tasks because the reference traces may be too short to capture long-time-dependencies and/or because random automata are too far from the languages of the training tasks. In the second setting, we leverage knowledge of at least the training formulas Φ𝑡𝑟𝑎𝑖𝑛 to tailor the representation hyperparameters to better suit the temporal states of interest. Despite this, we do not assume any knowledge about the test tasks or the operating environment. For selecting reference traces in this scenario, we adopt the concept of the characteristic set of a DFA from the literature on automata induction [29]. The characteristic set of an automaton 𝐴, denoted as 𝒞 (𝐴), is the minimal set of labeled traces necessary to accurately identify the DFA from traces [9]. In particular, we calculate the characteristic set of each training automata 𝐴𝜙 , with 𝜙 ∈ Φ𝑡𝑟𝑎𝑖𝑛 , with algorithm [30], and we choose reference traces from these sets. In this way we know that each trace in 𝑃 ̄ is necessary to capture at least one state of one training automaton. Again, since the union of all the characteristic sets of training tasks may result in a very large set of traces, to reduce the computational cost, we randomly sample in this set a certain number of reference traces, without considering them all, having therefore 𝑃 ̄ ⊆ ⋃𝜙∈Φ𝑡𝑟𝑎𝑖𝑛 𝒞 (𝐴𝜙 ). Regarding the choice of reference automata, we simply randomly select 𝐵 training task formulas, Φ𝑏𝑎𝑠𝑒 ⊆ Φ𝑡𝑟𝑎𝑖𝑛 . We will show in the next section how these strategies for the reference selection will impact on the learning and transfer process. 5. Experiments In this section we report some preliminary experiments validating our framework. We publicly provide our code on GitHub at https://github.com/KRLGroup/kernel-representations-transfer/. Environment We test our approach on temporally extended navigation tasks. The latter are accom- plished in a Minecraft-like environment, similar to those considered by [31] and [32]. The environment consists in a grid world containing different items like the one depicted in Figure 1 (a). The RM input alphabet is composed by a symbol for each different item plus a symbol indicating the absence of items (empty-cell). Each symbol is considered set to True when the agent is in the item-cell and false otherwise. The agent has to learn how to navigate the grid in order to satisfy the temporal constraint specified in the task. Each task is expressed in LTLf and then translated into an automaton using the tool LTL2DFA [33]. The reward given to the agent is 100 when it reaches a final DFA state and 0 otherwise. Task dataset A total of 11000 formulas were used for the transfer learning experiments, divided into: (i) 9000 for training; (ii) 1000 for validation; (iii) 1000 for testing. All 11000 formulas were sampled without repetition from a distribution of partially-ordered tasks, as described in [5]. These tasks consist of multiple sequences of propositions, where the propositions in each sequence must be solved in order, but the sequences can be solved in parallel. An example of this kind of taks (specified informally in English) is: ”go to the pickaxe (P), the gem (G), and the door (D) in that order, and reach the lava (L) and the gem (G) in that order” where one valid solution would be to reach lava, pickaxe, gem, door in that order. The number of possible unique tasks that can be generated by this distribution is in the order of 1039 [5]. Configurations The experiments were performed using 3 different variants of the Kernel Represen- tation and a baseline representation. These differ only in the way the observation over the current state of the DFA is represented, as follows: (KR5) and (KR5) are generated respectively using 5 and 32 randomly sampled DFAs with 10 states, over all possible traces of length in [1, 5]; (KC5) and (KC32) are generated respectively using 5 and 32 DFAs sampled from the training distribution, over a subset of the union of all the characteristic sets of the training set DFAs; (VRM) is the Vanilla Reward Machines baseline representation, i.e. an integer indicating the current state of the DFA, with 0 always being the initial state. For KC5 and KC32, the set of traces used to compute the Kernel Representation was generated as follows: (i) first, the union of all the characteristic sets of all DFAs in the training set was computed; (ii) from this set, we removed all traces that were subtraces of others already contained in the set; (iii) we truncated this set to the first 𝑁 traces such that the sum of their lengths was as close as possible to 20000 characters. This resulted in a set of 𝑁 = 1611 characteristic traces of total length 20008. Note that the other two representations, KR5 and KR32, are computed using trace sets of much greater size, but similar total length. In fact, the number of all possible traces with length in [1, 5] is 51 + 52 + ⋯ + 55 = 3905, with their total length being 51 + 2 ⋅ 52 + ⋯ + 5 ⋅ 55 = 18555. 5.1. Results Offline Analysis of Kernel Representations Before delving into the RL and transfer experiments, in this section we analyze the four generated Kernel representations and their potential informative power. We believe this is especially useful to compare the different hyperparameters used to generate them. In particular, we investigate two desirable properties of approximate Kernel representations: intra-task variation, i.e. different states of the same DFA should be represented by different vectors, since they have different semantics; inter-task variation, i.e. states in the test DFAs that have semantics similar to states in a train DFA, should be represented by close vectors. While these hold for an ideal Kernel representation, there is no guarantee for approximations, which motivates the need for this kind of analysis. Finally, notice that ideally we desire representations with both high intra-task variation and low inter-task variation. Table 1 Percentage of solved formulas and discounted return on the test and training sets, and number of training steps needed to finish the training curriculum. Results are averaged over 5 runs. Conf. % Solved (train) Disc. ret. (train) % Solved (test) Disc. ret. (test) Train Steps (M) KR5 93.1 ± 1.5 30.1 ± 1.2 92.3 ± 2.2 28.9 ± 1.7 20.2 ± 3.4 KR32 94.3 ± 1.3 31.1 ± 1.4 94.1 ± 1.9 30.1 ± 1.6 22.9 ± 3.5 KC5 93.9 ± 1.9 30.6 ± 1.4 93.9 ± 2.3 29.6 ± 1.6 14.9 ± 0.8 KC32 92.3 ± 2.0 29.6 ± 1.2 92.2 ± 2.3 28.6 ± 1.3 24.9 ± 3.5 VRM 90.5 ± 0.0 28.7 ± 0.0 90.4 ± 0.0 27.5 ± 0.0 32.4 ± 3.6 In our experiments, we define intra-task variation of a task 𝐴𝜙 as the relative standard deviation (RSD) among the representations of all its states. We also define inter-task variation of a task 𝐴𝜙 with respect to a set of 𝑘 tasks Γ = {𝐴𝜙1 , ..., 𝐴𝜙𝑘 } as the sum of the minimum distances between the representations of its states and all the representations of all states of the DFAs in the set Γ, namely as ∑𝑞∈𝑄𝜙 𝑚𝑖𝑛𝑞′ ∈∪𝑖=𝑘 𝑄 𝑑𝑖𝑠𝑡(𝛼𝜙 (𝑞), 𝛼𝜙𝑖 (𝑞 ′ )). Results are reported in Figure 1 (c) and (d). These highlight 𝑖=1 𝜙𝑖 3 different cases: low variations (KR5), balanced (KR32 and KC5), and high variations (KC32). Thus, representations with extremely good intra-task variation sacrifice inter-task variation, and vice versa. However, extremely low inter-task variation is not beneficial when intra-task variation is also low (KR5), since this could mean that semantically different temporal states collapse into too close vectors. Vice versa, an extremely high intra-task variation is not beneficial when inter-task variation is also high (KC32), since it could mean that semantically close temporal states are mapped to distant vectors, hindering transfer learning. Thus, these results suggest that the best representations are those that strike a balance between intra and inter-task variation (KR32 and KC5), such that they do not sacrifice any of the two properties. Transfer Learning Performance To evaluate the transfer learning performance, we train a PPO [34] agent on the training set for each of the configurations above, recording the policy with the best validation performance during training. We use a curriculum-learning training scheme, sorting tasks in the training set by the number of states of the corresponding DFA as a proxy of task difficulty. Moreover, we train the agent on the same task until either it is solved, or a maximum number of episodes were seen by PPO before moving on to the next one. After training, we evaluate the zero-shot transfer performance of the policy on the test tasks. Table 5.1 reports the percentage of training and test task solved, the discounted return obtained, and the number of training steps needed to finish the curriculum. KR32 obtains the best results in all 3 performance metrics, with KC5 being a close second, and VRM always obtains the worst results. This suggests that the careful choice of traces and reference automata allowed KC5 to reach results comparable with KR32, a representation that uses a similar total trace length and ∼ 6 times the number of reference automata, again highlighting the importance of these aspects. Moreover, KC5 took the least amount of training steps to finish the curriculum among all representations, which is less than half of that required by the VRM baseline. This suggests that a careful choice of traces and references is also highly beneficial for sample efficiency. Moreover, these results confirm the findings of the offline representation analysis, in that representations with extreme variations (KR5 and KC32) obtain worse performance than the balanced representations (KC5 and KR32). 6. Conclusions and Future Work In this paper we introduce a novel Kernel Representation for temporal states of LTL tasks, to address the challenge of policy transfer in temporally-extended Reinforcement Learning. We show that conditioning the agent policy over the semantic temporal space is beneficial for transferring the policy to new LTL instructions, even if this space is computed offline and does not leverage any learning module. We investigate how the choice of reference traces and automata impacts the learning and transfer performances and the quality of the representation, defining two metrics (intra and inter-task variation) to analyze the representation offline. We speculate that representations that strike a balance between these two metrics have better chances of successful transfer to unseen tasks, and confirm this hypothesis in our RL experiments. In the future we would like to develop a strategy to select reference and train tasks so as to maximize the probability of transfer to a broader set of test formulas. Acknowledgments This work has been partially supported by PNRR MUR project PE0000013-FAIR. Declaration on Generative AI The author(s) have not employed any Generative AI tools. References [1] R. S. Sutton, A. G. Barto, Reinforcement Learning: An Introduction, second ed., The MIT Press, 2018. URL: http://incompleteideas.net/book/the-book-2nd.html. [2] M. L. Littman, U. Topcu, J. Fu, C. L. I. Jr., M. Wen, J. MacGlashan, Environment-independent task specifications via GLTL, CoRR abs/1704.04341 (2017). URL: http://arxiv.org/abs/1704.04341. arXiv:1704.04341 . [3] G. De Giacomo, L. Iocchi, M. Favorito, F. Patrizi, Foundations for restraining bolts: Reinforcement learning with ltlf/ldlf restraining specifications, Proceedings of the International Conference on Automated Planning and Scheduling 29 (2021) 128–136. URL: https://ojs.aaai.org/index.php/ ICAPS/article/view/3549. [4] R. T. Icarte, T. Q. Klassen, R. A. Valenzano, S. A. McIlraith, Reward machines: Exploiting reward function structure in reinforcement learning, J. Artif. Intell. Res. 73 (2022) 173–208. URL: https: //doi.org/10.1613/jair.1.12440. doi:10.1613/JAIR.1.12440 . [5] P. Vaezipoor, A. C. Li, R. T. Icarte, S. A. McIlraith, Ltl2action: Generalizing LTL instructions for multi-task RL, in: Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, 2021, pp. 10497–10508. URL: http://proceedings.mlr.press/ v139/vaezipoor21a.html. [6] Y.-L. Kuo, B. Katz, A. Barbu, Encoding formulas as deep networks: Reinforcement learning for zero-shot execution of ltl formulas, in: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, 2020, pp. 5604–5610. [7] L. Bortolussi, G. M. Gallo, J. Kretínský, L. Nenzi, Learning model checking and the kernel trick for signal temporal logic on stochastic processes, in: Tools and Algorithms for the Construction and Analysis of Systems - 28th International Conference, TACAS 2022, Held as Part of the European Joint Conferences on Theory and Practice of Software, ETAPS 2022, Munich, Germany, April 2-7, 2022, Proceedings, Part I, 2022, pp. 281–300. URL: https://doi.org/10.1007/978-3-030-99524-9_15. doi:10.1007/978- 3- 030- 99524- 9\_15 . [8] G. Saveri, L. Bortolussi, Towards invertible semantic-preserving embeddings of logical formulae, 2023. arXiv:2305.03143 . [9] E. M. Gold, Complexity of automaton identification from given data, Information and Control 37 (1978) 302–320. URL: https://www.sciencedirect.com/science/article/pii/S0019995878905624. doi:https://doi.org/10.1016/S0019- 9958(78)90562- 4 . [10] G. De Giacomo, M. Y. Vardi, Linear temporal logic and linear dynamic logic on finite traces, in: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, IJCAI ’13, AAAI Press, 2013, p. 854–860. [11] M. L. Littman, U. Topcu, J. Fu, C. Isbell, M. Wen, J. MacGlashan, Environment-independent task specifications via gltl, arXiv preprint arXiv:1704.04341 (2017). [12] A. Camacho, R. T. Icarte, T. Q. Klassen, R. A. Valenzano, S. A. McIlraith, Ltl and beyond: Formal languages for reward function specification in reinforcement learning., in: IJCAI, volume 19, 2019, pp. 6065–6073. [13] R. T. Icarte, T. Klassen, R. Valenzano, S. McIlraith, Using reward machines for high-level task specification and decomposition in reinforcement learning, in: International Conference on Machine Learning, PMLR, 2018, pp. 2107–2116. [14] R. Toro Icarte, T. Q. Klassen, R. Valenzano, S. A. McIlraith, Teaching multiple tasks to an RL agent using LTL, in: Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems (AAMAS), 2018, pp. 452–461. [15] B. G. León, M. Shanahan, F. Belardinelli, Systematic generalisation through task temporal logic and deep reinforcement learning, Adaptive and Learning Agents Workshop at International Conference on Autonomous Agents and Multiagent Systems (2022). [16] B. G. León, M. Shanahan, F. Belardinelli, In a nutshell, the human asked for this: Latent goals for following temporal specifications, in: International Conference on Learning Representations (ICLR), 2022. [17] B. Araki, X. Li, K. Vodrahalli, J. Decastro, M. Fry, D. Rus, The logical options framework, in: M. Meila, T. Zhang (Eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, PMLR, 2021, pp. 307–317. URL: https: //proceedings.mlr.press/v139/araki21a.html. [18] J. Andreas, D. Klein, S. Levine, Modular multitask reinforcement learning with policy sketches, in: International Conference on Machine Learning, PMLR, 2017, pp. 166–175. [19] J. X. Liu, A. Shah, E. Rosen, G. Konidaris, S. Tellex, Skill transfer for temporally-extended task specifications, 2023. arXiv:2206.05096 . [20] P. Vaezipoor, A. C. Li, R. A. T. Icarte, S. A. Mcilraith, LTL2action: Generalizing LTL instructions for multi-task RL, in: International Conference on Machine Learning, PMLR, 2021, pp. 10497–10508. [21] A. Pnueli, The temporal logic of programs, in: 18th Annual Symposium on Foundations of Computer Science, Providence, Rhode Island, USA, 31 October - 1 November 1977, IEEE Computer Society, 1977, pp. 46–57. URL: https://doi.org/10.1109/SFCS.1977.32. doi:10.1109/SFCS.1977.32 . [22] G. De Giacomo, R. De Masellis, M. Montali, Reasoning on ltl on finite traces: Insensitivity to infiniteness, Proceedings of the AAAI Conference on Artificial Intelligence 28 (2014). URL: https://ojs.aaai.org/index.php/AAAI/article/view/8872. doi:10.1609/aaai.v28i1.8872 . [23] F. Bacchus, C. Boutilier, A. Grove, Rewarding behaviors, Portland, OR, 1996, pp. 1160–1167. URL: behaviors.pdf. [24] A. Camacho, R. Toro Icarte, T. Q. Klassen, R. Valenzano, S. A. McIlraith, Ltl and beyond: Formal languages for reward function specification in reinforcement learning, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, 2019, pp. 6065–6073. URL: https://doi. org/10.24963/ijcai.2019/840. doi:10.24963/ijcai.2019/840 . [25] O. Maler, D. Nickovic, Monitoring temporal properties of continuous signals, in: Formal Techniques, Modelling and Analysis of Timed and Fault-Tolerant Systems, Joint International Conferences on Formal Modelling and Analysis of Timed Systems, FORMATS 2004 and Formal Techniques in Real-Time and Fault-Tolerant Systems, FTRTFT 2004, Grenoble, France, September 22-24, 2004, Proceedings, 2004, pp. 152–166. URL: https://doi.org/10.1007/978-3-540-30206-3_12. doi:10.1007/ 978- 3- 540- 30206- 3\_12 . [26] A. Donzé, T. Ferrère, O. Maler, Efficient robust monitoring for STL, in: Computer Aided Verification - 25th International Conference, CAV 2013, Saint Petersburg, Russia, July 13- 19, 2013. Proceedings, 2013, pp. 264–279. URL: https://doi.org/10.1007/978-3-642-39799-8_19. doi:10.1007/978- 3- 642- 39799- 8\_19 . [27] G. D. Giacomo, L. Iocchi, M. Favorito, F. Patrizi, Foundations for restraining bolts: Reinforcement learning with ltlf/ldlf restraining specifications, 2019. arXiv:1807.06333 . [28] I. Zakirzyanov, A. A. Shalyto, V. I. Ulyantsev, Finding all minimum-size dfa consistent with given examples: Sat-based approach, in: SEFM Workshops, 2017. [29] C. de la Higuera, A bibliographical study of grammatical inference, Pattern Recognition 38 (2005) 1332–1348. URL: https://www.sciencedirect.com/science/article/pii/S0031320305000221. doi:https://doi.org/10.1016/j.patcog.2005.01.003 , grammatical Inference. [30] P. García, D. López, M. Vázquez de Parga, Polynomial characteristic sets for dfa identification, Theoretical Computer Science 448 (2012) 41–46. URL: https://www.sciencedirect.com/science/ article/pii/S0304397512004070. doi:https://doi.org/10.1016/j.tcs.2012.04.042 . [31] J. Corazza, I. Gavran, D. Neider, Reinforcement learning with stochastic reward machines, Pro- ceedings of the AAAI Conference on Artificial Intelligence 36 (2022) 6429–6436. URL: https: //ojs.aaai.org/index.php/AAAI/article/view/20594. doi:10.1609/aaai.v36i6.20594 . [32] A. C. Li, Z. Chen, P. Vaezipoor, T. Q. Klassen, R. T. Icarte, S. A. McIlraith, Noisy symbolic abstractions for deep RL: A case study with reward machines, CoRR abs/2211.10902 (2022). URL: https: //doi.org/10.48550/arXiv.2211.10902. doi:10.48550/arXiv.2211.10902 . arXiv:2211.10902 . [33] F. Fuggitti, Ltlf2dfa, 2019. doi:10.5281/zenodo.3888410 . [34] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, ArXiv abs/1707.06347 (2017). URL: https://api.semanticscholar.org/CorpusID:28695052.