Enhancing Deep Sequence Generation with Logical
                                Temporal Knowledge
                                Elena Umili1,∗ , Gabriel Paludo Licks1 and Fabio Patrizi1
                                1
                                    Sapienza University of Rome


                                              Abstract
                                              Despite significant advancements in deep learning for sequence forecasting, neural models are typically
                                              trained only on data, and the incorporation of high-level prior logical knowledge in their training is still
                                              an hard challenge. This limitation hinders the exploitation of background knowledge, such as common
                                              sense or domain-specific information, in predictive tasks performed by neural networks. In this work,
                                              we propose a principled approach to integrate prior knowledge in Linear Temporal Logic over finite
                                              traces (LTL𝑓 ) into deep autoregressive models for multistep symbolic sequence generation (i.e., suffix
                                              prediction) at training time. Our method involves representing logical knowledge through continuous
                                              probabilistic relaxations and employing a differentiable schedule for sampling the next symbol from
                                              the network. We test our approach on synthetic datasets based on background knowledge in Declare,
                                              inspired by Business Process Management (BPM) applications. The results demonstrate that our method
                                              consistently improves the performance of the neural predictor, achieving lower Damerau-Levenshtein
                                              (DL) distances from target sequences and higher satisfaction rates of the logical knowledge compared to
                                              models trained solely on data.

                                              Keywords
                                              Suffix prediction, Neurosymbolic AI, Deep learning with logical knowledge, Linear Temporal Logic


                                1. Introduction
                                This paper addresses the problem of suffix prediction by exploiting both data and prior logic
                                temporal knowledge. The task of suffix prediction is particularly important in Business Process
                                Management (BPM) for forecasting the future of a process trace, enabling resource allocation,
                                and the anticipation of future steps in a process based on historical data.
                                   Recently, there has been significant interest in employing deep learning techniques for suffix
                                prediction in BPM [1], involving the use of Recurrent Neural Networks (RNNs) [2], Transformers
                                [3], and Deep Reinforcement Learning algorithms [4]. Despite the significant advancements
                                in machine learning and deep learning, some studies show how deep models can surprisingly
                                fail to satisfy even the most basic logical constraints [5] [6] derived from commonsense or
                                additional domain-knowledge. This occurs because these models are trained solely on data,
                                with the integration of logical knowledge into the training process remaining an open challenge.
                                Due to this, while in many application fields, such as BPM, both data and logical knowledge
                                about the process are often available, the latter remain typically unused. This work explores a

                                PMAI@ECAI24: International ECAI Workshop on Process Management in the AI era, October 19, 2024, Santiago De
                                Compostela, Spain
                                ∗
                                    Corresponding author.
                                Envelope-Open umili@diag.uniroma1.it (E. Umili); licks@diag.uniroma1.it (G. P. Licks); patrizi@diag.uniroma1.it (F. Patrizi)
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
novel approach, aimed at bridging this gap, by taking advantage of prior knowledge expressed
in Linear Temporal Logic over finite traces (LTL𝑓 ), when training a deep generative model for
suffix prediction.
   Very few methods incorporate logical knowledge into deep sequence generation [6, 7].
Specifically, STLnet [6] employs specifications in Signal Temporal Logic on continuous temporal
sequences for applications other than BPM, however, the attempt to apply it to our domains
produced poor results. On the other hand, Di Francescomarino et al. [7] apply their method
to suffix prediction in BPM but use background knowledge only at test time, employing a
variant of beam search that also takes into account the temporal formula modeling some process
properties. Thus, the proposed method does not fully integrate background knowledge into
the training process but, rather, a smart way to query the pre-trained network, considering the
LTL𝑓 specification. Interestingly, though, this approach can be seamlessly combined with ours,
to further maximize constraint satisfaction.
   Our contribution is a method to integrate the knowledge of certain process properties,
expressed in LTL over finite traces (LTL𝑓 ), into the training process of a sequence neural
predictor. This enables us to leverage both these sources of information –data and background
LTL𝑓 knowledge– at training time. Specifically, we achieve this by defining a differentiable
counterpart of the LTL𝑓 knowledge and leveraging a differentiable sampling process known as
Gumbel-Softmax trick [8]. By utilizing these two elements, we define a logical loss function
that can be used in conjunction with any other loss employed by the predictor. This ensures
that the network learns to generate traces similar to those seen in the training dataset and to
satisfy the given temporal specifications at the same time. Through preliminary experiments on
synthetic data, we show that incorporating our loss function at training-time results, for RNN
models, in predicted suffixes that feature both a lower Damerau-Levenshtein (DL) distance [9]
from the target suffixes and a higher rate of satisfaction of the LTL𝑓 constraints.
   We observe that, although grounded here on a typical BPM task, our approach can be applied
to any scenario involving multi-step symbolic sequence generation through deep learning
models, thus making the main results of this work potentially of general interest to the wider
audience focusing on Machine Learning and, in particular, neurosymbolic AI.


2. Related Work
Recently, there has been significant interest in employing deep Neural Networks (NN) in
predictive monitoring of business processes, for tasks such as next activity prediction, suffix
prediction, and attribute prediction [1]. Despite significant advances in the field, nearly all works
rely on training these models solely on data without utilizing any formal prior knowledge about
the process. They mainly focus on two aspects: (i) enhancing the neural model, ranging from
RNNs [10, 11, 7, 12], Convolutional NN (CNN) [13], Generative Adversarial Networks (GANs) [14,
12], Autoencoders [12], and Transformers [12]; and (ii) wisely choosing the sampling technique
to query the network at test time to generate the suffix, mostly using greedy search [10], random
search [11], or beam search [7], and more recently, policies trained with Reinforcement Learning
(RL) [15, 4]. Among all these works, only one exploits prior process knowledge [7], expressed as
a set of LTL formulas, but it uses this knowledge only at test time, modifying the beam search
sampling algorithm to select potentially compliant traces with the background knowledge.
   In this work, we take a radically different approach by introducing a principled way to
integrate background knowledge in LTL𝑓 with a deep NN model for suffix prediction at training
time. This is based on defining a logical loss that can be combined with the loss of any
autoregressive neural model and any sampling technique at test time, that draws inspiration
from the literature in Neurosymbolic (NeSy) AI [16]. In this field, many prior works focus on
exploiting temporal logical knowledge in deep learning tasks, but none have been used for
multi-step symbolic sequence generation. T-leaf [17] creates a semantic embedding space to
represent both formulas and traces and uses it in tasks such as sequential action recognition
and imitation learning, which do not involve multi-step prediction. [18] modifies Logic Tensor
Networks (LTN) [19] to integrate LTL𝑓 background knowledge in image sequence classification
tasks rather than generative tasks. STLnet [6] adopts a student-teacher training scheme where
the student network proposes a suffix based on the data, that is corrected by the teacher network
to satisfy the formula. This work uses Signal Temporal Logic (STL) formulas and is applied to
continuous trajectories rather than discrete traces. Our attempts to apply it to discrete data
and LTL𝑓 formulas translated into STL yielded poor results, as the resulting STL formulas were
extremely challenging for the framework to handle.
   Our work is the first to integrate temporal knowledge in the generation of multi-step symbolic
sequences. It is based on encoding LTL𝑓 formulas using a matrix representation that we
previously used for very different tasks –such as learning RL policies for non-Markovian tasks
[20] and inducing automata from a set of labeled traces [21]– that we adapt here for use in the
generative task of suffix prediction.


3. Background and Notation
Linear Temporal Logic and Deterministic Finite Automata Linear Temporal Logic (LTL)
[22] is a language which extends traditional propositional logic with modal operators. With the
latter we can specify rules that must hold through time. In this work we use LTL interpreted
over finite traces (LTL𝑓 ) [23], which model finite, but length-unbounded process executions,
and is thus adequate for finite-horizon problems. Given a finite set 𝑃 of atomic propositions,
the set of LTL𝑓 formulas 𝜙 is inductively defined as follows:

                              𝜙 ∶∶= ⊤ ∣ ⊥ ∣ 𝑝 ∣ ¬𝜙 ∣ 𝜙 ∧ 𝜙 ∣ 𝑋 𝜙 ∣ 𝜙𝑈 𝜙,                            (1)

where 𝑝 ∈ 𝑃. We use ⊤ and ⊥ to denote true and false respectively. 𝑋 (Strong Next) and 𝑈
(Until) are temporal operators. Other temporal operators are: 𝑁 (Weak Next) and 𝑅 (Release)
respectively, defined as 𝑁 𝜙 ≡ ¬𝑋 ¬𝜙 and 𝜙1 𝑅𝜙2 ≡ ¬(¬𝜙1 𝑈 ¬𝜙2 ); 𝐺 (globally) 𝐺𝜙 ≡ ⊥𝑅𝜙 and 𝐹
(eventually) 𝐹 𝜙 ≡ ⊤𝑈 𝜙. A trace 𝑥 = [𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑙) ] is a sequence of propositional assign-
ments to the propositions in 𝑃, where 𝑥 (𝑡) ⊆ 𝑃 is the set of all and only propositions that are true
at instant 𝑡. Additionally, |𝑥| = 𝑙 represents the length of 𝑥. Since every trace is finite, |𝑥| < ∞ and
𝑥 ∈ (2𝑃 )∗ . If the propositional symbols in 𝑃 are all mutually exclusive, e.g. the domain produces
one and only one symbol true at each step, we have that 𝑥 (𝑡) ∈ 𝑃. As customary in BPM, we
make this assumption, known as the Declare assumption [24]. By 𝑥 ⊨ 𝜙 we denote that the trace
𝑥 satisfies the LTL𝑓 formula 𝜙. We refer the reader to [23] for a formal description of the LTL𝑓
semantics. Any LTL𝑓 formula 𝜙 can be translated into a Deterministic Finite Automaton (DFA)
[23] 𝐴𝜙 = (Σ, 𝑄, 𝑞0 , 𝛿, 𝐹 ), with Σ the automaton alphabet, 𝑄 the finite set of states, 𝑞0 ∈ 𝑄 the
initial state, 𝛿 ∶ 𝑄 × Σ → 𝑄 the transition function and 𝐹 ⊆ 𝑄 the set of final states, s.t. for a
trace 𝑥 ∈ 2𝑃 , we have that 𝑥 ∈ 𝐿(𝐴𝜙 ) iff 𝑥 ⊨ 𝜙, where 𝐿(𝐴𝜙 ) denotes the language (of words)
accepted by 𝐴𝜙 . Depending on whether the Declare assumption holds or not, Σ can be 𝑃 or 2𝑃 ,
respectively. In our case, the former holds.

Deep Autoregressive Models and Suffix Prediction Deep autoregressive models are a
class of deep learning models that automatically predict the next component in a sequence by
using the previous elements in the sequence as inputs. These models can be applied to both
continuous and categorical (symbolic) data, finding applications in various generative AI tasks
such as Natural Language Processing (NLP) and Large Language Models (LLM) [25, 26], image
synthesis [27, 28], and time-series prediction [29]. They encompass deep architectures such as
RNNs and Transformers and, in general, any neural model capable of estimating the probability

                                 𝑝(𝑥 (𝑡) ∣ 𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑡−1) ) = 𝑝(𝑥 (𝑡) ∣ 𝑥 <𝑡 )               (2)

where 𝑥 = [𝑥 (1) , 𝑥 (2) , … , 𝑥 (𝑙) ] is a sequence of data. Its probability can be calculated as
                                                          𝑙
                                             𝑝(𝑥) = ∏ 𝑝(𝑥 (𝑖) ∣ 𝑥 (<𝑖) )                                  (3)
                                                        𝑖=1

In suffix prediction in BPM, given a subsequence (or prefix) of past events 𝑝𝑡 = [𝑒 (1) , 𝑒 (2) , … , 𝑒 (𝑡) ]
that the process has produced up to the current time step 𝑡, with 𝑒 (𝑖) in a finite set of activities
𝒜, we aim to complete the trace by generating the sequence of future events, called also suffix
𝑠𝑡 = [𝑒 ̃(𝑡+1) , 𝑒 ̃(𝑡+2) , … ]. Note that we denote ground-truth events observed in the domain as 𝑒
and predicted events as 𝑒,̃ that may in general deviate from actual future events. This task can
also be accomplished using autoregressive models, by choosing at each step the most probable
next event according to the neural network, concatenating it with the prefix, and continuing
to predict the next event in this manner until a special end-of-trace symbol is predicted or the
trace has reached a maximum number of steps 𝑇. We denote the trace obtained in this way as
𝜎̃ = 𝑝𝑡 + 𝑠𝑡 , with:

               𝑒 ̃(𝑘) = argmax 𝑝(𝑒 ̃(𝑘) = 𝑎 ∣ 𝑒 (1) , … , 𝑒 (𝑡) , 𝑒 ̃(𝑡+1) , … , 𝑒 ̃(𝑘−1) )   𝑡<𝑘≤𝑇       (4)
                           𝑎∈𝒜

This method of generating the suffix is a greedy search strategy and generally may not produce
the most probable suffix, e.g., the trace that maximizes the probability in Equation 3. Other
non-optimal sampling strategies commonly used for this task include Beam Search, Random
Sampling, and Temperature Sampling [4].


4. Method
Our method assumes an autoregressive neural model to estimate the probability of the next
event given a trace of previous events (Equation 2). For explanation purposes, we consider a
simple RNN-based next-activity predictor, where:

                                           ℎ(𝑡) = 𝑓 (𝑒 (𝑡) , ℎ(𝑡−1) ; 𝜃𝑓 )
                                                                                                      (5)
                                           𝑦 (𝑡) = 𝑔(ℎ(𝑡) ; 𝜃𝑔 )

where 𝑒 (𝑡) ∈ [0, 1]|𝒜 | is the event of the trace at time 𝑡 encoded as one-hot vector. ℎ(𝑡) ∈ ℝ𝑑ℎ is the
hidden state of the RNN, a continuous vector representation of past events in the trace having
size 𝑑ℎ . 𝑓 ∶ [0, 1]|𝒜 | × ℝ𝑑ℎ → ℝ𝑑ℎ is a parametric recursive function calculating the current
hidden state ℎ(𝑡) given the previous one ℎ(𝑡−1) and the current event 𝑒 (𝑡) as input. We denote its
learnable parameters as 𝜃𝑓 . Finally 𝑔 ∶ ℝ𝑑ℎ → [0, 1]|𝒜 | is the learnable function, with parameters
𝜃𝑔 , mapping the hidden state to the probability vector of the next event 𝑦 (𝑡) . Function 𝑓 and 𝑔
can be implemented by an RNN and a multi-layer perceptron (MLP), respectively, as shown in
Figure 1. The model parameters are usually trained with a supervised loss 𝐿𝒟 evaluated on a
dataset of 𝑁 ground truth traces 𝒟 = {𝜎1 , 𝜎2 , … , 𝜎𝑁 } recorded observing the process. The loss
formulation for a trace 𝜎𝑘 ∈ 𝒟 of length 𝑙 is as follows:
                                                  𝑙
                                               1
                        𝐿𝒟 (𝜎𝑘 , 𝜃𝑓 , 𝜃𝑔 ) =     ∑ crossentropy(𝑔(ℎ(𝜎𝑘≤𝑡 )), 𝜎𝑘𝑡+1 )                  (6)
                                               𝑙 𝑡=1

This loss teaches the network to produce the next symbol in the trace to mimic as closely as
possible the data contained in the dataset. In our method, this loss is combined with a logic loss
𝐿𝜙 , which imposes the satisfaction of prior knowledge in the following way

                                           𝐿 = 𝛼𝐿𝒟 + (1 − 𝛼)𝐿𝜙                                        (7)

where 𝛼 is a constant chosen between 0 and 1 balancing the influence of each loss on the training
process.

4.1. Logic Loss Calculation
We assume known certain process properties that need to be enforced at training time. Such
properties, which constitute the background (or prior) knowledge we have about the process,
are expressed as an LTL𝑓 formula 𝜙 and constitute the basis for defining the logic loss 𝐿𝜙 .
   Essentially, 𝐿𝜙 estimates the probability that the suffix generated by the RNN satisfies 𝜙 and
is enforced to be as close as possible to 1. Observe that, while the supervised loss in Equation
6 takes into account only the next-step prediction, the satisfaction of 𝜙 must be checked over
the entire predicted trace, which is produced by querying the network multiple times. This
is because compliance of the trace to 𝜙 can be evaluated only one the complete trace. Indeed,
the fact that the predicted partial trace 𝜎̃ satisfies (or violates) 𝜙 before termination does not
preclude the entire predicted trace from violating (or satisfying) 𝜙.
   Additionally, notice that the next-activity predictor is trained with perfectly one-hot (or
symbolic) data and produces a (continuous) probability vector as output, which may be very
different from a one-hot vector. Therefore, we cannot directly feed the network with the
probabilities predicted in previous steps, as this may generate unexpected outputs. Instead,
we need to sample from these probabilities, in order to produce one-hot vectors again. At
Figure 1: Logic loss computation by using a differentiable procedure for both suffix generation and
formula evaluation. The violet arrows represents connections that are interested by backpropagation of
the loss, and the elements highlighted with a red border are the modules whose parameters are affected
by the logic loss.


the same time, this sampling process needs to guarantee differentiability, so as to ensure
backpropagation’s applicability.
  The calculation of the logic loss is exemplified in the following steps:
   1. given a prefix 𝑝𝑡 , generate a suffix 𝑠𝑡 that is at the same time: (i) very probable according
      to the network; (ii) differentiable; and (iii) at least nearly symbolic, i.e., as close as possible
      to a one-hot vector;
   2. calculate the probability that the predicted trace 𝜎̃ = 𝑝𝑡 + 𝑠𝑡 satisfies 𝜙 through a differen-
      tiable procedure;
   3. enforce maximization of the obtained probability.

   Given the differentiability of both the knowledge evaluator and the sampling process, the
loss can be backpropagated through the suffix and influence the parameters 𝜃𝑓 and 𝜃𝑔 of the
next activity predictor.

Differentiable Suffix Generation To sample the next activity 𝑒 ̃(𝑡+1) from the probability
distribution returned by the neural network 𝑦 (𝑡) , while ensuring differentiability of the output
sample, we use the Gumbel-Softmax reparameterization trick [8, 30]:
                                                         log(𝑦 (𝑡) ) + 𝐺
                                  𝑒 ̃(𝑡+1) = softmax (                   )                            (8)
                                                                𝜏
where 𝐺 is a random sample drawn from the Gumbel distribution, and 𝜏 is a temperature
parameter used to interpolate between discrete one-hot-encoded categorical distributions and
continuous categorical probabilities. When 𝜏 → 0, the output samples tend to be one-hot, while
they tend to the continuous probabilities in 𝑦 (𝑡) when 𝜏 = 1.
Differentiable LTL𝑓 Evaluation In order to evaluate the LTL𝑓 specification in a differentiable
way, we represent 𝜙 through a matrix representation that is generally used to encode Probabilistic
Finite Automata (PFA) [21, 31, 20]. We first translate 𝜙 into the equivalent DFA 𝐴𝜙 [32] and
simplify the automaton with the Declare assumption. We then represent the automaton 𝐴𝜙 =
(𝑃, 𝑄, 𝑞0 , 𝛿, 𝐹 ) as an input vector 𝑣0 ∈ [0, 1]1×|𝑄| , a transition matrix 𝑀𝛿 ∈ [0, 1]|𝑃|×|𝑄|×|𝑄| , and
a finality vector 𝑣𝐹 ∈ [0, 1]|𝑄|×1 . The input vector 𝑣0 takes the value 1 at index 𝑞 if 𝑞 = 𝑞0 ,
and 0 otherwise. The matrix 𝑀𝛿 takes the value 1 at index (𝑝, 𝑞, 𝑞 ′ ) if 𝛿(𝑞, 𝑝) = 𝑞 ′ , and 0
otherwise. We denote 𝑀𝛿 [𝑝] ∈ [0, 1]|𝑄|×|𝑄| as the 2D transition matrix for symbol 𝑝. The output
vector 𝑣𝐹 has a 1 at position 𝑞 if 𝑞 ∈ 𝐹, and 0 otherwise. Given a string 𝑥 = [𝑥 (1) 𝑥 (2) … 𝑥 (𝑙) ],
                       (0) (1)     (𝑙)
we denote 𝑞𝑝 = [𝑞𝑝 , 𝑞𝑝 , … , 𝑞𝑝 ] as the sequence of probabilities to visit a certain state, and
        (0)   (1)    (𝑙)                                                                 (𝑡)
𝑎𝑝 = [𝑎𝑝 , 𝑎𝑝 , … , 𝑎𝑝 ] as the probabilities that the sequence is accepted. Here, 𝑞𝑝 ∈ [0, 1]1×|𝑄|
is a row vector containing at position 𝑞 the probability of being in state 𝑞 at time 𝑡, while
 (𝑡)
𝑎𝑝 ∈ [0, 1] is a scalar representing the probability that the sequence is accepted at time 𝑡.

                                      (0)
                                     𝑞𝑝 = 𝑣0
                                      (𝑡)  (𝑡−1)
                                     𝑞𝑝 = 𝑞𝑝     × 𝑀𝛿 [𝑥 (𝑡) ]       ∀𝑡 > 0                          (9)
                                      (𝑡)  (𝑡)
                                     𝑎𝑝 = 𝑞𝑝 × 𝑣𝐹

Here, × denotes the inner product. Therefore, the probability of a string being accepted is the
                                                                  (𝑙)
probability of being in a final state in the last computed state 𝑎𝑝 . Note that 𝑥 (𝑡) ∈ 𝑃 is an integer,
while the predicted symbol sampled by the network 𝑒 ̃(𝑡) (Equation 8) is a probability vector,
which tends to approximate a one-hot vector but is not guaranteed to be composed only of
zeros and ones and may contain some noise. For this reason, we adapt Equation 9 to our case
by adding the expectation computation over the next automaton state.
                                       |𝑃|
                               (𝑡)           (𝑡−1)
                              𝑞𝑝 = ∑ (𝑞𝑝             × 𝑀𝛿 [𝑝])𝑒 ̃(𝑡) [𝑝]   ∀𝑡 > 0
                                   𝑝=1                                                              (10)
                               (𝑡)  (𝑡)
                              𝑎𝑝 = 𝑞𝑝 × 𝑣𝐹

Enforcing Compliance with the Formula In summary, given a prefix 𝑝𝑡 = [𝑒 (1) , 𝑒 (2) , … , 𝑒 (𝑡) ],
we calculate the suffix 𝑠𝑡 = [𝑒 ̃(𝑡+1) , … , 𝑒 ̃(𝑇 ) ] up to a time-step 𝑇 > 𝑡 by iterating through the
following steps at each time step: (i) producing the RNN probability for the next symbol (Eq. 5),
(ii) differentiably sampling the next activity from this probability (Eq. 8), and (iii) concatenating
the next activity to the current trace. The trace generated in this way, 𝜎̃ = 𝑝 + 𝑠, is processed by
the differentiable automaton (Eq. 10), which produces an acceptance probability at the last step,
  (𝑇 )
𝑎𝑝 . We enforce this probability to stay close to 1 by minimizing the loss

                                                              (𝑇 )
                                             𝐿𝜙 = − log(𝑎𝑝 ).                                       (11)

The entire process is illustrated in Figure 1. The figure shows how the loss is backpropagated
through both the suffix prediction and knowledge evaluation processes, affecting the parameters
of the learnable functions 𝑓 and 𝑔 of the next activity predictor.
5. Experiments
In this section, we describe the setup for our method’s empirical evaluation and discuss the
results achieved. The experiments are reproducible using our implementation at https://github.
com/whitemech/suffix-prediction-pmai2024.

5.1. Experimental Setup
We aim to evaluate our method on a diverse set of formulas with varying levels of difficulty. To
achieve this, we construct random formulas by combining Declare constraints [33] as follows:
(i) we set the maximum number of conjuncts 𝐶 and the maximum number of disjuncts 𝐷;
(ii) we generate a random formula 𝜙 as the disjunction of up to 𝐷 disjuncts, each of them is
the conjunction of up to 𝐶 randomly selected Declare constraints, possibly negated, with each
constraint having random symbols from 𝒜; (iii) we evaluate 𝜙 on a set of random strings and
select the formula for the experiment only if the satisfaction rate is between 10% and 90%, in
order to eliminate formulas that are too obvious or too difficult.
    For each random formula, we construct a dataset 𝒟 composed of 𝑁 traces of length 𝑙 that are
compliant with the formula. We then split the dataset into 90% for training and 10% for testing.
We generate 25 dataset-formula pairs by setting 𝐷 = 𝐶 = 5, 𝒜 = {𝑎, 𝑏, 𝑐}, 𝑁 = 1000, and 𝑙 = 20.
These pairs are used to evaluate our framework Additionally, each experiment’s training and
testing are performed on different prefix lengths |𝑝| ∈ 5, 10, 15 to evaluate the model’s ability
to predict shorter and longer suffixes. During testing, the suffix prediction stops when the
model outputs the end-of-trace symbol or when the predicted suffix reaches a maximum of
𝑇 = 2 × 𝑙 = 40 timesteps. Each experiment is repeated 5 times with different seeds to ensure
robustness and reliability of the results.

5.2. Empirical Results
Our empirical evaluation compares the performance of four models:
    • RNN (random): An RNN trained on the training data, queried with random sampling.
      The next activity is selected by randomly sampling according to the probability distribu-
      tion predicted by the RNN output layer.

    • RNN+LTL (random): An RNN trained with the combined loss in Eq 7, where 𝐿𝜙 is
      calculated using the random formula associated with the dataset, and queried with
      random sampling.

    • RNN (greedy): An RNN trained only with data, with suffixes generated using greedy
      sampling (Eq. 4), always taking the most probable next event predicted by the output
      layer. This model is equivalent to the one proposed in [2].

    • RNN+LTL (greedy): Similar to the previous model but with the addition of LTL back-
      ground knowledge during training.
  We instantiate all models as LSTMs with two layers and 100 neurons in each layer (as in
[10]), a batch size of 64, and an Adam optimizer with a learning rate of 5 × 10−4 . The models
                                  Accuracy % by prefix length                                                                   Satisfiability % by prefix length                                                         DL distance by prefix length
              50                                                                                           100                                                                                        30
                   RNN (random)   RNN+LTL (random)    RNN (greedy)   RNN+LTL (greedy)                            RNN (random)     RNN+LTL (random)    RNN (greedy)   RNN+LTL (greedy)                      RNN (random)   RNN+LTL (random)    RNN (greedy)   RNN+LTL (greedy)


              40                                                                                           80                                                                                         25

                                                                                                                                                                                                      20
              30                                                                                           60


                                                                                        Satisfiability %


                                                                                                                                                                                        DL distance
 Accuracy %


                                                                                                                                                                                                      15
              20                                                                                           40
                                                                                                                                                                                                      10

              10                                                                                           20
                                                                                                                                                                                                      5

              0                                                                                             0                                                                                         0
                         5                      10                      15                                             5                        10                      15                                       5                      10                      15
                                           Prefix length                                                                                   Prefix length                                                                           Prefix length


              (a) Next activity pred. acc. %                                                                     (b) Satisfaction rate %                                                                             (c) DL distance

Figure 2: Empirical results.


that incorporate LTL background knowledge use 𝛼 set to 0.6 (Eq. 7), 𝜏 equal to 0.5 (Eq. 8), and 𝑇
equal to 2 × 𝑙 = 40 timesteps (Eq. 11).
   The plots in Figure 2 show the average and standard deviation of the results obtained from
running experiments on all 25 datasets. The evaluated metrics are the model’s test accuracy
on the next activity prediction (Figure 2a), the satisfaction rate of test suffixes (Figure 2b), and
the Damerau-Levenshtein (DL) distance (Figure 2c) of the predicted traces against the ground
truth traces. The empirical results show that the integration of background knowledge does not
affect the accuracy of next activity prediction, indicating that the predictor retains its capability
to mimic training data when combined with logical knowledge. However, RNN+LTL models
consistently achieve the best performance in suffix prediction. In terms of the satisfiability
percentage metric, utilizing LTL background knowledge consistently results in an improved
satisfaction rate over using only data, regardless of the sampling technique employed. Varying
prefix lengths display no significant impact on the satisfaction rate, implying that the models
are equally capable of sampling traces that satisfy the formulas for any input prefix length. The
DL distance comparison further supports the superiority of models that utilize LTL background
knowledge, showing a lower distance between the sampled traces and the ground truth traces
for both sampling strategies used. The prefix length, however, does impact the DL distance,
with shorter prefix inputs leading to a higher distance between the predicted and ground truth
traces. In general, random sampling shows better performance compared to greedy sampling in
terms of both metrics. Additionally, the DL distance results of the random sampling methods
exhibit a more stable standard deviation than those of the greedy sampling methods. Overall,
our empirical evaluation supports the hypothesis that integrating background knowledge into
the models consistently results in improved performance.


6. Conclusions and Future Work
In conclusion, we present a novel NeSy framework for incorporating background knowledge in
LTL𝑓 into the training process of a deep autoregressive model for multistep symbolic sequence
generation (i.e., suffix prediction). Our method involves: (i) translating the LTL𝑓 formula into
a DFA, (ii) representing the DFA probabilistically through a neural framework, and (iii) softly
enforcing DFA acceptance on differentiable suffixes obtained via Gumbel-Softmax sampling.
This process allows us to define a logical loss that smoothly guides the predicted suffixes to
satisfy the given knowledge, which can be integrated with other loss functions of the predictor.
Experimental results on synthetic datasets demonstrate that applying our loss function to RNN
models consistently produces suffixes with both lower DL distances and higher rates of LTL𝑓
knowledge satisfaction, regardless of the sampling technique used for querying the network.
In future work, we plan to apply our framework to realistic BPM datasets and more advanced
neural predictors, such as GANs and Transformers.


Acknowledgments
This work has been partially supported by PNRR MUR project PE0000013-FAIR , the Sapienza
Project MARLeN, the EU H2020 project AIPlan4EU (No. 101016442), the ERC-ADG WhiteMech
(No. 834228), and the PRIN project RIPER (No. 20203FFYLK).


References
 [1] E. Rama-Maneiro, J. C. Vidal, M. Lama, Deep learning for predictive business process
     monitoring: Review and benchmark, IEEE Transactions on Services Computing 16 (2020)
     739–756.
 [2] N. Tax, I. Verenich, M. L. Rosa, M. Dumas, Predictive business process monitoring
     with LSTM neural networks, in: Advanced Information Systems Engineering - 29th
     International Conference, CAiSE 2017, Essen, Germany, June 12-16, 2017, Proceed-
     ings, 2017, pp. 477–492. URL: https://doi.org/10.1007/978-3-319-59536-8_30. doi:10.1007/
     978- 3- 319- 59536- 8\_30 .
 [3] G. Rivera Lazo, R. Ñanculef, Multi-attribute transformers for sequence prediction in
     business process management, in: Discovery Science: 25th International Conference,
     DS 2022, Montpellier, France, October 10–12, 2022, Proceedings, Springer-Verlag, Berlin,
     Heidelberg, 2022, p. 184–194. doi:10.1007/978- 3- 031- 18840- 4_14 .
 [4] E. Rama-Maneiro, F. Patrizi, J. C. Vidal, M. Lama, Towards learning the optimal sampling
     strategy for suffix prediction in predictive monitoring, in: Proc. of CAISE 2024, 2024, p. To
     Appear.
 [5] E. Giunchiglia, M. C. Stoian, S. Khan, F. Cuzzolin, T. Lukasiewicz, ROAD-R: the autonomous
     driving dataset with logical requirements, Mach. Learn. 112 (2023) 3261–3291. doi:10.
     1007/S10994- 023- 06322- Z .
 [6] M. Ma, J. Gao, L. Feng, J. Stankovic, Stlnet: Signal temporal logic enforced multivariate
     recurrent neural networks, in: H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, H. Lin
     (Eds.), Advances in Neural Information Processing Systems, volume 33, Curran Associates,
     Inc., 2020, pp. 14604–14614.
 [7] C. Di Francescomarino, F. M. Maggi, G. Petrucci, A. Yeshchenko, An eye into the future:
     Leveraging a-priori knowledge in predictive business process monitoring, in: Business Pro-
     cess Management - 15th International Conference, BPM 2017, Barcelona, Spain, September
     10-15, 2017, Proceedings, 2017, pp. 252–268. doi:10.1007/978- 3- 319- 65000- 5\_15 .
 [8] E. Jang, S. Gu, B. Poole, Categorical reparameterization with gumbel-softmax, in: 5th
     International Conference on Learning Representations, ICLR 2017, Toulon, France, April
     24-26, 2017, Conference Track Proceedings, 2017.
 [9] F. J. Damerau, A technique for computer detection and correction of spelling errors,
     Commun. ACM 7 (1964) 171–176. doi:10.1145/363958.363994 .
[10] N. Tax, I. Verenich, M. L. Rosa, M. Dumas, Predictive business process monitoring with
     LSTM neural networks, in: Advanced Information Systems Engineering - 29th Interna-
     tional Conference, CAiSE 2017, Essen, Germany, June 12-16, 2017, Proceedings, 2017, pp.
     477–492. doi:10.1007/978- 3- 319- 59536- 8\_30 .
[11] J. Evermann, J. Rehse, P. Fettke, Predicting process behaviour using deep learning, Decis.
     Support Syst. 100 (2017) 129–140. URL: https://doi.org/10.1016/j.dss.2017.04.003. doi:10.
     1016/J.DSS.2017.04.003 .
[12] I. Ketykó, F. Mannhardt, M. Hassani, B. F. van Dongen, What averages do not tell -
     predicting real life processes with sequential deep learning, CoRR abs/2110.10225 (2021).
     URL: https://arxiv.org/abs/2110.10225. arXiv:2110.10225 .
[13] N. D. Mauro, A. Appice, T. M. A. Basile, Activity prediction of business process in-
     stances with inception CNN models, in: AI*IA 2019 - Advances in Artificial Intelli-
     gence - XVIIIth International Conference of the Italian Association for Artificial In-
     telligence, Rende, Italy, November 19-22, 2019, Proceedings, 2019, pp. 348–361. URL:
     https://doi.org/10.1007/978-3-030-35166-3_25. doi:10.1007/978- 3- 030- 35166- 3\_25 .
[14] F. Taymouri, M. L. Rosa, S. M. Erfani, A deep adversarial model for suffix and remaining time
     prediction of event sequences, in: Proceedings of the 2021 SIAM International Conference
     on Data Mining, SDM 2021, Virtual Event, April 29 - May 1, 2021, 2021, pp. 522–530. URL:
     https://doi.org/10.1137/1.9781611976700.59. doi:10.1137/1.9781611976700.59 .
[15] A. Chiorrini, C. Diamantini, A. Mircoli, D. Potena, A preliminary study on the application of
     reinforcement learning for predictive process monitoring, in: Process Mining Workshops
     - ICPM 2020 International Workshops, Padua, Italy, October 5-8, 2020, Revised Selected
     Papers, 2020, pp. 124–135. doi:10.1007/978- 3- 030- 72693- 5\_10 .
[16] T. R. Besold, A. S. d’Avila Garcez, S. Bader, H. Bowman, P. M. Domingos, P. Hitzler,
     K. Kühnberger, L. C. Lamb, P. M. V. Lima, L. de Penning, G. Pinkas, H. Poon, G. Zaverucha,
     Neural-symbolic learning and reasoning: A survey and interpretation, in: Neuro-Symbolic
     Artificial Intelligence: The State of the Art, 2021, pp. 1–51. doi:10.3233/FAIA210348 .
[17] Y. Xie, F. Zhou, H. Soh, Embedding symbolic temporal knowledge into deep se-
     quential models, CoRR abs/2101.11981 (2021). URL: https://arxiv.org/abs/2101.11981.
     arXiv:2101.11981 .
[18] E. Umili, R. Capobianco, G. D. Giacomo, Grounding ltlf specifications in image sequences,
     in: Proceedings of the 20th International Conference on Principles of Knowledge Repre-
     sentation and Reasoning, KR 2023, Rhodes, Greece, September 2-8, 2023, 2023, pp. 668–678.
     URL: https://doi.org/10.24963/kr.2023/65. doi:10.24963/KR.2023/65 .
[19] S. Badreddine, A. d’Avila Garcez, L. Serafini, M. Spranger, Logic tensor networks, Artificial
     Intelligence 303 (2022) 103649. URL: https://www.sciencedirect.com/science/article/pii/
     S0004370221002009. doi:https://doi.org/10.1016/j.artint.2021.103649 .
[20] E. Umili, F. Argenziano, A. Barbin, R. Capobianco, Visual reward machines, in: Proceedings
     of the 17th International Workshop on Neural-Symbolic Learning and Reasoning, La
     Certosa di Pontignano, Siena, Italy, July 3-5, 2023, 2023, pp. 255–267. URL: https://ceur-ws.
     org/Vol-3432/paper23.pdf.
[21] E. Umili, R. Capobianco, DeepDFA: a transparent neural network design for dfa induction,
     2023. doi:10.13140/RG.2.2.25449.98401 .
[22] A. Pnueli, The temporal logic of programs, in: 18th Annual Symposium on Foundations
     of Computer Science, Providence, Rhode Island, USA, 31 October - 1 November 1977,
     IEEE Computer Society, 1977, pp. 46–57. URL: https://doi.org/10.1109/SFCS.1977.32. doi:10.
     1109/SFCS.1977.32 .
[23] G. De Giacomo, M. Y. Vardi, Linear temporal logic and linear dynamic logic on finite
     traces, in: Proceedings of the Twenty-Third International Joint Conference on Artificial
     Intelligence, IJCAI ’13, AAAI Press, 2013, p. 854–860.
[24] G. De Giacomo, R. De Masellis, M. Montali, Reasoning on ltl on finite traces: Insensitivity to
     infiniteness, Proceedings of the AAAI Conference on Artificial Intelligence 28 (2014). URL:
     https://ojs.aaai.org/index.php/AAAI/article/view/8872. doi:10.1609/aaai.v28i1.8872 .
[25] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
     E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and
     efficient foundation language models, CoRR abs/2302.13971 (2023). URL: https://doi.org/
     10.48550/arXiv.2302.13971. doi:10.48550/ARXIV.2302.13971 . arXiv:2302.13971 .
[26] OpenAI, Gpt-4 technical report, 2024. arXiv:2303.08774 .
[27] A. van den Oord, N. Kalchbrenner, K. Kavukcuoglu, Pixel recurrent neural networks, in:
     M. F. Balcan, K. Q. Weinberger (Eds.), Proceedings of The 33rd International Conference
     on Machine Learning, volume 48 of Proceedings of Machine Learning Research, PMLR, New
     York, New York, USA, 2016, pp. 1747–1756.
[28] T. Salimans, A. Karpathy, X. Chen, D. P. Kingma, Pixelcnn++: A pixelcnn implementation
     with discretized logistic mixture likelihood and other modifications, in: ICLR, 2017.
[29] A. Casolaro, V. Capone, G. Iannuzzo, F. Camastra, Deep learning for time series forecast-
     ing: Advances and open problems, Information 14 (2023). URL: https://www.mdpi.com/
     2078-2489/14/11/598. doi:10.3390/info14110598 .
[30] C. J. Maddison, A. Mnih, Y. W. Teh, The concrete distribution: A continuous relaxation of
     discrete random variables, in: 5th International Conference on Learning Representations,
     ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017. URL:
     https://openreview.net/forum?id=S1jE5L5gl.
[31] E. Umili, Neurosymbolic integration of linear temporal logic in non symbolic do-
     mains, in: Multi-Agent Systems - 20th European Conference, EUMAS 2023, Naples,
     Italy, September 14-15, 2023, Proceedings, 2023, pp. 521–527. URL: https://doi.org/10.1007/
     978-3-031-43264-4_41. doi:10.1007/978- 3- 031- 43264- 4\_41 .
[32] F. Fuggitti, Ltlf2dfa, 2019. doi:10.5281/zenodo.3888410 .
[33] G. De Giacomo, M. Dumas, F. M. Maggi, M. Montali, Declarative process modeling in
     bpmn, in: J. Zdravkovic, M. Kirikova, P. Johannesson (Eds.), Advanced Information Systems
     Engineering, Springer International Publishing, Cham, 2015, pp. 84–100.