=Paper=
{{Paper
|id=Vol-3741/paper17
|storemode=property
|title=Enhancing Next Activity Prediction with Adversarial Training of Vision Transformers
|pdfUrl=https://ceur-ws.org/Vol-3741/paper17.pdf
|volume=Vol-3741
|authors=Vincenzo Pasquadibisceglie,Annalisa Appice,Giovanna Castellano,Donato Malerba
|dblpUrl=https://dblp.org/rec/conf/sebd/Pasquadibisceglie24
}}
==Enhancing Next Activity Prediction with Adversarial Training of Vision Transformers==
Enhancing Next Activity Prediction with Adversarial Training of Vision Transformers Vincenzo Pasquadibisceglie1,2,*,† , Annalisa Appice1,2 , Giovanna Castellano1,2 and Donato Malerba1,2 1 University of Bari Aldo Moro, Bari, Italy 2 Consorzio Interuniversitario Nazionale per l’Informatica - CINI, Bari, Italy Abstract Predicting the subsequent activity in the ongoing execution (trace) of a business process is a crucial task in Predictive Process Monitoring (PPM). This capability enables analysts to intervene proactively and prevent undesirable behaviors. This paper presents a PPM approach that integrates adversarial training with Vision Transformers (ViTs) to enhance the accuracy of predicting the next activity in a running process trace. This approach takes into account multi-view information that may be captured in a process trace, treating them as distinct patches of an image. Attention modules are employed to reveal explainable information about the different views of a business process and the trace events that could influence the prediction. Additionally, to mitigate overfitting and improve accuracy, we investigate the impact of adversarial ViT training. Experiments conducted on various benchmark event logs demonstrate the effectiveness of the proposed approach compared to several state-of-the-art PPM techniques. Notably, the explanations obtained through attention modules yield valuable insights. Keywords Predictive process monitoring, Next activity prediction, Deep learning, Multi-view learning, Adversarial training, Vision transformers, Attention, XAI, Computer vision 1. Introduction Predictive Process Monitoring (PPM) is a burgeoning field focused on enhancing business process efficiency and effectiveness through predictive analytics. By analyzing historical raw event data, PPM systems can identify patterns and trends, providing valuable insights into the key factors contributing to process inefficiencies and bottlenecks. The use of deep learning in predictive modeling has become increasingly popular in PPM systems, reflecting the broader trend of deep learning’s success across various domains. Specifi- cally, several deep neural networks, such as Long Short-Term Memory (LSTM) networks [1], [2], [3], [4], Convolutional Neural Networks (CNNs) [5], [6], Generative Adversarial Networks (GANs) [7], and Autoencoders [8], have recently contributed to improving the accuracy of PPM SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy * Corresponding author. $ vincenzo.pasquadibisceglieds@uniba.it (V. Pasquadibisceglie); annalisa.appice@uniba.it (A. Appice); giovanna.castellano@uniba.it (G. Castellano); donato.malerba@uniba.it (D. Malerba) 0000-0002-7273-3882 (V. Pasquadibisceglie); 0000-0001-9840-844X (A. Appice); 0000-0002-6489-8628 (G. Castellano); 0000-0001-8432-4608 (D. Malerba) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings systems. This is due to their ability to learn accurate deep neural models, which in turn enable proactive and corrective actions to enhance process performance and mitigate risks. While the primary focus of PPM systems remains on delivering accurate predictions of future states of running traces, there is a growing preference for predictive models that are easier to explain in PPM applications. Recent studies [9, 10, 11, 12] have explored the application of existing eXplainable AI (XAI) methods to elucidate opaque PPM models. However, the issue of model explainability in the context of deep learning-based PPM systems remains under-explored. In [13], we recently introduced a method called JARVIS [13] (Joining Adversarial tRaining with VISion transformers in next-activity prediction), which combines Vision Transformers (ViT) and Adversarial Training to achieve a balance between model accuracy and explainability. Specifically, the model’s explainability is enhanced by the adoption of a ViT, a deep neural architecture comprising multiple self-attention layers. An attention layer in deep learning is a component that enables a neural network to concentrate on specific parts of the input data when making predictions or decisions. It is inspired by the human visual system, which can selectively focus on different parts of an image to understand it better. Therefore, multiple self-attentions layers can provide an explanation of the model’s behavior in terms of the most informative inputs. Adversarial training [14] is also employed to improve accuracy by incorporating perturbed (i.e., adversarial) inputs into the training process, thereby mitigating overfitting and enhancing generalization. The paper is organized as follows. Preliminary concepts are reported in Section 2, while the JARVIS approach is described in Section 3. The experimental setup and the results of the evaluation of the proposed approach are illustrated in Sections 4. Finally, Section 5 recalls the purpose of our research, draws conclusions, and illustrates possible future developments. 2. Preliminary concepts A trace is a record of a business process that shows the stages of its execution through a sequence of events. An event is a complex entity characterized by two essential components: the activity and the timestamp (indicating when the activity occurred). Additionally, events may possess optional characteristics, such as the resource responsible for the activity or the cost involved in completing it. Consequently, each event is accompanied by two mandatory views, representing the activity and the timestamp, as well as 𝑚 optional views corresponding to other event characteristics. Let 𝒜 be the set of all activity names, 𝒮 be the set of all trace identifiers, 𝒯 be the set of all timestamps, and 𝒱𝑗 be the set of all names in the 𝑗-th view, where 1 ≤ 𝑗 ≤ 𝑚 . Definition 1 (Event). Given the event universe ℰ = 𝒮 × 𝒜 × 𝒯 × 𝒱1 × . . . × 𝒱𝑚 , an event 𝑒 ∈ ℰ is a tuple 𝑒 = (𝜎, 𝑎, 𝑡, 𝑣1 , . . . , 𝑣𝑚 ) that represents the occurrence of activity 𝑎 in trace 𝜎 at timestamp 𝑡 with characteristics 𝑣1 , 𝑣2 , . . . , 𝑣𝑚 . Let us introduce the functions: 𝜋𝒮 : ℰ ↦→ 𝒮 such that 𝜋𝒮 (𝑒) = 𝜎, 𝜋𝒜 : ℰ ↦→ 𝒜 such that 𝜋𝒜 (𝑒) = 𝑎, 𝜋𝒯 : ℰ ↦→ 𝒯 such that 𝜋𝒯 (𝑒) = 𝑡 and 𝜋𝒱𝑗 : ℰ ↦→ 𝒱𝑗 such that 𝜋𝒱𝑗 (𝑒) = 𝑣𝑗 , where 𝑗 = 1, . . . , 𝑚. Definition 2 (Trace). Let ℰ * denote the set of all possible sequences on ℰ. A trace 𝜎 is a sequence 𝜎 = ⟨𝑒1 , 𝑒2 . . . , 𝑒𝑛 ⟩ ∈ ℰ * so that: (1) ∀𝑖 = 1, . . . , 𝑛, ∃𝑒𝑖 ∈ ℰ such that 𝜎(𝑖) = 𝑒𝑖 and 𝜋𝒮 (𝑒𝑖 ) = 𝜎, and (2) ∀𝑖 = 1, . . . , 𝑛 − 1, 𝜋𝒯 (𝑒𝑖 ) ≤ 𝜋𝒯 (𝑒𝑖+1 ). Definition 3 (Event log). Let ℬ(ℰ * ) denote the set of all multisets over ℰ. An event log ℒ ⊆ ℬ(ℰ * ) is a multiset of traces. Definition 4 (Prefix trace). A prefix trace 𝜎 𝑘 = ⟨𝑒1 , 𝑒2 , . . . , 𝑒𝑘 ⟩ is the sub-sequence of a trace 𝜎 starting from the beginning of the trace 𝜎 with 1 ≤ 𝑘 = |𝜎 𝑘 | < |𝜎|. A trace is a complete (i.e., started and ended) process instance, while a prefix trace is a process instance in execution (also called running trace). The activity 𝜋𝒜 (𝑒𝑘+1 ) = 𝑎𝑘+1 corresponds to the next-activity of 𝜎 𝑘 , i.e., 𝑛𝑒𝑥𝑡(𝜎 𝑘 ) = 𝜋𝒜 (𝑒𝑘+1 ) with 𝑒𝑘+1 = 𝜎(𝑘 + 1). Definition 5 (Multiset of labeled prefix traces). Let ℒ ⊆ ℬ(ℰ * ) be an event log, 𝒫 ⊆ ℬ(ℰ * × 𝒜) is the multiset of all prefix traces extracted from traces recorded in ℒ. Each pre- fix trace is labeled with the next-activity associated to each prefix sequence in the corresponding trace so that 𝒫 = [𝜎 𝑘 , 𝜋𝒜 (𝑒𝑘+1 )|𝜎 ∈ ℒ ∧ 1 ≤ 𝑘 < |𝜎|]. Definition 6 (Single-view representation of a labeled prefix trace multiset). Let 𝒱 be a view (either mandatory, i.e., 𝒱 = 𝒜 or 𝒱 = 𝒯 , or optional, i.e. 𝒱 = 𝒱𝑗 with 𝑗 = 1, . . . , 𝑚), Π : ℰ * ↦→ 𝒱 * be a function such that Π(𝜎 𝑘 ) = Π(⟨𝑒1 , 𝑒2 , . . . , 𝑒𝑘 ⟩) = ⟨𝜋𝒱 (𝑒1 ), 𝜋𝒱 (𝑒2 ) . . . , 𝜋𝒱 (𝑒𝑘 )⟩. 𝒫𝒱 denotes the multiset of the labeled prefix traces of 𝒫 as they are represented in the view 𝒱, that is, 𝒫𝒱 = {Π𝒱 (𝜎 𝑘 ), 𝑎𝑘+1 |(𝜎 𝑘 , 𝜋(𝑒𝑘+1 ))) ∈ 𝒫}. Given the prefix 𝜎 𝑘 of a longer trace 𝜎 for which we do not know the actions in the rest of the sequence ⟨𝑒𝑘+1 , 𝑒𝑘+2 , . . . , 𝑒𝑛 ⟩, we can resort to machine learning techniques to learn a function 𝐻 : 𝒜* ↦→ 𝒜, from a labeled prefix trace multiset 𝒫, such that 𝐻(𝜎 𝑘 ) predicts the expected next-activity 𝑎𝑘+1 . More specifically, we frame the next-activity prediction task as a multi-class classification problem. In this study, we represent the labeled multiset 𝒫 as a collection of color image patches that are given as input to a ViT architecture [15]. This approach allows us to leverage ViT’s self-attention mechanism to capture complex relationships between different parts of the input data. Moreover, we can simultaneously enhance the model’s explainability, as the self-attention mechanism enables the model to focus on the most informative inputs. 3. JARVIS The main phases of the JARVIS approach are described in the following. Specifically, Section 3.1 illustrates how the labeled prefix trace multiset 𝒫 is extracted from the event log ℒ and transformed into a set ℐ of multi-patch color images. Section 3.2 describes how ℐ is fed into a ViT architecture that is trained with adversarial training to estimate parameters of a next-activity prediction function 𝐻. Finally, Section 3.3 illustrates the extraction of the attention maps. 3.1. Multi-patch image encoding This phase takes the event log ℒ as input and creates the multiset of multi-patch color images ℐ as output. According to the multi-view formulation introduced in Section 2, every event recorded in ℒ is a complex entity whose representation takes into account two mandatory characteristics (activity 𝒜 and timestamp 𝒯 ) and 𝑚 optional characteristics (𝒱1 , 𝒱2 , . . . , 𝒱𝑚 ), respectively. The timestamp information associated with an event is transformed in the time in seconds passed from the beginning of the trace. In this study, every numerical characteristic is converted into a categorical one by resorting to the equal-frequency discretization algorithm. The number of discretization bins of a numeric characteristic is set equal to the average number of distinct categories in the original categorical views of the event log. This ensures that the granularity of the discretized variables is consistent with that of the other original categorical variables. After this step, the event log ℒ contains all multi-view information in the categorical format. We denote V the final set of 𝑚 + 2 categorical views that characterize events recorded in the pre-processed event log ℒ. Subsequently, the multiset 𝒫 is created by extracting traces from ℒ and labeling them with the next activity. As prefix traces in ℒ may vary in length, we employ a combination of padding and windowing techniques to ensure uniformity in the length of the prefix traces in 𝒫. Let 𝐴𝑉 𝐺𝜎 be the average length of all the traces in ℒ, the padding is used with a window length equal to 𝐴𝑉 𝐺𝜎 , as in [3]. Prefix traces with length less than 𝐴𝑉 𝐺𝜎 are standardized by adding dummy events. Prefix traces with length greater than 𝐴𝑉 𝐺𝜎 are standardized by retaining only the most recent 𝐴𝑉 𝐺𝜎 events. After this step, 𝒫 comprises labeled prefix traces having fixed size equal to 𝐴𝑉 𝐺𝜎 . The Continuous-Bag-of-Words (CBOW) architecture of the Word2Vec scheme [16] is then used to transform the categorical representation of a prefix trace into a bidimensional, numeric embedding representation. For each view 𝒱 ∈ V, a CBOW neural network, denoted by 𝐶𝐵𝑂𝑊𝒱 is trained in order to convert each single-view sequence Π𝒱 (𝜎 𝑘 ) ∈ 𝒫𝒱 into an 𝐴𝑉 𝐺𝜎 -sized numerical vector. Specifically, Π𝒱 (𝜎 𝑘 ) is converted into a bidimensional, numeric embedding P𝒱 ∈ R𝐴𝑉 𝐺𝜎 ×𝐴𝑉 𝐺𝜎 with size 𝐴𝑉 𝐺𝜎 × 𝐴𝑉 𝐺𝜎 . Finally, for each labeled prefix trace (𝜎 𝑘 , 𝑎𝑘+1 ) ∈ 𝒫, the list of its multi- view, bidimensional, numeric embeddings P𝒜 , P𝒯 , . . . , P𝒱1 , . . . , P𝒱𝑚 , generated for Π𝒜 (𝜎 𝑘 ), Π𝒯 (𝜎 𝑘 ), Π𝒱1 (𝜎 𝑘 ), . . . , Π𝒱𝑚 (𝜎 𝑘 ), respectively, are converted into the imagery color patches Prgb rgb rgb rgb 𝒜 , P𝒯 , . . . , P𝒱1 , . . . , P𝒱𝑚 by mapping numeric values of bidimensional em- beddings into RGB pixels. In particular, every imagery color patch Prgb ∈ R𝐴𝑉 𝐺𝜎 ×𝐴𝑉 𝐺𝜎 ×3 records the embedding of a prefix trace with respect to a view into a numerical tensor with size 𝐴𝑉 𝐺𝜎 × 𝐴𝑉 𝐺𝜎 × 3. Let P be a bidimensional, numeric embedding, each numeric value of 𝑣 ∈ P is converted into a RGB pixel 𝑣 rgb ∈ Prgb by resorting to the RGB-like encod- ing function adopted in √ [5]. The 𝑚 +√2 color patches of a prefix trace are distributed into a patch grid with size ⌈ 𝑚 + 2⌉ × ⌈ 𝑚 + 2⌉ from left to right, and from top to bottom. Notice that every cell of the patch grid records a patch with size 𝐴𝑉 𝐺𝜎 × 𝐴𝑉 𝐺𝜎 × 3. In this √ way, we are able to produce√ the color image a prefix trace, that is a tensor with size (⌈ 𝑚 + 2⌉ · 𝐴𝑉 𝐺𝜎 ) × (⌈ 𝑚 + 2⌉ · 𝐴𝑉 𝐺𝜎 ) × 3. The generated multi-patch images are labeled as the corresponding prefix traces and added to the labeled image multiset ℐ. The parameters of the ViT architecture are estimated through the adversarial training strategy. 3.2. Adversarial Training In this study, we use the popular state-of-the-art Fast Gradient Sign Method (FGSM) [17] to generate adversarial images. It is a white-box gradient-based algorithm that finds the loss to apply to an input image, in order to make decisions of a pre-trained neural model less overfitted on a specific class. The pre-trained model is the ViT architecture described above with parameters initially estimated on the original labeled images of ℐ. The FGSM algorithm is based on the gradient formula: 𝑔(I) = ∇I 𝐽(𝜃, I, 𝑦), where ∇I denotes the gradient computed with respect to the imagery sample x, and 𝐽(𝜃, I, 𝑦) denotes the loss function of the ViT neural model initially trained on the original training set ℐ. In theory, FGSM determines the minimum perturbation 𝜖 to add to a training image I to create an adversarial sample that maximizes the loss function. According to this theory, given an input perturbation value 𝜖, for each labeled image (I, 𝑦) ∈ ℐ, a new image (I𝑎𝑑𝑣 , 𝑦) ∈ ℐ𝑎𝑑𝑣 can be generated such that I𝑎𝑑𝑣 = I + 𝜖 · 𝑠𝑖𝑔𝑛(𝑔(I)). As ℐ𝑎𝑑𝑣 is generated, parameters of the ViT architecture are finally estimated from the adversarially-augmented training set ℐ ∪ ℐ𝑎𝑑𝑣 . 3.3. Extracting maps of attention Once the ViT parameters have been estimated, the ViT model is used to decide on the next- activity of any prefix trace. The Attention Rollout method [18] is used to extract the map of attention of the decision of the ViT model on a single sample. Then, we derive a quantitative indicator of the importance of events within patches by exploiting the lightness information of attention maps. The lighter the pixel in the attention map, the higher the effect of the pixel information enclosed in the image of the prefix trace on the ViT decision. Indeed, the generated attention maps are represented in the RGB color space, which operates on three channels (red, green, and blue) and does not provide information about lightness. Hence we transform the RGB representation of attention maps into the LAB color space, which operates on three different channels: the color lightness (L), the color ranges from green to red (A), and the color ranges from blue to yellow (B). The transformation from the RGB space to the LAB space is performed as follows [19]: 𝐿 = 0.2126 · 𝑅 + 0.7152 · 𝐺 + 0.0722 · 𝐵, 𝐴 = 1.4749(0.2213 · 𝑅 − 0.3390 · 𝐺 + 0.1177 · 𝐵) + 128, and 𝐵 = 0.6245(0.1949 · 𝑅 + 0.6057 · 𝐺 − 0.8006 · 𝐵) + 128. 4. Experiments Section 4.1 describes the event logs used for evaluating the accuracy and explainability of JARVIS and the experimental set-up. Section 4.2 describes the accuracy results, while Section 4.3 describes the explanation results. 4.1. Event logs and experimental set-up We analyzed eight real-life event logs available on the 4TU Centre for Research.1 For each event log we performed a temporal split, dividing the log into training and testing traces. To achieve this, we sorted the traces of each event log by their starting timestamps. The first two-thirds of the sorted traces were chosen for training the predictive model, while the remaining one-third was reserved for evaluating the model’s performance on unseen data. 4.2. Accuracy performance analysis We evaluated the performance of JARVIS against the methods outlined in [3], [5], [10], [20] and [21]. All methods, with the exception of [3], were initially tested by their respective authors, who considered specific views of traces. Specifically, [5] and [10] were experimented with activity, resource, and timestamp information, [20] was experimented with activity and timestamp information, and [21] was experimented with activity information. To provide a fair comparison, we ran these related methods by accounting for all views recorded in the considered event logs. In fact, as the authors of the considered related methods made the code available, we were able to run all the compared algorithms in the same experimental setting, thus performing a safe comparison. We analyze the macro FScore and the macro GMean performances achieved. Both the macro FScore and the macro GMean are well-known multi-class classification metrics commonly used in imbalanced domains. Table 1 collects the macro FScore and the macro GMean of both the considered related methods and JARVIS. These results show that JARVIS achieves the highest FScore and GMean in five out of eight event logs, being the runner-up method in one out of eight event logs. In addition, JARVIS always outperforms the two related methods using an imagery encoding strategy [5, 20] except for BPI12W. Specifically, it always outperforms the related method using a Transformer [21]. It commonly outperforms the related method using the attention modules [10] except for the macro FScore in BPI12W, and both macro FScore and macro GMean in BPI13I. Table 1 Comparison between JARVIS and related methods defined in [3], [5], [10], [20] and [21] : macro FScore and macro GMean. The best results are in bold, while the runner-up results are underlined. FScore GMean Eventlog JARVIS [3] [5] [10] [20] [21] JARVIS [3] [5] [10] [20] [21] BPI12W 0.667 0.737 0.692 0.673 0.673 0.661 0.820 0.847 0.828 0.792 0.819 0.825 BPI12WC 0.705 0.685 0.661 0.675 0.645 0.668 0.812 0.798 0.778 0.792 0.780 0.787 BPI12C 0.644 0.654 0.642 0.638 0.643 0.624 0.786 0.792 0.782 0.785 0.781 0.781 BPI13P 0.414 0.320 0.336 0.408 0.228 0.405 0.595 0.533 0.546 0.594 0.472 0.593 BPI13I 0.387 0.405 0.295 0.407 0.363 0.380 0.615 0.626 0.534 0.626 0.594 0.603 Receipt 0.525 0.455 0.409 0.471 0.302 0.383 0.733 0.676 0.646 0.702 0.563 0.620 BPI17O 0.720 0.714 0.705 0.691 0.718 0.712 0.846 0.833 0.830 0.815 0.835 0.831 BPI20R 0.491 0.450 0.483 0.455 0.432 0.481 0.699 0.660 0.691 0.664 0.643 0.683 1 https://data.4tu.nl/portal View Left map (“a2”) Right map (“a4”) activity 91.56 120.00 resource 10.50 11.88 timestamp 15.81 19.43 impact 69.56 45.00 (a) Maps of attention org country 12.19 39.81 org group 23.06 100.25 org role 20.94 36.31 product 5.94 95.81 resource country 13.88 28.38 (b) Patch lightness per view Figure 1: (a) Maps of attention of two prefix traces of BPI13P, which are correctly labeled with “a2” (“Accepted-In Progress”) and “a4” (“Completed-Closed”), respectively. The maps are shown in the luminosity channel of the LAB space. The numbers identify the names of the views in the event log: 1- “activity”, 2-“resource”, 3-“timestamp”, 4-“impact”, 5-“org country”, 6-“org group, 7-“org role”, 8-“product”, 9-“resource country”. (b) Patch lightness measured for each view of BPI13P in the two maps of attention. 4.3. Explanation analysis This analysis aimed to explore how intrinsic explanations enclosed in the attention maps generated through the ViT model may provide useful insights to explain model decisions. For example, Figure 1a shows the lightness channel of the attention maps extracted from the ViT model trained by JARVIS on two prefix traces of BPI13P. These prefix traces were correctly labeled with the next-activity “a2” (“Accepted-In Progress”) and “a4” (“Completed- Closed”), respectively. Figure 1b reports the local patch lightness measured for each view in the maps of attention shown in Figure 1a. These results show that the patch associated with “activity" conveys the most relevant information for recognizing both “Accepted-In Progress” and “Completed-Closed” as the next-activities of the two sample prefix traces. However, “impact" and “org group" are the second and third most important views for the decision on the next- activity “Accepted-In Progress”, while “org group" and “product" are the second and third most important views for the decision on the next-activity “Completed-Closed”. Significantly, the “product" view, which ranks among the top three for predicting the next activity “Completed- Closed”, holds less importance when predicting the next activity “Accepted-In Progress”. This analysis underscores the notion that distinct views may carry varying degrees of significance depending on the specific decision being made. Finally, we examine the global effect of different views by accounting for the patch lightness computed for each view and averaged on all the prefix traces of the training set. Figure 2 shows the heatmap of the average patch lightness computed on the training set in the event logs of this study. This map shows which views have the higher global effect on ViT decisions. As expected, the activity information is globally the most important information for ViT decisions in all the event logs. However, this explanation information shows that the “product” information is globally in the top-three ranked views in BPI13P, whereas “number of terms” and “action” information are globally in the top-three ranked views in BPI17O. Findings lend support to the decision to develop a multi-view approach that does not solely rely on the standard views Figure 2: Heatmap of the global patch lightness computed for all views (axis X) in all event logs (axis Y). “×” denotes that the view reported on the axis X is missing in the event log reported on the axis Y. (activity, timestamp and resource). They demonstrate that the type of information most valuable for predicting the next activity in each running trace may depend on the type of study process of which the trace is an execution. 5. Conclusion This paper introduces a Predictive Process Monitoring (PPM) approach designed to forecast the subsequent activity in a sequence of events. The method employs an image-based representation of multiple views of the event sequence. It employs a ViT architecture, which utilizes self- attention modules to assign attention values to pairs of image patches, thereby capturing relationships between different views of the process data. Moreover, the self-attention modules allow for the integration of explainability into the model’s structure by providing insights into specific views and events that influenced the predictions. The proposed approach is assessed using various event logs, and the results illustrate its accuracy and the advantages of the attention mechanism’s explanatory capabilities. 6. Acknowledgments Vincenzo Pasquadibisceglie, Giovanna Castellano and Donato Malerba are partially sup- ported by the project FAIR - Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007), under the NRRP MUR program funded by the NextGenerationEU. Annalisa Appice is partially supported by project SERICS (PE00000014) under the NRRP MUR National Recovery and Resilience Plan funded by the European Union - NextGenerationEU. References [1] N. Tax, I. Verenich, M. La Rosa, M. Dumas, Predictive business process monitoring with LSTM neural networks, in: International Conference on Advanced Information Systems Engineering, CAISE 2017, LNCS, Springer, 2017, pp. 477–492. [2] M. Camargo, M. Dumas, O. González-Rojas, Learning accurate lstm models of business processes, in: Business Process Management: 17th International Conference, BPM 2019, Vienna, Austria, September 1–6, 2019, Proceedings 17, Springer, 2019, pp. 286–302. [3] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, A multi-view deep learning approach for predictive business process monitoring, IEEE Transactions on Services Computing 15 (2022) 2382–2395. doi:10.1109/TSC.2021.3051771. [4] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, Darwin: An online deep learning approach to handle concept drifts in predictive process monitoring, Engineering Appli- cations of Artificial Intelligence 123 (2023) 106461. URL: https://www.sciencedirect.com/ science/article/pii/S0952197623006450. doi:https://doi.org/10.1016/j.engappai. 2023.106461. [5] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, Predictive process mining meets computer vision, in: Business Process Management Forum, BPM 2020, volume 392 of LNBIP, Springer, 2020, pp. 176–192. [6] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, G. Modugno, ORANGE: outcome-oriented predictive process monitoring based on image encoding and cnns, IEEE Access 8 (2020) 184073–184086. doi:10.1109/ACCESS.2020.3029323. [7] F. Taymouri, M. L. Rosa, S. M. Erfani, Z. D. Bozorgi, I. Verenich, Predictive business process monitoring via generative adversarial nets: The case of next event prediction, in: 18th Int. Conf. on Business Process Man., BPM 2020, LNCS, Springer, 2020, pp. 237–256. [8] N. Mehdiyev, J. Evermann, P. Fettke, A novel business process prediction model using a deep learning method, Business & Information Systems Engineering 62 (2018) 143–157. doi:10.1007/s12599-018-0551-3. [9] R. Galanti, et al, Explainable predictive process monitoring, arXiv preprint arXiv:2008.01807 (2020). [10] B. Wickramanayake, Z. He, C. Ouyang, C. Moreira, Y. Xu, R. Sindhgatta, Building inter- pretable models for business process prediction using shared and specialised attention mechanisms, Knowledge-Based Systems 248 (2022) 1–22. doi:https://doi.org/10. 1016/j.knosys.2022.108773. [11] R. Galanti, M. de Leoni, M. Monaro, N. Navarin, A. Marazzi, B. Di Stasi, S. Maldera, An explainable decision support system for predictive process analytics, Engineering Appli- cations of Artificial Intelligence 120 (2023) 105904. doi:https://doi.org/10.1016/j. engappai.2023.105904. [12] V. Pasquadibisceglie, A. Appice, G. Ieva, D. Malerba, Tsunami - an explainable ppm approach for customer churn prediction in evolving retail data environments, Journal of Intelligent Information Systems (2023). URL: https://doi.org/10.1007/s10844-023-00838-5. doi:10.1007/s10844-023-00838-5. [13] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, Jarvis: Joining adversarial training with vision transformers in next-activity prediction, IEEE Transactions on Services Computing (2023) 1–14. doi:10.1109/TSC.2023.3331020. [14] T. Bai, J. Luo, J. Zhao, B. Wen, Q. Wang, Recent advances in adversarial training for adversarial robustness, in: 30th International Joint Conference on Artificial Intelligence, IJCAI 2021, 2021, pp. 4312–4321. [15] A. Dosovitskiy, et al., An image is worth 16x16 words: Transformers for image recognition at scale, in: 9th Int. Conf. on Learning Representations, ICLR 2021, ???? [16] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in vector space, in: 1st Int. Conf. on Learning Representations, ICLR 2013, 2013. [17] I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, in: 3rd International Conference on Learning Representations, ICLR 2015, 2015, pp. 1–11. [18] S. Abnar, W. H. Zuidema, Quantifying attention flow in transformers, in: 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Association for Computational Linguistics, 2020, pp. 4190–4197. [19] N. Nader, F. E.-Z. EL-Gamal, M. E. l, Enhanced kinship verification analysis based on color and texture handcrafted techniques, Research Square (2022). doi:https://doi.org/10. 21203/rs.3.rs-2139523/v1. [20] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, Using convolutional neural networks for predictive process analytics, in: 1st International Conference on Process Mining, ICPM 2019, IEEE, 2019, pp. 129–136. doi:10.1109/ICPM.2019.00028. [21] Z. A. Bukhsh, A. Saeed, R. M. Dijkman, Processtransformer: Predictive business process monitoring with transformer network, CoRR abs/2104.00721 (2021).