=Paper= {{Paper |id=Vol-3741/paper17 |storemode=property |title=Enhancing Next Activity Prediction with Adversarial Training of Vision Transformers |pdfUrl=https://ceur-ws.org/Vol-3741/paper17.pdf |volume=Vol-3741 |authors=Vincenzo Pasquadibisceglie,Annalisa Appice,Giovanna Castellano,Donato Malerba |dblpUrl=https://dblp.org/rec/conf/sebd/Pasquadibisceglie24 }} ==Enhancing Next Activity Prediction with Adversarial Training of Vision Transformers== https://ceur-ws.org/Vol-3741/paper17.pdf

Enhancing Next Activity Prediction with Adversarial
Training of Vision Transformers
Vincenzo Pasquadibisceglie1,2,*,† , Annalisa Appice1,2 , Giovanna Castellano1,2 and
Donato Malerba1,2
1
University of Bari Aldo Moro, Bari, Italy
2
Consorzio Interuniversitario Nazionale per l’Informatica - CINI, Bari, Italy

Abstract
Predicting the subsequent activity in the ongoing execution (trace) of a business process is a crucial
task in Predictive Process Monitoring (PPM). This capability enables analysts to intervene proactively
and prevent undesirable behaviors. This paper presents a PPM approach that integrates adversarial
training with Vision Transformers (ViTs) to enhance the accuracy of predicting the next activity in a
running process trace. This approach takes into account multi-view information that may be captured in
a process trace, treating them as distinct patches of an image. Attention modules are employed to reveal
explainable information about the different views of a business process and the trace events that could
influence the prediction. Additionally, to mitigate overfitting and improve accuracy, we investigate the
impact of adversarial ViT training. Experiments conducted on various benchmark event logs demonstrate
the effectiveness of the proposed approach compared to several state-of-the-art PPM techniques. Notably,
the explanations obtained through attention modules yield valuable insights.

Keywords
Predictive process monitoring, Next activity prediction, Deep learning, Multi-view learning, Adversarial
training, Vision transformers, Attention, XAI, Computer vision

1. Introduction
Predictive Process Monitoring (PPM) is a burgeoning field focused on enhancing business
process efficiency and effectiveness through predictive analytics. By analyzing historical raw
event data, PPM systems can identify patterns and trends, providing valuable insights into the
key factors contributing to process inefficiencies and bottlenecks.
The use of deep learning in predictive modeling has become increasingly popular in PPM
systems, reflecting the broader trend of deep learning’s success across various domains. Specifi-
cally, several deep neural networks, such as Long Short-Term Memory (LSTM) networks [1],
[2], [3], [4], Convolutional Neural Networks (CNNs) [5], [6], Generative Adversarial Networks
(GANs) [7], and Autoencoders [8], have recently contributed to improving the accuracy of PPM

SEBD 2024: 32nd Symposium on Advanced Database Systems, June 23-26, 2024, Villasimius, Sardinia, Italy
*
Corresponding author.
$ vincenzo.pasquadibisceglieds@uniba.it (V. Pasquadibisceglie); annalisa.appice@uniba.it (A. Appice);
giovanna.castellano@uniba.it (G. Castellano); donato.malerba@uniba.it (D. Malerba)
0000-0002-7273-3882 (V. Pasquadibisceglie); 0000-0001-9840-844X (A. Appice); 0000-0002-6489-8628
(G. Castellano); 0000-0001-8432-4608 (D. Malerba)
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).

CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
systems. This is due to their ability to learn accurate deep neural models, which in turn enable
proactive and corrective actions to enhance process performance and mitigate risks.
While the primary focus of PPM systems remains on delivering accurate predictions of future
states of running traces, there is a growing preference for predictive models that are easier
to explain in PPM applications. Recent studies [9, 10, 11, 12] have explored the application of
existing eXplainable AI (XAI) methods to elucidate opaque PPM models. However, the issue of
model explainability in the context of deep learning-based PPM systems remains under-explored.
In [13], we recently introduced a method called JARVIS [13] (Joining Adversarial tRaining with
VISion transformers in next-activity prediction), which combines Vision Transformers (ViT)
and Adversarial Training to achieve a balance between model accuracy and explainability.
Specifically, the model’s explainability is enhanced by the adoption of a ViT, a deep neural
architecture comprising multiple self-attention layers. An attention layer in deep learning is a
component that enables a neural network to concentrate on specific parts of the input data when
making predictions or decisions. It is inspired by the human visual system, which can selectively
focus on different parts of an image to understand it better. Therefore, multiple self-attentions
layers can provide an explanation of the model’s behavior in terms of the most informative inputs.
Adversarial training [14] is also employed to improve accuracy by incorporating perturbed
(i.e., adversarial) inputs into the training process, thereby mitigating overfitting and enhancing
generalization.
The paper is organized as follows. Preliminary concepts are reported in Section 2, while
the JARVIS approach is described in Section 3. The experimental setup and the results of the
evaluation of the proposed approach are illustrated in Sections 4. Finally, Section 5 recalls the
purpose of our research, draws conclusions, and illustrates possible future developments.

2. Preliminary concepts
A trace is a record of a business process that shows the stages of its execution through a sequence
of events. An event is a complex entity characterized by two essential components: the activity
and the timestamp (indicating when the activity occurred). Additionally, events may possess
optional characteristics, such as the resource responsible for the activity or the cost involved in
completing it. Consequently, each event is accompanied by two mandatory views, representing
the activity and the timestamp, as well as 𝑚 optional views corresponding to other event
characteristics. Let 𝒜 be the set of all activity names, 𝒮 be the set of all trace identifiers, 𝒯 be
the set of all timestamps, and 𝒱𝑗 be the set of all names in the 𝑗-th view, where 1 ≤ 𝑗 ≤ 𝑚 .

Definition 1 (Event). Given the event universe ℰ = 𝒮 × 𝒜 × 𝒯 × 𝒱1 × . . . × 𝒱𝑚 , an event
𝑒 ∈ ℰ is a tuple 𝑒 = (𝜎, 𝑎, 𝑡, 𝑣1 , . . . , 𝑣𝑚 ) that represents the occurrence of activity 𝑎 in trace 𝜎 at
timestamp 𝑡 with characteristics 𝑣1 , 𝑣2 , . . . , 𝑣𝑚 .

Let us introduce the functions: 𝜋𝒮 : ℰ ↦→ 𝒮 such that 𝜋𝒮 (𝑒) = 𝜎, 𝜋𝒜 : ℰ ↦→ 𝒜 such that
𝜋𝒜 (𝑒) = 𝑎, 𝜋𝒯 : ℰ ↦→ 𝒯 such that 𝜋𝒯 (𝑒) = 𝑡 and 𝜋𝒱𝑗 : ℰ ↦→ 𝒱𝑗 such that 𝜋𝒱𝑗 (𝑒) = 𝑣𝑗 , where
𝑗 = 1, . . . , 𝑚.
Definition 2 (Trace). Let ℰ * denote the set of all possible sequences on ℰ. A trace 𝜎 is a sequence
𝜎 = ⟨𝑒1 , 𝑒2 . . . , 𝑒𝑛 ⟩ ∈ ℰ * so that: (1) ∀𝑖 = 1, . . . , 𝑛, ∃𝑒𝑖 ∈ ℰ such that 𝜎(𝑖) = 𝑒𝑖 and 𝜋𝒮 (𝑒𝑖 ) = 𝜎,
and (2) ∀𝑖 = 1, . . . , 𝑛 − 1, 𝜋𝒯 (𝑒𝑖 ) ≤ 𝜋𝒯 (𝑒𝑖+1 ).

Definition 3 (Event log). Let ℬ(ℰ * ) denote the set of all multisets over ℰ. An event log ℒ ⊆
ℬ(ℰ * ) is a multiset of traces.

Definition 4 (Prefix trace). A prefix trace 𝜎 𝑘 = ⟨𝑒1 , 𝑒2 , . . . , 𝑒𝑘 ⟩ is the sub-sequence of a trace
𝜎 starting from the beginning of the trace 𝜎 with 1 ≤ 𝑘 = |𝜎 𝑘 | < |𝜎|.

A trace is a complete (i.e., started and ended) process instance, while a prefix trace is a process
instance in execution (also called running trace). The activity 𝜋𝒜 (𝑒𝑘+1 ) = 𝑎𝑘+1 corresponds to
the next-activity of 𝜎 𝑘 , i.e., 𝑛𝑒𝑥𝑡(𝜎 𝑘 ) = 𝜋𝒜 (𝑒𝑘+1 ) with 𝑒𝑘+1 = 𝜎(𝑘 + 1).

Definition 5 (Multiset of labeled prefix traces). Let ℒ ⊆ ℬ(ℰ * ) be an event log, 𝒫 ⊆
ℬ(ℰ * × 𝒜) is the multiset of all prefix traces extracted from traces recorded in ℒ. Each pre-
fix trace is labeled with the next-activity associated to each prefix sequence in the corresponding
trace so that 𝒫 = [𝜎 𝑘 , 𝜋𝒜 (𝑒𝑘+1 )|𝜎 ∈ ℒ ∧ 1 ≤ 𝑘 < |𝜎|].

Definition 6 (Single-view representation of a labeled prefix trace multiset). Let 𝒱 be a
view (either mandatory, i.e., 𝒱 = 𝒜 or 𝒱 = 𝒯 , or optional, i.e. 𝒱 = 𝒱𝑗 with
𝑗 = 1, . . . , 𝑚), Π : ℰ * ↦→ 𝒱 * be a function such that Π(𝜎 𝑘 ) = Π(⟨𝑒1 , 𝑒2 , . . . , 𝑒𝑘 ⟩) =
⟨𝜋𝒱 (𝑒1 ), 𝜋𝒱 (𝑒2 ) . . . , 𝜋𝒱 (𝑒𝑘 )⟩. 𝒫𝒱 denotes the multiset of the labeled prefix traces of 𝒫 as they are
represented in the view 𝒱, that is, 𝒫𝒱 = {Π𝒱 (𝜎 𝑘 ), 𝑎𝑘+1 |(𝜎 𝑘 , 𝜋(𝑒𝑘+1 ))) ∈ 𝒫}.

Given the prefix 𝜎 𝑘 of a longer trace 𝜎 for which we do not know the actions in the rest
of the sequence ⟨𝑒𝑘+1 , 𝑒𝑘+2 , . . . , 𝑒𝑛 ⟩, we can resort to machine learning techniques to learn
a function 𝐻 : 𝒜* ↦→ 𝒜, from a labeled prefix trace multiset 𝒫, such that 𝐻(𝜎 𝑘 ) predicts the
expected next-activity 𝑎𝑘+1 . More specifically, we frame the next-activity prediction task as a
multi-class classification problem.
In this study, we represent the labeled multiset 𝒫 as a collection of color image patches
that are given as input to a ViT architecture [15]. This approach allows us to leverage ViT’s
self-attention mechanism to capture complex relationships between different parts of the input
data. Moreover, we can simultaneously enhance the model’s explainability, as the self-attention
mechanism enables the model to focus on the most informative inputs.

3. JARVIS
The main phases of the JARVIS approach are described in the following. Specifically, Section
3.1 illustrates how the labeled prefix trace multiset 𝒫 is extracted from the event log ℒ and
transformed into a set ℐ of multi-patch color images. Section 3.2 describes how ℐ is fed into a ViT
architecture that is trained with adversarial training to estimate parameters of a next-activity
prediction function 𝐻. Finally, Section 3.3 illustrates the extraction of the attention maps.
3.1. Multi-patch image encoding
This phase takes the event log ℒ as input and creates the multiset of multi-patch color images
ℐ as output.
According to the multi-view formulation introduced in Section 2, every event recorded in
ℒ is a complex entity whose representation takes into account two mandatory characteristics
(activity 𝒜 and timestamp 𝒯 ) and 𝑚 optional characteristics (𝒱1 , 𝒱2 , . . . , 𝒱𝑚 ), respectively. The
timestamp information associated with an event is transformed in the time in seconds passed
from the beginning of the trace. In this study, every numerical characteristic is converted into a
categorical one by resorting to the equal-frequency discretization algorithm. The number of
discretization bins of a numeric characteristic is set equal to the average number of distinct
categories in the original categorical views of the event log. This ensures that the granularity
of the discretized variables is consistent with that of the other original categorical variables.
After this step, the event log ℒ contains all multi-view information in the categorical format.
We denote V the final set of 𝑚 + 2 categorical views that characterize events recorded in the
pre-processed event log ℒ.
Subsequently, the multiset 𝒫 is created by extracting traces from ℒ and labeling them with
the next activity. As prefix traces in ℒ may vary in length, we employ a combination of padding
and windowing techniques to ensure uniformity in the length of the prefix traces in 𝒫. Let
𝐴𝑉 𝐺𝜎 be the average length of all the traces in ℒ, the padding is used with a window length
equal to 𝐴𝑉 𝐺𝜎 , as in [3]. Prefix traces with length less than 𝐴𝑉 𝐺𝜎 are standardized by adding
dummy events. Prefix traces with length greater than 𝐴𝑉 𝐺𝜎 are standardized by retaining only
the most recent 𝐴𝑉 𝐺𝜎 events. After this step, 𝒫 comprises labeled prefix traces having fixed
size equal to 𝐴𝑉 𝐺𝜎 .
The Continuous-Bag-of-Words (CBOW) architecture of the Word2Vec scheme [16] is then
used to transform the categorical representation of a prefix trace into a bidimensional, numeric
embedding representation. For each view 𝒱 ∈ V, a CBOW neural network, denoted by
𝐶𝐵𝑂𝑊𝒱 is trained in order to convert each single-view sequence Π𝒱 (𝜎 𝑘 ) ∈ 𝒫𝒱 into an
𝐴𝑉 𝐺𝜎 -sized numerical vector. Specifically, Π𝒱 (𝜎 𝑘 ) is converted into a bidimensional, numeric
embedding P𝒱 ∈ R𝐴𝑉 𝐺𝜎 ×𝐴𝑉 𝐺𝜎 with size 𝐴𝑉 𝐺𝜎 × 𝐴𝑉 𝐺𝜎 .
Finally, for each labeled prefix trace (𝜎 𝑘 , 𝑎𝑘+1 ) ∈ 𝒫, the list of its multi-
view, bidimensional, numeric embeddings P𝒜 , P𝒯 , . . . , P𝒱1 , . . . , P𝒱𝑚 , generated for
Π𝒜 (𝜎 𝑘 ), Π𝒯 (𝜎 𝑘 ), Π𝒱1 (𝜎 𝑘 ), . . . , Π𝒱𝑚 (𝜎 𝑘 ), respectively, are converted into the imagery color
patches Prgb rgb rgb rgb
𝒜 , P𝒯 , . . . , P𝒱1 , . . . , P𝒱𝑚 by mapping numeric values of bidimensional em-
beddings into RGB pixels. In particular, every imagery color patch Prgb ∈ R𝐴𝑉 𝐺𝜎 ×𝐴𝑉 𝐺𝜎 ×3
records the embedding of a prefix trace with respect to a view into a numerical tensor with
size 𝐴𝑉 𝐺𝜎 × 𝐴𝑉 𝐺𝜎 × 3. Let P be a bidimensional, numeric embedding, each numeric value
of 𝑣 ∈ P is converted into a RGB pixel 𝑣 rgb ∈ Prgb by resorting to the RGB-like encod-
ing function adopted in √ [5]. The 𝑚 +√2 color patches of a prefix trace are distributed into
a patch grid with size ⌈ 𝑚 + 2⌉ × ⌈ 𝑚 + 2⌉ from left to right, and from top to bottom.
Notice that every cell of the patch grid records a patch with size 𝐴𝑉 𝐺𝜎 × 𝐴𝑉 𝐺𝜎 × 3. In
this
√ way, we are able to produce√ the color image a prefix trace, that is a tensor with size
(⌈ 𝑚 + 2⌉ · 𝐴𝑉 𝐺𝜎 ) × (⌈ 𝑚 + 2⌉ · 𝐴𝑉 𝐺𝜎 ) × 3.
The generated multi-patch images are labeled as the corresponding prefix traces and added
to the labeled image multiset ℐ. The parameters of the ViT architecture are estimated through
the adversarial training strategy.

3.2. Adversarial Training
In this study, we use the popular state-of-the-art Fast Gradient Sign Method (FGSM) [17] to
generate adversarial images. It is a white-box gradient-based algorithm that finds the loss
to apply to an input image, in order to make decisions of a pre-trained neural model less
overfitted on a specific class. The pre-trained model is the ViT architecture described above
with parameters initially estimated on the original labeled images of ℐ. The FGSM algorithm is
based on the gradient formula: 𝑔(I) = ∇I 𝐽(𝜃, I, 𝑦), where ∇I denotes the gradient computed
with respect to the imagery sample x, and 𝐽(𝜃, I, 𝑦) denotes the loss function of the ViT neural
model initially trained on the original training set ℐ. In theory, FGSM determines the minimum
perturbation 𝜖 to add to a training image I to create an adversarial sample that maximizes the loss
function. According to this theory, given an input perturbation value 𝜖, for each labeled image
(I, 𝑦) ∈ ℐ, a new image (I𝑎𝑑𝑣 , 𝑦) ∈ ℐ𝑎𝑑𝑣 can be generated such that I𝑎𝑑𝑣 = I + 𝜖 · 𝑠𝑖𝑔𝑛(𝑔(I)).
As ℐ𝑎𝑑𝑣 is generated, parameters of the ViT architecture are finally estimated from the
adversarially-augmented training set ℐ ∪ ℐ𝑎𝑑𝑣 .

3.3. Extracting maps of attention
Once the ViT parameters have been estimated, the ViT model is used to decide on the next-
activity of any prefix trace. The Attention Rollout method [18] is used to extract the map of
attention of the decision of the ViT model on a single sample. Then, we derive a quantitative
indicator of the importance of events within patches by exploiting the lightness information
of attention maps. The lighter the pixel in the attention map, the higher the effect of the pixel
information enclosed in the image of the prefix trace on the ViT decision. Indeed, the generated
attention maps are represented in the RGB color space, which operates on three channels (red,
green, and blue) and does not provide information about lightness. Hence we transform the RGB
representation of attention maps into the LAB color space, which operates on three different
channels: the color lightness (L), the color ranges from green to red (A), and the color ranges
from blue to yellow (B). The transformation from the RGB space to the LAB space is performed
as follows [19]: 𝐿 = 0.2126 · 𝑅 + 0.7152 · 𝐺 + 0.0722 · 𝐵, 𝐴 = 1.4749(0.2213 · 𝑅 − 0.3390 ·
𝐺 + 0.1177 · 𝐵) + 128, and 𝐵 = 0.6245(0.1949 · 𝑅 + 0.6057 · 𝐺 − 0.8006 · 𝐵) + 128.

4. Experiments
Section 4.1 describes the event logs used for evaluating the accuracy and explainability of
JARVIS and the experimental set-up. Section 4.2 describes the accuracy results, while Section
4.3 describes the explanation results.
4.1. Event logs and experimental set-up
We analyzed eight real-life event logs available on the 4TU Centre for Research.1 For each event
log we performed a temporal split, dividing the log into training and testing traces. To achieve
this, we sorted the traces of each event log by their starting timestamps. The first two-thirds of
the sorted traces were chosen for training the predictive model, while the remaining one-third
was reserved for evaluating the model’s performance on unseen data.

4.2. Accuracy performance analysis
We evaluated the performance of JARVIS against the methods outlined in [3], [5], [10], [20]
and [21]. All methods, with the exception of [3], were initially tested by their respective
authors, who considered specific views of traces. Specifically, [5] and [10] were experimented
with activity, resource, and timestamp information, [20] was experimented with activity and
timestamp information, and [21] was experimented with activity information. To provide a fair
comparison, we ran these related methods by accounting for all views recorded in the considered
event logs. In fact, as the authors of the considered related methods made the code available, we
were able to run all the compared algorithms in the same experimental setting, thus performing
a safe comparison. We analyze the macro FScore and the macro GMean performances achieved.
Both the macro FScore and the macro GMean are well-known multi-class classification metrics
commonly used in imbalanced domains. Table 1 collects the macro FScore and the macro GMean
of both the considered related methods and JARVIS. These results show that JARVIS achieves
the highest FScore and GMean in five out of eight event logs, being the runner-up method in one
out of eight event logs. In addition, JARVIS always outperforms the two related methods using
an imagery encoding strategy [5, 20] except for BPI12W. Specifically, it always outperforms the
related method using a Transformer [21]. It commonly outperforms the related method using
the attention modules [10] except for the macro FScore in BPI12W, and both macro FScore and
macro GMean in BPI13I.

Table 1
Comparison between JARVIS and related methods defined in [3], [5], [10], [20] and [21] : macro FScore
and macro GMean. The best results are in bold, while the runner-up results are underlined.
FScore GMean
Eventlog
JARVIS [3] [5] [10] [20] [21] JARVIS [3] [5] [10] [20] [21]
BPI12W 0.667 0.737 0.692 0.673 0.673 0.661 0.820 0.847 0.828 0.792 0.819 0.825
BPI12WC 0.705 0.685 0.661 0.675 0.645 0.668 0.812 0.798 0.778 0.792 0.780 0.787
BPI12C 0.644 0.654 0.642 0.638 0.643 0.624 0.786 0.792 0.782 0.785 0.781 0.781
BPI13P 0.414 0.320 0.336 0.408 0.228 0.405 0.595 0.533 0.546 0.594 0.472 0.593
BPI13I 0.387 0.405 0.295 0.407 0.363 0.380 0.615 0.626 0.534 0.626 0.594 0.603
Receipt 0.525 0.455 0.409 0.471 0.302 0.383 0.733 0.676 0.646 0.702 0.563 0.620
BPI17O 0.720 0.714 0.705 0.691 0.718 0.712 0.846 0.833 0.830 0.815 0.835 0.831
BPI20R 0.491 0.450 0.483 0.455 0.432 0.481 0.699 0.660 0.691 0.664 0.643 0.683

1
https://data.4tu.nl/portal
View Left map (“a2”) Right map (“a4”)
activity 91.56 120.00
resource 10.50 11.88
timestamp 15.81 19.43
impact 69.56 45.00
(a) Maps of attention org country 12.19 39.81
org group 23.06 100.25
org role 20.94 36.31
product 5.94 95.81
resource country 13.88 28.38

(b) Patch lightness per view
Figure 1: (a) Maps of attention of two prefix traces of BPI13P, which are correctly labeled with “a2”
(“Accepted-In Progress”) and “a4” (“Completed-Closed”), respectively. The maps are shown in the
luminosity channel of the LAB space. The numbers identify the names of the views in the event log: 1-
“activity”, 2-“resource”, 3-“timestamp”, 4-“impact”, 5-“org country”, 6-“org group, 7-“org role”, 8-“product”,
9-“resource country”. (b) Patch lightness measured for each view of BPI13P in the two maps of attention.

4.3. Explanation analysis
This analysis aimed to explore how intrinsic explanations enclosed in the attention maps
generated through the ViT model may provide useful insights to explain model decisions.
For example, Figure 1a shows the lightness channel of the attention maps extracted from
the ViT model trained by JARVIS on two prefix traces of BPI13P. These prefix traces were
correctly labeled with the next-activity “a2” (“Accepted-In Progress”) and “a4” (“Completed-
Closed”), respectively. Figure 1b reports the local patch lightness measured for each view in
the maps of attention shown in Figure 1a. These results show that the patch associated with
“activity" conveys the most relevant information for recognizing both “Accepted-In Progress”
and “Completed-Closed” as the next-activities of the two sample prefix traces. However, “impact"
and “org group" are the second and third most important views for the decision on the next-
activity “Accepted-In Progress”, while “org group" and “product" are the second and third most
important views for the decision on the next-activity “Completed-Closed”. Significantly, the
“product" view, which ranks among the top three for predicting the next activity “Completed-
Closed”, holds less importance when predicting the next activity “Accepted-In Progress”. This
analysis underscores the notion that distinct views may carry varying degrees of significance
depending on the specific decision being made.
Finally, we examine the global effect of different views by accounting for the patch lightness
computed for each view and averaged on all the prefix traces of the training set. Figure 2 shows
the heatmap of the average patch lightness computed on the training set in the event logs of this
study. This map shows which views have the higher global effect on ViT decisions. As expected,
the activity information is globally the most important information for ViT decisions in all
the event logs. However, this explanation information shows that the “product” information
is globally in the top-three ranked views in BPI13P, whereas “number of terms” and “action”
information are globally in the top-three ranked views in BPI17O. Findings lend support to
the decision to develop a multi-view approach that does not solely rely on the standard views
Figure 2: Heatmap of the global patch lightness computed for all views (axis X) in all event logs (axis Y).
“×” denotes that the view reported on the axis X is missing in the event log reported on the axis Y.

(activity, timestamp and resource). They demonstrate that the type of information most valuable
for predicting the next activity in each running trace may depend on the type of study process
of which the trace is an execution.

5. Conclusion
This paper introduces a Predictive Process Monitoring (PPM) approach designed to forecast the
subsequent activity in a sequence of events. The method employs an image-based representation
of multiple views of the event sequence. It employs a ViT architecture, which utilizes self-
attention modules to assign attention values to pairs of image patches, thereby capturing
relationships between different views of the process data. Moreover, the self-attention modules
allow for the integration of explainability into the model’s structure by providing insights into
specific views and events that influenced the predictions. The proposed approach is assessed
using various event logs, and the results illustrate its accuracy and the advantages of the
attention mechanism’s explanatory capabilities.

6. Acknowledgments
Vincenzo Pasquadibisceglie, Giovanna Castellano and Donato Malerba are partially sup-
ported by the project FAIR - Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP
H97G22000210007), under the NRRP MUR program funded by the NextGenerationEU. Annalisa
Appice is partially supported by project SERICS (PE00000014) under the NRRP MUR National
Recovery and Resilience Plan funded by the European Union - NextGenerationEU.
References
[1] N. Tax, I. Verenich, M. La Rosa, M. Dumas, Predictive business process monitoring with
LSTM neural networks, in: International Conference on Advanced Information Systems
Engineering, CAISE 2017, LNCS, Springer, 2017, pp. 477–492.
[2] M. Camargo, M. Dumas, O. González-Rojas, Learning accurate lstm models of business
processes, in: Business Process Management: 17th International Conference, BPM 2019,
Vienna, Austria, September 1–6, 2019, Proceedings 17, Springer, 2019, pp. 286–302.
[3] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, A multi-view deep learning
approach for predictive business process monitoring, IEEE Transactions on Services
Computing 15 (2022) 2382–2395. doi:10.1109/TSC.2021.3051771.
[4] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, Darwin: An online deep learning
approach to handle concept drifts in predictive process monitoring, Engineering Appli-
cations of Artificial Intelligence 123 (2023) 106461. URL: https://www.sciencedirect.com/
science/article/pii/S0952197623006450. doi:https://doi.org/10.1016/j.engappai.
2023.106461.
[5] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, Predictive process mining
meets computer vision, in: Business Process Management Forum, BPM 2020, volume 392
of LNBIP, Springer, 2020, pp. 176–192.
[6] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, G. Modugno, ORANGE:
outcome-oriented predictive process monitoring based on image encoding and cnns, IEEE
Access 8 (2020) 184073–184086. doi:10.1109/ACCESS.2020.3029323.
[7] F. Taymouri, M. L. Rosa, S. M. Erfani, Z. D. Bozorgi, I. Verenich, Predictive business process
monitoring via generative adversarial nets: The case of next event prediction, in: 18th Int.
Conf. on Business Process Man., BPM 2020, LNCS, Springer, 2020, pp. 237–256.
[8] N. Mehdiyev, J. Evermann, P. Fettke, A novel business process prediction model using a
deep learning method, Business & Information Systems Engineering 62 (2018) 143–157.
doi:10.1007/s12599-018-0551-3.
[9] R. Galanti, et al, Explainable predictive process monitoring, arXiv preprint arXiv:2008.01807
(2020).
[10] B. Wickramanayake, Z. He, C. Ouyang, C. Moreira, Y. Xu, R. Sindhgatta, Building inter-
pretable models for business process prediction using shared and specialised attention
mechanisms, Knowledge-Based Systems 248 (2022) 1–22. doi:https://doi.org/10.
1016/j.knosys.2022.108773.
[11] R. Galanti, M. de Leoni, M. Monaro, N. Navarin, A. Marazzi, B. Di Stasi, S. Maldera, An
explainable decision support system for predictive process analytics, Engineering Appli-
cations of Artificial Intelligence 120 (2023) 105904. doi:https://doi.org/10.1016/j.
engappai.2023.105904.
[12] V. Pasquadibisceglie, A. Appice, G. Ieva, D. Malerba, Tsunami - an explainable ppm
approach for customer churn prediction in evolving retail data environments, Journal of
Intelligent Information Systems (2023). URL: https://doi.org/10.1007/s10844-023-00838-5.
doi:10.1007/s10844-023-00838-5.
[13] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, Jarvis: Joining adversarial
training with vision transformers in next-activity prediction, IEEE Transactions on Services
Computing (2023) 1–14. doi:10.1109/TSC.2023.3331020.
[14] T. Bai, J. Luo, J. Zhao, B. Wen, Q. Wang, Recent advances in adversarial training for
adversarial robustness, in: 30th International Joint Conference on Artificial Intelligence,
IJCAI 2021, 2021, pp. 4312–4321.
[15] A. Dosovitskiy, et al., An image is worth 16x16 words: Transformers for image recognition
at scale, in: 9th Int. Conf. on Learning Representations, ICLR 2021, ????
[16] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word representations in
vector space, in: 1st Int. Conf. on Learning Representations, ICLR 2013, 2013.
[17] I. J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial examples, in:
3rd International Conference on Learning Representations, ICLR 2015, 2015, pp. 1–11.
[18] S. Abnar, W. H. Zuidema, Quantifying attention flow in transformers, in: 58th Annual
Meeting of the Association for Computational Linguistics, ACL 2020, Association for
Computational Linguistics, 2020, pp. 4190–4197.
[19] N. Nader, F. E.-Z. EL-Gamal, M. E. l, Enhanced kinship verification analysis based on color
and texture handcrafted techniques, Research Square (2022). doi:https://doi.org/10.
21203/rs.3.rs-2139523/v1.
[20] V. Pasquadibisceglie, A. Appice, G. Castellano, D. Malerba, Using convolutional neural
networks for predictive process analytics, in: 1st International Conference on Process
Mining, ICPM 2019, IEEE, 2019, pp. 129–136. doi:10.1109/ICPM.2019.00028.
[21] Z. A. Bukhsh, A. Saeed, R. M. Dijkman, Processtransformer: Predictive business process
monitoring with transformer network, CoRR abs/2104.00721 (2021).