=Paper=
{{Paper
|id=Vol-3098/dc_209
|storemode=property
|title=Automating the Design of Process Mining
Pipelines through Meta-Learning (Extended Abstract)
|pdfUrl=https://ceur-ws.org/Vol-3098/dc_209.pdf
|volume=Vol-3098
|authors=Gabriel Marques Tavares
|dblpUrl=https://dblp.org/rec/conf/icpm/Tavares21
}}
==Automating the Design of Process Mining
Pipelines through Meta-Learning (Extended Abstract)==
<pdf width="1500px">https://ceur-ws.org/Vol-3098/dc_209.pdf</pdf>
<pre>
Automating the Design of Process Mining Pipelines
   Through Meta-Learning (Extended Abstract)
                                                    Gabriel Marques Tavares
                                              Università degli Studi di Milano, Italy
                                                    gabriel.tavares@unimi.it


   With more than twenty years of history Process Mining            strategy to approach the automation problem. MtL is the
(PM) techniques have now achieved the maturity level to             process of learning from the application of various learning
cover the entire stack of the data science pipeline, from           algorithms on different data, thus, solving the algorithm selec-
raw data to decisions [1]. To prepare process discovery or          tion problem by recommending the algorithm that produces
conformance checking, event logs can be extracted, lifted,          the best performance for a particular data set. Given the
cleaned, segmented, profiled, encoded. To support decisions         results of multiple configurations observed during the training
PM models and metrics foster predictions and optimization           process, an MtL procedure recommends configurations for
procedures. Machine Learning (ML) algorithms are often              new tasks. We introduce a general framework where the
integrated into PM pipelines to support, among others, noise or     observed configurations and tasks are abstract objects that
anomaly detection, clustering, feature selection, classification,   can be instantiated according to the specific scenario one
and regression tasks. A consequence of this growth in the           wants to study. Instantiating the framework means deciding
variety of tools available is that designing a PM pipeline is       descriptors, hyperparameter tuning, algorithms, and quality
becoming complex. Identifying the best pipe of techniques to        assessment metrics to be used in the MtL procedure. Figure
achieve the best results given a specific task and a specific       1 shows the abstract (non-instantiated) MtL-based framework
event log is challenging even for experts. The spectrum of          applied to PM.
algorithms and concepts is larger and larger while the number          As observed in Figure 1, a set of event logs is required as in-
of parameterizations among interacting algorithms is combina-       put to the framework. The more heterogeneous the data set, the
torial. To deal with this trend, this thesis is aimed at studying   better because different process behaviors are explored. Failing
PM pipelines to verify which steps can be automated.                in creating a representative group may negatively impact the
   1) Research Question: The general research question of           framework’s performance. The Meta-Feature Extraction step
this work is: can the design of PM pipelines be automated?          aims at obtaining event log features, known as meta-features
Answering this question implies studying the relations be-          according to MtL terminology. The challenge is to correctly
tween the algorithms composing a pipeline to verify if specific     capture log characteristics using a representative set of meta-
combinations are more effective than others. For example,           features, capable of describing the process behavior from com-
event log characteristics may guide the choice of the appro-        plementary perspectives. Furthermore, the feature extraction
priate discovery technique since high-level log descriptors can     operation should have a low computational cost, otherwise,
support the identification of the best discovery algorithm [2].     extracting meta-features would be more costly than testing all
In particular, it will be of interest to verify which conditions,   possible meta-targets. To this extent, based on information the-
e.g. log complexity or noise level, and requirements, e.g.          ory, statistical and PM feature extraction literature, we explore
real-time response or quality measures, impact the optimal          meta-features capturing activity, trace, and log descriptors.
pipelining of PM techniques.                                        The meta-features cover several complementary perspectives,
   Clearly, our work will not be able to exhaustively address       such as central tendency, statistical dispersion, probability
all the possible relations between all possible PM techniques       distribution shape, log structuredness, and variability, among
or design requirements. However, we aim at introducing a            others. The Meta-Target Definition step is the most volatile
novel methodology for studying these relationships, investi-        in the abstract framework as its details depend on the task
gating some scenarios in a comprehensive way and using              being studied. For instance, consider the problem of selecting
a reproducible method. A software framework for executing           a process discovery technique, a set of discovery techniques is
experimental analysis will also be delivered.                       applied to the event logs and, given a ranking function, the best
   A practical side of our work is providing recommendations        discovery algorithm is selected. In this scenario, the ranking
on the effective design of PM pipelines that can be exploited in    function could simply be a metric capturing the produced
training programs for specialists or in recommender systems.        model quality. The ranking function can also be a result of
   2) Research Design: We investigate a solution based on           aggregated metrics using average or even user-based weights.
Automated Machine Learning (AutoML), which allows non-              During the instantiation of the framework, one must consider
experts to achieve satisfactory results and experts to optimize     the available techniques for a given problem and a respective
their tasks. For that, we propose a Meta-learning (MtL)             ranking function able to rate the output of these techniques.

 Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
 International (CC BY 4.0).
                    Event Logs                    Meta-Feature Extraction                                 Meta-Database                     Meta-Learner

                                          Log         Information Theory   Statistical                    Meta-Features

                                                                                                                                              Machine
                                                                                                                                              Learning


                                                                                                                          Meta-Instances
                                                      Meta-Target Definition

                                        Technique 1

                                        Technique 2                 Ranking                Best                                              Meta-Model
                                                                    function              Ranked
                                            ...
                                        Technique N
                                                                                                                                           Recommendation


                                                      Fig. 1. Overview of the proposed framework.


This way, the meta-target definition step selects the appropriate                        tions, important efforts must be spent to verify the solidity
technique for each event log. The results of the two preceding                           of the experimental design. Many aspects must be validated:
steps (meta-features and meta-targets) are joined in the Meta-                           (i) the representativeness of the event logs sample, (ii) the
Database phase, resulting in a data set similar to traditional                           sensitivity of meta-features, (iii) the relevance of the algorithm
ML applications, that is, a set of features describing instances                         selected (iv) the hyperparameter tuning. We plan to consolidate
and their associated labels. Hence, a Meta-Learner (e.g. a                               the experiments already executed by extending the set of data
traditional supervised algorithm) can be fed using the meta-                             and algorithms used and providing a solid validation of the
database. The meta-learner maps the relationships between                                generalization power we can obtain in each scenario. The
meta-features and meta-targets and produces a Meta-Model.                                availability/unavailability of executable software implementing
Given a new event log, the meta-feature extraction takes place                           the last results of the literature will impose us some limitations
to capture the process behavior and the meta-features are                                without invalidating the generality of the approach.
submitted to the meta-model, which can, in turn, recommend                                  An inherent challenge is that PM applications might be at-
the most suitable technique for that event log.                                          tached to certain purposes, ranging from abstract goals such as
   3) Initial Results: The experiments to evaluate the frame-                            having an overview of the process to clearly defined questions
work are direct instantiations of the abstract model applied to                          such as identifying the delay root-cause for a sublog. The more
different PM tasks. The first problem we studied is anomaly                              specific questions may be assessed by quality metrics when
detection [3], which is traditionally performed by conformance                           well defined while more general goals are more subjective.
checking techniques. In this scenario, we aim at enhancing                               We aim at instantiate the framework to a range of PM tasks
anomaly detection performance by using encodings. Thus,                                  in order to test its adaptability to narrowed scenarios.
the meta-targets are encoding techniques and the ranking                                    5) Relation to the State of the Art: The flexibility of our
function measuring detection accuracy is F-score. Encoding                               framework allows the instantiation of a wide range of PM
techniques are associated with meta-features extracted from                              tasks. For each task we study, respective literature exists to
the event logs, forming the meta-database. In this scenario,                             compare with. Nevertheless, our framework can be compared
the framework averages 0.73 of F-score.                                                  (with restrictions) with recommender systems. For example,
   A second envisioned application is the task of process                                in [5], the authors use a portfolio-based algorithm selection
discovery [2]. Here the framework automates the selection                                strategy to recommend process discovery algorithms. How-
of the optimal process discovery algorithm considering the                               ever, the literature often lacks generalization as recommenders
meta-features extracted from the event logs. For that, a set                             are designed for specific tasks, limiting their applicability to
of discovery techniques must be chosen, and the meta-target                              certain problems.
definition step ranks the techniques based on model quality
                                                                                                                          R EFERENCES
metrics, such as fitness and precision. In this scenario, the
framework averages 0.91 of F-score.                                                      [1] W. M. P. van der Aalst, “Responsible data science: Using event data in
                                                                                             a “people friendly” manner,” in Enterprise Information Systems. Cham:
   Recent experiments contemplated the clustering task, which                                Springer International Publishing, 2017, pp. 3–28.
can support variant analysis and serve as a preprocessing step                           [2] S. Barbon Jr., P. Ceravolo, E. Damiani, and G. M. Tavares, “Using
for other PM tasks. The framework reached 0.69 F-score in                                    meta-learning to recommend process discovery methods,” 2021. [Online].
                                                                                             Available: https://arxiv.org/abs/2103.12874
recommending both encoding and clustering algorithms for a                               [3] G. M. Tavares and S. Barbon Jr, “Process mining encoding via meta-
given event log [4]. Note that according to our research goal                                learning for an enhanced anomaly detection,” in New Trends in Database
the accuracy of the framework is just an intermediate goal.                                  and Information Systems. Cham: Springer International Publishing, 2021,
                                                                                             pp. 157–168.
The final aim is to verify which design options in PM can be                             [4] S. Barbon Jr., P. Ceravolo, E. Damiani, and G. M. Tavares, “Selecting
automated because driven by regular and measurable factors,                                  optimal trace clustering pipelines with automl,” 2021. [Online]. Available:
and which are intrinsically contextual or subjective.                                        https://arxiv.org/abs/2109.00635
                                                                                         [5] J. Ribeiro, J. Carmona, M. Mısır, and M. Sebag, “A recommender
   4) Planned Activities and Challenges: If the ambition is                                  system for process discovery,” in Business Process Management. Cham:
obtaining solid conclusions about the relationship between the                               Springer International Publishing, 2014, pp. 67–83.
components of a PM pipeline, to offer reliable recommenda-

 Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
 International (CC BY 4.0).

</pre>