Automating the Design of Process Mining Pipelines Through Meta-Learning (Extended Abstract) Gabriel Marques Tavares Università degli Studi di Milano, Italy gabriel.tavares@unimi.it With more than twenty years of history Process Mining strategy to approach the automation problem. MtL is the (PM) techniques have now achieved the maturity level to process of learning from the application of various learning cover the entire stack of the data science pipeline, from algorithms on different data, thus, solving the algorithm selec- raw data to decisions [1]. To prepare process discovery or tion problem by recommending the algorithm that produces conformance checking, event logs can be extracted, lifted, the best performance for a particular data set. Given the cleaned, segmented, profiled, encoded. To support decisions results of multiple configurations observed during the training PM models and metrics foster predictions and optimization process, an MtL procedure recommends configurations for procedures. Machine Learning (ML) algorithms are often new tasks. We introduce a general framework where the integrated into PM pipelines to support, among others, noise or observed configurations and tasks are abstract objects that anomaly detection, clustering, feature selection, classification, can be instantiated according to the specific scenario one and regression tasks. A consequence of this growth in the wants to study. Instantiating the framework means deciding variety of tools available is that designing a PM pipeline is descriptors, hyperparameter tuning, algorithms, and quality becoming complex. Identifying the best pipe of techniques to assessment metrics to be used in the MtL procedure. Figure achieve the best results given a specific task and a specific 1 shows the abstract (non-instantiated) MtL-based framework event log is challenging even for experts. The spectrum of applied to PM. algorithms and concepts is larger and larger while the number As observed in Figure 1, a set of event logs is required as in- of parameterizations among interacting algorithms is combina- put to the framework. The more heterogeneous the data set, the torial. To deal with this trend, this thesis is aimed at studying better because different process behaviors are explored. Failing PM pipelines to verify which steps can be automated. in creating a representative group may negatively impact the 1) Research Question: The general research question of framework’s performance. The Meta-Feature Extraction step this work is: can the design of PM pipelines be automated? aims at obtaining event log features, known as meta-features Answering this question implies studying the relations be- according to MtL terminology. The challenge is to correctly tween the algorithms composing a pipeline to verify if specific capture log characteristics using a representative set of meta- combinations are more effective than others. For example, features, capable of describing the process behavior from com- event log characteristics may guide the choice of the appro- plementary perspectives. Furthermore, the feature extraction priate discovery technique since high-level log descriptors can operation should have a low computational cost, otherwise, support the identification of the best discovery algorithm [2]. extracting meta-features would be more costly than testing all In particular, it will be of interest to verify which conditions, possible meta-targets. To this extent, based on information the- e.g. log complexity or noise level, and requirements, e.g. ory, statistical and PM feature extraction literature, we explore real-time response or quality measures, impact the optimal meta-features capturing activity, trace, and log descriptors. pipelining of PM techniques. The meta-features cover several complementary perspectives, Clearly, our work will not be able to exhaustively address such as central tendency, statistical dispersion, probability all the possible relations between all possible PM techniques distribution shape, log structuredness, and variability, among or design requirements. However, we aim at introducing a others. The Meta-Target Definition step is the most volatile novel methodology for studying these relationships, investi- in the abstract framework as its details depend on the task gating some scenarios in a comprehensive way and using being studied. For instance, consider the problem of selecting a reproducible method. A software framework for executing a process discovery technique, a set of discovery techniques is experimental analysis will also be delivered. applied to the event logs and, given a ranking function, the best A practical side of our work is providing recommendations discovery algorithm is selected. In this scenario, the ranking on the effective design of PM pipelines that can be exploited in function could simply be a metric capturing the produced training programs for specialists or in recommender systems. model quality. The ranking function can also be a result of 2) Research Design: We investigate a solution based on aggregated metrics using average or even user-based weights. Automated Machine Learning (AutoML), which allows non- During the instantiation of the framework, one must consider experts to achieve satisfactory results and experts to optimize the available techniques for a given problem and a respective their tasks. For that, we propose a Meta-learning (MtL) ranking function able to rate the output of these techniques. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). Event Logs Meta-Feature Extraction Meta-Database Meta-Learner Log Information Theory Statistical Meta-Features Machine Learning Meta-Instances Meta-Target Definition Technique 1 Technique 2 Ranking Best Meta-Model function Ranked ... Technique N Recommendation Fig. 1. Overview of the proposed framework. This way, the meta-target definition step selects the appropriate tions, important efforts must be spent to verify the solidity technique for each event log. The results of the two preceding of the experimental design. Many aspects must be validated: steps (meta-features and meta-targets) are joined in the Meta- (i) the representativeness of the event logs sample, (ii) the Database phase, resulting in a data set similar to traditional sensitivity of meta-features, (iii) the relevance of the algorithm ML applications, that is, a set of features describing instances selected (iv) the hyperparameter tuning. We plan to consolidate and their associated labels. Hence, a Meta-Learner (e.g. a the experiments already executed by extending the set of data traditional supervised algorithm) can be fed using the meta- and algorithms used and providing a solid validation of the database. The meta-learner maps the relationships between generalization power we can obtain in each scenario. The meta-features and meta-targets and produces a Meta-Model. availability/unavailability of executable software implementing Given a new event log, the meta-feature extraction takes place the last results of the literature will impose us some limitations to capture the process behavior and the meta-features are without invalidating the generality of the approach. submitted to the meta-model, which can, in turn, recommend An inherent challenge is that PM applications might be at- the most suitable technique for that event log. tached to certain purposes, ranging from abstract goals such as 3) Initial Results: The experiments to evaluate the frame- having an overview of the process to clearly defined questions work are direct instantiations of the abstract model applied to such as identifying the delay root-cause for a sublog. The more different PM tasks. The first problem we studied is anomaly specific questions may be assessed by quality metrics when detection [3], which is traditionally performed by conformance well defined while more general goals are more subjective. checking techniques. In this scenario, we aim at enhancing We aim at instantiate the framework to a range of PM tasks anomaly detection performance by using encodings. Thus, in order to test its adaptability to narrowed scenarios. the meta-targets are encoding techniques and the ranking 5) Relation to the State of the Art: The flexibility of our function measuring detection accuracy is F-score. Encoding framework allows the instantiation of a wide range of PM techniques are associated with meta-features extracted from tasks. For each task we study, respective literature exists to the event logs, forming the meta-database. In this scenario, compare with. Nevertheless, our framework can be compared the framework averages 0.73 of F-score. (with restrictions) with recommender systems. For example, A second envisioned application is the task of process in [5], the authors use a portfolio-based algorithm selection discovery [2]. Here the framework automates the selection strategy to recommend process discovery algorithms. How- of the optimal process discovery algorithm considering the ever, the literature often lacks generalization as recommenders meta-features extracted from the event logs. For that, a set are designed for specific tasks, limiting their applicability to of discovery techniques must be chosen, and the meta-target certain problems. definition step ranks the techniques based on model quality R EFERENCES metrics, such as fitness and precision. In this scenario, the framework averages 0.91 of F-score. [1] W. M. P. van der Aalst, “Responsible data science: Using event data in a “people friendly” manner,” in Enterprise Information Systems. Cham: Recent experiments contemplated the clustering task, which Springer International Publishing, 2017, pp. 3–28. can support variant analysis and serve as a preprocessing step [2] S. Barbon Jr., P. Ceravolo, E. Damiani, and G. M. Tavares, “Using for other PM tasks. The framework reached 0.69 F-score in meta-learning to recommend process discovery methods,” 2021. [Online]. Available: https://arxiv.org/abs/2103.12874 recommending both encoding and clustering algorithms for a [3] G. M. Tavares and S. Barbon Jr, “Process mining encoding via meta- given event log [4]. Note that according to our research goal learning for an enhanced anomaly detection,” in New Trends in Database the accuracy of the framework is just an intermediate goal. and Information Systems. Cham: Springer International Publishing, 2021, pp. 157–168. The final aim is to verify which design options in PM can be [4] S. Barbon Jr., P. Ceravolo, E. Damiani, and G. M. Tavares, “Selecting automated because driven by regular and measurable factors, optimal trace clustering pipelines with automl,” 2021. [Online]. Available: and which are intrinsically contextual or subjective. https://arxiv.org/abs/2109.00635 [5] J. Ribeiro, J. Carmona, M. Mısır, and M. Sebag, “A recommender 4) Planned Activities and Challenges: If the ambition is system for process discovery,” in Business Process Management. Cham: obtaining solid conclusions about the relationship between the Springer International Publishing, 2014, pp. 67–83. components of a PM pipeline, to offer reliable recommenda- Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).