<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automating the Design of Process Mining Pipelines Through Meta-Learning (Extended Abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gabriel Marques Tavares</string-name>
          <email>gabriel.tavares@unimi.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Universita` degli Studi di Milano</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>With more than twenty years of history Process Mining (PM) techniques have now achieved the maturity level to cover the entire stack of the data science pipeline, from raw data to decisions [1]. To prepare process discovery or conformance checking, event logs can be extracted, lifted, cleaned, segmented, profiled, encoded. To support decisions PM models and metrics foster predictions and optimization procedures. Machine Learning (ML) algorithms are often integrated into PM pipelines to support, among others, noise or anomaly detection, clustering, feature selection, classification, and regression tasks. A consequence of this growth in the variety of tools available is that designing a PM pipeline is becoming complex. Identifying the best pipe of techniques to achieve the best results given a specific task and a specific event log is challenging even for experts. The spectrum of algorithms and concepts is larger and larger while the number of parameterizations among interacting algorithms is combinatorial. To deal with this trend, this thesis is aimed at studying PM pipelines to verify which steps can be automated. 1) Research Question: The general research question of this work is: can the design of PM pipelines be automated? Answering this question implies studying the relations between the algorithms composing a pipeline to verify if specific combinations are more effective than others. For example, event log characteristics may guide the choice of the appropriate discovery technique since high-level log descriptors can support the identification of the best discovery algorithm [2]. In particular, it will be of interest to verify which conditions, e.g. log complexity or noise level, and requirements, e.g. real-time response or quality measures, impact the optimal pipelining of PM techniques. Clearly, our work will not be able to exhaustively address all the possible relations between all possible PM techniques or design requirements. However, we aim at introducing a novel methodology for studying these relationships, investigating some scenarios in a comprehensive way and using a reproducible method. A software framework for executing experimental analysis will also be delivered. A practical side of our work is providing recommendations on the effective design of PM pipelines that can be exploited in training programs for specialists or in recommender systems. 2) Research Design: We investigate a solution based on Automated Machine Learning (AutoML), which allows nonexperts to achieve satisfactory results and experts to optimize their tasks. For that, we propose a Meta-learning (MtL)</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>strategy to approach the automation problem. MtL is the
process of learning from the application of various learning
algorithms on different data, thus, solving the algorithm
selection problem by recommending the algorithm that produces
the best performance for a particular data set. Given the
results of multiple configurations observed during the training
process, an MtL procedure recommends configurations for
new tasks. We introduce a general framework where the
observed configurations and tasks are abstract objects that
can be instantiated according to the specific scenario one
wants to study. Instantiating the framework means deciding
descriptors, hyperparameter tuning, algorithms, and quality
assessment metrics to be used in the MtL procedure. Figure
1 shows the abstract (non-instantiated) MtL-based framework
applied to PM.</p>
      <p>As observed in Figure 1, a set of event logs is required as
input to the framework. The more heterogeneous the data set, the
better because different process behaviors are explored. Failing
in creating a representative group may negatively impact the
framework’s performance. The Meta-Feature Extraction step
aims at obtaining event log features, known as meta-features
according to MtL terminology. The challenge is to correctly
capture log characteristics using a representative set of
metafeatures, capable of describing the process behavior from
complementary perspectives. Furthermore, the feature extraction
operation should have a low computational cost, otherwise,
extracting meta-features would be more costly than testing all
possible meta-targets. To this extent, based on information
theory, statistical and PM feature extraction literature, we explore
meta-features capturing activity, trace, and log descriptors.
The meta-features cover several complementary perspectives,
such as central tendency, statistical dispersion, probability
distribution shape, log structuredness, and variability, among
others. The Meta-Target Definition step is the most volatile
in the abstract framework as its details depend on the task
being studied. For instance, consider the problem of selecting
a process discovery technique, a set of discovery techniques is
applied to the event logs and, given a ranking function, the best
discovery algorithm is selected. In this scenario, the ranking
function could simply be a metric capturing the produced
model quality. The ranking function can also be a result of
aggregated metrics using average or even user-based weights.
During the instantiation of the framework, one must consider
the available techniques for a given problem and a respective
ranking function able to rate the output of these techniques.</p>
      <p>Information Theory Statistical</p>
      <p>Meta-Target Definition
Technique 1
Technique 2</p>
      <p>...</p>
      <p>Technique N</p>
      <p>Ranking
function</p>
      <p>Machine
Learning</p>
      <p>Meta-Model
Recommendation
This way, the meta-target definition step selects the appropriate
technique for each event log. The results of the two preceding
steps (meta-features and meta-targets) are joined in the
MetaDatabase phase, resulting in a data set similar to traditional
ML applications, that is, a set of features describing instances
and their associated labels. Hence, a Meta-Learner (e.g. a
traditional supervised algorithm) can be fed using the
metadatabase. The meta-learner maps the relationships between
meta-features and meta-targets and produces a Meta-Model.
Given a new event log, the meta-feature extraction takes place
to capture the process behavior and the meta-features are
submitted to the meta-model, which can, in turn, recommend
the most suitable technique for that event log.</p>
      <p>
        3) Initial Results: The experiments to evaluate the
framework are direct instantiations of the abstract model applied to
different PM tasks. The first problem we studied is anomaly
detection [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], which is traditionally performed by conformance
checking techniques. In this scenario, we aim at enhancing
anomaly detection performance by using encodings. Thus,
the meta-targets are encoding techniques and the ranking
function measuring detection accuracy is F-score. Encoding
techniques are associated with meta-features extracted from
the event logs, forming the meta-database. In this scenario,
the framework averages 0.73 of F-score.
      </p>
      <p>
        A second envisioned application is the task of process
discovery [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Here the framework automates the selection
of the optimal process discovery algorithm considering the
meta-features extracted from the event logs. For that, a set
of discovery techniques must be chosen, and the meta-target
definition step ranks the techniques based on model quality
metrics, such as fitness and precision. In this scenario, the
framework averages 0.91 of F-score.
      </p>
      <p>
        Recent experiments contemplated the clustering task, which
can support variant analysis and serve as a preprocessing step
for other PM tasks. The framework reached 0.69 F-score in
recommending both encoding and clustering algorithms for a
given event log [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Note that according to our research goal
the accuracy of the framework is just an intermediate goal.
The final aim is to verify which design options in PM can be
automated because driven by regular and measurable factors,
and which are intrinsically contextual or subjective.
      </p>
      <p>4) Planned Activities and Challenges: If the ambition is
obtaining solid conclusions about the relationship between the
components of a PM pipeline, to offer reliable
recommendations, important efforts must be spent to verify the solidity
of the experimental design. Many aspects must be validated:
(i) the representativeness of the event logs sample, (ii) the
sensitivity of meta-features, (iii) the relevance of the algorithm
selected (iv) the hyperparameter tuning. We plan to consolidate
the experiments already executed by extending the set of data
and algorithms used and providing a solid validation of the
generalization power we can obtain in each scenario. The
availability/unavailability of executable software implementing
the last results of the literature will impose us some limitations
without invalidating the generality of the approach.</p>
      <p>An inherent challenge is that PM applications might be
attached to certain purposes, ranging from abstract goals such as
having an overview of the process to clearly defined questions
such as identifying the delay root-cause for a sublog. The more
specific questions may be assessed by quality metrics when
well defined while more general goals are more subjective.
We aim at instantiate the framework to a range of PM tasks
in order to test its adaptability to narrowed scenarios.</p>
      <p>
        5) Relation to the State of the Art: The flexibility of our
framework allows the instantiation of a wide range of PM
tasks. For each task we study, respective literature exists to
compare with. Nevertheless, our framework can be compared
(with restrictions) with recommender systems. For example,
in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], the authors use a portfolio-based algorithm selection
strategy to recommend process discovery algorithms.
However, the literature often lacks generalization as recommenders
are designed for specific tasks, limiting their applicability to
certain problems.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          , “
          <article-title>Responsible data science: Using event data in a “people friendly” manner,” in Enterprise Information Systems</article-title>
          . Cham: Springer International Publishing,
          <year>2017</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>28</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S. Barbon</given-names>
            <surname>Jr</surname>
          </string-name>
          .,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ceravolo</surname>
          </string-name>
          , E. Damiani, and
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Tavares</surname>
          </string-name>
          , “
          <article-title>Using meta-learning to recommend process discovery methods</article-title>
          ,”
          <year>2021</year>
          . [Online]. Available: https://arxiv.org/abs/2103.12874
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Tavares</surname>
          </string-name>
          and
          <string-name>
            <given-names>S.</given-names>
            <surname>Barbon</surname>
          </string-name>
          <string-name>
            <surname>Jr</surname>
          </string-name>
          , “
          <article-title>Process mining encoding via metalearning for an enhanced anomaly detection,” in New Trends in Database and Information Systems</article-title>
          . Cham: Springer International Publishing,
          <year>2021</year>
          , pp.
          <fpage>157</fpage>
          -
          <lpage>168</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S. Barbon</given-names>
            <surname>Jr</surname>
          </string-name>
          .,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ceravolo</surname>
          </string-name>
          , E. Damiani, and
          <string-name>
            <given-names>G. M.</given-names>
            <surname>Tavares</surname>
          </string-name>
          , “
          <article-title>Selecting optimal trace clustering pipelines with automl</article-title>
          ,”
          <year>2021</year>
          . [Online]. Available: https://arxiv.org/abs/2109.00635
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carmona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mısır</surname>
          </string-name>
          , and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sebag</surname>
          </string-name>
          , “
          <article-title>A recommender system for process discovery,” in Business Process Management</article-title>
          . Cham: Springer International Publishing,
          <year>2014</year>
          , pp.
          <fpage>67</fpage>
          -
          <lpage>83</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>