<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>iGEDI: interactive Generating Event Data with Intentional Features</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Maldonado</string-name>
          <email>maldonado@dbs.ifi.lmu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sai Anirudh Aryasomayajula</string-name>
          <email>anirudhsai027@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian M. M. Frey</string-name>
          <email>christian.frey@utn.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thomas Seidl</string-name>
          <email>seidl@dbs.ifi.lmu.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Download/Demo URL</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Documentation URL</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Event Data Generation, Optimization, Event Log Features, Benchmarking</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>GNU/Linux, MacOS</institution>
          ,
          <addr-line>Microsoft Windows</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Munich Center for Machine Learning Munich</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Technology Nuremberg</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>18</lpage>
      <abstract>
        <p>Process mining solutions aim to improve performance, save resources, and address bottlenecks in organizations. However, success depends on data quality and availability, and existing analyses often lack diverse data for rigorous testing. To overcome this, we propose an interactive web application tool, extending the GEDI Python framework, which creates event datasets that meet specific (meta-)features. It provides diverse benchmark event data by exploring new regions within the feature space, enhancing the range and quality of process mining analyses. This tool improves evaluation quality and helps uncover correlations between meta-features and metrics, ultimately enhancing solution efectiveness.</p>
      </abstract>
      <kwd-group>
        <kwd>Metadata description</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
Source code repository
Screencast video</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Value
1.0
Python
https://github.com/lmu-dbs/gedi/archive/refs/heads/demo_icpm24.zip
https://github.com/lmu-dbs/gedi/blob/demo-icpm24/README.md
https://github.com/lmu-dbs/gedi/tree/demo-icpm24
https://youtu.be/9iQhaYwyQ9E
The development of benchmark event data (ED) that employs comprehensive intentional feature
characteristics and their connections to metrics supports process miners to evaluate methods
more eficiently and reliably. However, the availability of diverse data often presents a challenge,
nEvelop-O
∗Corresponding author.
limiting the ability to thoroughly evaluate these novel methods. [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] Existing tools, such as
PURPLE[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and Declare4Py[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ][
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], assist in generating event logs based on specific properties,
but they are often constrained to basic features like trace length and the number of variants.
To address this gap, we introduce an interactive online tool that integrates GEDI (Generating
Event Data with Intentional Features) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] — a framework that ofers a broad range of properties,
from statistical measures to entropy-based characteristics.
      </p>
      <p>
        Our tool, interactive Generating Event Data with Intentional Features (iGEDI) empowers
users to create event data tailored to their specific needs and objectives. By supporting seamless
customization and integration, our innovative platform not only enhances the eficiency of
testing process mining methods but also enables researchers to explore deeper connections
between event data characteristics and evaluation metrics. In academic research, it’s crucial to
train and evaluate methods on diverse datasets to improve robustness and generalization. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]
demonstrates that evaluation metrics in process discovery are interrelated when models are
trained on real-world benchmarks versus enriched data settings. Our tool also allows testing on
synthetic datasets that mimic characteristics of inaccessible test data, such as those restricted
by GDPR.
2. iGEDI’s Main Features
iGEDI, in fig. 1, is an interactive web application available both as an online service 1 and as
a locally executable program2. It allows users to create configuration files to subsequently
1https://huggingface.co/spaces/andreamalhera/igedi
2https://pypi.org/project/gedi/
generate event data based on the framework “Generating Event Data with Intentional Features”
(GEDI) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. GEDI employs (meta-)features, which numerically describe event log properties, to
generate ED that have specific desired values. Supported event data features are presented in
the Feature Extraction From Event Data (FEEED) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] framework and include statistics as well
as more complex, relationships between ED elements. Specifically, the feature types, that are
currently supported by iGEDI concern simple summary statistics, entropies [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and epa-based
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] about cardinality of traces/variants, trace length, variants, and (start/end) activities. For
detailed feature descriptions and default settings for realistic bounds, we refer to our repository3.
      </p>
      <p>
        Defined feature values are handled as targets in a hyperparameter optimization (HPO)
problem. As proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], GEDI embeds the Process Tree and Log Generator (PTLG) proposed
by Joucke et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and iteratively generates a process to optimize the parameters of PTLG, such
that novel EDs’ features align with the intended feature values, i.e. targets. The parameters of
the embedded generator module are optimized by Bayesian Optimization (BO). Intuitively,
BO iteratively selects and evaluates promising parameters, aiming to minimize an objective
function. Formally, GEDI’s objective function tackles a minimization problem of distances in
feature space between an array of desired feature values and an array of generated ED’s feature
values. Hence, by leveraging GEDI, users can reproduce single event logs based on their desired
feature values, or examine a grid of event logs, by regarding a hyperrectangle (grid) of specific
feature value combinations.
      </p>
      <p>Alongside implementing our architecture, iGEDI assists users throughout the specification
process, automatically generates configuration files defining the feature space, and enables
them to deploy GEDI either locally or as an interactive web application. Using the online web
app, the user can directly download the generated event logs.</p>
      <p>
        Next, we describe iGEDI’s two options to create one or multiple event logs at once:
iGEDI supports manual input as well as input from a
ifle . The supported file formats include event logs with a log rmcv ense
‘.xes‘ extension or ‘.csv‘ files. For the event log, users have BPIC15f4 0.003 0.604
the option to select features of interest, and the generated RTFMP 0.376 0.112
event log will be optimized to closely match the feature HD 0.517 0.254
values of the event log. For the ‘.csv‘ option, the file should
contain at least one feature column, according to FEEED’s Table 1: feature values for three
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] features, a ’log’ column containing the name of the target real ED
event log. Therefore, one row represents a desired feature
combination. Table 1 shows a possible example for such a ‘.csv‘ file. It depicts the feature values
for ratio most common variant (rmcv) and epa-based normalized sequence entropy (ense), as in
[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], of three public available datasets4, namely BPIC15f4[
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], RTFMP[
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and HD[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. While
rmcv compares the frequency of the most common variant to the overall number of traces in an
event-log, the intuition behind ense[
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is to measure the variability/predictability of sequences
captured by the event-log, considering their prefixes. A low ense indicates a process, where
most cases follow similar paths, and a high value indicates a complex or highly variable process
with many diferent paths.
3https://github.com/lmu-dbs/gedi/tree/demo-icpm24
4https://www.tf-pm.org/competitions-awards/bpi-challenge
      </p>
      <p>Moreover, independently of the input option, the user can choose to generate point targets
or a multidimensional grid of targets lying within a finite hyperrectangle:</p>
      <p>Point targets mode (as seen in fig. 1) aims to reproduce ED directly aiming at specified
feature values. In manual mode, the user can define specific target values for each selected
feature for one generation experiment. Manual input requires semantic knowledge about
selected features to choose values in sensible feature ranges. In contrast, inputting a table
(”From CSV” option) and choosing the point target option will generate one event log per
row, targeting their respective feature values for listed features. To reproduce ED, listed in
table 1 in terms of the two selected features, iGEDI will produce three sets of targets for
ED generation: [{rmcv: 0.003, ense: 0.604}, {rmcv: 0.376, ense: 0.112}, {rmcv: 0.517, ense:
0.254}] to reproduce BPIC15f4, RTFMP and HD, respectively. Using this option, we
generated ED and measured their euclidean similarity to respective targets, as shown in fig. 2.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Tool maturity</title>
      <p>
        The quality of generated logs by GEDI in terms of feasibility, representativeness, and usage for
benchmarking process mining tasks has been elaborately evaluated in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], an in-depth
analysis of inter-feature relations in a generated grid setting is discussed. Figure 3 depicts the
target distance between generated event logs and their respective targets with a color scale. It
contains 121 combinations of features created by the range option, where both features vary
between 0.0 and 1.0 with a step size of 0.1, as in the example presented above. The lighter (darker)
the color, the closer (further away) the measured feature values from the generated ED to its
respective targets. The combination of rmcv and ense exemplarily shows a bright bottom left
side and a dark top corner. By definition, a high value of rmcv indicates that the most common
variant is highly frequent in the event log, which results in a high amount of cases following
a similar path, represented by a low ense value. In contrast, a high ense value indicates high
variability in the event logs paths, which constraints the most common path to a low frequency,
resulting in low rmcv values. For this reason combinations of simultaneously high values
for both rmcv and ense are unfeasible, as depicted in fig. 3. Therefore, the target distance of
generated features indicates the level of feasibility for that particular feature value combination.
Subsequently, further analysis about relations
between feature values and metrics for a
specific task as, e.g. process discovery, can be
performed by benchmarking on highly
feasible logs from the generated ED collection.
      </p>
      <p>Overall, our tool iGEDI enhances existing
log generation tools by ofering improved
functionality and expanded features. It
facilitates understanding the relationship
between feature sets and evaluation metrics,
aiding in the creation of tailored methods for
specific tasks. It also supports model
pretraining on diverse datasets, enhancing
generalization. For testing, iGEDI can replicate
feature-based behavior of real-world data,
enabling reproducible benchmarking and explo- Figure 3: Target similarity between grid
generration of feature-metric relations. However, ated ED and their targets.
the framework’s efectiveness is sensitive to
feature selection, with increased complexity potentially leading to unfeasible solutions during
hyperparameter optimization.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Screencast and Website</title>
      <p>iGEDI, as an online service is available at https://huggingface.co/spaces/andreamalhera/gedi .
The source code, as well as examples, artifacts generated during the experiments, user guide, and
examples are available at https://github.com/lmu-dbs/gedi/tree/demo-icpm24. For a short
handson experience, we refer to our screencast video available at https://youtu.be/9iQhaYwyQ9E.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Jouck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bolt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Depaire</surname>
          </string-name>
          , M. de Leoni,
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          ,
          <article-title>An integrated framework for process discovery algorithm evaluation</article-title>
          ,
          <year>2018</year>
          . arXiv:
          <year>1806</year>
          .07222.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Jouck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Depaire</surname>
          </string-name>
          ,
          <article-title>Generating artificial data for empirical analysis of control-flow discovery algorithms</article-title>
          ,
          <source>Business &amp; Information Systems Engineering</source>
          <volume>61</volume>
          (
          <year>2019</year>
          )
          <fpage>695</fpage>
          -
          <lpage>712</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Burattin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Re</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tiezzi</surname>
          </string-name>
          ,
          <article-title>Purple: a purpose-guided log generator</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>I.</given-names>
            <surname>Donadello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Riva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Maggi</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. Shikhizada,</surname>
          </string-name>
          <article-title>Declare4py: A python library for declarative process mining, CEUR-WS</article-title>
          .org,
          <year>2022</year>
          , pp.
          <fpage>117</fpage>
          -
          <lpage>121</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Donadello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Maggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Riva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <article-title>Asp-based log generation with purposes in declare4py</article-title>
          , in: J.
          <string-name>
            <surname>M. E. M. van der Werf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cabanillas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Leotta</surname>
          </string-name>
          , L. Genga (Eds.),
          <source>Doctoral Consortium and Demo Track 2023 at the International Conference on Process Mining</source>
          <year>2023</year>
          co
          <article-title>-located with the 5th International Conference on Process Mining (ICPM</article-title>
          <year>2023</year>
          ), Rome, Italy, October
          <volume>27</volume>
          ,
          <year>2023</year>
          , volume
          <volume>3648</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Maldonado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Frey</surname>
          </string-name>
          , G. Tavares,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rehwald</surname>
          </string-name>
          , T. Seidl,
          <article-title>GEDI: generating event data with intentional features for benchmarking process mining, To be published in BPM 2024</article-title>
          . Krakow, Poland, Sep
          <volume>01</volume>
          -
          <fpage>06</fpage>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Maldonado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Marques</surname>
          </string-name>
          <string-name>
            <surname>Tavares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Oyamada</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ceravolo</surname>
          </string-name>
          , T. Seidl,
          <article-title>FEEED: feature extraction from event data</article-title>
          , in: J.
          <string-name>
            <surname>M. E. M. van der Werf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Cabanillas</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Leotta</surname>
          </string-name>
          , L. Genga (Eds.),
          <source>Doctoral Consortium and Demo Track 2023 at the International Conference on Process Mining</source>
          <year>2023</year>
          co
          <article-title>-located with the 5th International Conference on Process Mining (ICPM</article-title>
          <year>2023</year>
          ), Rome, Italy, October
          <volume>27</volume>
          ,
          <year>2023</year>
          , volume
          <volume>3648</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C. O.</given-names>
            <surname>Back</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Debois</surname>
          </string-name>
          , T. Slaats,
          <article-title>Entropy as a measure of log variability</article-title>
          ,
          <source>Journal on Data Semantics</source>
          <volume>8</volume>
          (
          <year>2019</year>
          )
          <fpage>129</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Augusto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mendling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vidgof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wurm</surname>
          </string-name>
          ,
          <article-title>The connection between process complexity of event sequences and models discovered by process mining</article-title>
          ,
          <source>Information Sciences 598</source>
          (
          <year>2022</year>
          )
          <fpage>196</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>B. F. van Dongen</surname>
          </string-name>
          ,
          <article-title>Bpi challenge 2015 dataset</article-title>
          , https://data.4tu.nl/articles/dataset/BPI_ Challenge_
          <year>2015</year>
          /12689204,
          <year>2015</year>
          . Eindhoven University of Technology.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Boulevard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kropf</surname>
          </string-name>
          , S. van der Meer, T. De Laet,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rozinat</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. F. van Dongen</surname>
          </string-name>
          ,
          <article-title>Bpi challenge 2017 road trafic fee management dataset</article-title>
          , https://data.4tu.nl/articles/dataset/ BPI_Challenge_2017_Road_Traffic_Fee_Management/12689357,
          <year>2017</year>
          . Business Process Intelligence Challenge.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Polato</surname>
          </string-name>
          ,
          <article-title>Dataset belonging to the help desk log of an italian company</article-title>
          ,
          <year>2017</year>
          . URL: https://data.4tu.nl/articles/_/12675977/1.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>