<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CDLG: A Tool for the Generation of Event Logs with Concept Drifts</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Justus Grimm</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alexander Kraus</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Han van der Aa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data and Web Science Group, University of Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>92</fpage>
      <lpage>96</lpage>
      <abstract>
        <p>Process mining targets the analysis of data recorded during the execution of business processes. When such data covers a period in which a process underwent changes, an event log is subject to concept drifts. Given that such drifts can greatly afect the quality of insights obtained from them, numerous approaches have been established to detect concept drifts. However, assessing and comparing the quality of these approaches is hard, due to the lack of appropriate data. Recognizing this, we present CDLG; a powerful and highly-flexible tool for the generation of event logs with known concept drifts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Process mining</kwd>
        <kwd>Concept drifts</kwd>
        <kwd>Event log generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Business processes are subject to frequent changes due to the dynamic environments in which
they are executed [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Analyzing data recorded during the execution of such changing processes
results in the presence of concept drifts, i.e., situations in which an event log contains data
from diferent versions of a process. Recognizing the detrimental impact that such concept
drifts can have on obtained process mining results, a broad range of detection approaches has
been developed [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        To assess and compare the quality of such approaches, appropriate evaluation data is required
in the form of event logs that are subject to concept drifts and for which the details of these
drifts are available as a gold standard. This is hardly the case for publicly available real-world
event logs, for which there are only a few instances of (partially) known drifts, such as the
help desk log [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] that was studied by various authors [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. When it comes to synthetic event
logs, there are no reference collections available, whereas existing generation tools provide
only minimal support for the generation of event logs that contain drifts [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and do not provide
information in terms of a gold standard.
      </p>
      <p>To address this gap, this paper presents the Concept Drift Log Generator (CDLG), a tool that
allows users to generate synthetic event logs with concept drifts and a corresponding gold
standard. CDLG supports a wide range of options when it comes to log generation, allowing
users to obtain logs of varying complexity (in terms of trace length, control-flow constructs, log
size, noise, etc.) and insert one or more concept drifts of diferent types (i.e., sudden, gradual,
recurrent, and incremental) into them. CLDG provides users with the means to obtain these
logs through three execution modes, trading of customization versus speed. In particular, users
can generate logs by following an interactive terminal script, by providing parameters through
a text file, or by generating an entire log collection based on desired high-level settings.</p>
      <p>In the remainder of this paper, Section 2 describes the functionality of the CDLG tool in more
detail, whereas Section 3 discusses its maturity. The tool itself, a corresponding Python package,
a tutorial document, and an instruction video can be accessed through our repository.1</p>
    </sec>
    <sec id="sec-2">
      <title>2. The CDLG Tool</title>
      <p>2.1. Scope
This section describes the CDLG tool in terms of its scope (Section 2.1) and the three execution
modes that it provides to users (Section 2.2).</p>
      <p>
        Our CDLG tool allows users to generate event logs that contain the primary kinds of
controllfow-based concept drifts and provides users with considerable control over other aspects, such
as model complexity, log size, and noise. As discussed in Section 2.2, users can customize all
these aspects if desired or instead choose to go with default or automatically generated options.
Concept drift types. CDLG can generate event logs that contain drift scenarios with the four
drift types visualized in Figure 1, i.e., sudden, incremental, gradual, and recurring drifts, as, e.g.,
defined by Bose et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>Sudden drift</title>
      </sec>
      <sec id="sec-2-2">
        <title>Incremental drift</title>
      </sec>
      <sec id="sec-2-3">
        <title>Gradual drift</title>
        <p>Recurring drift
n
o
i
s
r
e
V
n
o
i
s
r
e
V
n
o
i
s
r
e
V
n
o
i
s
r
e
V</p>
      </sec>
      <sec id="sec-2-4">
        <title>Time</title>
      </sec>
      <sec id="sec-2-5">
        <title>Time</title>
      </sec>
      <sec id="sec-2-6">
        <title>Time</title>
      </sec>
      <sec id="sec-2-7">
        <title>Time</title>
        <p>Event log generation. CDLG generates event logs with concept drifts by playing out a process
tree for each version of a process required for a given drift scenario. As seen in Figure 1,
this means that the generation of a single sudden, gradual, or recurring drift requires two
process trees, whereas an incremental drift requires three or more diferent trees. Depending
on the employed execution mode (see Section 2.2), the necessary process trees can be explicitly
provided as input by the user or automatically generated.</p>
        <p>Since sudden, incremental, and recurring drifts involve the non-overlapping execution of
process versions, CDLG generates sub-logs for each part of the drift scenario, according to the
corresponding process tree, and then concatenates these sub-logs. For instance, a sudden drift</p>
        <sec id="sec-2-7-1">
          <title>1https://gitlab.uni-mannheim.de/processanalytics/cdlg_tool</title>
          <p>from version 1 to 2 that occurs after three quarters of a desired event log with 1000 traces will
consist of 750 traces of 1, followed by 250 traces of 2. For incremental and recurring drifts,
the tool generates as many sub-logs as necessary to capture the drift scenario (e.g., five for the
incremental and four for the recurring drift in Figure 1).</p>
          <p>For gradual drifts, which contain a period of time in which two process versions overlap,
we generate sub-logs such that their joint version has a period with traces from both 1 and
2. Our tool allows for this transition to occur in a linear fashion, where traces of 2 gradually
become more common, as well as in an exponential fashion, in which this transition occurs
more rapidly.</p>
          <p>Multi-drift scenarios. CDLG also supports the generation of event logs with multiple,
consecutive drifts. In this case, multiple of the aforementioned drift scenarios are concatenated, where
the last process version of one scenario represents the first version used in the next.
Noise insertion. Finally, CDLG supports the insertion of noise into an event log, which can be
used to assess the robustness of concept drift detection approaches. We allow for noisy traces
to be inserted that are similar to the traces in the provided process trees, as well as for traces
that are wholly randomly generated (over the same set of activities as the process trees).
Gold standard attribute. Having information on the characteristics of the concept drift(s) in
an event log is crucial for the proper evaluation of detection approaches. Therefore, given an
inserted drift, CDLG records information on its drift type, the moment at which it occurred, as
well as the control-flow changes that occurred in the process. This is all stored as an event log
attribute in the resulting XES file, as shown in Figure 2.
2.2. Execution Modes
CDLG allows users to generate event logs in various manners, generally trading of ease of use
(and speed) versus the degree of customization that is possible.</p>
          <p>Terminal mode. This mode, accessed through start_generator_terminal.py, provides
the highest degree of customization to a user. As detailed in the tutorial document (see Section 1),
this mode starts by allowing users to define the process trees necessary for a drift scenario (either
provided as input or automatically generated according to desired characteristics), and define
aspects such as the desired log size. Then, users can define the desired drift scenario between the
established process versions in terms of the drift type and change moment(s). Finally, users can
decide to insert noise and, if desired, add another drift to the log. Note that the terminal mode
provides default settings for all options, which means that users have considerable freedom
when establishing drift scenarios, but can skip customization of certain aspects if desired.
Parameter-file modes. For users that simply want to obtain event logs with
concept drifts to evaluate a detection approach, CDLG provides two options. Users
(a) Drift map.</p>
          <p>(b) Drift chart w/o noise.</p>
          <p>(c) Drift chart with noise.
can generate individual logs according to parameters set in a text file by calling the
generate_log_drift_type_from_doc.py functions (one for each type of drift). Due to
the randomization provided by CDLG, such as the automated generation of process trees of
various complexity, each execution of these methods will result in a unique event log.</p>
          <p>Users can also execute generate_collection_of_logs.py to generate an entire
collection of a user-defined number of event logs with concept drifts. This method provides slightly
less customization options as the methods for individual logs, but can provide users with a large
number of event logs suitable for the automated assessment of the accuracy of a concept drift
detection approach.</p>
          <p>Python package. Finally, we provide the core functionality of CDLG as part of a Python package,
which can be installed using pip.2 Specifically, this package provides methods for the generation
of event logs with each of the four kinds of concept drifts, the insertion of noise in such event
logs, and the automated generation of process trees to be used as a basis for log generation. For
more details on the methods and their parameters, we refer to the corresponding README file.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Tool Maturity</title>
      <p>
        The core of CDLG was developed as a bachelor project, of which the resulting thesis [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]—
available in the project’s repository—provides details on the conceptual algorithms underlying
CDLG’s functionality.
      </p>
      <p>
        Evaluation. As part of this thesis project, CDLG was evaluated by using it to generate various
event logs with concept drifts and feeding these event logs as input to the Visual Drift Detection
(VDD) tool [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], a state-of-the-art concept drift detection approach and visualizer. The evaluation
showed the suitability of the CDLG logs to assess the potential of detection approaches to
recognize concept drifts in diferent scenarios and under diferent circumstances, for instance
by comparing event logs with and without noise. A visualization obtained using VDD for such a
case is shown in Figure 3. The figures reveal that VDD was able to accurately detect a recurring
      </p>
      <sec id="sec-3-1">
        <title>2https://gitlab.uni-mannheim.de/processanalytics/cdlg-package</title>
        <p>drift in event logs without and with noise, to the significance of the results in the latter case
was lower (Figure 3c).</p>
        <p>Future work. We plan to continue the development of CDLG in the future.</p>
        <p>
          Technical extensions of the current functionality that we aim to make are to enable CDLG to
automatically establish intermediary steps for incremental drifts between two provided process
trees and to provide users with more control over the noise-insertion procedure, e.g., by ensuring
that noise inserted into processes with meaningful event labels makes semantic sense [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>On a conceptual level, a key extension that we envision includes support for multi-perspective
drifts, e.g., where the behavior of a process changes with respect to factors such as the execution
time or arrival rate, distributions over data attributes, the resource perspective, and combinations
of such aspects together with control-flow changes. Furthermore, we aim to provide support
for the generation of multi-order drifts, which are drift scenarios in which multiple drifts occur
simultaneously, such as a process that is subject to both seasonal and monthly recurring changes.
Finally, we also intend to extend CDLG by providing methods for the automated computation
of evaluation results of a detection approach for a set of CDLG generated event logs, providing
measures such as precision, recall, and detection delay.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R. J. C.</given-names>
            <surname>Bose</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. M. Van Der Aalst</surname>
            , I. Žliobaitė,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Pechenizkiy</surname>
          </string-name>
          ,
          <article-title>Dealing with concept drifts in process mining</article-title>
          ,
          <source>IEEE Trans Neural Netw Learn Syst</source>
          <volume>25</volume>
          (
          <year>2013</year>
          )
          <fpage>154</fpage>
          -
          <lpage>171</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G.</given-names>
            <surname>Elkhawaga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abuelkheir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. I.</given-names>
            <surname>Barakat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Riad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reichert</surname>
          </string-name>
          ,
          <article-title>Conda-pm-a systematic review and framework for concept drift analysis in process mining</article-title>
          ,
          <source>Algorithms</source>
          <volume>13</volume>
          (
          <year>2020</year>
          )
          <fpage>161</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Polato</surname>
          </string-name>
          ,
          <article-title>Dataset belonging to the help desk log of an italian company</article-title>
          ,
          <year>2017</year>
          . doi:
          <volume>10</volume>
          .4121/ uuid:
          <fpage>0c60edf1</fpage>
          -6f83
          <string-name>
            <surname>-</surname>
          </string-name>
          4e75-
          <fpage>9367</fpage>
          -4c63b3e9d5bb.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ostovar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Maaradji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. La</given-names>
            <surname>Rosa</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. H.</surname>
          </string-name>
          <article-title>ter</article-title>
          <string-name>
            <surname>Hofstede</surname>
            ,
            <given-names>B. F. van Dongen</given-names>
          </string-name>
          ,
          <article-title>Detecting drift from event streams of unpredictable business processes</article-title>
          ,
          <source>in: ER</source>
          , Springer,
          <year>2016</year>
          , pp.
          <fpage>330</fpage>
          -
          <lpage>346</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yeshchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Di Ciccio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mendling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polyvyanyy</surname>
          </string-name>
          ,
          <article-title>Comprehensive process drift detection with visual analytics</article-title>
          ,
          <source>in: ER</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>119</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Burattin</surname>
          </string-name>
          ,
          <article-title>PLG2: Multiperspective process randomization with online and ofline simulations</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          , volume
          <volume>1789</volume>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Grimm</surname>
          </string-name>
          ,
          <article-title>CDLG: Generation of event logs with concept drifts</article-title>
          ,
          <source>Bachelor's Thesis</source>
          , University of Mannheim,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>H. van der Aa</surname>
          </string-name>
          , A. Rebmann,
          <string-name>
            <given-names>H.</given-names>
            <surname>Leopold</surname>
          </string-name>
          ,
          <article-title>Natural language-based detection of semantic execution anomalies in event logs</article-title>
          ,
          <source>Information Systems</source>
          <volume>102</volume>
          (
          <year>2021</year>
          )
          <fpage>101824</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>