<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Demos and Resources
$ david.jilg@dfki.de (D. Jilg); grueger@uni-trier.de (J. Grüger); tobias.geyer@dfki.de (T. Geyer);
bergmann@uni-trier.de (R. Bergmann)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>DALG: The Data Aware Event Log Generator</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>David Jilg</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joscha Grüger</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tobias Geyer</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ralph Bergmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Artificial Intelligence and Intelligent Information Systems, University of Trier</institution>
          ,
          <addr-line>54296 Trier</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>German Research Center for Artificial Intelligence (DFKI), Branch University of Trier</institution>
          ,
          <addr-line>54296 Trier</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Data and process mining techniques can be applied in many areas to gain valuable insights, but accessibility to real-world process data is severely limited. However, research, but especially the development of new methods, depends on a suficient basis of realistic data. With adequate quality, synthetic data can be a solution to this problem. The SAMPLE [1] approach aims to mitigate this problem by generating multi-perspective synthetic event logs that make sense on a semantic level. In this paper, we present the tool DALG: The Data Aware Event Log Generator, which allows users to generate synthetic event logs using the SAMPLE approach.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Event Log Generation</kwd>
        <kwd>Synthetic Data</kwd>
        <kwd>Process Mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>application and allows to define semantic attribute descriptions for a given data Petri net (DPN)
and generate event logs based on them.</p>
      <p>In the following, Sect. 2 describes how DALG difers from already existing tools for synthetic
data generation by highlighting its innovative characteristics. Subsequently, Sect. 4 evaluates
the maturity of the tool before we draw a conclusion in Sect. 5.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Innovation and characteristics</title>
      <p>
        DALG permits users to generate synthetic event logs from available data Petri nets, which also
support control-flow-only Petri nets. The initial step involves loading a data Petri net modeled
in the Petri Net Markup Language format [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. From this point on, the user can configure a wide
variety of simulation parameters and can provide additional information about the provided
model. Due to the variety of the configuration options and the resulting complexity for the
user, the tool provides multiple features to assist the user during configuration. After the user
starts the simulation, information about the simulation status is provided and the event logs are
exported in the eXtensible Event Stream format [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] once the simulation has finished.
Semantic information. In developing the SAMPLE approach, it was determined that a data
Petri net simply does not describe a process accurately enough and, therefore, does not contain
enough semantic information about the process to produce realistic synthetic data in the data
perspective without using additional information [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Therefore, SAMPLE was introduced as an
approach for complementary semantic description of the data perspective in the generation
of synthetic event logs. DALG implements and extends the SAMPLE approach and enables
the semantic description of variables and transitions in a data Petri net. Via a user interface
or a configuration file, the tool allows, among other things, to define intervals, distribution
functions, value ranges and dependencies for variables. In contrast to existing tools, DALG
uses this additional semantic information about the properties of variables and transitions to
generate more realistic values by, for example, only generating values inside the given value
ranges of the numerical variables. This enables the generation of semantically meaningful data
in the data perspective.
      </p>
      <p>• Dependencies: To achieve the generation of realistic values for the variables in a
data Petri net, the dependencies between variables have to be considered. For example,
consider a process model describing the treatment process for diagnosing and treating
cancer patients. If this model has two variables describing a patient’s gender and cancer
type, then these two variables can be interdependent, as certain cancer types are exclusive
to male patients (e.g., prostate cancer) or female patients (e.g., ovarian cancer).
DALG lets the user express these dependencies using pairs of logical expressions and
value range restrictions. The previously presented example could be partially described
by adding the following dependencies to the variable describing the cancer type.
'gender == "female"' =&gt; (!=,'prostate cancer')</p>
      <p>'gender == "male"' =&gt; (!=,'ovarian cancer')
These two dependencies specify that the variable describing the cancer type cannot be
"prostate cancer" if the patient is female, and "ovarian cancer" if the patient is male.
• Distribution: For numeric variables, DALG lets users define distribution functions for
generated values. In the cancer treatment example, if age is included in the process model,
its distribution can be specified. This ensures realistic value distributions in synthetic
event logs.
• Values: A data Petri net modelled in the PNML does not specify what values should be
written when a transition writes to a variable. Therefore, DALG allows the user to specify
a list of values for each variable that is used when generating values for that variable.
Additionally, each value can be specified with a weight that afects how likely the value is
picked when a value for the variable is needed. For numeric variables, it is also possible
to specify a value range instead of individual values.</p>
      <p>Simulation Configuration. DALG ofers comprehensive event log generation control. The
simulation setup allows choosing generation modes like randomized traces or experimental
complete play-outs for the control flow perspective. Configuration options encompass trace
quantity, length range, loop handling, duplicates, non-conforming traces, and timestamps.
Transition Configuration. The transition configuration allows you to set individual weights
for the transitions and mark them as invisible. In addition, the tool extends the SAMPLE
approach by the configurability of individual time constraints for transitions and thus addresses
a limitation identified in the evaluation of the SAMPLE approach.</p>
      <p>Usability. DALG prioritizes user-friendliness for researchers across domains, ensuring ease
of use without programming expertise in generating synthetic data for their studies. For this
purpose, a graphical user interface based on the modern QT6 framework1 was developed,
which provides the user with convenient access to all configuration options. Tooltips aid
user configuration, while automated analysis of the process model generates a preliminary
configuration, laying the foundation for semantic definition.</p>
      <p>Furthermore, the system checks and alerts users about invalid models or configurations, like
when minimum values exceed maximum values in variable intervals, before each simulation.
In addition, users can export their configurations to a JSON 2 file for future use or sharing,
enhancing reproducibility. Moreover, result reproducibility is secured as users can define a seed
for all simulation decisions, relying on pseudo-random number generation.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation</title>
      <p>
        The SAMPLE approach implemented in DALG is described and evaluated in [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. During
the approach’s evaluation, the need for semantic description of temporal information was
uncovered and then implemented in DALG. During interviews with process mining experts,
the application was presented and tested. In the process, the need for additional features, such
as the specification of semantic information regarding the time aspect of transitions, came
up. These features have also been added. DALG itself was tested using functional testing
and synthetic DPNs specially prepared to test all of DALG features. Additionally, synthetic
event logs based on real-world process models were generated with DALG and evaluated by
1https://www.qt.io/product/qt6
2https://www.json.org/json-en.html
experts from the models’ domains. For example, DALG was applied to a model describing
the treatment of melanoma, and doctors from the University Hospital Muenster evaluated the
synthetic treatment traces. They found that the synthetic data was mostly realistic, but some
unrealistic properties were identified. However, the unrealistic aspects were all found to be
caused by inaccurate semantic information supplied to DALG since the medical professionals
were not available to configure these. Petri nets defined using the PNML come in many diferent
shapes and sizes, DALG was also tested with publicly available data Petri nets. Scientific data
repositories such as 4Tu.ResearchData3 were searched to acquire DPNs. Subsequently, the
features of these DPNs were identified, and it was ensured that they are supported by DALG to
ensure a broad compatibility with existing DPNs. The usability of DALG was evaluated with
user studies. One limitation found during the evaluation is the great efort required to configure
the semantic information. This step is very time-consuming and error-prone. However, it was
also found that the amount of semantic information cannot be reduced if the goal is to generate
realistic data. A way of mitigating this problem could be to at least partially source the semantic
information automatically from external structures, such as ontologies.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Tool maturity</title>
      <p>DALG a fully functional standalone tool, whose development has been completed. Additionally,
the correctness of the implementation of the SAMPLE approach has been evaluated. The tool
and its source code are available for free use and development under the GNU General Public
License 3 on GitHub4. The GitHub repository also includes a tutorial document5 and a video
showcase6. DALG can be easily installed with the installer provided on GitHub and supports
Windows and Linux based operating systems. Additionally, a user manual is provided to guide
users through configuring the semantic information and running DALG.</p>
      <p>In summary, DALG is a fully functional stand-alone tool that is ready to be used by researchers
to generate synthetic multi-perspective event logs across many domains.
This paper presents DALG, a stand-alone implementation of the SAMPLE approach, that enables
the generation of multi-perspective event logs with a realistic data perspective. The tool aims
at experts in the field of Business Process Management who do not have access to suitable
event logs or are struggling to acquire relevant event logs. With DALG, users can continue in
their process with reliable data. Because of the process model conformity, another use is the
generation of event logs for evaluations.</p>
      <p>The presented tool is available for download and can be extended. In future research, the
tool will be extended to include the functionality of an auto configurator. This allows learning
the configuration based on event logs. This should especially address privacy aspects in event
log availability and significantly simplify the configuration. In addition, the user interface is to
be expanded and revised to make it easier to use. Since it is currently only possible to generate
compliant traces, an extension for the controlled generation of non-compliant traces should
also be developed and integrated.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Grüger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Geyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jilg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bergmann</surname>
          </string-name>
          ,
          <article-title>Sample: A semantic approach for multi-perspective event log generation</article-title>
          , in: M.
          <string-name>
            <surname>Montali</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Senderovich</surname>
          </string-name>
          , M. Weidlich (Eds.),
          <source>Process Mining Workshops</source>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>328</fpage>
          -
          <lpage>340</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kummer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wienberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Duvigneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schumacher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Köhler-Bußmeier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Rölke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Valk</surname>
          </string-name>
          ,
          <article-title>An extensible editor and simulation engine for petri nets: Renew</article-title>
          , volume
          <volume>3099</volume>
          ,
          <year>2004</year>
          , pp.
          <fpage>484</fpage>
          -
          <lpage>493</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B. N.</given-names>
            <surname>Yahya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Khosiawan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. A. D.</given-names>
            <surname>Do</surname>
          </string-name>
          ,
          <article-title>Rt-plg: Real time process log generator</article-title>
          , volume
          <volume>431</volume>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ackermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schönig</surname>
          </string-name>
          ,
          <article-title>Mudeps: Multi-perspective declarative process simulation</article-title>
          ,
          <source>in: International Conference on Business Process Management</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Kindler</surname>
          </string-name>
          ,
          <source>The Petri Net Markup Language</source>
          , Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2003</year>
          , pp.
          <fpage>124</fpage>
          -
          <lpage>144</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>[6] Ieee standard for extensible event stream (xes) for achieving interoperability in event logs and event streams</article-title>
          ,
          <source>IEEE Std</source>
          <year>1849</year>
          -
          <volume>2016</volume>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>