<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Distributed Event Factory: A Tool for Generating Event Streams on Distributed Data Sources</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Hendrik Reiter</string-name>
          <email>hendrik.reiter@email.uni-kiel.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christian Imenkamp</string-name>
          <email>christian.imenkamp@uni-bayreuth.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Agnes Koschmider</string-name>
          <email>agnes.koschmider@uni-bayreuth.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wilhelm Hasselbring</string-name>
          <email>hasselbring@email.uni-kiel.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>tools</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Christian-Albrechts-University Kiel</institution>
          ,
          <addr-line>Christian-Albrechts-Platz 4, 24118 Kiel</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Bayreuth</institution>
          ,
          <addr-line>Universitätsstraße 30, 95447 Bayreuth</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In real-life applications, data sources are often distributed. In a smart factory, data is generated by spatially distributed sensors. Distributed process mining algorithms may exploit this data locality by processing data where it is generated. The Distributed Event Factory is a tool to evaluate distributed process mining algorithms under (best-efort) realistic conditions. It generates synthetic event streams that consider the distributed nature of the data sources. In particular, we can evaluate the scalability of such algorithms by increasing the volume and velocity of the generated events. Additionally, other external factors such the temporal behavior of events, and varying load profiles can be configured. Using the example of a smart factory, we demonstrate the tool's capabilities.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Event Log Generator</kwd>
        <kwd>Stream Process Mining</kwd>
        <kwd>Distributed Process Mining</kwd>
        <kwd>Distributed Computing</kwd>
        <kwd>Markov Chain</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Value</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Process mining is a discipline dedicated to discovering and monitoring real-world processes.
Streaming process mining [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] aims to deliver the results of process mining algorithms as soon as
data is generated. The emerging field of distributed process mining further considers the spatial
context of data. Distributed Process Mining Algorithms (like e.g. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]) do not only process data
when it is generated but also at the location where it originates. By processing data directly at
its source, there is no need to send it to a central instance for processing. In a smart factory
with multiple production facilities, the use of distributed process mining algorithms might be
beneficial. Data is initially processed individually in each factory, and only the relevant data
required for process analysis is forwarded to other facilities. By applying the principle of data
sparsity, latencies can be minimized, privacy preserved, and network costs reduced.
      </p>
      <p>
        What is lacking for the eficient development of distributed process mining algorithms is a tool
that generates data while addressing the aspects of real-time processing and data distribution.
Furthermore, it would be beneficial if this tool could also consider other issues that occur in the
real world, such as noise, varying volume and velocity of data generation, and the delayed arrival
of data. In process mining, there are various tools that deal with the generation of synthetic data.
These either focus on privacy [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], concept drift [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or sensor events [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. PLG2 [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] comes closest
to these requirements. It can generate messages in real time. Additionally, parameters such
as noise and various model complexities can be configured. However, data distribution is not
suficiently considered. Furthermore, sending rates cannot be adjusted arbitrarily, and temporal
dependencies of data cannot be configured. An alternative approach calls for considering
properties of event logs and including (data) distribution into it. This requires to incorporate
additional properties into event logs like the reference of the location. Additionally, event logs
are finite and cannot be used to simulate potentially infinite data streams from online process
mining.
      </p>
      <p>In this paper, we present the Distributed Event Factory, which fulfills the aforementioned
requirements for distributed data generation. In Section 2, we present the basic concepts of the
tool and show how it can be configured. In Section 3, using the example of a smart factory,
we demonstrate that the Distributed Event Factory can generate semantically meaningful data.
Section 4 discusses the maturity of the tool. Section 5 summarizes the work and outlines future
work.</p>
    </sec>
    <sec id="sec-3">
      <title>2. Distributed Event Factory</title>
      <p>
        The Distributed Event Factory DEF is built upon a Markov chain. A Markov chain is a graph
that models stochastic processes, where the probabilities of the next state transition depend
only on the current state and not on previous states and is mostly used to simulate processes [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
and for load generation [
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. The benefit of a Markov chain is the ability of simulation of real
world settings as required for our purpose. DEF generates data according to the XES format [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]
with an event defined by its case id, activity name, and a timestamp. Moreover, the event data
source is added to the event. Formally, we describe the tool as follows:
ℰ =  ×  ×  × 
Definition 2.1 (Event). Let  be the set of case ids,  the set of all activities, and  the
set of data sources. We consider  ∈ N as a timestamp, and define  : () → N as the set
of random functions that generate a process duration. The events are drawn from the set
.
      </p>
      <p>Definition 2.2 (Data Source Topology). Let the data source topology be represented as a graph
 = (, ), where the vertices  correspond to data sources, each associated with a name
 ∈  . Each edge in  has a probability inscription ( ∈ R≤+1) that refers to the data source
where the process continues, a duration function  ∈ , and an activity  ∈ . Events are
generated by a random instantiation of the graph. Transitions are activated due to previous
edge inscription and the duration functions between vertices. The sum of the probabilities of
all outgoing edges per node has to be 1.</p>
      <p>Definition 2.3 (Distributed Event Stream). Let  be the event stream of data source  ∈  .
Furthermore, (, ) are the activity and duration of the transition.  ∈ N indicated the last
tracked timestamp and  ∈  the current case id. Then the generated event is appended to the
event stream of data source  :  · ⟨ (, ,  + (), )⟩</p>
      <p>In the following, we summarize how data distribution, varying sending rates, and the inclusion
of temporal aspects have been implemented in the Distributed Event Factory.</p>
      <p>Data Distribution. Every data source writes its own event stream. Hence, the data
distribution is achieved. Thus, data is stored decentrally. Each data source transcripts the event of the
outgoing edge to its event log. The tool provides three data sinks per default that emit events in
real time: the console, a GUI, and a Kafka1 broker. Additionally, individual data sinks can be
defined that directly invoke a distributed process mining algorithm.</p>
      <p>Data Volume and Velocity. The Distributed Event Factory can define arbitrary functions
that define the speed at which the simulation should run. This allows control over the volume
and velocity of the data flow. A constant and a gradually increasing load has been implemented.
However, these can also be overwritten arbitrarily.</p>
      <p>Process Execution Time. Each edge in the Markov chain is assigned a stochastic function.
This allows determining how long the process takes to execute. In practice, not every process
has the same duration, and this can be modeled accordingly. It may occur that processes arrive
at the processing components late or out of order. This can be simulated by allowing to assign
a negative time duration. The Distributed Event Factory provides implements functions that
model constant, uniformly distributed, or normally distributed execution times. These can also
be configured and overwritten by the user accordingly.</p>
      <p>The Distributed Event Factory is implemented in Python and publicly available on GitHub. It
is configured using a YAML file.</p>
    </sec>
    <sec id="sec-4">
      <title>3. Case Study: Smart Factory</title>
      <p>Let us consider a warehouse and a factory. Both are run in diferent locations and follow
particular processes (see Figure 4). (i) The activities of the warehouse include receiving goods,
storage, picking, packing, shipping, managing inventory, and handling returns. (ii) The activities
of the factory process are receiving goods, material preparation, assembly line setup, assembly,
quality control, packaging, and shipping. Please note that the factory requires the material
provided by the warehouse. Hence, the warehouse and the factory processes depend on each
other. Any delays (e.g., inventory shortages, shipping delays), can significantly impact the
factory’s operations. Thus, an interorganizational simulation (i.e., over distributed locations) is
required.</p>
      <p>The locations in the Distributed Event Factory can be modeled as groups of data sources.
Each group represents a specific operational area. For example, the process of receiving goods
can be defined as a data source. The group id for this process would be "warehouse". In contrast,
the assembly line setup would have "factory" as the group id.</p>
      <p>Each data source can emit diferent values depending on its configuration. For instance, the
goods reception produces activities like Reject, Pass To Production, or Store. These values allow
for tailored responses based on specific operational needs.</p>
      <p>Thus, by defining data sources and their corresponding group IDs, Distributed Event Factory
can organize data sources by location.</p>
    </sec>
    <sec id="sec-5">
      <title>4. Maturity</title>
      <p>
        The Distributed Event Factory is currently a prototype stage. However, the requirements we set
were implemented. Given that, sophisticated evaluations are still outstanding to validate the
diferent configurations of DEF. In the future we plan to conduct these evaluations and also
to advance DEF. We plan to implement a user interface, which initially allows visualization
of the distributed process and we plan to extend it for interaction. Based on our testing DEF
currently supports the generation of approximately 50000 simulation steps per second which
makes it suitable for load testing. The number varies with the selected data sink. Currently, data
is generated compliant to the XES format. An extension to object-centric event logs (OCEL) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
or to IoT data formats such as NICE [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] is planned. Furthermore, concept drift could also be
implemented.
      </p>
    </sec>
    <sec id="sec-6">
      <title>5. Conclusion</title>
      <p>This paper introduced the Distributed Event Factory, a tool addressing data generation for
distributed process mining. The tool relies on Markov chains and incorporates a spatial component
into process mining. It allows configuration of distribution data properties, supporting temporal
dependencies and location of processes and its adjustment of data volume and velocity. Future
work will focus on advancing the tool’s maturity. Additionally, paradigms such as object-centric
process mining and drifting data will be integrated. In this way, the Distributed Event Factory
allows to evaluate diferent settings of data for distributed process mining.
This work received funding by the Deutsche Forschungsgemeinschaft (DFG), grant 496119880.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Burattin</surname>
          </string-name>
          ,
          <article-title>Streaming process mining</article-title>
          ,
          <source>in: Process Mining Handbook</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>349</fpage>
          -
          <lpage>372</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -08848-3_
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Andersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rathje</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Landsiedel</surname>
          </string-name>
          ,
          <article-title>Edgealpha: Distributed process discovery at the data sources</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2405.03426. arXiv:
          <volume>2405</volume>
          .
          <fpage>03426</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Kaczmarek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koschmider</surname>
          </string-name>
          ,
          <article-title>Conceptualizing a log generator for privacy-aware event logs</article-title>
          ,
          <source>EMISA Forum 41</source>
          (
          <year>2021</year>
          )
          <fpage>39</fpage>
          -
          <lpage>40</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Grimm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kraus</surname>
          </string-name>
          , H. van der Aa,
          <article-title>CDLG: a tool for the generation of event logs with concept drifts</article-title>
          ,
          <source>in: International Conference on Business Process Management</source>
          ,
          <year>2022</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3216</volume>
          /paper_241.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Janssen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koschmider</surname>
          </string-name>
          ,
          <source>Generating Synthetic Sensor Event Logs for Process Mining</source>
          , Springer International Publishing,
          <year>2022</year>
          , p.
          <fpage>130</fpage>
          -
          <lpage>137</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>031</fpage>
          -07481-3_
          <fpage>15</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Burattin</surname>
          </string-name>
          ,
          <article-title>Plg2: Multiperspective processes randomization and simulation for online and ofline settings</article-title>
          ,
          <year>2015</year>
          . doi:
          <volume>10</volume>
          .48550/ARXIV.1506.08415.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chib</surname>
          </string-name>
          , E. Greenberg,
          <article-title>Markov chain Monte Carlo simulation methods in econometrics</article-title>
          ,
          <source>Econometric theory 12</source>
          (
          <year>1996</year>
          )
          <fpage>409</fpage>
          -
          <lpage>431</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.</given-names>
            <surname>Vögele</surname>
          </string-name>
          , A. van
          <string-name>
            <surname>Hoorn</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Schulz</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Hasselbring</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Krcmar</surname>
          </string-name>
          ,
          <article-title>WESSBAS: extraction of probabilistic workload specifications for load testing and performance prediction-a modeldriven approach for session-based application systems</article-title>
          ,
          <source>Software &amp; Systems Modeling</source>
          <volume>17</volume>
          (
          <year>2018</year>
          )
          <fpage>443</fpage>
          -
          <lpage>477</lpage>
          . doi:
          <volume>10</volume>
          .1007/s10270-016-0566-5.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>A. van Hoorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Vögele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Hasselbring</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Krcmar</surname>
          </string-name>
          ,
          <article-title>Automatic extraction of probabilistic workload specifications for load testing session-based application systems</article-title>
          ,
          <source>EAI Endorsed Transactions on Self-Adaptive Systems</source>
          <volume>15</volume>
          (
          <year>2015</year>
          ). doi:
          <volume>10</volume>
          .4108/icst. valuetools.
          <year>2014</year>
          .
          <volume>258171</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C.</given-names>
            <surname>Buijs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. F.</given-names>
            <surname>Van Dongen</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. M. Van der Aalst</surname>
          </string-name>
          ,
          <article-title>XES tools</article-title>
          ,
          <source>in: 22nd International Conference on Advanced Information Systems Engineering (CAiSE</source>
          <year>2010</year>
          ),
          <article-title>CEUR-WS</article-title>
          .org,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Object-centric process mining: Dealing with divergence and convergence in event data</article-title>
          ,
          <source>in: Software Engineering and Formal Methods</source>
          , Springer,
          <year>2019</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>25</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -30446-
          <issue>1</issue>
          _
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bertrand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Veneruso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Leotta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mecella</surname>
          </string-name>
          , E. Serral,
          <article-title>NICE: The Native IoT-Centric Event Log Model for Process Mining</article-title>
          , in: Process Mining Workshops, Springer,
          <year>2024</year>
          , pp.
          <fpage>32</fpage>
          -
          <lpage>44</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -56107-
          <issue>8</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>