<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>CauseCheck: A Tool for Simulating Deviations in Event Logs with Known Root Causes</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Frederik Hake</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Schneider</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikolaos Theofanopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Camila Gonzalez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Poey Sie Chuah</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michael Grohs</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jana-Rebecca Rehse</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>tools</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Mannheim</institution>
          ,
          <addr-line>L15 1-6, 68161 Mannheim</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Conformance checking compares process executions in an event log to a process model to detect where and how the executions deviate from the model. However, these techniques are not able to explain why deviations occur, i.e., what the root causes of deviations are. In this vein, root cause analysis techniques have been proposed, but their suitability for conformance deviations is uncertain due to a lack of appropriate evaluation data. To address this gap, this paper presents CauseCheck, a tool that simulates event logs with deviations from a process model for which the causes are known. In particular, the user defines such deviations and assigns corresponding root causes in the form of trace and event attributes. Thus, the logs can be used to show the ability to re-discover the known root causes for deviations.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Process Mining</kwd>
        <kwd>Conformance Checking</kwd>
        <kwd>Root Cause Analysis</kwd>
        <kwd>Event Log Generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Source code repository
Screencast video</p>
    </sec>
    <sec id="sec-2">
      <title>1. Introduction</title>
      <p>
        Value
log [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Over the last years, multiple conformance checking techniques have been developed,
such as rule checking, token-based replay, and alignments [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As output, the techniques often
quantify the degree of conformance, so-called fitness. Some also provide more detailed insights.
For example, alignments identify which events have been inserted or skipped [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Existing conformance checking techniques are able to identify how and where process
executions deviate, but they are not providing any insights why a deviation occurred [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
Having these insights into why such deviations occur could help process managers to prevent
deviations. Consequently, it would be desirable to support managers by deriving causes for the
deviations directly from event data, i.e., derive which attributes causally increase the likelihood
of deviations [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Although some techniques aim to unravel root causes for problems (e.g.,
[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]), they have not been shown to detect such causes for conformance deviations.
      </p>
      <p>
        To assess and compare the quality of approaches that derive root causes for deviations,
appropriate evaluation data is required in the form of event logs that contain deviations from a
process model for which the root causes are available as a ground truth. This is not the case for
publicly available real-life event logs, for which such a ground truth of root causes for deviations
does not exist. That is a common problem when evaluating root cause analysis techniques, not
only when analysing causes of deviations. Thus, evaluations are often based on illustrations
on these real-life logs rather than comparisons of techniques’ capabilities to a ground truth
[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. The lack of appropriate evaluation data can be encountered by using simulated data with
ground truth, which has not been done for conformance deviations.
      </p>
      <p>
        To address this gap, we present the CauseCheck tool for simulating deviations from a process
model in event logs for which the root cause is known. Given a process model as input that
captures the intended process behavior, the tool simulates an event log and synthetically injects
diferent types of deviations that can occur in a process. In particular, it allows users to use
deviation types based on five patterns commonly used to characterize deviations [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: inserted,
skipped, repeated, replaced, and swapped activities (or sequences thereof). These deviations are
assigned root causes in the form of trace and event attributes. For that, the user first defines
these trace and event attributes. Then, whenever a particular attribute has a particular value, a
deviation occurs with a user-defined likelihood. For example, one potential cause-deviation
pair could be “whenever the bank is equal to Bank A, activity Z is skipped in 50% of the traces
although it is required according to the model”. Further, users can define noise levels, i.e.,
random occurrences of deviations that are not attributed to a root cause. The tool returns an
event log that contains deviations and the root causes within the trace and event attributes.
      </p>
    </sec>
    <sec id="sec-3">
      <title>2. The CauseCheck Tool</title>
      <p>At https://github.com/FrederikHake/CauseCheck, the CauseCheck Tool can be accessed. In this
repository, the user can find the source code, instructions on how to run the tool, and further
documentation. A demo video is available at https://github.com/FrederikHake/CauseCheck/
blob/main/Demo%20CauseCheck.mp4.</p>
      <sec id="sec-3-1">
        <title>2.1. Functional Components</title>
        <p>As illustrated in Fig. 1, a process model is required as input to set up an evaluation experiment
using CauseCheck. This process model describes the to-be behavior and a playout of it is
synthetically changed to contain the deviation and causes. Then, the user defines general
characteristics of the desired event log as well as decision point probabilities within the process
models. After that, all event and trace attributes should be created. Subsequently, the user
defines which deviations from the process model occur and by which trace and event attributes
they are caused. Finally, the user has the option to include a level of noise. As output, the tool
generates an XES file with the simulated deviations and causes. In the following, we present
the steps of the tool in more detail using a loan application process as running example.
General Characteristics. After providing a process model which captures the intended
behavior in a BPMN or PNML format as input, the user is requested to specify the time-frame
of the event log as illustrated in Fig. 2. Further, the size of the event log should be defined.
Decision Point Probabilities. Process models often include decision points where only one of
the multiple paths should be followed. There are processes where certain paths are less common
than others, e.g., a cancellation of a loan application is less common than its eventual payout.
To account for these cases and define the intended behavior in detail, CauseCheck requires the
user to assign decision point probabilities to all XOR-choices in the process model. As shown in
Fig. 3, the tool displays the process model as a Petri net of the process in the upper part of the
screen to identify decision points. Then, the user should define probabilities for each path. Per
default, the simulation assumes an equal likelihood of all possible paths.</p>
        <p>Trace &amp; Event Attributes. The third
page of the application prompts the
user to define trace and event attributes
which can be selected later as root
causes for deviations, starting with the
event attributes. For that, the user
can select attributes of the types
numerical, categorical, and time-related
from a drop-down menu. For all event
attributes, the user defines which
attribute values are possible and how Figure 4: Trace &amp; Event Attributes
likely each value is for each activity the
event is associated with. Thereby, only the timestamp is mandatory to be selected so that the
user is able to proceed to the next page of the application. After the successful definition of all
event attributes, the tool shows a summary of them. Similarly to the event attributes, the user
is then prompted to define trace attributes. This particular page is optional. Again, the user
may select a trace attribute type from the drop-down, define all possible attribute values, and
their corresponding likelihood. For instance, Fig. 4 shows the trace attribute “Bank” with the
three possible values “Bank A”, “Bank B”, and “Bank C”.</p>
        <p>
          Deviations &amp; Causes. In the next step, the user defines the deviations as well as corresponding
ground truth of causes. For that, users can select out of five commonly used deviation types [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]:
(1) inserted: an activity (or a sequence) is executed in addition to the intended model behavior
at a point in the trace, which is defined by the user
(2) skipped: an activity (or a sequence) is not executed although required in the model
(3) repeated: an activity (or a sequence) is wrongly re-executed after it has been previously
executed in accordance with the model
(4) replaced: an activity (or a sequence) is executed instead of another activity (or a sequence)
(5) swapped: two activities (or sequences) are performed in the wrong order
        </p>
        <p>For each deviation, the user enters a
unique identifier and also which activity (or
sequence) it should refer to. For example, the
user can specify that activity “A_partly
submitted” is skipped. Then, a cause (or multiple
causes) of the deviation should be defined.</p>
        <p>For that, users can select all trace and event
attributes and choose a particular value as
the cause. Further, a likelihood of deviation
occurrence is required. For example, as
illustrated in Fig. 5, the user can specify that
“A_partly submitted” is skipped with a
likelihood of 50% whenever the trace attribute Figure 5: Deviations &amp; Causes
“Bank” is equal to “Bank A”. This likelihood
corresponds to the causal efect of the attribute value “Bank A” on the skip. After specifying
these details, the user adds the new deviation. This can be done for any number of deviations.
Noise. In the final step, the user can include noise into the event log. This noise is defined as
random occurrences of deviations with no associated cause. The user can add general noise (i.e.,
random occurrence of an undefined deviation), type specific noise (i.e., random occurrence of a
deviation type like skipped) and deviation specific noise (i.e., random occurrence of a previously
defined deviation). For all diferent options, the level of noise is assigned as a probability.</p>
        <p>The last screen of the application is responsible for downloading the simulated event log
with all its deviations and causes. The user can download both the event log with the deviations
they created and a deviation-free log.</p>
      </sec>
      <sec id="sec-3-2">
        <title>2.2. Tool Architecture</title>
        <p>
          The CauseCheck tool features a Python-based back-end and a React-based front-end that
communicate through request and response mechanisms. The back-end utilizes Flask-session
to ensure communication with the back-end even if the front-end is closed during use. Further,
PM4Py[
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] handles the process model and its playout. The front-end, built with TypeScript,
incorporates the Material UI React component library for flexibility and easy customization.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>3. Maturity</title>
      <p>
        We used the tool for processes models of the BPI Challenges 2012 (sub-process with A_
activities only; 12A) and 2020 (International Declarations; Int.) obtained from [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. 12A is rather
straightforward with only 10 activities whereas Int. is more complex with 34 activities and a
potential loop. Based on these models, we simulated 32 logs in total, 16 logs for each 12A and
Int. In particular, we inserted either 5, 10, 15, or 20 diferent deviations into logs with either
100, 1,000, 10,000, or 100,000 traces. Executing these 4 × 4 combinations led to 16 logs per
model, which we uploaded to our repository. Execution times in seconds for all 32 logs in Tab. 1
indicate reasonable computational eficiency. Thereby, times are not influenced by the number
of deviations and scale approximately linearly with the number of traces. For Int., the times
take substantially longer due to the complex process model but are still reasonable with 2 hours.
      </p>
      <p>To show the functionality of CauseCheck, consider Fig. 6. It illustrates the occurrences of a
skip of the first activity in the 12A log within the simulation of 1,000 traces. This activity should
occur in every trace but is synthetically skipped in 50% of the traces associated with Bank A.
Since only 50% of the traces are associated with Bank A, the deviation should on average exist
in 25% of the traces. In our simulation, the skip occurs in 233 of 1,000 traces, indicating that,
after adjusting for randomness in the probabilities, the correct number of traces contains the
deviation. Further, consider Fig. 7 which shows the alignment of a diferent deviating trace.
Concretely, A_FINALIZED and A_ACCEPTED are swapped with each other, visible as a log
and model move on A_FINALIZED with a synchronous move on A_ACCEPTED in between.</p>
      <p>A_SUBMITTAE_DPARTLYSUBMAIT_PTERDEACCEPAT_EFDINALIZEDA_ACCEPTED ≫ A_APPROVAE_DREGISTERAE_DACTIVATED
A_SUBMITTAE_DPARTLYSUBMAIT_PTERDEACCEPTED≫ A_ACCEPTEAD_FINALIZEDA_APPROVAE_DREGISTERAE_DACTIVATED</p>
    </sec>
    <sec id="sec-5">
      <title>4. Conclusion</title>
      <p>We presented CauseCheck, a tool for generating synthetic
event logs with conformance deviations for which the
ground truth of causes is known. It aims to provide a
realistic simulation by assigning probabilities to decision
points and incorporating noise. This allows researchers
to evaluate tools that want to uncover root causes for
conformance deviations. In particular, the output of root
cause analysis techniques can be compared to the ground
truth, quantifying whether the correct causes for the
deviations are detected. In the future, we want to analyze the
capabilities of techniques to re-discover deviation causes
and potentially propose our solution for the task.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Carmona</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Dongen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weidlich</surname>
          </string-name>
          , Conformance Checking - Relating
          <source>Processes and Models</source>
          , Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grohs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Rehse</surname>
          </string-name>
          ,
          <article-title>Attribute-based conformance diagnosis: Correlating trace attributes with process conformance</article-title>
          ,
          <source>in: ICPM Workshops</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>203</fpage>
          -
          <lpage>215</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Qafari</surname>
          </string-name>
          , W. Aalst,
          <article-title>Feature recommendation for structural equation model discovery in process mining</article-title>
          ,
          <source>Prog Artif Intell</source>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>25</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Bozorgi</surname>
          </string-name>
          , I. Teinemaa,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dumas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. La</given-names>
            <surname>Rosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polyvyanyy</surname>
          </string-name>
          ,
          <article-title>Process mining meets causal machine learning: Discovering causal rules from event logs</article-title>
          ,
          <source>in: ICPM</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hosseinpour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jans</surname>
          </string-name>
          ,
          <article-title>Auditors' categorization of process deviations</article-title>
          ,
          <source>Journal of Information Systems</source>
          <volume>38</volume>
          (
          <year>2024</year>
          )
          <fpage>67</fpage>
          -
          <lpage>89</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berti</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. J. van Zelst</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Process mining for python (pm4py): Bridging the gap between process-and data science</article-title>
          ,
          <source>ICPM Demos</source>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Grohs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pfeifer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Rehse</surname>
          </string-name>
          ,
          <article-title>Business process deviation prediction: Predicting nonconforming process behavior</article-title>
          ,
          <source>in: ICPM</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>113</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>