<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Measuring Generalization of Process Models Discovered from Event Logs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anandi Karunaratne</string-name>
          <email>anandik@student.unimelb.edu.au</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>The University of Melbourne</institution>
          ,
          <addr-line>Victoria 3010</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>measures. Generalization is a critical yet under-explored quality criterion for discovered process models in process mining. This research will enhance the understanding of generalization by examining the impact of event log characteristics on generalization estimations, developing measures applicable in a wide range of scenarios, and analyzing these As the world becomes increasingly digital, many organizations rely on information systems, which record event logs containing traces of executed processes captured as sequences of performed actions. Process mining uses these event logs to study and improve the systems [1].</p>
      </abstract>
      <kwd-group>
        <kwd>generalization</kwd>
        <kwd>process models</kwd>
        <kwd>process mining</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and Motivation</title>
      <p>CEUR
ceur-ws.org</p>
    </sec>
    <sec id="sec-2">
      <title>2. Research Questions and Project Roadmap</title>
      <p>This research will follow the Design Science Research Methodology [15]. It will start with a literature
review, followed by objective refinement, technique design, and evaluation using synthetic and
realworld datasets. Findings will be shared through publications, presentations, and the final thesis.</p>
      <p>We plan to address the following research questions:
RQ1 How do event log characteristics impact the estimation of generalization of a discovered
process model?
RQ2 How to use the bootstrap framework for estimating generalization of models discovered from
event logs?</p>
      <p>RQ3 What useful properties do generalization estimators based on the bootstrap framework possess?
The subsequent sections detail each research question and present a plan to address them.</p>
      <sec id="sec-2-1">
        <title>2.1. RQ1: Impact of Log Characteristics</title>
        <p>Generalization, by definition, requires the study of unseen system behavior. When the system behavior
is unknown, existing generalization estimation methods use event logs to estimate the unseen system
behavior. Therefore, it is reasonable to assume that the efectiveness of this approach depends on the
log quality, such as representativeness and noise.</p>
        <p>Log representativeness reflects how well the log captures the true system behavior. A highly
representative log provides a comprehensive view of the system’s executions. In such cases, the log itself
might be a good indicator of the system behavior, while complex estimation techniques may ofer
little additional value and could potentially introduce unnecessary assumptions. Conversely, a less
representative log may miss important process information, leading to incomplete or biased insights,
necessitating efective estimation methods to infer the true underlying system behavior.</p>
        <p>Noise in event logs refers to inaccuracies or inconsistencies in the recorded data, such as incorrect,
missing, or misordered events. The presence of noise can distort our understanding of the system and
afect the reliability of generalization estimates. Too much noise could render any estimation attempts
inefective, while moderate levels of noise might require careful preprocessing or robust generalization
estimation techniques. In the absence of noise, simple estimation methods might sufice, as the data
accurately reflects the process execution.</p>
        <p>These observations suggest that the approach to estimate generalization should be adaptive,
considering the specific qualities of the event log at hand. This research will investigate the impact of diferent
log quality levels on generalization estimation, develop techniques to assess log quality in relation to
generalization estimation and explore adaptive generalization estimation methods that consider event
log characteristics.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. RQ2: Enhancing Bootstrap Generalization</title>
        <p>
          The research conducted to answer this question will aim to expand the class of systems for which one can
reliably estimate generalization using the bootstrap framework. The existing bootstrap generalization
estimation methods assume that the system can be described as a directly-follows graph [13]. We
will design and evaluate new event log sampling methods and bootstrap framework configurations
that allow eficient and efective estimation of generalization over more expressive generative systems,
such as those that can be captured using various subclasses of Petri net systems, including free-choice
and extended-free-choice systems [16]. We will apply block bootstrapping [17] and Sequence Generative
Adversarial Networks (SGANs) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] approaches to data generation for sampling event logs. To maximize
the efectiveness of the designed log sampling mechanisms, we will conduct empirical studies with
ground truth systems to understand which bootstrap configurations, for instance, quantity and size of
log samples, yield more accurate generalization estimations.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. RQ3: Evaluating Bootstrap Generalization</title>
        <p>
          To answer this research question, we will study properties satisfied by the existing and new
generalization estimators grounded in the bootstrap framework. First, we will evaluate whether our generalization
estimators satisfy the desired properties discussed in the literature [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Then, we will apply mathematical
modeling and analysis methods to identify additional interesting properties the bootstrap generalization
estimators satisfy. By doing so, we will aim to understand whether these estimators are reliable and
meaningful for assessing the quality of models discovered from event logs. Consequently, we will
compile a list of essential properties that generalization measures and estimators should possess, thereby
supporting the process mining community in establishing standards and best practices for evaluating
process models.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Conclusion</title>
      <p>This research aims to design and evaluate new efective ways to estimate generalization of process
models discovered from event logs recorded by information systems. Specifically, using the bootstrap
generalization estimation framework [13], we will investigate how event log characteristics afect
generalization estimations, extend the framework to allow reliable generalization estimations for a
wide class of systems, and study the properties of generalization estimators grounded in the bootstrap
framework. These advancements will enhance the understanding of generalization, a critical but often
overlooked quality criterion of discovered process models.</p>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
      <p>This PhD project is supervised by Artem Polyvyanyy and Alistair Mofat from the University of
Melbourne.
[11] A. F. Syring, N. Tax, W. M. P. van der Aalst, Evaluating conformance measures in process mining
using conformance propositions, Trans. Petri Nets and Other Models of Conc. XIV (2019) 192–221.
[12] G. Janssenswillen, N. Donders, T. Jouck, B. Depaire, A comparative study of existing quality
measures for process discovery, Information Systems (2017) 1–15.
[13] A. Polyvyanyy, A. Mofat, L. García-Bañuelos, Bootstrapping generalization of process models
discovered from event data, in: Int. Conf. Adv. Inf. Sys. Eng., 2022, pp. 36–54.
[14] B. Efron, R. J. Tibshirani, An Introduction to the Bootstrap, Springer, 1993.
[15] A. R. Hevner, S. T. March, J. Park, S. Ram, Design science in information systems research, MIS Q.</p>
      <p>(2004) 75–105.
[16] T. Murata, Petri nets: Properties, analysis and applications, Proceedings of the IEEE (1989)
541–580.
[17] S. N. Lahiri, Theoretical comparisons of block bootstrap methods, The Annals of Statistics (1999)
386–404.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          ,
          <source>Process Mining-Data Science in Action</source>
          , 2 ed., Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Process discovery: An introduction</article-title>
          , in: Process Mining, Springer,
          <year>2011</year>
          , pp.
          <fpage>125</fpage>
          -
          <lpage>156</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Process mining: A 360 degree overview</article-title>
          , in: Process Mining Handbook, Springer,
          <year>2022</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J. C. A. M.</given-names>
            <surname>Buijs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. F. van Dongen</given-names>
            ,
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Quality dimensions in process discovery: The importance of fitness, precision, generalization and simplicity</article-title>
          ,
          <source>Int. J. Coop. Inf. Sys</source>
          .
          <volume>23</volume>
          (
          <year>2014</year>
          )
          <volume>1440001</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>1440001</lpage>
          :
          <fpage>39</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Adriansyah</surname>
            ,
            <given-names>B. F. van Dongen</given-names>
          </string-name>
          ,
          <article-title>Replaying history on process models for conformance checking and performance analysis</article-title>
          ,
          <source>Wiley Interdisc. Reviews: Data Min. and Know. Disc</source>
          .
          <volume>2</volume>
          (
          <year>2012</year>
          )
          <fpage>182</fpage>
          -
          <lpage>192</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>K. L. M. vanden Broucke</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. De Weerdt</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Vanthienen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Baesens</surname>
          </string-name>
          ,
          <article-title>Determining process model precision and generalization with weighted artificial negative events</article-title>
          ,
          <source>IEEE Trans. Know. and Data Eng</source>
          .
          <volume>26</volume>
          (
          <year>2014</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1889</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>B. F. van Dongen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Carmona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chatain</surname>
          </string-name>
          ,
          <article-title>A unified approach for measuring precision and generalization based on anti-alignments</article-title>
          ,
          <source>in: Int. Conf. Bus. Proc. Manag</source>
          .,
          <year>2016</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>56</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Theis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Darabi</surname>
          </string-name>
          ,
          <article-title>Adversarial system variant approximation to quantify process model generalization</article-title>
          ,
          <source>IEEE Access 8</source>
          (
          <year>2020</year>
          )
          <fpage>194410</fpage>
          -
          <lpage>194427</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Relating process models and event logs-21 conformance propositions</article-title>
          , in: Algorithms &amp;
          <article-title>Theories for the Analysis of Event Data, CEUR-WS</article-title>
          .org,
          <year>2018</year>
          , pp.
          <fpage>56</fpage>
          -
          <lpage>74</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Buijs</surname>
          </string-name>
          ,
          <article-title>Flexible evolutionary algorithms for mining structured process models</article-title>
          ,
          <source>PhD Thesis</source>
          , Technische Universiteit Eindhoven (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>