<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lumigi: Shining Light on Your Process Data (Extended Abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lotte Vugs</string-name>
          <email>lotte@wavespi.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maarten van Asseldonk</string-name>
          <email>maarten@wavespi.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Niek van Son</string-name>
          <email>niek@wavespi.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Waves Process Intelligence</institution>
          ,
          <addr-line>Eindhoven</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>-Process mining techniques use event logs to discover process models, analyze performance, check compliance, and predict process outcomes. Since an event log is the key input for process mining, the quality of the event log is of paramount importance for the value obtained with any process mining analysis. However, data quality issues can arise while preparing event logs, e.g. inaccurate timestamps or imprecise activity names. Therefore, to ensure the insights obtained with the process mining analysis are accurate, it is important that the process data quality is validated. However, there are little structured approaches available to analyze the process data quality. In this paper, we present Lumigi. Lumigi is a freely available, stand-alone tool developed to fill the gap between the need for a structured approach to validate process data quality and the tools available for business users.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        Process mining consists of a set of methods, tools and
techniques to discover process models, analyze performance, check
compliance and compare variants using event data recorded in
information systems [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The key input for process mining
is called an event log. In order to conduct a process mining
analysis, event data needs to be collected and transformed
into an event log. The quality of the event log is of utmost
importance to derive maximum value from any process mining
analysis. In the process of creating the event log, various
data quality issues can arise, like inaccurate timestamps or
imprecise activity names [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>Although various data quality issues can reside in the event
log, there are few practical guidelines for business-users to
assess the quality of an event log in a structured way. Without
these guidelines, a process mining analyst is forced to rely
on their personal experience and trial-and-error to find data
quality issues. Failing to find these issues may result in
conclusions that are misleading or even flat out wrong.</p>
      <p>This paper introduces Lumigi, a freely available,
standalone tool that helps users to detect data quality issues for
process mining through a structured approach. Lumigi is
designed as a complementary tool to existing process mining
tools, to support process miners in the final stages of data
transformation and the first stages of the process analysis.</p>
      <p>The remainder of this paper is structured as follows. In
Section II, we describe the related work. In Section III, we
outline the components of the tool. Afterwards, in Section IV
contains the conclusion of the paper. Furthermore, in Section
Fig. 1. Screenshot of the overview screen, depicting the different perspectives
Lumigi features to analyze process data quality
V, we outline our next steps. Lastly, in Section VI, we thank
those that supported us in creating this tool.</p>
    </sec>
    <sec id="sec-2">
      <title>II. RELATED WORK</title>
      <p>
        For a general overview of the state-of-the-art of data quality
in process mining, we refer the reader to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Furthermore, to
the best of our knowledge, the only other tool focused on
data quality assessment in process mining is the R package
”DaQAPO”, now part of bupaR [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Lumigi’s key distinguishment is its focus on business users.
It is based on state-of-the-art research, combined with
experience from practice. To give its users a first grasp of the data
quality, Lumigi features the quality dimensions Completeness,
Timeliness and Complexity. In this context, we adopted the
corresponding definitions presented in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Furthermore, for
Bathcing, Tangling Activities, and Similarity, inspiration for
Lumigi is drawn from [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the concept of data
imperfection patterns is introduced and a list of data imperfection
patterns is introduced. In [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], some of these imperfection
patterns are more elaborately defined using pseudocode. Which
data patterns can be found with Lumigi, and how, is outlined
in Lumigi’s documentation: www.lumigi.io/documentation.
      </p>
    </sec>
    <sec id="sec-3">
      <title>III. TOOL OUTLINE</title>
      <p>Lumigi is a freely available tool, aimed at business users.
It is a stand-alone process mining tool focused on offering a
structured approach to analyze process data quality.
Furthermore, it is complemented with documentation explaining the
different metrics with examples, tips, and a list of possible root
causes for the behavior for data quality issues found, offering
a framework for enlisting possible root causes for data quality
issues. A screen cast demonstrating the tool is available.1
parallel activities and activities with a lot of predecessors and
successors (called Flower Activities).</p>
      <p>With Similarity, the analyst can search for synonymous
activity names.</p>
      <sec id="sec-3-1">
        <title>A. Input Configuration</title>
        <p>Lumigi uses a CSV file as input, after which the user is
asked to specify which column represents the case identifier,
the activity name, the timestamp, and, optionally, the resource.</p>
      </sec>
      <sec id="sec-3-2">
        <title>B. Tool Overview</title>
        <p>After configuration, all metrics are calculated, after which
the user is directed to the overview screen. This screen
summarizes the different perspectives that Lumigi features to
analyze process data quality. A screenshot of the overview
screen is shown in Figure 1. The menu can be used to navigate
to a specific feature.</p>
        <p>First, the data quality dimensions Completeness, Timeliness,
and Complexity can be used to get a general grasp of the data.
Completeness can be described as having all the event data that
is necessary for the task at hand. This is analyzed by assessing
the fraction of values in each column that is missing, and
assessing the timestamp granularity. Timeliness measures how
current the data is, and whether it is in the expected time frame.
In Lumigi, this is assessed by visualizing the number of events
over time. Complexity focuses on the structuredness of the
data. On activity-level, Lumigi features, among other things,
the total number of activities, the set of start activities, and the
set of end activities. On variant level, Lumigi showcases the
number of variants, the number of variants that occurs only
once, and a plot depicting the fraction of cases per variant.
The three quality dimensions that are featured by Lumigi, are
designed to find relatively easy-to-find outliers before moving
on to more complex data quality patterns. Examples of these
easier-to-find outliers are columns with a lot of missing data,
timestamp outliers, and incorrect start or end activities.</p>
        <p>
          Second, more complex patterns are analyzed to find data
quality issues. With Batching, we look at events that are
recorded at (almost) the same moment in time. With his/her
domain knowledge, the analyst then can reflect on whether the
batching is intended, or that there is an underlying data quality
issue that manifests itself through batching behavior. Here, we
distinguish between-case batching and within-case batching.
In academic literature, these concepts are better known as
inter-case batching and intra-case batching [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. However, we
changed the terminology, as we felt it was understood better
by the business users evaluating our tool.
        </p>
        <p>With Tangling Activities, the analyst can identify activities
that are suspects for tangles in process graphs, making your
process discovery analysis more difficult. With this knowledge
in mind, an analyst can decide to exclude some of these
tangling activities in the first stage of the analysis, to structure
the process graph. For this, metrics are designed to find
1A screen cast demonstrating Lumigi
https://www.youtube.com/watch?v=I eD36HlZHs</p>
        <p>The quality of process data is of utmost importance to derive
maximum value from a process mining analysis. However, as
process miners working in practice, we found few structured
approaches to analyze it. To fill this gap, this paper introduces
Lumigi; a freely available, stand-alone tool focused on process
data quality.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>V. NEXT STEPS</title>
      <p>Process data quality is a developing research field. Lumigi
is a first attempt to raise awareness about process data quality,
but it is certainly not without flaws. We keep a list of open
opportunities on our website, using the feedback we have
received from our users. For the most up-to-date list of
limitations of Lumigi, we refer the reader to our website:
www.lumigi.io.</p>
      <p>One of the limitations frequently mentioned is that Lumigi
lacks a functionality to repair the event log. In our evaluations,
business users emphasized to perform this repair as close to the
source as possible. We are therefore currently expanding our
focus to data transformation, and exploring ways to leverage
community knowledge to create high-quality event logs.</p>
    </sec>
    <sec id="sec-5">
      <title>VI. ACKNOWLEDGEMENT</title>
      <p>
        The main source of inspiration for developing Lumigi was
[
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Furthermore, the authors would like to thank the members
of the Business Process Management Group of the Queensland
University of Technology for their early-stage feedback, in
particular: Robert Andrews, Arthur Hofstede, and Michael
Adams. Furthermore, we would like to thank all process
miners that provided their valuable feedback in the beta
release. Although the list is too long to showcase here, and
we would like to mitigate the risk of forgetting to name some
of you, we hope it suffices to thank you in this fashion.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>W. Van der Aalst</surname>
          </string-name>
          , Process mining: Data science in action,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Suriadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Andrews</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. H.</surname>
          </string-name>
          <article-title>ter</article-title>
          <string-name>
            <surname>Hofstede</surname>
          </string-name>
          , and M. T. Wynn, “
          <article-title>Event log imperfection patterns for process mining: Towards a systematic approach to cleaning event logs</article-title>
          ,
          <source>” Information Systems</source>
          , vol.
          <volume>64</volume>
          , pp.
          <fpage>132</fpage>
          -
          <lpage>150</lpage>
          ,
          <year>2017</year>
          . [Online]. Available: http://dx.doi.org/10.1016/j.is.
          <year>2016</year>
          .
          <volume>07</volume>
          .011
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Martin</surname>
          </string-name>
          , “Data Quality in Process Mining,” in Interactive Process Mining in Healthcare, C. Fernandez-Llatas, Ed. Cham: Springer International Publishing,
          <year>2021</year>
          , pp.
          <fpage>53</fpage>
          -
          <lpage>79</lpage>
          . [Online]. Available: https: //doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -53993-1 5
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4] --, “
          <article-title>daqapo: Data Quality Assessment for Process-Oriented Data</article-title>
          ,”
          <year>2020</year>
          . [Online]. Available: https://cran.r-project.org/package=daqapo
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Verhulst</surname>
          </string-name>
          , “
          <article-title>Evaluating quality of event data within event logs: an extensible framework</article-title>
          ,”
          <source>Master Thesis</source>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Andrews</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Wynn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Vallmuur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Ter Hofstede</surname>
          </string-name>
          , E. Bosley,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elcock</surname>
          </string-name>
          , and
          <string-name>
            <given-names>S.</given-names>
            <surname>Rashford</surname>
          </string-name>
          , “
          <article-title>Leveraging data quality to better prepare for process mining: An approach illustrated through analysing road trauma pre-hospital retrieval and transport processes in Queensland,”</article-title>
          <source>International Journal of Environmental Research and Public Health</source>
          , vol.
          <volume>16</volume>
          , no.
          <issue>7</issue>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>