<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>FilterTree: a Repeatable Branching XES Editor (Extended Abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sander J.J. Leemans</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>RWTH</institution>
          ,
          <addr-line>Aachen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>70</fpage>
      <lpage>74</lpage>
      <abstract>
        <p>A large fraction of process mining eforts is spent on event data preparation: the step between data extraction and the subsequent analysis using process mining tools. Data preparation may be repetitive and is typically performed in a trial-and-error way. In this paper, we introduce the FilterTree XES and CSV editing tool, which allows for the programmatic chaining of XES and CSV filters, allowing for repeatable event data preparation. The FilterTree tool is platform-independent and open source.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;process mining</kwd>
        <kwd>event log filtering</kwd>
        <kwd>XES editor</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Common folklore in process mining tells that of the time spent on process mining projects, 80%
is spent on preparing event data, and only 20% is spent on analysis. Massaging event data into
an event log is a trial-and-error process, which may involve selecting activity columns, selecting
case columns, altering data types, combining columns, parsing timestamps, filtering, addressing
data quality issues, selecting activities, computing aggregate columns, etc. The repetitiveness
of this process is captured by several process mining methodologies, such as [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]: even in the
ifnal phases of analysis, the data preparation may have to change, for instance after discovery
of data quality issues or after analysis questions have been adjusted, thus requiring a slightly
diferent view on the event data.
      </p>
      <p>In this paper, we propose a tool to import CSV or XES files, and edit XES event logs by
means of filters. A filter reads an XES (or CSV) file from disk and writes an adjusted XES file
to disk. Filters are organised by the user into a filter tree, which specifies the filters with their
parameters. In a filter tree, most filters are sequential, that is, one is applied to the result of its
predecessor. It is also possible branch the filter chain, where a single filter may have more than
one subsequent filters.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Significance, Innovations &amp; Main Features</title>
      <p>
        Every process mining project starts with extracting event data. Except in straightforward
“projects" where public data is used in a standard process mining technique, event data
preparation is a necessary next step before the actual process analysis can commence. Many existing
tools can perform event data preparation and apply filters to an event log, and a few tools can
perform edit operations on XES logs [
        <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5">2, 3, 4, 5</xref>
        ]. To the best of our knowledge, there is no tool
that is a combination of:
• XES-based, with support for CSV. Importing CSV files is critical, but advanced process
mining operations require XES concepts, such as trace attributes, log attributes, summing
trace outcomes, etc.
• Repeatable. The same set of filters can be applied to a new event log, thereby repeating
the analysis without manual filtering steps, or to slightly change how an event log is
prepared, requires easy repeatability.
• Branchable. In process mining projects, several diferent perspectives may be necessary
to answer the analysis questions. These perspectives may require diferent event logs.
      </p>
      <p>Branching allows to re-use parts of chains of filters.
• Disk based and file-manager friendly. A common problem encountered is CSV or XES
ifles that are too big to visualise or load into process mining tools. While filtering, logs
should be handled disk-to-disk as to support any log that fits on disk.
• Ofline. An ofline tool provides privacy and confidentiality, and does not need to upload
datasets.
• Extensible. There is always another, unsupported, filter, thus the tools needs to be easily
extensible.</p>
      <p>The FilterTree tool aims to satisfy all of these properties.</p>
      <sec id="sec-2-1">
        <title>2.1. Notable Plug-ins</title>
        <p>CSV files can be either row-based or column-based; for both, FilterTree has a plug-in. In a
row-based CSV file, each row represents an event. Using the filter CSV to XES, the only
parameter necessary is the name(s) of the column(s) of the CSV file that contain the trace
identifier, that is, the column that tells us which case the event belongs to. This column or
combination of columns is copied to the trace level as its concept:name.</p>
        <p>In a column-based CSV file, each row represents a trace, and the columns contain timestamps,
indicating when the activity belonging to that column was executed. This structure of data is
often encountered in healthcare settings, where standard forms that are used to log treatment
steps use this structure. The plug-in CSV to XES - trace per row converts such a file
into an XES event log, where each row becomes a trace, and every cell that has a timestamp is
converted to an event (with the concept:name being the name of the column); every cell that
does not parse as a timestamp becomes a trace attribute. This plug-in optionally takes a list of
Java-based timestamp formats, such as yyyy-M-d H:mm:ss.SSS. Both of these CSV-plug-ins
set some default log attributes, and attempt to guess the data type of each cell as accurately as
possible.</p>
        <p>Using the map events and map traces plug-ins, a particular event/trace attribute can be
transformed using a provided map (in a separate CSV file), and written as another event/trace
attribute.</p>
        <p>The add start events filter copies each event, copies a chosen trace timestamp attribute
to time:timestamp of the new event, sets lifecycle:transition to start, and adds a
corresponding concept:instance to both events. The plug-in sort events sorts the events
based on time:timestamp.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Usage</title>
      <sec id="sec-3-1">
        <title>3.1. File Format</title>
        <p>A filter tree is represented in a simple text file (with the .ftree extension). Comment lines start
with %. The first line in this file contains the import event log, and each line thereafter contains
one filter. On such a line, the name of the filter comes first, followed by the bar | symbol,
followed by the parameters necessary for the filter, separated by spaces. If a parameter contains
a space, it must be enclosed in double quotes. Indenting a filter line starts a new branch.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. User Interface</title>
        <p>The user interface shows a filter tree file, with syntax highlighting and auto-completion; a
screenshot is shown in Figure 1. When the user changes something by typing, after a small
timeout the filters are automatically (re-)computed as necessary. The resulting logs, including
all intermediate steps, are kept in a managed sub-folder; NB: the tool removes irrelevant files
from this sub-folder. In the sub-folder, the final results of all branches are kept with a consistent
and predictable filename, for compatibility with file management systems.</p>
        <p>The user can enable a visualisation of the last log of the last branch, however, please note that
this will attempt to load the log in memory; henceforth, this option is not enabled by default.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Download</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Maturity</title>
      <p>
        FilterTree is platform-independent (Java) and available with a GPL license from https://leemans.
ch/filtertree. A full list of supported filters is included on this website, as well as a screencast
demoing the tool. The source code is available at https://svn.win.tue.nl/repos/prom/Packages/
SanderLeemans/FilterTree/. An empty text file can be used to start the editor.
The FilterTree tool has been used in several of our own projects, including on private healthcare
data (Figure 1), historical bureaucrat career data, road trafic fine collection data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], etc. These
settings ranged from simple (e.g. lifting a few event attributes to trace level) to complex (see
Figure 1). In this latter case, the initial log was too complex (260MB CSV) to be of use directly
in ProM or any other process mining tool we were allowed to try, due data confidentiality
and semi-commercial nature of the data. The FilterTree tool allowed us to transform the log
from CSV into XES and to filter it down to a manageable sub-view, which could be analysed in
standard process mining tools.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Preparing event data for analysis remains a rather ill-supported task, especially in settings with
repeated small changes, large and complex event logs, data quality issues, or changing analysis
questions. In this paper, we proposed a tool to edit XES and CSV files by means of filters. The
FilterTree tool is repeatable as it keeps a full filter chain specificiation; it supports branching in
the chain to allow multiple chains to share the same initial filters.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>M. L. van Eck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J. J.</given-names>
            <surname>Leemans</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          , PM ˆ2 :
          <article-title>A process mining project methodology</article-title>
          ,
          <source>in: Advanced Information Systems</source>
          Engineering - 27th International Conference, CAiSE
          <year>2015</year>
          , Stockholm, Sweden, June 8-12,
          <year>2015</year>
          , Proceedings, volume
          <volume>9097</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2015</year>
          , pp.
          <fpage>297</fpage>
          -
          <lpage>313</lpage>
          . URL: https://doi.org/10.1007/ 978-3-
          <fpage>319</fpage>
          -19069-3_
          <fpage>19</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>319</fpage>
          -19069-3\_
          <fpage>19</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J. Van</given-names>
            <surname>Zelst</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Process mining for python (pm4py): bridging the gap between process-and data science</article-title>
          , arXiv preprint arXiv:
          <year>1905</year>
          .
          <volume>06169</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Polyvyanyy</surname>
          </string-name>
          ,
          <article-title>Process query language</article-title>
          , in: A.
          <string-name>
            <surname>Polyvyanyy</surname>
          </string-name>
          (Ed.),
          <source>Process Querying Methods</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>313</fpage>
          -
          <lpage>341</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -92875-9_
          <fpage>11</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -92875-9\_
          <fpage>11</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B. F. van Dongen</given-names>
            ,
            <surname>A. K. A. de Medeiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. M. W.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. J. M. M. Weijters</surname>
            ,
            <given-names>W. M. P. van der Aalst</given-names>
          </string-name>
          ,
          <article-title>The prom framework: A new era in process mining tool support</article-title>
          ,
          <source>in: Applications and Theory of Petri Nets</source>
          <year>2005</year>
          , 26th International Conference, ICATPN 2005,
          <article-title>Miami</article-title>
          , USA, June 20-25,
          <year>2005</year>
          , Proceedings, volume
          <volume>3536</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2005</year>
          , pp.
          <fpage>444</fpage>
          -
          <lpage>454</lpage>
          . URL: https://doi.org/10.1007/11494744_25. doi:
          <volume>10</volume>
          . 1007/11494744\_
          <fpage>25</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H. M. W.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. C. A. M.</given-names>
            <surname>Buijs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. F. van Dongen</given-names>
            ,
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          , Xes, xesame,
          <source>and prom 6</source>
          , in: Information Systems Evolution - CAiSE
          <source>Forum</source>
          <year>2010</year>
          , Hammamet, Tunisia, June 7-9,
          <year>2010</year>
          , Selected Extended Papers, volume
          <volume>72</volume>
          <source>of Lecture Notes in Business Information Processing</source>
          , Springer,
          <year>2010</year>
          , pp.
          <fpage>60</fpage>
          -
          <lpage>75</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -17722-
          <issue>4</issue>
          _5. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>642</fpage>
          -17722-4\_5.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S. J. J.</given-names>
            <surname>Leemans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shabaninejad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Khosravi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Sadiq</surname>
          </string-name>
          , M. T. Wynn,
          <article-title>Identifying cohorts: Recommending drill-downs based on diferences in behaviour for process mining</article-title>
          , in: G. Dobbie, U. Frank, G. Kappel,
          <string-name>
            <given-names>S. W.</given-names>
            <surname>Liddle</surname>
          </string-name>
          , H. C. Mayr (Eds.),
          <source>Conceptual Modeling - 39th International Conference, ER 2020</source>
          , Vienna, Austria, November 3-
          <issue>6</issue>
          ,
          <year>2020</year>
          , Proceedings, volume
          <volume>12400</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2020</year>
          , pp.
          <fpage>92</fpage>
          -
          <lpage>102</lpage>
          . URL: https: //doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -62522-
          <issue>1</issue>
          _7. doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -62522-1\_7.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>