<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Relational XES: Data Management for Process Mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>B.F. van Dongen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sh. Shabani</string-name>
          <email>S.Shabaninejad@tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, Eindhoven University of Technology</institution>
          ,
          <country>The Netherlands. B.F.v.Dongen</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Information systems log data during the execution of business processes in so called “event logs”. Process mining aims to improve business processes by extracting knowledge from event logs. Currently, the de-facto standard for storing and managing event data, XES, is tailored towards sequential access of this data. Handling more and more data in process mining applications is an important challenge and there is a need for standardized ways of storing and processing event data in the large. In this paper, we first discuss several solutions to address the “big data” problem in process mining. We present a new framework for dealing with large event logs using a relational data model which is backwards compatible with XES. This framework, called Relational XES, provides buffered, random access to events resulting in a reduction of memory usage and we present experiments with existing process mining applications to show how this framework trades memory for CPU time.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>Over the last few years, the amount of historical execution data stored by large-scale
information systems has grown exponentially. This data is typically stored for analysis
purposes, for example to enable auditing, process improvement or process mining. The
evolution of information system into such large-scale systems increases the need for
business process analysis techniques to also handle data at such a large scale.</p>
      <p>Most process mining techniques that exist today focus on the analysis of so-called
event logs with the purpose to discover, monitor and improve business processes.
Traditionally, process mining techniques assume that it is possible to sequentially record
events into traces such that each event refers to an activity (i.e., a well-defined step in
the process) and is related to a particular case (i.e., a process instance). Event logs may
store additional information such as the resource (i.e., person or device) executing or
initiating an activity, the timestamp of an event, or data elements recorded with an event
(e.g., the size of an order).</p>
      <p>This classic sequential view on a log data has a few important downsides. First, each
event is assumed to be relevant for only one trace in the context of one process. In real
life this is not necessarily the case, i.e. events may be relevant in the context of many
different traces belonging to different processes. Letting an event appear in many traces
requires a significant amount of duplication. Second, the tree structure of an event log
is great for sequential access to the log, but not suitable for random access, while most
techniques actually use a random access paradigm to access events in the log. Third,
the current rise of decomposition and distribution based techniques for process mining
requires easy filtering of event logs both vertically, i.e. distributing cases over different
logs and horizontally, i.e. distributing events over different logs.</p>
      <p>In this paper, we present an architecture that addresses the issues above while it
maintains backwards compatibility with existing process mining techniques. Our
architecture uses a database to store the event log, allowing for events to appear in multiple
traces, for events to be accessed in a random-access fashion and for efficient filtering.</p>
      <p>The remainder of this paper is structured as follows. In Section 2 we discuss related
work. Then, in Section 3, we present our framework for storing and managing event
data. In Section 4 we compare our framework with the de-facto standard for managing
event data and we conclude the paper in Section 5.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Process mining is a research discipline that provides techniques to discover, monitor
and improve processes based on event data. It is beyond the scope of this paper to give
a full introduction to process mining, but we refer to [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] for such an overview.
      </p>
      <p>
        Most process mining techniques today have been assuming that event data is
available in the form of sequential traces of ordered events that are typically ordered in time.
More recently however, process mining researchers have come to realize that events are
typically related to multiple processes being executed in parallel while synchronizing
on some activities [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4–6</xref>
        ]. Furthermore, research is emerging on the analysis of streams
of events where the notion of a trace is not predefined but may change over time. [
        <xref ref-type="bibr" rid="ref1 ref3">1, 3</xref>
        ]
      </p>
      <p>
        Since the start of research on process mining, attempts have been made on
standardizing the way in which event data is to be stored. This has lead to a number of
semi-standards, such as MXML [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] which was a simple XML format for audit trails of
process aware information systems and the recent XES format [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ]. The XES
format is supported by both academic tools such as ProM [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] as well as industrial tools
such as Disco [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Many real-life datasets have been made public in the XES xml format
      </p>
      <p>
        OpenXES is a reference implementation for dealing with event data in the XES
format. This reference implementation consists of a number of interfaces as well as a
number of concrete implementations. These interfaces allow for both sequential and
random, read and write access to event data. Over the years, this implementation has
proven to be very successful in the open source process mining framework ProM [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
Most implementations of OpenXES keep entire logs in memory, while others use
internal databases. However, all implementations use XML as their serialization format.
      </p>
      <p>
        It is not the first time that databases are used to store event logs. XESAme [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is a
log-import framework that allows events from other systems to be converted to a XES
file. Internally, this framework uses a temporary database to store event data, but then
this database is not exposed to the user and the contents are always serialized to the
XES XML format.
      </p>
    </sec>
    <sec id="sec-3">
      <title>Relational XES (RXES)</title>
      <p>
        XES is an open standard for storing and managing event data. For the purpose of storing
event data, a standardized, extensible storage format was developed, of which the
definition is shown in Figure 1. In XES, each event, trace and log is annotated with typed or
untyped attributes which are given semantics through so-called extensions. For
example, the activity an event refers to is stored as a literal attribute with key “concept:name”.
This key is defined in the standard “Concept Extension” [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
      </p>
      <p>
        One of the fundamental assumptions of XES is that each event belongs to exactly
one trace and occurred in the context of exactly one log as shown in Figure 1. In practice
however, this is often not the case [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Therefore, in this paper, we present the RXES framework that lifts this assumption.
Instead of considering an event to have occurred in the context of a particular trace, we
consider a trace to be a collection of events and a log to be a collection of such traces,
but events may occur in many different traces and traces may appear in multiple logs.
This allows for backwards compatibility with traditional mining techniques that rely on
the “single trace” view, while also allowing for more advanced techniques to consider
multiple views at once without the need for duplicating these events.</p>
      <p>In Figure 2 we show an ER diagram for the RXES framework. The framework uses
a scheme to store events in a database that is largely based on the UML class diagram
used for XES but separates the contents of events and traces from their appearance in
traces and logs respectively, thus requiring significantly less data duplication. In the
remainder of this section, we discuss the main differences between XES and RXES.
3.1</p>
      <sec id="sec-3-1">
        <title>Representation of events and traces</title>
        <p>In RXES, logs, traces and events are represented by tables with ids. The actual state of
an XTEvent is dependent on values of its attributes (accessed through the XAttributable
class). Similarly, an XTrace is nothing more than an ordered list of events (represented
by the composition) which carries state through its attributes and can only exist in the
context of a log. Therefore, there is no need to have content in the tables representing
traces and events in RXES.</p>
        <p>Events occurring in the context of a particular trace are represented by the trace_has_event
table which identifies occurrences of events rather than the events themselves and
similarly trace occurrences are represented by the log_has_trace table. As events can
appear in multiple traces at different locations and traces can appear in multiple logs at
different locations, we keep an order number indicated by sequence that is specific
to the occurrence of an event in a trace or a trace in a log.</p>
        <p>An event appearing in multiple traces can in RXES be represented by a single entry
in the event table, but the schema is flexible enough to allow for duplication of the event
for each occurrence if desired.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Representation of Attributes</title>
        <p>
          In XES, attributes are represented by a single class XAttribute. Each attribute is
composed of a single key and is typed through one of its subclasses (Boolean,
ConFig. 1: UML class diagram of the XES standard [
          <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
          ]
tainer : : : Timestamp). The XAttribute class may carry attributes itself (i.e.
metaattributes). In RXES, attributes are separated from values to reduce data duplication and
to enforce uniqueness of an attribute’s type. The latter is technically not a requirement
in XES, but it is considered good practice not to use the same attribute with more types.
To cover meta attributes the attribute table includes a parent child relationship
expressing that attributes have attributes. The values of the meta attributes are stored in the
event has attribute table, hence the definition of a meta attribute that is
contained in more than one event attribute with a different value is stored for each value.
This is a design choice to avoid having the relate events occurrences with attributes.
        </p>
        <p>As shown in Figure 1, event logs may contain a number of global attributes. The
notion of global events refers to the idea that a log can specify that a particular attribute
is present on all traces (or events) contained within it, i.e. the attribute is defined to
be globally present. This allows for process mining techniques to verify whether a log
satisfies certain input conditions, such as the presence of timestamps. A global attribute
also specifies a default value which can be used for example when adding new traces or
events to a log or in case an attribute declared to be global is nonetheless not present.
In RXES, we use two binary attribute in the log_has_attribute table to indicate
if an attribute is global. To cover the concept of global attribute it has been enforced by
the schema that there should not be two global attributes in one log with the same key
but a different value.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>Representation of Classifiers and Extensions</title>
        <p>Another feature of XES is the availability of so-called classifiers and extensions which
provide semantics to events in a log. A classifier in XES is composed of a collection
of attribute keys. If a classifier is included in a log, it provides insights into the way
individual events should be translated into business activities (or event classes in XES
terms). The idea is that such classifiers provide a starting point for process mining
techniques to reason on the correct level of abstraction. In practice, classifiers are rarely
contained in a log and are often added by the process mining technique. Therefore, in
our database, we represent the attribute keys of classifiers simply by text. As a general
rule, a classifier should only refer to keys of event global attributes, but this is not
enforced by XES nor by RXES.</p>
        <p>
          The last concept of XES that is supported by RXES are the so-called extensions.
The extension mechanism allows users to give semantics to attributes. These extensions
specify specific keys for attributes which have to be interpreted in a standardized way.
For example, the concept extension defined by [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] defines attributes concept:name
of type String which stores a generally understood name for any type hierarchy
element. [...] e.g. the name of the executed activity represented by the event[
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In RXES,
extensions are stored in a separate table, and attributes are connected to extensions using
foreign keys.
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4 Identification of attributes</title>
        <p>The attribute table uses auto-generated IDs attached to each attribute to connect
attributes of the same type using SQL queries. By using an id as a primary key rather
than other fields, we allow for recognition of identical attributes which is quite useful
for decomposition, filtering and importing. When encountering an attribute that exists
in the whole log, we may add a reference to the existing id in the database using an
inmemory cache of existing attributes. However, if an attribute is not in cache, we add the
attribute using a fresh id. The consolidation of identical attributes can be done offline.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Benefits of RelationalXES over OpenXES</title>
      <p>RelationalXES provides a full implementation of all OpenXES interfaces using the
database as a backend. As a result the framework is fully backward compatible and
provides a number of benefits over OpenXES. First, keeping events in the database and
only retrieving minimal amounts of data per each request, reduces the memory usage
of process mining techniques significantly. Second, through SQL-based filtering and
database views, decomposition algorithms no longer require event data to be duplicated
in memory, and third, events can exist outside of the context of a trace or a log, allowing
RXES to store event data from streams and to utilize trace identification techniques.</p>
      <p>RXES has been implemented in Java ,and for the experiments in this paper, MySQL
has been used as a database system. However, the design of the application allows
for any relational database systems. The framework keeps only the data items that are
directly related to the log entity plus list of ids of traces in the memory and later loads
the actual data if it is necessary, i.e. if it is requested by a process mining technique.
RXES is implemented as a package in the ProM framework and is available as an
opensource implementation. It can be downloaded from http://www.processmining.org/.</p>
      <p>Currently OpenXES has multiple implementations included in ProM. In this
section, we show the difference between our new technique and existing implementations
by evaluating memory and CPU usage of processing different sizes of logs.</p>
      <p>
        To investigate memory and CPU usage we used 30 sample event-logs. For that,
we used a public, real-life dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] as a base. This dataset contains 13,087 traces
with 262,200 events. In total, there are 1,082,719 attributes contained in these events.
To investigate the behavior of the system, we extend the size of this log in different
dimensions, i.e., we increased the number of events and attributes of the base log up
to 10 times . We used a binary search technique to find the least amount of memory
needed to process each event log. The resulting memory use for each log and each
implementation is shown in Figure 3(a) and Figure 3(b).
      </p>
      <p>Clearly, the default OpenXES implementation (using Java’s Collection Framework)
has the highest memory allocation. The existing MapDB implementation (in package
XESLite) drops the usage by 90%, while our technique uses 90% less than that. For
RXES, these results include the memory needed by the Database Management System.</p>
      <p>Figures 3(c) and 3(d) show a comparison for the time needed to access all events in
the event logs. To evaluate time, a program that requests all elements of the event log has
been executed 10 times and an average time is computed. The amount of memory given
to the Java virtual machine is sufficient to meet the demands of the Default OpenXES
implementation. Clearly, there is a direct relation between memory usage and speed of
access. As Figure 3(c) shows, the default implementation is the fastest method simply
because the event log is loaded into memory completely. The figures further show that
(a) Memory use vs. number of events
(b) Memory use vs. number of attributes
b
M
s
1;500
1;000
500</p>
      <p>RXES is slower than MapDB which uses an in-memory database, which is mostly
because for this experiment we used minimum size of the trace buffer which only keeps
one trace at a time. Hence, the execution time is result of performing many SQL queries
over a TCP/IP connection to the DBMS.</p>
      <p>In general, if the size of the event log is small compared to the available memory,
MapDBDisk keeps a good balance between memory and time usage. However,
compared to RXES, MapDB doesn’t allow for distribution and replication, while it is vital
in cases of using decomposition and parallelism to handle the big data.</p>
      <p>The ability use of RXES to process large event logs with readily available process
mining techniques is essential for the adoption of our framework in practice. However,
our framework is capable of performing tasks directly on the database, such as event
log filtering.
5</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion and Future Work</title>
      <p>In this paper, we presented RelationalXES. This framework is a generalization of the
de-facto standard XES for storing and managing event data in process mining. Where
all existing implementations of XES uses the strict notion of containment for events in
traces and traces in logs, our framework is much more flexible and allows for events to
appear in more than one trace and for traces to appear in more than one log. That makes
it possible to have different views on the same event log. Furthermore, the database
schema used in RXES allows for a significant reduction of duplication by storing
frequently occurring attributes only once rather than repeating them for every occurrence.</p>
      <p>In the paper, we presented the framework in detail and we discuss the underlying
database schema. Furthermore, we show the reduction of memory use by process
mining techniques using RXES.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Andrea</given-names>
            <surname>Burattin</surname>
          </string-name>
          , Alessandro Sperduti, and
          <string-name>
            <surname>Wil M. P. van der Aalst</surname>
          </string-name>
          .
          <article-title>Heuristics miners for streaming event data</article-title>
          .
          <source>CoRR, abs/1212.6383</source>
          ,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Christian</surname>
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Gu</surname>
          </string-name>
          <article-title>¨nther and Anne Rozinat</article-title>
          . Disco:
          <article-title>Discover your processes</article-title>
          .
          <source>In Niels Lohmann and Simon Moser</source>
          , editors,
          <source>BPM (Demos)</source>
          , volume
          <volume>940</volume>
          <source>of CEUR Workshop Proceedings</source>
          , pages
          <fpage>40</fpage>
          -
          <lpage>44</lpage>
          . CEUR-WS.org,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Fabrizio</given-names>
            <surname>Maria</surname>
          </string-name>
          <string-name>
            <surname>Maggi</surname>
          </string-name>
          , Andrea Burattin, Marta Cimitile, and
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Sperduti</surname>
          </string-name>
          .
          <article-title>Online process discovery to detect concept drifts in ltl-based declarative process models</article-title>
          . In Robert Meersman et.al.s, editor,
          <source>OTM Conferences</source>
          , volume
          <volume>8185</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>94</fpage>
          -
          <lpage>111</lpage>
          . Springer,
          <year>2013</year>
          . ISBN 978-3-
          <fpage>642</fpage>
          -41029-1.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Erik</surname>
            <given-names>H. J.</given-names>
          </string-name>
          <string-name>
            <surname>Nooijen</surname>
          </string-name>
          , Boudewijn F. van
          <string-name>
            <surname>Dongen</surname>
            ,
            <given-names>and Dirk</given-names>
          </string-name>
          <string-name>
            <surname>Fahland</surname>
          </string-name>
          .
          <article-title>Automatic discovery of data-centric and artifact-centric processes</article-title>
          .
          <source>In Marcello La Rosa and Pnina Soffer</source>
          , editors,
          <source>Business Process Management Workshops</source>
          , volume
          <volume>132</volume>
          <source>of Lecture Notes in Business Information Processing</source>
          , pages
          <fpage>316</fpage>
          -
          <lpage>327</lpage>
          . Springer,
          <year>2012</year>
          . ISBN 978-3-
          <fpage>642</fpage>
          -36284-2.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Viara</given-names>
            <surname>Popova</surname>
          </string-name>
          and
          <string-name>
            <given-names>Marlon</given-names>
            <surname>Dumas</surname>
          </string-name>
          .
          <article-title>Discovering unbounded synchronization conditions in artifact-centric process models</article-title>
          .
          <source>In Niels Lohmann</source>
          , Minseok Song, and Petia Wohed, editors,
          <source>Business Process Management Workshops</source>
          , volume
          <volume>171</volume>
          <source>of Lecture Notes in Business Information Processing</source>
          , pages
          <fpage>28</fpage>
          -
          <lpage>40</lpage>
          . Springer,
          <year>2013</year>
          . ISBN 978-3-
          <fpage>319</fpage>
          -06256-3.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Viara</given-names>
            <surname>Popova</surname>
          </string-name>
          , Dirk Fahland, and
          <string-name>
            <given-names>Marlon</given-names>
            <surname>Dumas</surname>
          </string-name>
          .
          <article-title>Artifact lifecycle discovery</article-title>
          .
          <source>CoRR, abs/1303.2554</source>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Wil</surname>
            <given-names>M. P. van der Aalst. Process</given-names>
          </string-name>
          <string-name>
            <surname>Mining - Discovery</surname>
          </string-name>
          , Conformance and Enhancement of Business Processes. Springer,
          <year>2011</year>
          . ISBN 978-3-
          <fpage>642</fpage>
          -19344-6.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>B.F.; van Dongen. Bpi challenge</surname>
          </string-name>
          <year>2012</year>
          ,
          <year>2012</year>
          . URL http://dx.doi.org/ 10.4121/uuid:
          <fpage>3926db30</fpage>
          -f712
          <string-name>
            <surname>-</surname>
          </string-name>
          4394
          <string-name>
            <surname>-</surname>
          </string-name>
          aebc-75976070e91f.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Boudewijn</surname>
            <given-names>F. van Dongen</given-names>
          </string-name>
          and
          <string-name>
            <surname>Wil M. P. van der Aalst</surname>
          </string-name>
          .
          <article-title>Emit: A process mining tool</article-title>
          . In Jordi Cortadella and Wolfgang Reisig, editors,
          <source>ICATPN</source>
          , volume
          <volume>3099</volume>
          of Lecture Notes in Computer Science, pages
          <fpage>454</fpage>
          -
          <lpage>463</lpage>
          . Springer,
          <year>2004</year>
          . ISBN 3-540-22236-7.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H. M. W.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          and
          <string-name>
            <surname>Christian W. Gu</surname>
          </string-name>
          <article-title>¨nther. XES standard definition 2.0</article-title>
          .
          <string-name>
            <surname>Technical</surname>
            <given-names>report</given-names>
          </string-name>
          , BPMcenter.org,
          <year>July 2014</year>
          . URL http://bpmcenter.org/ wp-content/uploads/reports/2014/BPM-14-09.pdf.
          <source>BPM Center Report BPM-14-09.</source>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>H. M. W. Verbeek</surname>
          </string-name>
          ,
          <string-name>
            <surname>Joos C. A. M. Buijs</surname>
          </string-name>
          , Boudewijn F. van
          <string-name>
            <surname>Dongen</surname>
          </string-name>
          , and
          <string-name>
            <surname>Wil</surname>
            <given-names>M. P. van der Aalst. XES</given-names>
          </string-name>
          , XESame, and
          <article-title>ProM 6</article-title>
          . In Pnina Soffer and Erik Proper, editors,
          <source>CAiSE Forum</source>
          , volume
          <volume>72</volume>
          <source>of Lecture Notes in Business Information Processing</source>
          , pages
          <fpage>60</fpage>
          -
          <lpage>75</lpage>
          . Springer,
          <year>2010</year>
          . ISBN 978-3-
          <fpage>642</fpage>
          -17721-7.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>