<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DB-XES: Enabling Process Discovery in the Large</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alifah Syamsiyah</string-name>
          <email>A.Syamsiyah@tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Boudewijn F. van Dongen</string-name>
          <email>B.F.v.Dongen@tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wil M.P. van der Aalst</string-name>
          <email>W.M.P.v.d.Aalst@tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eindhoven University of Technology</institution>
          ,
          <addr-line>Eindhoven</addr-line>
          ,
          <country country="NL">the Netherlands</country>
        </aff>
      </contrib-group>
      <fpage>63</fpage>
      <lpage>77</lpage>
      <abstract>
        <p>Dealing with the abundance of event data is one of the main process discovery challenges. Current process discovery techniques are able to e ciently handle imported event log les that t in the computer's memory. Once data les get bigger, scalability quickly drops since the speed required to access the data becomes a limiting factor. This paper proposes a new technique based on relational database technology as a solution for scalable process discovery. A relational database is used both for storing event data (i.e. we move the location of the data) and for pre-processing the event data (i.e. we move some computations from analysis-time to insertion-time). To this end, we rst introduce DB-XES as a database schema which resembles the standard XES structure, we provide a transparent way to access event data stored in DB-XES, and we show how this greatly improves on the memory requirements of a stateof-the-art process discovery technique. Secondly, we show how to move the computation of intermediate data structures, such as the directly follows relation, to the database engine, to reduce the time required during process discovery. The work presented in this paper is implemented in ProM tool, and a range of experiments demonstrates the feasibility of our approach.</p>
      </abstract>
      <kwd-group>
        <kwd>process discovery</kwd>
        <kwd>process mining</kwd>
        <kwd>big event data</kwd>
        <kwd>relational database</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Process mining is a research discipline that sits between machine learning and
data mining on the one hand and process modeling and analysis on the other
hand. The goal of process mining is to turn event data into insights and actions
in order to improve processes [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. One of the main perspectives o ered by
process mining is process discovery, a technique that takes an event log and
produces a model without using any a-priori information. Given the abundance
of event data, the challenge is to enable process mining in the large. Any sampling
technique would lead to statistically valid results on mainstream behavior, but
would not lead to insights into the exceptional behavior, which is typically the
goal of process mining.
      </p>
      <p>In the traditional setting of process discovery, event data is read from an
event log le and a process model describing the recorded behavior is produced,
event data
intermediate structure
(e.g., directly fol ows relation)
step 1
step 2</p>
      <p>process mining tool
process model
(e.g., Petri net or BPMN model)
as depicted in Figure 1(a). In between, there is a so-called intermediate structure,
which is an abstraction of event data in a structured way, e.g. the directly
follows relation, a pre x-automaton, etc. To build such an intermediate structure,
process mining tools load the event log in memory and build the intermediate
structure in the tool, hence the analysis is bound by the memory needed to store
both the event log and the immediate structure in memory. Furthermore, the
time needed for the analysis includes the time needed to convert the log to the
intermediate structure.</p>
      <p>
        To increase the scalability, relational databases have been proposed for
storing event data [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], as depicted in Figure 1(b), i.e. the event log le is replaced
by a database. In [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] a database schema was introduced to store event data and
experiments showed the reduction in memory use. A connection is established
from the database to process mining tools to access the event data on demand
using the standard interfaces for dealing with event logs, i.e. OpenXES [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Since
no longer the entire event log is to be read in memory, the memory
consumption of the process mining analysis will be shown to be reduced signi cantly as
now only the intermediate structure needs to be stored. However, this memory
reduction comes at a cost of analysis time since access to the database is several
orders of magnitude slower than access to an in-memory event log while building
the intermediate structure for further analysis.
      </p>
      <p>Therefore, we present a third solution, called DB-XES, where we not only
move the location of the event data, but also the location of such intermediate
structures. In order to do so, we move the computation of intermediate structures
from analysis time to insertion time, as depicted in Figure 1(c). In other words,
each intermediate structure is kept up-to-date for each insertion of a new event
of a trace in the database. In this paper we present the general idea and a
concrete instantiation using the intermediate structure of a state-of-the-art process
discovery technique. We show that the proposed solution saves both memory
and time during process analysis.</p>
      <p>The remainder of this paper is organized as follows. In Section 2, we discuss
some related work. In Section 3, we present the database schema for DB-XES.
In Section 4, we extend DB-XES with the notion of intermediate structure. In
Section 5 we show how a well-known intermediate structure can be computed
inside the database. Then, in Section 6, we present experiments using the
Inductive Miner. These show signi cant performance gains. Finally, we conclude
and discuss the future work in Section 7.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        One of the rst tools to extract event data from a database was XESame [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In
XESame users can interactively select data from the database and then match it
with XES elements. However, the database is only considered as a storage place
of data as no direct access to the database is provided.
      </p>
      <p>
        Similar to XESame, in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] a technique is presented where data stored in
databases is serialized into an XES le. The data is accessed with the help of
two ontologies, namely a domain ontology and an event ontology. Besides that,
the work also provided on-demand access to the data in the database using query
unfolding and rewriting techniques in Ontology Based Data Access [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However,
the performance issues make this approach unsuitable for large databases.
      </p>
      <p>Some commercial tools, such as Celonis1 and Minit2, also incorporate features
to extract event data from a database. The extraction can be done extremely
fast, however, its architecture has several downsides. First, it is not generic since
it requires a transformation to a very speci c schema, e.g. a table containing
information about case identi er, activity name, and timestamp. Second, it cannot
handle huge event data which exceed computer's memory due to the fact that
the transformation is done inside the memory. Moreover, since no direct access
to the database is provided, some updates in the database will lead to restarting
of the whole process in order to get the desired model.</p>
      <p>
        Building on the idea of direct access to the database, in [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], RXES was
introduced before as the relational representation of XES and it is was shown that
RXES uses less memory compared to the le-based OpenXES and MapDB XES
1 http://www.celonis.de/en/
2 http://www.minitlabs.com/
Lite implementations. However, its application to a real process mining algorithm
was not investigated and the time-performance analysis was not included.
      </p>
      <p>
        In [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ], the performance of multidimensional process mining (MPM) is
improved using relational databases techniques. It presented the underlying
relational concepts of PMCube, a data-warehouse-based approach for MPM. It
introduced generic query patterns which map OLAP queries to SQL to push the
operations to the database management systems. This way, MPM may
benet from the comprehensive optimization techniques provided by state-of-the-art
database management systems. The experiments reported in the paper showed
that PMCube procides a signi cantly better perfromance than PMC, the
stateof-the-art implementation of the Process Cubes approach.
      </p>
      <p>
        The use of database in process mining gives signi cance not only to the
procedural process mining, but also declarative process mining. The work in [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]
introduced an SQL-based declarative process mining approach that analyses
event log data stored in relational databases. It deals with existing issues in
declarative process mining, namely the performance issues and expressiveness
limitation to a speci c set of constraints. By leveraging database performance
technology, the mining procedure in SQLMiner can be done fast. Furthermore,
SQL queries provide exibility in writing constraints and it can be customized
easily to cover process perspective beyond control ow.
      </p>
      <p>
        Apart from using databases, some other techniques for handling big data
in process mining have been proposed [
        <xref ref-type="bibr" rid="ref10 ref12 ref2">2, 10, 12</xref>
        ], two of them are decomposing
event logs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and streaming process mining [
        <xref ref-type="bibr" rid="ref18 ref7">7, 18</xref>
        ]. In decomposition, a large
process mining problem is broken down into smaller problems focusing on a
restricted set of activities. Process mining techniques are applied separately in
each small problem which then they are combined to get an overall result. This
approach deals with exponential complexity in the number of activities of most
process mining algorithms [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ]. Whereas in streaming process mining, it provides
online-fashioned process mining where the event data is freshly produced, i.e. it
does not restrict to only process the historical data as in traditional process
mining. Both approaches however require severe changes to the algorithms used
for analysis and they are therefore not directly applicable to existing process
mining techniques.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>DB-XES as Event Data Storage</title>
      <p>
        In the eld of process mining, event logs are typically considered to be structured
according to the XES standard [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Based on this standard, we create a relational
representation for event logs, which we called DB-XES. We select relational
databases rather than any other type of databases, e.g. NoSQL [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ], because
of the need to be able to slice and dice data in di erent ways. An e-commerce
system, for example, may need to be analyzed using many views. One view
can be de ned based on customer order, other view may also be de ned based
on delivery, etc. Some NoSQL databases, such as key-value store databases,
document databases, or column-oriented databases, are suitable for the data
log_has_trace
log_id VARCHAR(50)
trace_id VARCHAR(50)
sequence INT(11)
      </p>
      <p>Indexes
trace
id VARCHAR(50)
attr_id VARCHAR(50)
Indexes
event
id VARCHAR(50)
attr_id VARCHAR(50)
event_col_id VARCHAR(50)</p>
      <p>Indexes
trace_has_event
trace_id VARCHAR(50)
event_id VARCHAR(50)
sequence INT(11)
Indexes
Triggers
event_collection
id VARCHAR(50)
name VARCHAR(50)
Indexes
log
id VARCHAR(50)
attr_id VARCHAR(50)
Indexes
log_has_global
log_id VARCHAR(50)
attr_id VARCHAR(50)
scope VARCHAR(50)
Indexes
which can be aggregated, but have di culties supporting multiple perspectives
at the same time. Besides, relational databases are more mature than NoSQL
databases with respect to database features, such as trigger operations.</p>
      <p>Figure 2 shows the basic database schema of DB-XES. The XES main
elements are represented in tables log, trace, event, and attribute. The relation
between these elements are stored in tables log has trace and trace has event.
Furthermore, classi er and extension information related to a log can be
accessed through tables log has classi er and log has extension. Global attributes
are maintained in the table log has global. In order to store the source of event
data, we introduce the event collection table.</p>
      <p>
        OpenXES is a Java-based reference implementation of the XES standard for
storing and managing event log data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. OpenXES is a collection of interfaces
and corresponding implementations tailored towards accessing XES les. In
consequence of moving event data from XES les to DB-XES, we need to implement
some Java classes in OpenXES. Having the new version of OpenXES, it allows
for any process mining techniques capable of handling OpenXES data to be used
on DB-XES data. The implementation is distributed within the DBXes package
in ProM (https://svn.win.tue.nl/repos/prom/Packages/DBXes/Trunk/).
      </p>
      <p>The general idea is to create SQL queries to get the event data for
instantiating the Java objects. Access to the event data in the database is de ned
for each element of XES, therefore we provide on demand access. We de ne a
log, a trace, and an event based on a string identi er and an instance of class
Connection in Java. The identi er is retrieved from a value under column id in
log, trace, and event table respectively. Whereas the instance of class Connection
should refer to the database where we store the event data. Upon initialization
of the database connection, the list of available identi ers is retrieved from the
database and stored in memory using global variables.
4</p>
    </sec>
    <sec id="sec-4">
      <title>Extending DB-XES with Intermediate Structures</title>
      <p>In the analysis, process mining rarely uses event data itself, rather it processes an
abstraction of event data called an intermediate structure. This section discusses
the extension of DB-XES with intermediate structures. First, we brie y explain
about several types of intermediate structures in process mining, then we present
a highly used intermediate structure we implemented in DB-XES as an example.</p>
      <p>
        There are many existing intermediate structures in process mining, such as
the eventually follows relation, no co-occurrence relation [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ], handover of work
relation [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ], and pre x-closed languages in region theory [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Each intermediate
structure has its own functions and characteristics. Some intermediate structures
are robust to ltering, hence we may get di erent views on the processes by
ltering the event data without recalculation of the intermediate structure like
eventually follows relation, but some require full recomputation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Mostly
intermediate structures can be computed by reading the event data in a single
pass over the events, but some are more complex to be computed. In general the
size of intermediate structure is much smaller than the size of the log [
        <xref ref-type="bibr" rid="ref14 ref4 ref5">4, 5, 14</xref>
        ],
but some intermediate structures are bigger than the log [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. In the following
we brie y introduce some examples of intermediate structures.
      </p>
      <p>{ The directly follows relation (a &gt; b) contains information that a is directly
followed by b in the context of a trace. This relation is not robust to ltering.
Once ltering happens, the relation must be recalculated. Suppose that a is
directly followed by b, i.e. a &gt; b, and b is directly followed by c, i.e. b &gt; c. If
we lter b, now a is directly followed by c, hence a new relation a &gt; c holds.
{ The eventually follows relation (V (a; b)) is the transitive closure of the
directly follows relation: a is followed by b somewhere in the trace. Suppose
that a is eventually followed by b, i.e. V (a; b), and a is eventually followed by
c, i.e. V (a; c). If we lter b, a is still followed by c somewhere in the trace, i.e.</p>
      <p>V (a; c) still holds. Therefore, eventually follows relation is robust to ltering.
{ The no co-occurrence relation (R(a; b)) counts the occurrences of a with no
co-occurring b in the trace. For example, a occurs four times with no
cooccurring b, i.e. R(a; b) = 4, and a occurs three times with no co-occurring
c, i.e. R(a; c) = 3. If we lter b, it does not e ect the occurrence of a with no
c, i.e. R(a; c) = 3 still holds. Therefore, no co-occurrence relation is robust
to ltering.
{ The handover of work relation between individual a and b (H(a; b)) exists if
there are two subsequent activities where the rst is completed by a and the
second by b. This is also an example of non-robust intermediate structure for</p>
      <p>log_has_dfr
log_id VARCHAR(50)
classifier_id VARCHAR(50)
dfr_id VARCHAR(50)</p>
      <p>Indexes</p>
      <p>
        While many intermediate structures can be identi ed when studying process
mining techniques, we currently focus on the Directly Follows Relation (DFR).
DFR is used in many process mining algorithms, including the most widely used
process discovery techniques, i.e. Inductive Miner [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In the following we discuss
how DB-XES is extended by a DFR table.
Directly Follows Relation (DFR) contains information about the frequency with
which one event class directly follows another event class in the context of a
trace. Following the de nition in [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], DFR is de ned as follows.
De nition 1 (Event Log). Let E be a set of events. An event log L E is a
set of event sequences (called traces) such that each event appears precisely once
in precisely one trace.
      </p>
      <p>De nition 2 (Event Attributes and Classi ers). Let E be a set of events
and let A be a set of attribute names.</p>
      <p>{ For any event e 2 E and name a 2 A: #a(e) is the value of attribute a for
event e. #a(e) = ? if there is no value.
{ Any subset C fa1; a2; :::; ang A is a classi er, i.e., an ordered set of
attributes. We de ne: #c(e) = (#a1 (e); #a2 (e); :::; #an (e)).
{ In the context of an event log there is a default classi er DC A for which
we de ne the shorthand of event class e = #DC (e).</p>
      <sec id="sec-4-1">
        <title>De nition 3 (Directly Follows Relation (DFR)). Let L E be an event</title>
        <p>log. x is directly followed by y, denoted x &gt; y, if and only if there is a trace
= he1; e2; :::; eni 2 L and 1 i &lt; n such that ei = x and ei+1 = y.</p>
        <p>Translated to DB-XES, table dfr consists of three important columns next
to the id of the table, namely eventclass1 which indicates the rst event class in
directly follows relation, eventclass2 for the second event class, and freq which
indicates how often an event class is directly followed by another event class.
Figure 3 shows the position of table dfr in DB-XES. As DFR is de ned on the
event classes based on a classi er, every instance in table dfr is linked to an
instance of table classi er in the log.</p>
        <p>De nition 4 (Table dfr). Let L E be an event log, X = fe j e 2 Eg is the
set of event classes. dfr 2 X X 9 N where:
{ dom(dfr) = f(x; y) 2 X X j x &gt; yg
{ dfr(x; y) = Phe1;:::;eni2L jfi 2 f1; :::; n
1g j ei = x ^ ei+1 = ygj</p>
        <p>As mentioned before, the design choice to incorporate DFR as the
intermediate structure is due to fact that DFR is used in the state-of-the-art process
discovery algorithm. However, DB-XES can be extended into other intermediate
structures such as eventually follows relations and no co-occurrence relations.
5</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>DFR Pre-Computation in DB-XES</title>
      <p>Typically, process mining algorithms build an intermediate structure in memory
while going through the event log in a single pass (as depicted in Figure 1(a)).
However, this approach will not be feasible to handle huge event log whose size
exceeds the computer memory. Moving the location of the event data from a le
to a database as depicted in Figure 1(b) increases the scalability of process
mining as the computer memory no longer needs to contain the event data. However,
the ping-pong communication between the database and process mining tools is
time consuming. Therefore, in this section, we show how DFR is pre-computed
in DB-XES (Figure 1(c)). Particularly, we show how common processing tasks
can be moved both in time and location, i.e. we show how to store intermediate
structures in DB-XES and we show how these structures can be updated while
inserting the data rather than when doing the process mining task. This paper
focuses on a particular intermediate structure, namely the DFR, but the work
trivially extends to other intermediate structures, as long as they can be kept
up-to-date during insertion of event data in the database.</p>
      <p>As mentioned in Section 4, the table dfr in Figure 3 is the table in DB-XES
which stores DFR values, furthermore, the table log has dfr stores the context in
1
2
3
4
5
6
7
8
9
10
11
12
13
which the DFR exists, i.e. it links the DFR values to a speci c log and classi er
combination. The dfr table is responsive to update operations, particularly when
users insert new events to the log. In the following we discuss how the dfr table
is created and updated in DB-XES.
5.1</p>
      <sec id="sec-5-1">
        <title>Creating Table dfr in DB-XES</title>
        <p>Suppose that there exists two entries in the trace has event table with trace id
, event id's ei and ei+1 and sequence's i and i + 1. The rst event ei is linked
to an attribute with value a and the second event is linked to an attribute
with value b while the log has a classi er based on attribute . In DB-XES, we
store the frequency of each pair a &gt; b in the database rather than letting the
discovery algorithm build in it on-demand and in-memory. In other words, the
directly follows relation is precomputed and the values can be retrieved directly
by a process mining algorithm when needed.</p>
        <p>To create table dfr, we run three SQL queries. The rst query is to obtain
pairs of directly follows relations. For instance, if an event class a is directly
followed by an event class b and this happens 100 times in the log, then there
will be a row in table dfr with value (dfr1, a, b, 100), assuming the id is dfr1.
Furthermore, the second and third queries are to extract start and end event
classes. We create an arti cial start (&gt;) and end (?) event for each process
instance. For example, if there are 200 cases where a happens as the start event
class, there will be a row in dfr with values (dfr1, &gt;, a, 200). Similarly, if b is
the end event class for 150 cases, there will be a row in dfr with values (dfr1, b,
?, 150).</p>
        <p>Technically, the SQL query contains big joins between tables trace has event,
event, attribute, log has trace, log has classi er, and classi er. Such joins are
needed to get pairs of event classes whose events belong to the same trace in
the same log which has some classi ers. The SQL query mentioned below is a
simpli ed query to obtain pairs of directly follows relations. To improve
understandability, we use placeholders (&lt; ::: &gt;) to abstract some details. Basically
they are trivial join conditions or selection conditions to interesting columns.
S E L E C T id , e v e n t C l a s s 1 , e v e n t C l a s s 2 , c o u n t (*) as freq
FROM (</p>
        <p>S E L E C T &lt;... &gt;
FROM (</p>
        <p>S E L E C T &lt;... &gt;
FROM t r a c e _ h a s _ e v e n t as t1</p>
        <p>I N N E R JOIN t r a c e _ h a s _ e v e n t as t2</p>
        <p>ON t1 . t r a c e _ i d = t2 . t r a c e _ i d
/* Here is to get c o n s e c u t i v e e v e n t s */</p>
        <p>W H E R E t1 . s e q u e n c e = t2 . s e q u e n c e - 1
) as t e m p t a b l e ,
a t t r i b u t e as a1 , a t t r i b u t e as a2 ,
e v e n t as event1 , e v e n t as event2 ,
14
15
16
17
18
19</p>
        <p>We start with a self join in table trace has event (line 6-8) to get pairs of
two events which belong to the same trace. Then we lter to pairs whose events
happen consecutively, i.e. the sequence of an event is preceded by the other (line
10). The next step is obtaining the attribute values of these events. The attribute
values are grouped based on the classi er in the log (line 16-17). This grouping
is essential if the classi er is built from a combination of several attributes, for
example a classi er based on the activity name and lifecycle. After grouping, we
get a multiset of pairs of event classes. Finally, the same pairs are grouped and
counted to have the frequency of how often they appeared in the log (line 1, 19).
Rows in table dfr are automatically updated whenever users insert a new event
through a trigger operation on table trace has event which is aware of an insert
command. Here we consider two scenarios: (1) a newly inserted event belongs
to a new trace in a log for which a dfr table exists and (2) a newly inserted
event belongs to an existing trace in such a log. We assume such insertion is
well-ordered, i.e. an event is not inserted at an arbitrary position.</p>
        <p>Suppose that we have a very small log L = [ha; bi], where we assume a and
b refer to the event class of the two events in L determined by a classi er cl for
which an entry (L, cl, dfr1) exists in the log has dfr table. This log only contains
one trace (say 1) with two events that correspond to two event classes, namely a
and b. If we add to L a new event with a new event class c to a new trace di erent
from 1 then such an event is considered as in the rst scenario. However, if we
add c to 1 then it is considered as the second scenario.</p>
        <p>In the rst scenario, we update the start and end frequency of the inserted
event type. In our example above, the rows in table dfr containing (dfr1, &gt;, c,
f ) and (dfr1, c, ?, f ) will be updated as (dfr1, &gt;, c, f + 1) and (dfr1, c, ?, f
+ 1) with f is the frequency value. If there is no such rows, (dfr1, &gt;, c, 1) and
(dfr1, c, ?, 1) will be inserted.</p>
        <p>In the second scenario, we update the end frequency of the last event class
before the newly inserted event class, and add the frequency of the pair of those
two. Referring to our example, row (dfr1, b, ?, f ) is updated to (dfr1, b, ?,
f - 1). If there exists row (dfr1, c, ?, f ), it is updated to (dfr1, c, ?, f + 1),
otherwise (dfr1, c, ?, 1) is inserted. Furthermore, if (dfr1, b, c, f ) exists in table
dfr, it is updated as (dfr1, b, c, f + 1), otherwise (dfr1, b, c, 1) is inserted.</p>
        <p>By storing the intermediate structure in the database and updating this
structure when events are inserted, we move a signi cant amount of computation
time to the database rather than to the process analysis tool. This allows for
faster analysis with virtually no limits on the size of the event log as we show in
the next section.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Experiments</title>
      <p>
        In this section we show the in uence of moving both the event data and the
directly follows table to the database on the memory use and time consumption
of Inductive Miner [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Next to the traditional in-memory processing of event
logs (Figure 1(a)), we consider two scenarios in DB-XES: (1) DB-XES without
DFR where the intermediate result is computed during the discovery (Figure
1(b)) and (2) DB-XES with DFR where the intermediate result is pre-computed
in the database (Figure 1(c)). We show that the latter provide scalability with
respect to data size and even improves time spent on actual analysis.
      </p>
      <p>As the basis for the experiments, we use an event log from a real company
which contains 29,640 traces, 2,453,386 events, 54 di erent event classes and
17,262,635 attributes. Then we extend this log in two dimensions, i.e. we
increase (1) the number of event classes and (2) the number of traces, events and
attributes. We extend the log by inserting copies of the original event log data
with some modi cations in the identi er, task name, and timestamp. In both
cases, we keep the other dimension xed in order to get a clear picture of the
in uence of each dimension separately on both memory use and CPU time. This
experiment was executed on the machine with processor Intel(R) Core(TM)
i74700MQ and 16GB of RAM.
6.1</p>
      <sec id="sec-6-1">
        <title>Memory Use</title>
        <p>In Figure 4(a), we show the in uence of increasing the number of event classes
on the memory use of the Inductive Miner. The Inductive Miner makes a linear
pass over the event log in order to build an object storing the direct succession
relation in memory. In theory, the direct succession relation is quadratic in the
number of event classes, but as only actual pairs of event classes with more than
one occurrence are stored and the relation is sparse, the memory consumption
scales linearly in the number of event classes as shown by the trendlines. It is
clear that the memory use of DB-XES is consistently lower than XES. This is
easily explained as there is no need to store the event log in memory. The fact
that DB-XES with DFR uses more memory than DB-XES without DFR is due
to the memory overhead of querying the database for the entire DFR table at
once.</p>
        <p>In Figure 4(b), we present the in uence of increasing the number of events,
traces and attributes while keeping the number of event classes constant. In this
case, normal XES quickly uses more memory than the machine has while both
DB-XES implementations show no increase in memory use with growing data
and the overall memory use is less than 50 MB. This is expected as the memory
consumption of the Inductive Miner varies with the number of event classes only,
i.e. the higher frequency values in the dfr table do not in uence the memory use.
1;500
8;000
) 6;000
B
M
(
ry4;000
o
m
e</p>
        <p>M2;000
60;000
50;000
)s 40;000
m
(e
im 30;000
T
U
PC 20;000</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2 CPU Time</title>
        <p>We also investigated the in uence of accessing the database to the CPU time
needed by the analysis, i.e. we measure the time spent to run the Inductive
Miner. In Figure 5(a), we show the in uence of the number of event classes on the
CPU time. When switching from XES les to DB-XES without DFR, the time
needed to do the analysis increases considerably. This is easily explained by the
overhead introduced in Java by initiating the query every time to access an event.
However, when using DB-XES with DFR, the time needed by the Inductive
Miner decreases, i.e. it is faster to obtain the dfr table from the database than
to compute it in memory.</p>
        <p>This e ect is even greater when we increase the number of traces, events
and attributes rather than the number of event classes as shown in Figure 5(b).
DB-XES with DFR shows a constant CPU time use, while normal XES shows a
steep linear increase in time use before running out of memory. DB-XES without
DFR also requires linear time, but is several orders of magnitude slower
(DBXES without DFR is drawn against the right-hand side axis).</p>
        <p>In this section, we have proven that the use of relational databases in
process mining, i.e DB-XES, provide scalability in terms of memory use. However,
accessing DB-XES directly by retrieving event data elements on demand and
computing intermediate structures in ProM is expensive in terms of processing
time. Therefore, we presented DB-XES with DFR where we moved the
computation of the intermediate structure to the database. This solution provides
scalability in both memory and time.</p>
        <p>We have implemented this solution as a ProM plug-in which connects
DBXES and Inductive Miner algorithm. We name the plug-in as Database Inductive
Miner and it is distributed within the DatabaseInductiveMiner package (https:
//svn.win.tue.nl/repos/prom/Packages/DatabaseInductiveMiner/Trunk/).
7</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Conclusion and Future Work</title>
      <p>This paper focuses on the issue of scalability in terms of both memory use and
CPU use in process discovery. We introduce a relational database schema called
DB-XES to store event data and we show how directly follows relation can be
stored in the same database and be kept up-to-date when inserting new events
into the database.</p>
      <p>Using experiments on real-life data we show that storing event data in
DBXES not only leads to a signi cant reduction in memory use of the process
mining tool, but can even speed up the analysis if the pre-processing is done in
the right way in the database upon insertion of the event data.</p>
      <p>For the experiments we used the Inductive Miner, which is a state-of-the-art
process discovery technique. However, the work trivially extends to other process
discovery techniques, as long as we can identify an intermediate structure used
by the technique which can be updated when inserting new events into the
database.</p>
      <p>The work presented in this paper is implemented in ProM. The plug-in paves
a way to access pre-computed DFR stored in DB-XES. These DFR values are
then retrieved and processed by Inductive Miner algorithm.</p>
      <p>For future work, we plan to implement also the event removal and
intermediate structures which robust to ltering. The intermediate structures will be
kept live under both insertion and deletion of events where possible.
Furthermore, we aim to further improve the performance through query optimization
and indexing.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>W.M.P. van der Aalst</surname>
          </string-name>
          .
          <article-title>Decomposing Petri Nets for Process Mining: A Generic Approach</article-title>
          .
          <source>Distributed and Parallel Databases</source>
          ,
          <volume>31</volume>
          (
          <issue>4</issue>
          ):
          <volume>471</volume>
          {
          <fpage>507</fpage>
          ,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>A.</given-names>
            <surname>Azzini</surname>
          </string-name>
          and
          <string-name>
            <given-names>P.</given-names>
            <surname>Ceravolo</surname>
          </string-name>
          .
          <article-title>Consistent process mining over big data triple stores</article-title>
          .
          <source>In 2013 IEEE International Congress on Big Data</source>
          , pages
          <volume>54</volume>
          {
          <fpage>61</fpage>
          ,
          <year>June 2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Diego Calvanese, Marco Montali, Alifah Syamsiyah, and
          <string-name>
            <surname>Wil MP van der Aalst</surname>
          </string-name>
          .
          <article-title>Ontology-driven extraction of event logs from relational databases</article-title>
          .
          <source>In Business Process Intelligence</source>
          <year>2015</year>
          .
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Claudio</given-names>
            <surname>Di</surname>
          </string-name>
          <string-name>
            <surname>Ciccio</surname>
          </string-name>
          , Fabrizio Maria Maggi, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Mendling</surname>
          </string-name>
          .
          <article-title>E cient discovery of target-branched declare constraints</article-title>
          .
          <source>Information Systems</source>
          ,
          <volume>56</volume>
          :
          <fpage>258</fpage>
          {
          <fpage>283</fpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Claudio</given-names>
            <surname>Di</surname>
          </string-name>
          Ciccio and
          <string-name>
            <given-names>Massimo</given-names>
            <surname>Mecella</surname>
          </string-name>
          .
          <article-title>On the discovery of declarative control ows for artful processes</article-title>
          .
          <source>ACM Trans. Manage. Inf. Syst.</source>
          ,
          <volume>5</volume>
          (
          <issue>4</issue>
          ):
          <volume>24</volume>
          :1{
          <fpage>24</fpage>
          :
          <fpage>37</fpage>
          ,
          <year>January 2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>C.W.</given-names>
            <surname>Gu</surname>
          </string-name>
          <article-title>nther</article-title>
          . XES Standard De nition.
          <source>www.xes-standard.org</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Sergio</given-names>
            <surname>Hernandez</surname>
          </string-name>
          ,
          <string-name>
            <surname>Sebastiaan J. van Zelst</surname>
          </string-name>
          ,
          <article-title>Joaqu n Ezpeleta,</article-title>
          and
          <string-name>
            <surname>Wil M. P. van der Aalst</surname>
          </string-name>
          .
          <article-title>Handling big(ger) logs: Connecting prom 6 to apache hadoop</article-title>
          .
          <source>In BPM Demo Session</source>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Sander</surname>
            <given-names>J. J.</given-names>
          </string-name>
          <string-name>
            <surname>Leemans</surname>
          </string-name>
          , Dirk Fahland, and
          <string-name>
            <surname>Wil M. P. van der Aalst</surname>
          </string-name>
          .
          <article-title>Discovering block-structured process models from event logs - A constructive approach</article-title>
          .
          <source>In PETRI NETS</source>
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Antonella</given-names>
            <surname>Poggi</surname>
          </string-name>
          , Domenico Lembo, Diego Calvanese, Giuseppe De Giacomo, Maurizio Lenzerini, and
          <article-title>Riccardo Rosati</article-title>
          .
          <source>Journal on data semantics x. chapter Linking Data to Ontologies</source>
          , pages
          <volume>133</volume>
          {
          <fpage>173</fpage>
          . Springer-Verlag, Berlin, Heidelberg,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hicham</surname>
            <given-names>Reguieg</given-names>
          </string-name>
          , Boualem Benatallah,
          <string-name>
            <surname>Hamid R. Motahari Nezhad</surname>
            , and
            <given-names>Farouk</given-names>
          </string-name>
          <string-name>
            <surname>Toumani</surname>
          </string-name>
          .
          <article-title>Event correlation analytics: Scaling process mining using mapreduceaware event correlation discovery techniques</article-title>
          .
          <source>IEEE Trans. Services Computing</source>
          ,
          <volume>8</volume>
          (
          <issue>6</issue>
          ):
          <volume>847</volume>
          {
          <fpage>860</fpage>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11. Stefan Schonig, Andreas Rogge-Solti, Cristina Cabanillas, Stefan Jablonski, and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Mendling</surname>
          </string-name>
          .
          <article-title>E cient and Customisable Declarative Process Mining with SQL</article-title>
          , pages
          <volume>290</volume>
          {
          <fpage>305</fpage>
          . Springer International Publishing, Cham,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12. W. v. d. Aalst and
          <string-name>
            <given-names>E.</given-names>
            <surname>Damiani</surname>
          </string-name>
          .
          <article-title>Processes meet big data: Connecting data science with process science</article-title>
          .
          <source>IEEE Transactions on Services Computing</source>
          ,
          <volume>8</volume>
          (
          <issue>6</issue>
          ):
          <volume>810</volume>
          {
          <fpage>819</fpage>
          ,
          <string-name>
            <surname>Nov</surname>
          </string-name>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Wil M. P. van der Aalst</surname>
          </string-name>
          .
          <article-title>Distributed process discovery and conformance checking</article-title>
          .
          <source>In FASE</source>
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Wil M. P. van der Aalst</surname>
          </string-name>
          , Hajo A.
          <string-name>
            <surname>Reijers</surname>
            , and
            <given-names>Minseok</given-names>
          </string-name>
          <string-name>
            <surname>Song</surname>
          </string-name>
          .
          <article-title>Discovering social networks from event logs</article-title>
          .
          <source>Computer Supported Cooperative Work (CSCW)</source>
          ,
          <volume>14</volume>
          (
          <issue>6</issue>
          ):
          <volume>549</volume>
          {
          <fpage>593</fpage>
          ,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>W.M.P. van der Aalst</surname>
          </string-name>
          .
          <source>Process Mining: Data Science in Action</source>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>J. M. E. M. van der Werf</surname>
            ,
            <given-names>B. F. van Dongen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>C. A. J.</given-names>
            <surname>Hurkens</surname>
          </string-name>
          ,
          <article-title>and</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Serebrenik</surname>
          </string-name>
          .
          <source>Process Discovery Using Integer Linear Programming</source>
          , pages
          <volume>368</volume>
          {
          <fpage>387</fpage>
          . Springer Berlin Heidelberg, Berlin, Heidelberg,
          <year>2008</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Boudewijn</surname>
            <given-names>F. van Dongen</given-names>
          </string-name>
          and
          <string-name>
            <given-names>Shiva</given-names>
            <surname>Shabani</surname>
          </string-name>
          .
          <article-title>Relational XES: data management for process mining</article-title>
          .
          <source>In CAiSE</source>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Sebastiaan</surname>
            <given-names>J. van Zelst</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Boudewijn F. van Dongen</surname>
          </string-name>
          , and
          <string-name>
            <surname>Wil M. P. van der Aalst</surname>
          </string-name>
          .
          <article-title>Know what you stream: Generating event streams from CPN models in prom 6</article-title>
          .
          <source>In BPM Demo Session</source>
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19. Meenu Dave Vatika Sharma.
          <article-title>Sql and nosql databases</article-title>
          .
          <source>International Journal of Advanced Research in Computer Science and Software Engineering</source>
          ,
          <volume>2</volume>
          (
          <issue>8</issue>
          ):
          <volume>20</volume>
          {
          <fpage>27</fpage>
          ,
          <year>August 2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>H.M.W. Verbeek</surname>
            ,
            <given-names>J.C.A.M.</given-names>
          </string-name>
          <string-name>
            <surname>Buijs</surname>
            ,
            <given-names>B.F. van Dongen</given-names>
          </string-name>
          , and
          <string-name>
            <surname>W.M.P. van der Aalst. XES</surname>
          </string-name>
          , XESame, and
          <article-title>ProM 6</article-title>
          . In P. So er and E. Proper, editors,
          <source>Information Systems Evolution</source>
          , volume
          <volume>72</volume>
          , pages
          <fpage>60</fpage>
          {
          <fpage>75</fpage>
          ,
          <year>2010</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <given-names>Thomas</given-names>
            <surname>Vogelgesang</surname>
          </string-name>
          and
          <article-title>H-Jurgen Appelrath. A relational data warehouse for multidimensional process mining</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>