<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Eventi er: Extracting Process Execution Logs from Operational Databases</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Carlos Rodr guez</string-name>
          <email>crodriguez@disi.unitn.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Robert Engel</string-name>
          <email>engel@ec.tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Galena Kostoska</string-name>
          <email>kostoska@disi.unitn.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Florian Daniel</string-name>
          <email>daniel@disi.unitn.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Casati</string-name>
          <email>casati@disi.unitn.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Aimar</string-name>
          <email>maimar@opera21.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Opera21 Group SpA</institution>
          ,
          <addr-line>Rovereto (TN)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
          ,
          <addr-line>Via Sommarive 5, I-38123, Povo (TN)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Vienna University of Technology Institute of Software Technology and Interactive Systems</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>This demo introduces Eventi er, a tool that helps in reconstructing an event log from operational databases upon which process instances have been executed. The purpose of reconstructing such event log is that of discovering process models out of it, and, hence, the tool targets researches and practitioners interested in process mining. The aim of this demo is to convey to the participants both the conceptual and practical implications of identifying and extracting process execution events from such databases for reconstructing ready-to-use event logs for process discovery.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Process discovery is the task of deriving a process model from process
execution data that are typically stored in event logs, which in turn are generated
by information systems that support the process execution [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Most of the
approaches available in the state of the art assume the existence of an event log,
where each event is assumed to have information, such as a process name,
activity name, execution timestamp, event type (e.g., start or end), and process
instance ID. In practice, most companies do not really have such an event log,
either because they do not have a business process engine that is able to
generate such logs or, if they do, the engine supports only parts of the process, e.g.,
because parts of the process are supported by legacy systems. In the second case,
it may also happen that the engine does not generate an event log that can be
used for process discovery, e.g., if the log contains only events regarding errors
in the system.
      </p>
      <p>
        The information stored in an event log commonly provides a very narrow
and focused view on the overall data produced by a process during its execution
(e.g., focusing on errors for recovery or control ow decisions and actors for
auditing). Typically, however, an information system also stores the full data
produced by a process inside its operational databases (OD) (also known
as production databases ), where these data comprise process progression data,
process state data, business data produced throughout the process, data related
to the regular operations of an organization, as well as their related business
facts and objects [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. ODs therefore store more and richer data than event logs,
but blur di erent aspects of data and neglect the event-based nature of process
executions. For this reason, process discovery starts from event logs.
      </p>
      <p>
        With this demo, we approach the problem of producing process execution
events in a fundamentally di erent context, i.e., in a context where we do not
have access to the information system running the process (hence we cannot
instrument it) and where the only way of obtaining process execution events is
deriving them from the OD of the information system after the actual process
execution. We call this activity eventi cation of the OD and we perform it
with the help of our tool Eventi er. For the rest of the paper, we assume that
the OD is a relational database [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Signi cance to the BPM eld. Much attention has been paid so far to the
problems of representing event logs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], event correlation [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and process discovery
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], while the problem of how to produce good events has been neglected by
research. As explained above, Eventi er approaches an important issue in the
eld of process mining by providing an application that will help both researches
and practitioners working in the eld.
2
      </p>
      <p>Eventi cation of the Operational Database
Let's start by giving some preliminary de nitions. An event log can be seen as a
sequence of events E = [e1; e2; :::; em], where ei = hid; tname; pname; piid; ts; pli
is an event of a process instance, with id being the identi er of the event, tname
being the name of the task the event is associated with, pname being the name
of the process type, piid being the process instance identi er, ts being the
timestamp of the event, and pl being the payload of the event. Thus, an event log
stores traces of process executions as atomic events that represent process
progression information and that may carry business data in their payload.</p>
      <p>Reconstructing an event log E with events ei means deciding when to infer
the existence of an event from the data in the OD and lling each of the attributes
of the event structure with meaningful values. These values either stem from
the data in the OD or they may be provided by a domain expert. Speci cally,
for the id attribute, assigning an identi er to an event means recognizing the
existence of the event. Given that we do not have real events in the OD but
other, indirect evidence of their occurrence, there is no \correct" or \original"
event identi er to be discovered. The question here is what we consider evidence
of an event. Similarly, in the case of tname, without the concept of task in the
applications of the information system, there is no explicit task naming that can
be discovered from the data. Thus, we need to nd a way to label the boxes that
will represent tasks in the discovered model. The value for the attribute pname
(the process name) we can only get from the domain expert, who knows which
process she is trying to discover. Then, the process instance identi er (piid) is
needed to group events into process instances. The piid is derived by means of
event correlation based on the values of the attributes of the identi ed events.
The attribute ts is needed to order events chronologically, which is a requirement
for process discovery. Therefore, we need to nd evidences in the OD that help us
in determining the ordering of events. Finally, the goal of choosing a payload pl
for the purpose of eventi cation is not to reconstruct the complete business data
that can be associated with a given task or event, but rather that of supporting
the correlation of events into process instances. We can get this data from the
rows that originate the events.</p>
      <p>We call the assignment of values to id, pname and tname the identi cation
of an event, to ts the ordering of events, to pl data association, and to piid
correlation. These four activities together constitute the eventi cation process,
and it is helped by heuristics in the form of eventi cation patterns :
Event identi cation patterns. These patterns help in the identi cation of
events from the OD. In these patterns, we assume that the existence of a row in
a relation R indicates the presence of an event. We express these patterns as a
function:</p>
      <p>identif y(R; pname; tname) ! e0 = hid; pname; ; tname; ; ti
where pname and tname are de ned by the domain expert, and t is the tuple
in R that originated e0. In concrete, we rely on the following three patterns for
the identi cation of events:
{ Single row, single event pattern (Figure 1(a)). In this pattern, each row in
a relation R indicates the existence of an event. R can be obtained with a
simple SQL query as:</p>
      <p>SELECT * FROM r1; r2; :::; rn</p>
      <p>WHERE [JOIN conditions for r1; r2; :::; rn];
{ Single row, multiple event pattern (Figure 1(b)). A tuple in R can evidence
the existence of more than one event, such as when di erent values of the
attributes Ai of R indicate di erent potential events. In this case, the relation
R is built by applying ltering conditions in the WHERE clause so as to keep
only the target events:</p>
      <p>SELECT * FROM r1; r2; :::; rn
WHERE [JOIN conditions for r1; r2; :::; rn]</p>
      <p>AND [ ltering conditions for the target event, e.g., r2:dispatched = yes];
{ Multiple row, single event pattern (Figure 1(c)). Multiple rows in a relation
R indicate the presence of a single event. This last pattern is useful, for
instance, when we deal with a denormalized relation that mixes data at
di erent granularities, e.g., when in a single tuple we nd both the header
of an invoice and the item sold. The SQL for R has the following form,
SELECT DISTINCT A1; A2; :::; Ak FROM r1; r2; :::; rn</p>
      <p>WHERE [JOIN conditions for r1; r2; :::; rn] ;</p>
      <p>(a)
A1 orderID ... An
xxxxxx xxxxxx xxxxxx</p>
      <p>(b)
e2 xx
e3 xx
e1 xAx1 dispatched delivered ... An
yyneeoss ynneoos xxxxxx</p>
      <p>(c)</p>
      <p>A1 orderID itemID ...An
eee321[[[dddieisslpipvaaettrcc]hh]] xxxxxxxx 1131 2113 xxxxxxxx e1[invoice]
where the attributes Ai should be the higher granularity attributes that
would be typically used in a GROUP BY, SQL statement.</p>
      <p>Event ordering pattern. The event ordering pattern aims at deriving the
ordering of events from time-related information associated to the records stored
in the OD, and is represented as:</p>
      <p>order(e0) ! e1 = hid; pname; ; tname; ts; ti
where e1 is the result of attaching a timestamp value to ts, and ts is the
projection of all timestamp or date attributes of e0:t generated by the previous
pattern. If only one timestamp can be found, it is used straightaway. If there are
more possible timestamps in pl, the domain expert chooses the one that best
represents the execution time of the task.</p>
      <p>Data association pattern. The data association pattern aims to select which
data to assign to pl. In the above patterns, we have so far simply carried over
the complete row t as payload of the event, while here we aim to select which
attributes out of the ones in t are really relevant. Our assumption is that all
necessary data is already present inside t, that is, we do not need to consult any
additional tables of the OD to ll pl with meaningful data. Thus, in the event
identi cation step, the necessary tables are joined, and t contains all potentially
relevant data items. The data association pattern is represented as:
getdata(e1) ! e2 = hid; pname; ; tname; ts; pli
where e1 is as de ned before, and pl is the new payload computed by projecting
attributes from t. In absence of any knowledge about the OD by the domain
expert, the heuristic we apply is to copy into pl all attributes of t, except
timestamps and auto-increment attributes, which by design cannot be used for
correlation. The domain expert can of course also choose manually which attributes
to include and which to exclude.</p>
      <p>Event correlation patterns. Eventually, we are ready to correlate events and
to compute the piid of the identi ed events. The goal of event correlation is to
group events into process instances, which are the basis for process discovery. As
explained above, we assume that after associating the nal payloads to events all
information we need to correlate events is present in the payload pl of the events
in the form of attribute-value pairs. In practice, correlating events into traces
means discovering the mathematical function over the attributes of pl that tells
if an event belongs to a given process instance, identi ed by the output piid of
the function. We represent this step as follows:</p>
      <p>correlate(e2) ! e = hid; pname; piid; tname; ts; pli
where e2 is as de ned above and e is the nal version of the discovered event
from the OD with the attribute piid lled with a suitable identi er of the process
instance the event belongs to.
3</p>
    </sec>
    <sec id="sec-2">
      <title>The Eventi er Environment</title>
      <p>Figure 2 provides an architectural view on the resulting approach to eventi
cation, which is a semi-automated process that requires the collaboration of a
domain expert having some basic knowledge of the OD to be eventi ed. First,
the domain expert identi es events in the OD, orders them, and associates data
with them. All these activities are supported the the so-called Event Extractor,
which supports the domain expert in an interactive and iterative fashion. The
result of this rst step is a set of events, which are however not yet correlated.
Correlation is assisted via a dedicated Event Correlator, which again helps the
domain expert to interactively identify the best attributes and conditions to
reconstruct process traces. The result of the whole process is an event log that is
ready for process discovery.</p>
      <p>
        The Eventi er is implemented as an integrated platform that includes the
components for eventi cation, correlation and process discovery. These
components allow domain experts to interactively apply patterns and to navigate
endto-end from the OD to the discovered process model and back. Since our aim
is not to make contributions on process discovery, we use existing process
discovery algorithms implemented as plugins for the popular process mining suite
ProM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. All components are implemented as Java desktop applications using
standard libraries such as Swing. The implementation of the Event Correlator
is partly based upon a software tool originally developed for the correlation of
EDI messages [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For the creation of XES-conformant event logs [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that are
used in the interface to process discovery in ProM, we employ the OpenXES
libraries (http://www.xes-standard.org/openxes/start). Figure 3 shows the
screenshots of the Event Extractor and Correlator components.
      </p>
      <p>Domain Expert
Eventification</p>
      <p>Patterns</p>
      <p>Eventification
defines Rules
Operational</p>
      <p>DB</p>
      <p>uses
Event
Extractor</p>
      <p>Event Log</p>
      <p>DB</p>
      <p>Domain
Expert</p>
      <p>Correlation</p>
      <p>Rules
defines
uses</p>
      <p>Event
Correlator</p>
      <p>Correlated
Event Log</p>
      <p>DB</p>
      <p>Fig. 2. Overview of the database eventi cation prototype and approach.</p>
    </sec>
    <sec id="sec-3">
      <title>Demo scenario</title>
      <p>A demo video of our eventi cation tool in action can be found at the website
http://sites.google.com/site/dbeventification. The demo is in the form
of a screencast and illustrates the main features of our tool using as scenario
the case of an Italian logistics company for refrigerated goods. In this video we
clearly show the two main tasks of our approach as outlined in Figure 2 and we
also show the nal outcome in terms of the process model discovered from the
reconstructed event log.</p>
      <p>Acknowledgements. This work was supported by the Ianus project funded by
the Province of Trento (Italy) and Opera21 Group and by the Vienna Science
and Technology Fund (WWTF) through project ICT10-010.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>R.</given-names>
            <surname>Engel</surname>
          </string-name>
          , W. van der Aalst, M. Zapletal,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pichler</surname>
          </string-name>
          , and
          <string-name>
            <given-names>H.</given-names>
            <surname>Werthner</surname>
          </string-name>
          .
          <article-title>Mining Inter-organizational Business Process Models from EDI Messages: A Case Study from the Automotive Sector</article-title>
          .
          <source>In 24th Int. Conf. on Advanced Information Systems Engineering (CAiSE</source>
          <year>2012</year>
          ),
          <source>LNCS 7328</source>
          , pp.
          <fpage>222</fpage>
          -
          <lpage>237</lpage>
          . Springer,
          <year>2012</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>R.</given-names>
            <surname>Kimball</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Ross</surname>
          </string-name>
          .
          <article-title>The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling</article-title>
          . Wiley,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>H.</given-names>
            <surname>Montahari-Nezhad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Saint-Paul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Casati</surname>
          </string-name>
          , and
          <string-name>
            <given-names>B.</given-names>
            <surname>Benatallah</surname>
          </string-name>
          .
          <article-title>Event Correlation for Process Discovery from Web Service Interaction Logs</article-title>
          .
          <source>VLDB Journal</source>
          ,
          <volume>20</volume>
          (
          <issue>3</issue>
          ):
          <volume>417</volume>
          {
          <fpage>444</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>R.</given-names>
            <surname>Ramakrishnan</surname>
          </string-name>
          and
          <string-name>
            <given-names>J.</given-names>
            <surname>Gehrke</surname>
          </string-name>
          .
          <source>Database Management Systems. McGraw-Hill</source>
          ,
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. W. van der Aalst.
          <source>Process Mining: Discovery, Conformance and Enhancement of Business Processes</source>
          . Springer,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>H.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Buijs</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Dongen</surname>
          </string-name>
          , and
          <string-name>
            <surname>W. van der Aalst. XES</surname>
          </string-name>
          , XESame, and
          <article-title>ProM 6</article-title>
          .
          <source>In Information Systems Evolution</source>
          , volume
          <volume>72</volume>
          , pages
          <fpage>60</fpage>
          {
          <fpage>75</fpage>
          .
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>