Eventifier: Extracting Process Execution Logs
             from Operational Databases

 Carlos Rodrı́guez1 , Robert Engel2 , Galena Kostoska1 , Florian Daniel1 , Fabio
                          Casati1 , and Marco Aimar3
                               1
                                  University of Trento,
                    Via Sommarive 5, I-38123, Povo (TN), Italy
             {crodriguez,kostoska,daniel,casati}@disi.unitn.it
                         2
                           Vienna University of Technology
             Institute of Software Technology and Interactive Systems
                              engel@ec.tuwien.ac.at
                               3
                                  Opera21 Group SpA,
                                 Rovereto (TN), Italy
                                 maimar@opera21.it


      Abstract. This demo introduces Eventifier, a tool that helps in recon-
      structing an event log from operational databases upon which process
      instances have been executed. The purpose of reconstructing such event
      log is that of discovering process models out of it, and, hence, the tool
      targets researches and practitioners interested in process mining. The
      aim of this demo is to convey to the participants both the conceptual
      and practical implications of identifying and extracting process execution
      events from such databases for reconstructing ready-to-use event logs for
      process discovery.


1    Introduction
Process discovery is the task of deriving a process model from process exe-
cution data that are typically stored in event logs, which in turn are generated
by information systems that support the process execution [5]. Most of the ap-
proaches available in the state of the art assume the existence of an event log,
where each event is assumed to have information, such as a process name, ac-
tivity name, execution timestamp, event type (e.g., start or end), and process
instance ID. In practice, most companies do not really have such an event log,
either because they do not have a business process engine that is able to gener-
ate such logs or, if they do, the engine supports only parts of the process, e.g.,
because parts of the process are supported by legacy systems. In the second case,
it may also happen that the engine does not generate an event log that can be
used for process discovery, e.g., if the log contains only events regarding errors
in the system.
    The information stored in an event log commonly provides a very narrow
and focused view on the overall data produced by a process during its execution
(e.g., focusing on errors for recovery or control flow decisions and actors for
2       C. Rodrı́guez et al.

auditing). Typically, however, an information system also stores the full data
produced by a process inside its operational databases (OD) (also known
as production databases), where these data comprise process progression data,
process state data, business data produced throughout the process, data related
to the regular operations of an organization, as well as their related business
facts and objects [2]. ODs therefore store more and richer data than event logs,
but blur different aspects of data and neglect the event-based nature of process
executions. For this reason, process discovery starts from event logs.
    With this demo, we approach the problem of producing process execution
events in a fundamentally different context, i.e., in a context where we do not
have access to the information system running the process (hence we cannot
instrument it) and where the only way of obtaining process execution events is
deriving them from the OD of the information system after the actual process
execution. We call this activity eventification of the OD and we perform it
with the help of our tool Eventifier. For the rest of the paper, we assume that
the OD is a relational database [4].

Significance to the BPM field. Much attention has been paid so far to the
problems of representing event logs [6], event correlation [3] and process discovery
[5], while the problem of how to produce good events has been neglected by
research. As explained above, Eventifier approaches an important issue in the
field of process mining by providing an application that will help both researches
and practitioners working in the field.


2    Eventification of the Operational Database

Let’s start by giving some preliminary definitions. An event log can be seen as a
sequence of events E = [e1 , e2 , ..., em ], where ei = hid, tname, pname, piid, ts, pli
is an event of a process instance, with id being the identifier of the event, tname
being the name of the task the event is associated with, pname being the name
of the process type, piid being the process instance identifier, ts being the times-
tamp of the event, and pl being the payload of the event. Thus, an event log
stores traces of process executions as atomic events that represent process pro-
gression information and that may carry business data in their payload.
    Reconstructing an event log E with events ei means deciding when to infer
the existence of an event from the data in the OD and filling each of the attributes
of the event structure with meaningful values. These values either stem from
the data in the OD or they may be provided by a domain expert. Specifically,
for the id attribute, assigning an identifier to an event means recognizing the
existence of the event. Given that we do not have real events in the OD but
other, indirect evidence of their occurrence, there is no “correct” or “original”
event identifier to be discovered. The question here is what we consider evidence
of an event. Similarly, in the case of tname, without the concept of task in the
applications of the information system, there is no explicit task naming that can
be discovered from the data. Thus, we need to find a way to label the boxes that
             Extracting Process Execution Logs from Operational Databases          3

will represent tasks in the discovered model. The value for the attribute pname
(the process name) we can only get from the domain expert, who knows which
process she is trying to discover. Then, the process instance identifier (piid) is
needed to group events into process instances. The piid is derived by means of
event correlation based on the values of the attributes of the identified events.
The attribute ts is needed to order events chronologically, which is a requirement
for process discovery. Therefore, we need to find evidences in the OD that help us
in determining the ordering of events. Finally, the goal of choosing a payload pl
for the purpose of eventification is not to reconstruct the complete business data
that can be associated with a given task or event, but rather that of supporting
the correlation of events into process instances. We can get this data from the
rows that originate the events.
    We call the assignment of values to id, pname and tname the identification
of an event, to ts the ordering of events, to pl data association, and to piid
correlation. These four activities together constitute the eventification process,
and it is helped by heuristics in the form of eventification patterns:

Event identification patterns. These patterns help in the identification of
events from the OD. In these patterns, we assume that the existence of a row in
a relation R indicates the presence of an event. We express these patterns as a
function:
    identif y(R, pname, tname) → e0 = hid, pname, −, tname, −, ti
where pname and tname are defined by the domain expert, and t is the tuple
in R that originated e0 . In concrete, we rely on the following three patterns for
the identification of events:

 – Single row, single event pattern (Figure 1(a)). In this pattern, each row in
   a relation R indicates the existence of an event. R can be obtained with a
   simple SQL query as:
     SELECT * FROM r1 , r2 , ..., rn
     WHERE [JOIN conditions for r1 , r2 , ..., rn ];
 – Single row, multiple event pattern (Figure 1(b)). A tuple in R can evidence
   the existence of more than one event, such as when different values of the
   attributes Ai of R indicate different potential events. In this case, the relation
   R is built by applying filtering conditions in the WHERE clause so as to keep
   only the target events:
     SELECT * FROM r1 , r2 , ..., rn
     WHERE [JOIN conditions for r1 , r2 , ..., rn ]
     AND [filtering conditions for the target event, e.g., r2 .dispatched = yes];
 – Multiple row, single event pattern (Figure 1(c)). Multiple rows in a relation
   R indicate the presence of a single event. This last pattern is useful, for
   instance, when we deal with a denormalized relation that mixes data at
   different granularities, e.g., when in a single tuple we find both the header
   of an invoice and the item sold. The SQL for R has the following form,
     SELECT DISTINCT A1 , A2 , ..., Ak FROM r1 , r2 , ..., rn
     WHERE [JOIN conditions for r1 , r2 , ..., rn ] ;
4           C. Rodrı́guez et al.
      (a)                                (b)                                                (c)

A1 orderID   ... A   n        A1 dispatched delivered   ... A   n
                                                                                     A1 orderID itemID   ... A   n
                         e1                                         e 1 [dispatch]   xx 1
xx   xx          xx           xx    yes         no         xx                                      1        xx
xx   xx          xx      e2                                         e 2 [dispatch]   xx 1          2        xx       e 1 [invoice]
                              xx    yes        yes         xx                        xx 1          3        xx
xx   xx          xx           xx     no         no         xx       e3 [deliver]     xx 3
                         e3                                                                        1        xx


Fig. 1. Types of event identification patterns: (a) single row, single event, (b) single
row, multiple events, and (c) multiple row, single event pattern

    where the attributes Ai should be the higher granularity attributes that
    would be typically used in a GROUP BY, SQL statement.


Event ordering pattern. The event ordering pattern aims at deriving the
ordering of events from time-related information associated to the records stored
in the OD, and is represented as:
    order(e0 ) → e1 = hid, pname, −, tname, ts, ti
where e1 is the result of attaching a timestamp value to ts, and ts is the pro-
jection of all timestamp or date attributes of e0 .t generated by the previous
pattern. If only one timestamp can be found, it is used straightaway. If there are
more possible timestamps in pl, the domain expert chooses the one that best
represents the execution time of the task.

Data association pattern. The data association pattern aims to select which
data to assign to pl. In the above patterns, we have so far simply carried over
the complete row t as payload of the event, while here we aim to select which
attributes out of the ones in t are really relevant. Our assumption is that all
necessary data is already present inside t, that is, we do not need to consult any
additional tables of the OD to fill pl with meaningful data. Thus, in the event
identification step, the necessary tables are joined, and t contains all potentially
relevant data items. The data association pattern is represented as:
    getdata(e1 ) → e2 = hid, pname, −, tname, ts, pli
where e1 is as defined before, and pl is the new payload computed by projecting
attributes from t. In absence of any knowledge about the OD by the domain
expert, the heuristic we apply is to copy into pl all attributes of t, except times-
tamps and auto-increment attributes, which by design cannot be used for corre-
lation. The domain expert can of course also choose manually which attributes
to include and which to exclude.

Event correlation patterns. Eventually, we are ready to correlate events and
to compute the piid of the identified events. The goal of event correlation is to
group events into process instances, which are the basis for process discovery. As
explained above, we assume that after associating the final payloads to events all
information we need to correlate events is present in the payload pl of the events
in the form of attribute-value pairs. In practice, correlating events into traces
means discovering the mathematical function over the attributes of pl that tells
             Extracting Process Execution Logs from Operational Databases                                        5

if an event belongs to a given process instance, identified by the output piid of
the function. We represent this step as follows:
    correlate(e2 ) → e = hid, pname, piid, tname, ts, pli
where e2 is as defined above and e is the final version of the discovered event
from the OD with the attribute piid filled with a suitable identifier of the process
instance the event belongs to.


3    The Eventifier Environment

Figure 2 provides an architectural view on the resulting approach to eventifi-
cation, which is a semi-automated process that requires the collaboration of a
domain expert having some basic knowledge of the OD to be eventified. First,
the domain expert identifies events in the OD, orders them, and associates data
with them. All these activities are supported the the so-called Event Extractor,
which supports the domain expert in an interactive and iterative fashion. The
result of this first step is a set of events, which are however not yet correlated.
Correlation is assisted via a dedicated Event Correlator, which again helps the
domain expert to interactively identify the best attributes and conditions to re-
construct process traces. The result of the whole process is an event log that is
ready for process discovery.
    The Eventifier is implemented as an integrated platform that includes the
components for eventification, correlation and process discovery. These compo-
nents allow domain experts to interactively apply patterns and to navigate end-
to-end from the OD to the discovered process model and back. Since our aim
is not to make contributions on process discovery, we use existing process dis-
covery algorithms implemented as plugins for the popular process mining suite
ProM [6]. All components are implemented as Java desktop applications using
standard libraries such as Swing. The implementation of the Event Correlator
is partly based upon a software tool originally developed for the correlation of
EDI messages [1]. For the creation of XES-conformant event logs [6] that are
used in the interface to process discovery in ProM, we employ the OpenXES li-
braries (http://www.xes-standard.org/openxes/start). Figure 3 shows the
screenshots of the Event Extractor and Correlator components.

                               Domain Expert                       Domain
                                                                   Expert             Correlation
                  Eventification              Eventification                            Rules
                    Patterns            defines Rules                       defines


                                       uses                             uses


                                    Event                                    Event                  Correlated
              Operational                              Event Log
                                   Extractor                                Correlator              Event Log
                 DB                                       DB
                                                                                                       DB


      Fig. 2. Overview of the database eventification prototype and approach.
6          C. Rodrı́guez et al.
                      Event	
 correlator
          Event	
 
          extractor


    Fig. 3. Screenshots of the components of our integrated platform for eventification.

4      Demo scenario
A demo video of our eventification tool in action can be found at the website
http://sites.google.com/site/dbeventification. The demo is in the form
of a screencast and illustrates the main features of our tool using as scenario
the case of an Italian logistics company for refrigerated goods. In this video we
clearly show the two main tasks of our approach as outlined in Figure 2 and we
also show the final outcome in terms of the process model discovered from the
reconstructed event log.
Acknowledgements. This work was supported by the Ianus project funded by
the Province of Trento (Italy) and Opera21 Group and by the Vienna Science
and Technology Fund (WWTF) through project ICT10-010.

References
1. R. Engel, W. van der Aalst, M. Zapletal, C. Pichler, and H. Werthner. Mining
   Inter-organizational Business Process Models from EDI Messages: A Case Study
   from the Automotive Sector. In 24th Int. Conf. on Advanced Information Systems
   Engineering (CAiSE 2012), LNCS 7328, pp.222-237. Springer, 2012.
2. R. Kimball and M. Ross. The Data Warehouse Toolkit: The Complete Guide to
   Dimensional Modeling. Wiley, 2002.
3. H. Montahari-Nezhad, R. Saint-Paul, F. Casati, and B. Benatallah. Event Corre-
   lation for Process Discovery from Web Service Interaction Logs. VLDB Journal,
   20(3):417–444, 2011.
4. R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill,
   2007.
5. W. van der Aalst. Process Mining: Discovery, Conformance and Enhancement of
   Business Processes. Springer, 2011.
6. H. Verbeek, J. Buijs, B. van Dongen, and W. van der Aalst. XES, XESame, and
   ProM 6. In Information Systems Evolution, volume 72, pages 60–75. 2011.