-

Eventi er: Extracting Process Execution Logs from Operational Databases

Carlos Rodr guez

crodriguez@disi.unitn.it 1

Robert Engel

engel@ec.tuwien.ac.at 2

Galena Kostoska

kostoska@disi.unitn.it 1

Florian Daniel

daniel@disi.unitn.it 1

Fabio Casati

casati@disi.unitn.it 1

Marco Aimar

maimar@opera21.it 0 0 Opera21 Group SpA , Rovereto (TN) , Italy 1 University of Trento , Via Sommarive 5, I-38123, Povo (TN) , Italy 2 Vienna University of Technology Institute of Software Technology and Interactive Systems

This demo introduces Eventi er, a tool that helps in reconstructing an event log from operational databases upon which process instances have been executed. The purpose of reconstructing such event log is that of discovering process models out of it, and, hence, the tool targets researches and practitioners interested in process mining. The aim of this demo is to convey to the participants both the conceptual and practical implications of identifying and extracting process execution events from such databases for reconstructing ready-to-use event logs for process discovery.

Process discovery is the task of deriving a process model from process execution data that are typically stored in event logs, which in turn are generated by information systems that support the process execution [ 5 ]. Most of the approaches available in the state of the art assume the existence of an event log, where each event is assumed to have information, such as a process name, activity name, execution timestamp, event type (e.g., start or end), and process instance ID. In practice, most companies do not really have such an event log, either because they do not have a business process engine that is able to generate such logs or, if they do, the engine supports only parts of the process, e.g., because parts of the process are supported by legacy systems. In the second case, it may also happen that the engine does not generate an event log that can be used for process discovery, e.g., if the log contains only events regarding errors in the system.

The information stored in an event log commonly provides a very narrow and focused view on the overall data produced by a process during its execution (e.g., focusing on errors for recovery or control ow decisions and actors for auditing). Typically, however, an information system also stores the full data produced by a process inside its operational databases (OD) (also known as production databases ), where these data comprise process progression data, process state data, business data produced throughout the process, data related to the regular operations of an organization, as well as their related business facts and objects [ 2 ]. ODs therefore store more and richer data than event logs, but blur di erent aspects of data and neglect the event-based nature of process executions. For this reason, process discovery starts from event logs.

With this demo, we approach the problem of producing process execution events in a fundamentally di erent context, i.e., in a context where we do not have access to the information system running the process (hence we cannot instrument it) and where the only way of obtaining process execution events is deriving them from the OD of the information system after the actual process execution. We call this activity eventi cation of the OD and we perform it with the help of our tool Eventi er. For the rest of the paper, we assume that the OD is a relational database [ 4 ].

Signi cance to the BPM eld. Much attention has been paid so far to the problems of representing event logs [ 6 ], event correlation [ 3 ] and process discovery [ 5 ], while the problem of how to produce good events has been neglected by research. As explained above, Eventi er approaches an important issue in the eld of process mining by providing an application that will help both researches and practitioners working in the eld. 2

Eventi cation of the Operational Database Let's start by giving some preliminary de nitions. An event log can be seen as a sequence of events E = [e1; e2; :::; em], where ei = hid; tname; pname; piid; ts; pli is an event of a process instance, with id being the identi er of the event, tname being the name of the task the event is associated with, pname being the name of the process type, piid being the process instance identi er, ts being the timestamp of the event, and pl being the payload of the event. Thus, an event log stores traces of process executions as atomic events that represent process progression information and that may carry business data in their payload.

Reconstructing an event log E with events ei means deciding when to infer the existence of an event from the data in the OD and lling each of the attributes of the event structure with meaningful values. These values either stem from the data in the OD or they may be provided by a domain expert. Speci cally, for the id attribute, assigning an identi er to an event means recognizing the existence of the event. Given that we do not have real events in the OD but other, indirect evidence of their occurrence, there is no \correct" or \original" event identi er to be discovered. The question here is what we consider evidence of an event. Similarly, in the case of tname, without the concept of task in the applications of the information system, there is no explicit task naming that can be discovered from the data. Thus, we need to nd a way to label the boxes that will represent tasks in the discovered model. The value for the attribute pname (the process name) we can only get from the domain expert, who knows which process she is trying to discover. Then, the process instance identi er (piid) is needed to group events into process instances. The piid is derived by means of event correlation based on the values of the attributes of the identi ed events. The attribute ts is needed to order events chronologically, which is a requirement for process discovery. Therefore, we need to nd evidences in the OD that help us in determining the ordering of events. Finally, the goal of choosing a payload pl for the purpose of eventi cation is not to reconstruct the complete business data that can be associated with a given task or event, but rather that of supporting the correlation of events into process instances. We can get this data from the rows that originate the events.

We call the assignment of values to id, pname and tname the identi cation of an event, to ts the ordering of events, to pl data association, and to piid correlation. These four activities together constitute the eventi cation process, and it is helped by heuristics in the form of eventi cation patterns : Event identi cation patterns. These patterns help in the identi cation of events from the OD. In these patterns, we assume that the existence of a row in a relation R indicates the presence of an event. We express these patterns as a function:

identif y(R; pname; tname) ! e0 = hid; pname; ; tname; ; ti where pname and tname are de ned by the domain expert, and t is the tuple in R that originated e0. In concrete, we rely on the following three patterns for the identi cation of events: { Single row, single event pattern (Figure 1(a)). In this pattern, each row in a relation R indicates the existence of an event. R can be obtained with a simple SQL query as:

SELECT * FROM r1; r2; :::; rn

WHERE [JOIN conditions for r1; r2; :::; rn]; { Single row, multiple event pattern (Figure 1(b)). A tuple in R can evidence the existence of more than one event, such as when di erent values of the attributes Ai of R indicate di erent potential events. In this case, the relation R is built by applying ltering conditions in the WHERE clause so as to keep only the target events:

SELECT * FROM r1; r2; :::; rn WHERE [JOIN conditions for r1; r2; :::; rn]

AND [ ltering conditions for the target event, e.g., r2:dispatched = yes]; { Multiple row, single event pattern (Figure 1(c)). Multiple rows in a relation R indicate the presence of a single event. This last pattern is useful, for instance, when we deal with a denormalized relation that mixes data at di erent granularities, e.g., when in a single tuple we nd both the header of an invoice and the item sold. The SQL for R has the following form, SELECT DISTINCT A1; A2; :::; Ak FROM r1; r2; :::; rn

WHERE [JOIN conditions for r1; r2; :::; rn] ;

(a) A1 orderID ... An xxxxxx xxxxxx xxxxxx

(b) e2 xx e3 xx e1 xAx1 dispatched delivered ... An yyneeoss ynneoos xxxxxx

(c)

A1 orderID itemID ...An eee321[[[dddieisslpipvaaettrcc]hh]] xxxxxxxx 1131 2113 xxxxxxxx e1[invoice] where the attributes Ai should be the higher granularity attributes that would be typically used in a GROUP BY, SQL statement.

Event ordering pattern. The event ordering pattern aims at deriving the ordering of events from time-related information associated to the records stored in the OD, and is represented as:

order(e0) ! e1 = hid; pname; ; tname; ts; ti where e1 is the result of attaching a timestamp value to ts, and ts is the projection of all timestamp or date attributes of e0:t generated by the previous pattern. If only one timestamp can be found, it is used straightaway. If there are more possible timestamps in pl, the domain expert chooses the one that best represents the execution time of the task.

Data association pattern. The data association pattern aims to select which data to assign to pl. In the above patterns, we have so far simply carried over the complete row t as payload of the event, while here we aim to select which attributes out of the ones in t are really relevant. Our assumption is that all necessary data is already present inside t, that is, we do not need to consult any additional tables of the OD to ll pl with meaningful data. Thus, in the event identi cation step, the necessary tables are joined, and t contains all potentially relevant data items. The data association pattern is represented as: getdata(e1) ! e2 = hid; pname; ; tname; ts; pli where e1 is as de ned before, and pl is the new payload computed by projecting attributes from t. In absence of any knowledge about the OD by the domain expert, the heuristic we apply is to copy into pl all attributes of t, except timestamps and auto-increment attributes, which by design cannot be used for correlation. The domain expert can of course also choose manually which attributes to include and which to exclude.

Event correlation patterns. Eventually, we are ready to correlate events and to compute the piid of the identi ed events. The goal of event correlation is to group events into process instances, which are the basis for process discovery. As explained above, we assume that after associating the nal payloads to events all information we need to correlate events is present in the payload pl of the events in the form of attribute-value pairs. In practice, correlating events into traces means discovering the mathematical function over the attributes of pl that tells if an event belongs to a given process instance, identi ed by the output piid of the function. We represent this step as follows:

correlate(e2) ! e = hid; pname; piid; tname; ts; pli where e2 is as de ned above and e is the nal version of the discovered event from the OD with the attribute piid lled with a suitable identi er of the process instance the event belongs to. 3

The Eventi er Environment

Figure 2 provides an architectural view on the resulting approach to eventi cation, which is a semi-automated process that requires the collaboration of a domain expert having some basic knowledge of the OD to be eventi ed. First, the domain expert identi es events in the OD, orders them, and associates data with them. All these activities are supported the the so-called Event Extractor, which supports the domain expert in an interactive and iterative fashion. The result of this rst step is a set of events, which are however not yet correlated. Correlation is assisted via a dedicated Event Correlator, which again helps the domain expert to interactively identify the best attributes and conditions to reconstruct process traces. The result of the whole process is an event log that is ready for process discovery.

The Eventi er is implemented as an integrated platform that includes the components for eventi cation, correlation and process discovery. These components allow domain experts to interactively apply patterns and to navigate endto-end from the OD to the discovered process model and back. Since our aim is not to make contributions on process discovery, we use existing process discovery algorithms implemented as plugins for the popular process mining suite ProM [ 6 ]. All components are implemented as Java desktop applications using standard libraries such as Swing. The implementation of the Event Correlator is partly based upon a software tool originally developed for the correlation of EDI messages [ 1 ]. For the creation of XES-conformant event logs [ 6 ] that are used in the interface to process discovery in ProM, we employ the OpenXES libraries (http://www.xes-standard.org/openxes/start). Figure 3 shows the screenshots of the Event Extractor and Correlator components.

Domain Expert Eventification

Patterns

Eventification defines Rules Operational

uses Event Extractor

Event Log

Domain Expert

Correlation

Rules defines uses

Event Correlator

Correlated Event Log

Fig. 2. Overview of the database eventi cation prototype and approach.

Demo scenario

A demo video of our eventi cation tool in action can be found at the website http://sites.google.com/site/dbeventification. The demo is in the form of a screencast and illustrates the main features of our tool using as scenario the case of an Italian logistics company for refrigerated goods. In this video we clearly show the two main tasks of our approach as outlined in Figure 2 and we also show the nal outcome in terms of the process model discovered from the reconstructed event log.

Acknowledgements. This work was supported by the Ianus project funded by the Province of Trento (Italy) and Opera21 Group and by the Vienna Science and Technology Fund (WWTF) through project ICT10-010.

Engel , W. van der Aalst, M. Zapletal,

Pichler , and

Werthner . Mining Inter-organizational Business Process Models from EDI Messages: A Case Study from the Automotive Sector . In 24th Int. Conf. on Advanced Information Systems Engineering (CAiSE 2012 ), LNCS 7328 , pp. 222 - 237 . Springer, 2012 .

Kimball and

Ross . The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling . Wiley, 2002 .

Montahari-Nezhad ,

Saint-Paul ,

Casati , and

Benatallah . Event Correlation for Process Discovery from Web Service Interaction Logs . VLDB Journal , 20 ( 3 ): 417 { 444 , 2011 .

Ramakrishnan and

Gehrke . Database Management Systems. McGraw-Hill , 2007 .

5. W. van der Aalst. Process Mining: Discovery, Conformance and Enhancement of Business Processes . Springer, 2011 .

Verbeek ,

Buijs , B. van Dongen , and W. van der Aalst. XES , XESame, and ProM 6 . In Information Systems Evolution , volume 72 , pages 60 { 75 . 2011 .