The onprom Toolchain for Extracting Business Process Logs using Ontology-based Data Access Diego Calvanese, Tahir Emre Kalayci, Marco Montali, and Ario Santoso KRDB Research Centre for Knowledge and Data Free University of Bozen-Bolzano, Bolzano, Italy {calvanese,tkalayci,montali,santoso}@inf.unibz.it Abstract. Process mining techniques require the input data to be explicitly struc- tured in the form of an event log. Unfortunately, in many real world settings, such event logs are not explicitly given, but they are implicitly stored in legacy infor- mation systems. Therefore, to enable process mining, there is a need to support the data preparation and the log extraction from legacy information systems. The onprom tool-chain aims at supporting users in the semi-automatic extraction of event logs from a legacy information system, reflecting different process-related views on the same data, and consequently facilitating multi-perspective process mining. The tool-chain is based on the ontology-based data access paradigm, and consists of three components, namely UML editor, annotation editor, and log ex- tractor. Each component can be used both as a plug-in for the extensible process mining framework ProM, or within an integrated toolkit. The produced logs are fully compliant with the XES standard. Keywords: data preparation for process mining, log extraction, event logs, XES format, ontology-based data access, conceptual modeling 1 Introduction The emerging area of process mining has become increasingly popular both in academia and in industry to resolve the mismatch between process models and event data gener- ated by process executions [1,2]. Process mining is a collection of techniques that com- bine model-based and data-oriented analysis to obtain useful insights on how business processes are executed in a real organizational environment in a synergic way. Through process mining, decision makers can discover process models from data, compare ex- pected and actual behaviors, and enrich models with information obtained from their execution. The applicability of process mining depends on two crucial factors [2,5,6]: (i) the availability of high-quality event data, and of event logs containing correct and complete event data about which cases have been executed, which events occurred for each case, and when they took place; (ii) the representation of such data in a format that is understandable by process mining algorithms, such as the XML-based IEEE eXten- sible Event Stream (XES) standard [8]. However, in many real world settings, the enterprise exploits functionalities offered by more general enterprise systems such as ERP, CRM, SCM, and other business suites. high-level IS? Create N conceptual Create Choose Create Do process Get mappings perspective event-data mining data XES/CSV annotations schema Y Bootstrap Enrich Y model + model + mappings mappings Other perspective? N Fig. 1. The onprom methodology and its four phases [6,5] In addition, such systems are typically configured for the specific needs of the com- pany, and connected to domain-specific and other legacy information systems. In such a complex common setting, event data are not explicitly given, but have instead to be reconstructed by reconciling these different data sources and reflecting the perspective of interest. This is why there is a pressing need for techniques that are able to support data and process analysts in the data preparation phase, and in particular in the extrac- tion of event data from legacy information systems [1]. Recently, we have proposed the onprom methodology [5,6,7] to support data and process analysts in the conceptual identification of event data, answering questions like: (i) Which are relevant concepts and relations? (ii) How do such concepts/relations map to the underlying information system? (iii) Which concepts/relations relate to the notion of case, event, and event at- tributes? Our methodology is backed up by a tool-chain that, once the aforementioned questions are answered, automatically extracts an event log conforming to the chosen perspective, and obtained by inspecting the data residing at the sources. This is done borrowing techniques from intelligent data management, and in particular by exploiting the ontology-based data access (OBDA) paradigm and tools [3,4,10]. A detailed account of the onprom framework is provided in [5]. 2 The onprom methodology and tool-chain The onprom methodology, illustrated in Figure 1, comprises four main phases. We assume the existence of a legacy information system I = hR, Di, with schema R and a set D of facts about the domain of interest. We consider the typical case where the information system is a relational database, hence R is a relational schema, and D is a relational database instance structured according to R. The first phase deals with the definition of an OBDA model towards understanding the data conceptually. In particular, the OBDA model serves a two-fold purpose [4,10]: (i) To provide a conceptual representation of the domain of interest, in terms of rele- vant concepts and relations that are understandable by domain experts. This is given in terms of a conceptual data schema T ; (ii) To explicitly link I to T , by means of a declarative mapping specification M, consequently allowing for querying I by us- ing the vocabulary provided by T . In onprom, we employ UML class diagrams as a concrete language for representing T , and we rely on their logic-based encoding [4] in terms of the standard ontology language OWL 2 QL [9]. M consists of a set of logical assertions that map SQL queries over schema R to facts over T . Once the OBDA system is in place, onprom allows one to abstract away the in- formation system. In this way, the analyst who is responsible for the data extraction can directly focus on T , using the concepts and relations contained therein so as to concretely formulate which perspective has to be taken towards process mining. More specifically, this amounts to enrich T with annotations L, each creating an implicit link between T and the core portion of the XES event log schema E. In this light, each annotation expresses one of the following aspects: (i) definition of a case, indicating which class provides the basis to identify case objects, and which conditions have to be satisfied by instances of the selected class so as to classify them as case objects; (ii) def- inition of an event, indicating which class provides the basis to identify occurrences of such an event; (iii) definition of an event attribute, indicating which navigational route has to be followed within the diagram so as to fetch the value for such an attribute given an instance of the corresponding event. When all inputs are provided, event logs are extracted automatically by using the facilities provided by the OBDA paradigm, and in particular the query and mapping transformation techniques described in [7,5]. The onprom tool-chain supports the various phases of the methodology. It is open source and can be downloaded from http://onprom.inf.unibz.it. It is avail- able as a stand-alone software, or as a set of plug-ins running inside the ProM process mining framework1 using built-in data exchange mechanisms. It consists of the fol- lowing components: (i) a UML Editor to model the conceptual data schema; (ii) an Annotation Editor to enrich the conceptual data schema with event-data annotations; (iii) a Log Extractor that extracts a XES event log from a given relational information system, exploiting the event-data annotations. Currently, the definition of the mapping specification is not covered within the onprom tool-chain, and we assume that it is realized manually or using third-party tools. UML Editor. The UML Editor provides two main functionalities: modeling of a UML class diagram, and import/export from/to OWL 2 QL, leveraging the correspondence described in [4,5]. Currently, the editor makes some simplifying assumptions such as: (i) no support for completeness of UML generalization hierarchies (in line with the OBDA approach [4]); (ii) support for binary associations only, in line with Semantic Web languages; (iii) multiplicities in associations (resp., features) are restricted to be either 0 or 1, allowing to express functionality and mandatory participation; (iv) no support for IS-A between associations. The UML class diagrams produced with the UML Editor can be saved in a propri- etary JSON format for further processing and as input for the Annotation Editor. It can also be exported as a standard OWL 2 QL ontology. The graphical layout information, which is not part of the OWL 2 QL language, is maintained in the form of OWL 2 an- notations, thus resulting in an ontology fully compliant with the W3C standard. The editor also supports loading an existing conceptual data schema, if it is available as an 1 http://www.promtools.org Fig. 2. The onprom toolkit screenshot showing five loaded objects and two open editors OWL 2 QL ontology file. A UML Editor loaded with a conceptual data schema example is shown in the right part of the onprom toolkit screenshot in Figure 2. Annotation Editor. The Annotation Editor supports analysts in the specification of event-data annotations on top of a UML class diagram. An annotated conceptual data schema loaded in the Annotation Editor is shown in the left part of Figure 2. To sim- plify the annotation task, the editor supports the following advanced operations. First, it allows one to choose properties and paths using navigational selections over the di- agram via mouse-click operations. Second, it takes multiplicities on associations and attributes into account; when the user is selecting properties of the case and of events (in particular the timestamp), the editor enables only navigation paths along functional associations, thus guaranteeing that a unique data value is reached. The annotations are automatically translated into corresponding SPARQL queries by the editor. The anno- tated conceptual data schema can be exported using a proprietary JSON format, which can then be imported by the Log Extractor. Log Extractor. The last component of the tool-chain implements the mapping syn- thesis technique as well as the XES log extraction functionalities, by leveraging the state-of-the-art OBDA system ontop2 [3] and the OpenXES APIs3 . To this end, the Log Extractor takes as input: (i) a conceptual data schema (generated via the UML Editor or represented as an OWL 2 QL file), (ii) an OBDA mapping specification, linking the conceptual data schema to the underlying relational database, and (iii) event-data anno- tations (created using the Annotation Editor). Given these inputs, the extractor works in two steps: (i) it automatically creates a new mapping specification, which establishes a direct correspondence between the database and the XES event log schema, and (ii) uses it to extracts a XES event log from the database. The extracted XES event log can be used with all the tools that are supporting the XES standard. Experiments. We have evaluated the tool using a machine with processor Intel Core i5 2.4 GHz, 8 Gb of RAM, and Postgres as the DBMS. One of the experiments shows that it takes ⇠11 minutes to produce a XES event log with 298 224 traces, 2 117 466 events, and 6 352 398 event attributes (⇠450 MB for the whole XES log). 2 http://ontop.inf.unibz.it 3 http://www.xes-standard.org 3 The Demo In our demo, we illustrate how to successfully employ the onprom tool-chain to tackle the challenging problem of data preparation for process mining on top of legacy databases. Specifically, (i) we start from a legacy relational database, (ii) use the UML Editor to conceptually model the data of interest, (iii) exploit the ontop mapping edi- tor for Protégé4 to link the resulting conceptual schema to legacy data via declarative mappings, and (iv) use the Annotation Editor to equip the schema with declarative an- notations that indicate where cases, events, and their attributes are “located”. In this way, we provide the necessary input to the Log Extractor. The screencasts of our demo are available at http://onprom.inf.unibz.it/index.php/screencasts. As mentioned, the onprom toolchain is open source and can be downloaded from http://onprom.inf.unibz.it. Acknowledgements. This research has been partially supported by the Euregio IPN12 KAOS (Knowledge-Aware Operational Support) project, which is funded by the “European Region Tyrol-South Tyrol-Trentino” (EGTC) under the first call for basic research projects, and by the UNIBZ internal project OnProm (ONtology-driven PROcess Mining). References 1. van der Aalst, W., et al.: Process mining manifesto. In: Proc. of the Business Process Man- agement Int. Workshops. LNBIP, vol. 99, pp. 169–194. Springer (2012) 2. van der Aalst, W.M.P.: Process Mining - Data Science in Action. Springer, 2nd edn. (2016) 3. Calvanese, D., Cogrel, B., Komla-Ebri, S., Kontchakov, R., Lanti, D., Rezk, M., Rodriguez- Muro, M., Xiao, G.: Ontop: Answering SPARQL queries over relational databases. Semantic Web J. 8(3), 471–487 (2017) 4. Calvanese, D., De Giacomo, G., Lembo, D., Lenzerini, M., Poggi, A., Rodriguez-Muro, M., Rosati, R.: Ontologies and databases: The DL-Lite approach. In: RW 2009 Tutorial Lectures. LNCS, vol. 5689, pp. 255–356. Springer (2009) 5. Calvanese, D., Kalayci, T.E., Montali, M., Santoso, A.: OBDA for log extraction in process mining. In: RW 2017 Tutorial Lectures, LNCS, vol. 10370, pp. 292–345. Springer (2017) 6. Calvanese, D., Kalayci, T.E., Montali, M., Tinella, S.: Ontology-based data access for ex- tracting event logs from legacy data: The onprom tool and methodology. In: 20th Int. Conf. on Business Information Systems. LNBIP, vol. 288, pp. 220–236. Springer (2017) 7. Calvanese, D., Montali, M., Syamsiyah, A., van der Aalst, W.M.P.: Ontology-driven extrac- tion of event logs from relational databases. In: Proc. of the 11th Int. Workshop on Business Process Intelligence. LNBIP, vol. 256, pp. 140–153. Springer (2016) 8. IEEE Computational Intelligence Society: IEEE Standard for eXtensible Event Stream (XES) for achieving interoperability in event logs and event streams. IEEE Std 1849-2016 (2016) 9. Motik, B., Cuenca Grau, B., Horrocks, I., Wu, Z., Fokoue, A., Lutz, C.: OWL 2 Web Ontol- ogy Language profiles (second edition). W3C Recommendation, W3C (Dec 2012), available at http://www.w3.org/TR/owl2-profiles/ 10. Poggi, A., Lembo, D., Calvanese, D., De Giacomo, G., Lenzerini, M., Rosati, R.: Linking data to ontologies. J. on Data Semantics X, 133–173 (2008) 4 http://protege.stanford.edu