I. SIGNIFICANCE OF THE TOOL

A Scalable Database for the Storage of Object-Centric Event Logs (Extended Abstract)

Alessandro Berti

a.berti@pads.rwth-aachen.de 0

Anahita Farhang Ghahfarokhi

Gyunam Park

Wil M.P. van der Aalst

wvdaalst@pads.rwth-aachen.de 0 0 Process and Data Science Department, RWTH Aachen University Process and Data Science department , Lehrstuhl fur Informatik 9 52074 Aachen , Germany

2021

-Object-centric process mining provides a set of techniques for the analysis of event data where events are associated to several objects. To store Object-Centric Event Logs (OCELs), the JSON-OCEL and JSON-XML formats have been recently proposed. However, the proposed implementations of the OCEL are file-based. This means that the entire file needs to be parsed in order to apply process mining techniques, such as the discovery of object-centric process models. In this paper, we propose a database storage for the OCEL format using the MongoDB document database. Since documents in MongoDB are equivalent to JSON objects, the current JSON implementation of the standard could be translated straightforwardly in a series of MongoDB collections. Index Terms-Object-Centric Process Mining; Object-Centric Event Log; Database Support; MongoDB

I. SIGNIFICANCE OF THE TOOL

OCEL http://www.ocel-standard.org/1.0/specification.pdf has been proposed to model the structure of object-centric event logs [ 1 ]. Implementations of the format have been made available for JSON and XML file formats, and tool support is proposed for the Python and Java languages. For all these, the event log is stored in a JSON/XML file that can be ingested in-memory by the tools/libraries. The necessity to load the log in-memory makes it difficult to manage a huge amount of object-centric event data since memory is a limited asset. With this paper, a novel implementation of the format is proposed based on the MongoDB document database. Documents can be imported in MongoDB starting from JSON objects. Hence, the JSON-OCEL implementation could be translated easily to MongoDB. Moreover, MongoDB can mix in-memory and on-disk computations to provide efficient data science pipelines. Other advantages of MongoDB that we exploit are: the fine-grained support for indexes (i.e., multikey), which makes ad-hoc querying faster; the fine-grained support for aggregations (i.e., grouping) that permits to move some of the computations at the database level; the support to replication, which provides redundancy and increases data availability https://docs.mongodb.com/manual/replication/. Graph databases have been assessed previously for the storage of object-centric event data [ 2 ], [ 3 ], [ 4 ], but the direct translation of the specification of OCEL in a graph database is more challenging1. Also, columnar storages have been used [ 5 ], [ 6 ], with the limitations that they work for basic column 1Even if object-centric event logs can, in general, be uploaded to graph databases as shown in https://doi.org/10.5281/zenodo.3865221 types but do not provide comprehensive support to JSON and advanced data types.

II. MAIN FEATURES OF THE TOOL

The implementation of the schema to host the elements of the OCEL standard follows from the implementation of JSONOCEL http://www.ocel-standard.org/1.0/specification.pdf. Fig. 1 shows how the translation of the different entities is possible. “ocel:events”: {… “ev1”: {“ocel:activity”: “A”, “ocel:timestamp”: “2020-01-01T00:00:00”, “ocel:omap”:[…], “ocel:vmap”:{...} } ...} “ocel:objects”: {… “obj1”: {“ocel:type”: “Order”, “ocel:ovmap”:{...} } ...} “ocel:global-event”: {…} “ocel:global-object”: {…} “ocel:global-log”: {…} ocel:events collection …{ “ocel:id”: “ev1”, “ocel:activity”:”A”, “ocel:timestamp”:Date(1970,01,01,00,00,00), “ocel:omap”:[…], “ocel:vmap”:{...} } ... ocel:objects collection …{ “ocel:id”: “obj1”, “ocel:type”:”Order”, “ocel:vmap”:{...} } ... ocel:others collection

Some fields are colored in red, meaning that an index has been applied to the fields to optimize the execution of some queries. In particular, the identifier, the activity, and the object map (multikey index) have been set as an index for the events. In contrast, the identifier and the type have been set as identifiers for the objects. The tool permits ingestion of logs in the JSON/XML-OCEL formats or exporting of the MongoDB implementation’s contents to JSON/XML-OCEL. Moreover, some essential object-centric process mining operations have been implemented at the MongoDB level (retrieving the lifecycle of the objects, providing statistics on the number of events, unique and total objects, counting the events per activity and the objects per type . . . ) to reduce the data exchange with the database and use the aggregation features of MongoDB. These are illustrated later in this extended abstract.

III. USAGE OF THE TOOL

The provided tool is based on the Python language and supports all existing OCEL implementations (JSON, XML, and MongoDB). The tool is available at the address https: //github.com/OCEL-standard/ocel-support. In particular, example scripts for the usage of the MongoDB interface are available in the folder examples/mongodb. First, the connection string and the database name could be set in the script commons.py. The script exporting.py permits to load an existing JSON/XML-OCEL file in the MongoDB database, while the script importing.py permits to save the object-centric event log to a JSON/XML-OCEL file. Other scripts perform computations on object-centric event logs: obj centr dfg.py provides routine for the computations of the directly-follows graph for each object type of the log. activities stats.py and ot stats.py provide some basic statistics for the activities (number of events and objects) and the object types (number of objects per type) of the event log. times between activities.py provides some statistics of the time passed between a couple of the activities of the log (regardless of the object type).

MongoDB offers a powerful aggregation package that permits performing significant object-centric process mining operations directly at the database level. As an example of a crucial object-centric process mining operation, we show an aggregation that is useful for the computation of the multidirectly follows graph (finding the events that belong to the lifecycle of an object). First, the ocel:omap attribute (list of related objects) is unrolled, so the same event is replicated for all the related objects. Then, a grouping operation based on the unrolled ocel:omap attribute is performed to collect the activities of the events related to the same object. e v e n t s c o l l e c t i o n . a g g r e g a t e ( [ f ” $ u n w i n d ” : ” $ o c e l : omap ” g , f ” $ g r o u p ” : f ’ i d ’ : ’ $ o c e l : omap ’ , ’ l i f e c y c l e ’ : f ” $ p u s h ” : ’ $ o c e l : a c t i v i t y ’ g g g ] , a l l o w D i s k U s e = T r u e )

The output of the aggregation can be used to calculate the directly-follows graph for the objects of a specific type, and looks like:

IV. MATURITY OF THE TOOL

The prototypal tool available at the address https://github. com/OCEL-standard/ocel-support has not been used in any real-life case study. We analyzed the scalability of the MongoDB implementation. All the experiments have been conducted with a notebook having an I7-7500U CPU, 16 GB of RAM, and an SSD hard drive. Table I reports on the results attained from logs of different size. The binary compression used to store the documents by MongoDB permits to save a significant amount of disk space in the storage of the log. We can also see that the index, which is necessary to increase the speed of the computations, occupies a significant amount of space compared to the size of the collection. In the computation of mDFGs, we can see that MongoDB mixes in-memory calculations with on-disk ones, especially if the amount of memory needed is higher than the amount of memory available. Compared to an in-memory approach, where the entire JSON object is imported into the memory, the computation of the object-centric directly-follows graph takes significantly more time. However, the amount of memory required to store the JSON is also considerably higher than the memory requirements of MongoDB. Our workstation went out of memory trying to ingest an event log having 6.8 M events, while MongoDB can manage bigger logs, as our experiments show. A video displaying the ingestion of an object-centric event log in MongoDB, and the execution of some computations, is available at the address https://www. youtube.com/watch?v=vDd5CASy1Y0.

V. ACKNOWLEDGMENTS

We thank the Alexander von Humboldt (AvH) Stiftung for supporting our research. Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence StrategyEXC-2023 Internet of Production 390621612.

[1]

A. F.

Ghahfarokhi ,

Park ,

Berti , and W. M. van der Aalst, “ Ocel: A standard for object-centric event logs , ” in European Conference on Advances in Databases and Information Systems . Springer, 2021 , pp. 169 - 175 .

[2]

Esser and

Fahland , “Multi-dimensional event data in graph databases , ” Journal on Data Semantics , pp. 1 - 33 , 2021 .

[3]

Jalali , “ Graph-based process mining ,” arXiv preprint arXiv: 2007 .09352, 2020 .

[4]

Esser and

Fahland , “ Storing and querying multi-dimensional process event logs using graph databases ,” in International Conference on Business Process Management. Springer, 2019 , pp. 632 - 644 .

[5]

Wang and

Kogan , “ Cloud-based in-memory columnar database architecture for continuous audit analytics , ” Journal of Information Systems , vol. 34 , no. 2 , pp. 87 - 107 , 2020 .

[6]

Berti and W. M. van der Aalst , “ Extracting multiple viewpoint models from relational databases , ” in 8th International Symposium on Data-Driven Process Discovery and Analysis (SIMPDA) . Springer International Publishing, 2018 , pp. 24 - 51 .