<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Scalable Database for the Storage of Object-Centric Event Logs (Extended Abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alessandro Berti</string-name>
          <email>a.berti@pads.rwth-aachen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anahita Farhang Ghahfarokhi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gyunam Park</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Wil M.P. van der Aalst</string-name>
          <email>wvdaalst@pads.rwth-aachen.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Process and Data Science Department, RWTH Aachen University Process and Data Science department</institution>
          ,
          <addr-line>Lehrstuhl fur Informatik 9 52074 Aachen</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <abstract>
        <p>-Object-centric process mining provides a set of techniques for the analysis of event data where events are associated to several objects. To store Object-Centric Event Logs (OCELs), the JSON-OCEL and JSON-XML formats have been recently proposed. However, the proposed implementations of the OCEL are file-based. This means that the entire file needs to be parsed in order to apply process mining techniques, such as the discovery of object-centric process models. In this paper, we propose a database storage for the OCEL format using the MongoDB document database. Since documents in MongoDB are equivalent to JSON objects, the current JSON implementation of the standard could be translated straightforwardly in a series of MongoDB collections. Index Terms-Object-Centric Process Mining; Object-Centric Event Log; Database Support; MongoDB</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. SIGNIFICANCE OF THE TOOL</title>
      <p>
        OCEL http://www.ocel-standard.org/1.0/specification.pdf
has been proposed to model the structure of object-centric
event logs [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Implementations of the format have been made
available for JSON and XML file formats, and tool support is
proposed for the Python and Java languages. For all these, the
event log is stored in a JSON/XML file that can be ingested
in-memory by the tools/libraries. The necessity to load the
log in-memory makes it difficult to manage a huge amount of
object-centric event data since memory is a limited asset. With
this paper, a novel implementation of the format is proposed
based on the MongoDB document database. Documents can
be imported in MongoDB starting from JSON objects. Hence,
the JSON-OCEL implementation could be translated easily
to MongoDB. Moreover, MongoDB can mix in-memory
and on-disk computations to provide efficient data science
pipelines. Other advantages of MongoDB that we exploit
are: the fine-grained support for indexes (i.e., multikey),
which makes ad-hoc querying faster; the fine-grained support
for aggregations (i.e., grouping) that permits to move some
of the computations at the database level; the support to
replication, which provides redundancy and increases data
availability https://docs.mongodb.com/manual/replication/.
Graph databases have been assessed previously for the
storage of object-centric event data [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], but the direct
translation of the specification of OCEL in a graph database
is more challenging1. Also, columnar storages have been used
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], with the limitations that they work for basic column
1Even if object-centric event logs can, in general, be uploaded to graph
databases as shown in https://doi.org/10.5281/zenodo.3865221
types but do not provide comprehensive support to JSON and
advanced data types.
      </p>
    </sec>
    <sec id="sec-2">
      <title>II. MAIN FEATURES OF THE TOOL</title>
      <p>The implementation of the schema to host the elements of
the OCEL standard follows from the implementation of
JSONOCEL http://www.ocel-standard.org/1.0/specification.pdf. Fig.
1 shows how the translation of the different entities is possible.
“ocel:events”: {…
“ev1”: {“ocel:activity”: “A”,
“ocel:timestamp”:
“2020-01-01T00:00:00”,
“ocel:omap”:[…],
“ocel:vmap”:{...} }
...}
“ocel:objects”: {…
“obj1”: {“ocel:type”: “Order”,
“ocel:ovmap”:{...} }
...}
“ocel:global-event”: {…}
“ocel:global-object”: {…}
“ocel:global-log”: {…}
ocel:events collection
…{ “ocel:id”: “ev1”,
“ocel:activity”:”A”,
“ocel:timestamp”:Date(1970,01,01,00,00,00),
“ocel:omap”:[…],
“ocel:vmap”:{...} } ...
ocel:objects collection
…{ “ocel:id”: “obj1”,
“ocel:type”:”Order”,
“ocel:vmap”:{...} } ...
ocel:others collection</p>
      <p>Some fields are colored in red, meaning that an index
has been applied to the fields to optimize the execution of
some queries. In particular, the identifier, the activity, and the
object map (multikey index) have been set as an index for the
events. In contrast, the identifier and the type have been set as
identifiers for the objects. The tool permits ingestion of logs in
the JSON/XML-OCEL formats or exporting of the MongoDB
implementation’s contents to JSON/XML-OCEL. Moreover,
some essential object-centric process mining operations have
been implemented at the MongoDB level (retrieving the
lifecycle of the objects, providing statistics on the number of events,
unique and total objects, counting the events per activity and
the objects per type . . . ) to reduce the data exchange with the
database and use the aggregation features of MongoDB. These
are illustrated later in this extended abstract.</p>
    </sec>
    <sec id="sec-3">
      <title>III. USAGE OF THE TOOL</title>
      <p>The provided tool is based on the Python language and
supports all existing OCEL implementations (JSON, XML,
and MongoDB). The tool is available at the address https:
//github.com/OCEL-standard/ocel-support. In particular,
example scripts for the usage of the MongoDB interface are
available in the folder examples/mongodb. First, the
connection string and the database name could be set in the
script commons.py. The script exporting.py permits to load
an existing JSON/XML-OCEL file in the MongoDB database,
while the script importing.py permits to save the object-centric
event log to a JSON/XML-OCEL file. Other scripts perform
computations on object-centric event logs:
obj centr dfg.py provides routine for the computations of
the directly-follows graph for each object type of the log.
activities stats.py and ot stats.py provide some basic
statistics for the activities (number of events and objects)
and the object types (number of objects per type) of the
event log.
times between activities.py provides some statistics of
the time passed between a couple of the activities of the
log (regardless of the object type).</p>
      <p>MongoDB offers a powerful aggregation package that
permits performing significant object-centric process mining
operations directly at the database level. As an example of a
crucial object-centric process mining operation, we show an
aggregation that is useful for the computation of the
multidirectly follows graph (finding the events that belong to the
lifecycle of an object). First, the ocel:omap attribute (list of
related objects) is unrolled, so the same event is replicated for
all the related objects. Then, a grouping operation based on
the unrolled ocel:omap attribute is performed to collect the
activities of the events related to the same object.
e v e n t s c o l l e c t i o n . a g g r e g a t e (
[ f ” $ u n w i n d ” : ” $ o c e l : omap ” g ,
f ” $ g r o u p ” : f ’ i d ’ : ’ $ o c e l : omap ’ ,
’ l i f e c y c l e ’ : f ” $ p u s h ” : ’ $ o c e l : a c t i v i t y ’ g g g ] ,
a l l o w D i s k U s e = T r u e )</p>
      <p>The output of the aggregation can be used to calculate the
directly-follows graph for the objects of a specific type, and
looks like:</p>
    </sec>
    <sec id="sec-4">
      <title>IV. MATURITY OF THE TOOL</title>
      <p>The prototypal tool available at the address https://github.
com/OCEL-standard/ocel-support has not been used in any
real-life case study. We analyzed the scalability of the
MongoDB implementation. All the experiments have been
conducted with a notebook having an I7-7500U CPU, 16 GB of
RAM, and an SSD hard drive. Table I reports on the results
attained from logs of different size. The binary compression
used to store the documents by MongoDB permits to save
a significant amount of disk space in the storage of the
log. We can also see that the index, which is necessary to
increase the speed of the computations, occupies a significant
amount of space compared to the size of the collection.
In the computation of mDFGs, we can see that MongoDB
mixes in-memory calculations with on-disk ones, especially
if the amount of memory needed is higher than the amount
of memory available. Compared to an in-memory approach,
where the entire JSON object is imported into the memory,
the computation of the object-centric directly-follows graph
takes significantly more time. However, the amount of memory
required to store the JSON is also considerably higher than the
memory requirements of MongoDB. Our workstation went
out of memory trying to ingest an event log having 6.8
M events, while MongoDB can manage bigger logs, as our
experiments show. A video displaying the ingestion of an
object-centric event log in MongoDB, and the execution of
some computations, is available at the address https://www.
youtube.com/watch?v=vDd5CASy1Y0.</p>
    </sec>
    <sec id="sec-5">
      <title>V. ACKNOWLEDGMENTS</title>
      <p>We thank the Alexander von Humboldt (AvH) Stiftung for
supporting our research. Funded by the Deutsche
Forschungsgemeinschaft (DFG, German Research Foundation) under
Germany’s Excellence StrategyEXC-2023 Internet of Production
390621612.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Ghahfarokhi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Berti</surname>
          </string-name>
          , and W. M. van der Aalst, “
          <article-title>Ocel: A standard for object-centric event logs</article-title>
          ,
          <source>” in European Conference on Advances in Databases and Information Systems</source>
          . Springer,
          <year>2021</year>
          , pp.
          <fpage>169</fpage>
          -
          <lpage>175</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Esser</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Fahland</surname>
          </string-name>
          ,
          <article-title>“Multi-dimensional event data in graph databases</article-title>
          ,
          <source>” Journal on Data Semantics</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>33</lpage>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Jalali</surname>
          </string-name>
          , “
          <article-title>Graph-based process mining</article-title>
          ,” arXiv preprint arXiv:
          <year>2007</year>
          .09352,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Esser</surname>
          </string-name>
          and
          <string-name>
            <given-names>D.</given-names>
            <surname>Fahland</surname>
          </string-name>
          , “
          <article-title>Storing and querying multi-dimensional process event logs using graph databases</article-title>
          ,” in International Conference on Business Process Management. Springer,
          <year>2019</year>
          , pp.
          <fpage>632</fpage>
          -
          <lpage>644</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          and
          <string-name>
            <given-names>A.</given-names>
            <surname>Kogan</surname>
          </string-name>
          , “
          <article-title>Cloud-based in-memory columnar database architecture for continuous audit analytics</article-title>
          ,
          <source>” Journal of Information Systems</source>
          , vol.
          <volume>34</volume>
          , no.
          <issue>2</issue>
          , pp.
          <fpage>87</fpage>
          -
          <lpage>107</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berti</surname>
          </string-name>
          and
          <string-name>
            <surname>W. M. van der Aalst</surname>
          </string-name>
          , “
          <article-title>Extracting multiple viewpoint models from relational databases</article-title>
          ,
          <source>” in 8th International Symposium on Data-Driven Process Discovery and Analysis (SIMPDA)</source>
          . Springer International Publishing,
          <year>2018</year>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>51</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>