<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>A. Costa); selinyagmureroglu@gmail.com (S. Y. Eroglu); kerstin.andree@tum.de (K. Andree);
luise.pufahl@tum.de (L. Pufahl)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>A Collection of Publicly Available Event Logs Enhanced by Metadata</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Ana Costa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Selin Y. Eroglu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kerstin Andree</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luise Pufahl</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Technical University of Munich, School of Computation, Information and Technology</institution>
          ,
          <addr-line>Heilbronn</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0001</lpage>
      <abstract>
        <p>With the rising importance of process mining and business process management research, access to suitable event logs is critical for research artifact development and evaluation. However, the current landscape of publicly available data lacks in metadata. This poses a challenge for researchers to identify relevant event logs for their research objectives. We address this gap by introducing a metadata structure for event logs and describing a collection of publicly available event data. 98 event logs were analyzed and categorized based on 37 criteria relevant to process mining research. A collection containing these logs and categorization is provided and analyzed with two use cases.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Event data</kwd>
        <kwd>Publicly Available Event logs</kwd>
        <kwd>Collection of Event Logs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Event logs are fundamental for discovering, monitoring, and improving business processes [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], and the
extraction of valuable information is achieved using process mining techniques [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. When developing
new process mining artifacts, publicly available, real-world event logs from various domains are crucial–
not only for identifying relevant requirements but also for evaluating the artifacts in realistic settings.
Over the years, the process mining community has compiled a rich collection of event logs. These
include datasets published through the Business Process Intelligence (BPI) Challenges (eg. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]), logs
shared as supplementary material to research papers, and logs extracted from public sources, such
as MIMIC-IV [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or the Ethereum blockchain [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. However, these logs are distributed across various
platforms and are often shared with limited metadata. As a result, efectively utilizing them requires
significant manual efort. Researchers must conduct time-consuming preliminary assessments to
determine whether a log fits their needs, due to the lack of structured descriptions and inconsistent
metadata annotations [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>
        This paper presents a curated collection of publicly available event logs that have been enriched
with an enhanced metadata structure. The structure was developed based on requirements formulated
in interviews with process mining researchers and subsequently validated and refined in a second
round of interviews and a focus group discussion. Based on this metadata schema, pre-selected event
logs were assessed and annotated in detail. For event log selection, we followed a systematic review
methodology adapted from the PRISMA guidelines [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Datasets from 4TU, Kaggle, UC Irvine, and IEEE
were assessed against predefined inclusion and exclusion criteria. Inclusion criteria required public
accessibility, compatibility with common process mining formats (e.g., XES, CSV), and the presence
of mandatory event attributes. Datasets were excluded if they exceeded 600 MB, required payment or
preprocessing, or lacked English documentation. Researchers can now explore the collection using
search and filter capabilities and download logs that match their specific requirements.
      </p>
      <p>The remainder of this paper is structured as follows: Section 2 introduces the event log resource and
metadata structure; Section 3 discusses its preliminary usage; and Section 4 outlines future applications
and directions for this work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Description of the Resource</title>
      <p>
        The resource contains a set of 98 publicly available event logs, whose metadata was enriched based on
domains and data features relevant to process mining research. The metadata structure for event logs
includes 37 attributes with (1) basic process information, (2) domain context, (3) academic context, (4)
data characteristics, and (5) resource context. The pre-selective event logs contain the three mandatory
attributes of process mining [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and can be accessed in 4TU, Kaggle, UC Irvine, or IEEE. They are
provided in either eXtensible Event Stream (XES) or Comma-separated values (CSV) format. The
collection does not include data files larger than 600 MB, with restricted or paid access, or not in English.
Figure 1 shows metadata structure with the attributes that give context to the event logs, as well as a
short description of these.
      </p>
      <p>We developed the metadata structure for event logs through the following procedure: First, we
conducted five structured interviews with experts to identify relevant metadata attributes from a
research perspective. The experts shared their requirements and highlighted current challenges in
process mining research stemming from the limited availability and accessibility of event log data.
We synthesized common requirements from the interviews and organized them into attributes and
corresponding attribute dimensions. To validate our extraction, we followed a two-step approach: (1) we
confirmed with the same experts in a second interview round that all their requirements were addressed,
and (2) we conducted a focus group discussion with three additional process mining researchers to
reflect on the terminology and completeness of the structure. Based on the finalized set of metadata
attributes, we assessed each selected event log and compiled the information into the final metadata
table. The following subsections present a short description of the metadata attributes and logs included
in the collection.
The metadata structure for event logs ofers filtering capabilities based on fundamental information
about the event logs, such as their names, publication years, dataset size, format of the log, and summary
statistics, including the number of cases, variants, events, and activities. The final collection includes
event logs ranging from 2010 to 2026, with 28 logs published or updated in 2017. The dataset comprises
61 event logs in XES format and 37 in CSV format, with file sizes ranging from 1.1 KB to 551 MB. The
highest number of recorded cases is 251,734, and the number of variants ranges from 1 to 22,632.</p>
      <p>Case duration provides information on the time between the start and end of the event log traces
given as mean and median. With the help of this, event logs can be identified that describe short or
long running business processes. Our identified event logs show periods ranging from hours to months.
The additional available start and end timestamps indicate that events are typically recorded shortly
before the year of publication of the event log; however, there are some outliers in timestamps that
exhibit start dates such as 1948 or 1970. Other criteria, such as the additional attributes (e.g., costs,
status, descriptions) and resource availability, vary across format types. XES logs contain up to 16
non-mandatory attributes, and resources are available in half of the event logs.</p>
      <p>The domain context classifies the real-world domain or industry from which the event log originates
(e.g., healthcare, manufacturing, finance) and contextualizes the recorded process with information
such as whether it is the main or sub-process, a real-world or synthetic log, and ofers a brief process
description. The most frequent domain applications for real-world event logs are public administration,
IT service management, and healthcare, while synthetic logs are provided mostly for process mining
applications and administrative processes. Furthermore, the academic context encompasses information
related to the research environment, including the DOI, publication time, and type.</p>
      <sec id="sec-2-1">
        <title>2.2. Data Characteristics and Resource Context.</title>
        <p>The data characteristics analysis covers factors that describe the activity label, information on the top
ifve variants (e.g., number of cases in the five most frequent variants, mean duration), or factors that
afect data analysis, such as potential performance issues, prominent exhibited behavior of the process,
or dependencies between activities. Almost all logs have activity labels, and the descriptions vary from
interpretable names to randomized characters requiring additional context (33% of the logs name their
activities with an alphabetical letter, and the others with the verb of the action being performed).</p>
        <p>The top five variants are observed in relation to case coverage and mean duration. With that, it is
possible to analyze whether a process is standardized or if the event log is exhibiting behavior where the
obtained case coverage is low. This attribute shows how many cases, the percentage of case coverage
in comparison to the total number of cases, and the mean duration when only the five most frequent
variants are considered. Potential issues are also being observed since we want to ofer the possibility to
identify processes with sequential behavior and event logs containing more flexible behavior. With this
attribute, it is possible to filter logs that exhibit a sequential behavior, a behavior with parallel structures,
or a more flexible behavior. Furthermore, primary dependency shows loops between activities or
indications of activities that are dependent on one another.</p>
        <p>Finally, resource context provides some detailed information regarding resources. Although 55% of
the logs contain resources, the resources range from 0 to 1440 across all logs. While some resources are
represented numerically, others contain the organization name or profession. The attribute resource
type indicates how the resources for each log are named. Furthermore, the type indication parameter
shows two or three examples of how the resources are named in each log.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Preliminary Analysis</title>
      <p>The collection of event logs has been applied in two distinct use cases, a project in the field of process
discovery and a process in process prediction, which are demonstrated in this section. Each use case is
presented with a description of its specific objectives and associated requirements. For each requirement,
we discuss its coverage by the proposed framework and name the exact filtering option. This is followed
by an explanation of how the proposed resource was utilized, along with a summary of the outcomes,
specifically whether relevant and use case-specific event log data could be identified.</p>
      <sec id="sec-3-1">
        <title>3.1. Use Case: Process Discovery and Feature Extraction.</title>
        <p>This project focuses on the discovery of relevant features of decision tasks in processes. The research
team aimed to analyze specific attributes of activities that contribute to or precede particular decisions.
While the project was situated within the context of financial processes (R2.1), its scope also extends to
processes that involve decision points more generally (R2.2). A central objective was to understand which
aspects of process execution are taken into account when decisions are made. Event log documentation
was considered a key requirement (R2.3). Related publications ensure clarity regarding the structure,
semantics, and context of the data, thereby supporting a more accurate interpretation of the
decisionrelated features within the processes. Table 1 summarizes the requirements and shows the corresponding
ifltering applied to the categorization framework.</p>
        <p>R1.1 covers the contextual requirement and is covered by the categorization framework. The attribute
Domain Application provides filtering functionality with regard to the domain of the process. It was
set to Financial and Banking since both domains are interesting for the overall requirement of having
process data of financial processes. The number of decision points within processes is not covered
by the categorization framework. Even though the number of variants is given for every event log, it
is not made clear whether these variants are due to decision points or other behavior patterns, such
as parallelization. R1.3, however, is covered. The academic context ofers possibilities to filter for
publication-based event logs or event logs to which a publication can be associated. This filtering option
was set to not match N/A so that it is ensured the filtered datasets are well-documented. In total, the
applied filtering resulted in six event logs, each fulfilling the requirements and objectives of the project.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Use Case: Event Log Sampling for Next Activity Prediction.</title>
        <p>This project focused on finding suitable samples of event logs for training a next-step activity prediction
model. For that, it is necessary to foresee undesired execution of activities (R2.1) and to obtain a
considerable representation of the process through frequent traces (R2.2). The distribution of data
attributes should be pre-analyzed, e.g., by computing the frequency of categorical data values or the
mean of numerical attributes (R2.3). With that, traces of each variant can be sorted and given a priority
to traces that have more resources of each variant. Finally, a sampling function is applied that returns
traces with higher priorities. Table 2 shows the corresponding requirements and filtering criteria.</p>
        <p>All requirements are covered by the attributes of the metadata and could be selected additionally with
diferent filter settings. R2.1, for example, was fulfilled by filtering out sequential or parallel behavior
from the logs with prominent exhibit behavior, but using the primary dependency filter to recognize loops
or indications of dependency between activities could also be relevant for this requirement. In order to
have a representative trace frequency (R2.2), the number of variants was filtered between 100 and 5000
variants, but other filtering could have been included, such as the percentage of case coverage and mean
duration of the top five variants . R2.3 was fulfilled by filtering the mean case duration with a specific
desired time range, but it could also have been covered by computing the frequency of categorical data,
since the activity label description shows the three most frequent activities of each log. As a final result,
the framework covered all requirements and resulted in 12 possible event logs.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Possible Usage and Outlook</title>
      <p>A collection of 98 publicly available event logs categorized by 37 metadata attributes serves as a valuable
resource for researchers in the field of process mining. The collection of event logs enhanced by metadata
is available in Zenodo1 and is licensed under the Creative Commons Attribution 4.0 International2.
The availability and license of each log is provided together with the resource, and all logs are free
to copy and redistribute for research purposes. The ability to filter logs based on domains or data
features accelerates the search process, supports reproducibility, and ensures eficient selection of logs
for research. By organizing event logs according to key characteristics, the metadata structure ofers
a clear overview that facilitates informed dataset selection. We encourage researchers to extend the
framework by continuously adding new logs along with relevant metadata, fostering a growing and
structured repository. The provided resource, thus, democratizes process data handling by increasing
data accessibility. Event logs can be searched and found eficiently and purposefully.
Declaration on Generative AI
The authors have not employed any Generative AI tools.
1The resource is available in Zenodo with the link https://zenodo.org/records/16268743
2The license is described at https://creativecommons.org/licenses/by/4.0/deed.en</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>W. M. P. Van Der Aalst</surname>
          </string-name>
          , Process Mining: Data Science in Action, Springer Publishing Company, Incorporated,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>J. De Weerdt</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. De Backer</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Vanthienen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Baesens</surname>
          </string-name>
          ,
          <article-title>A multi-dimensional quality assessment of state-of-the-art process discovery algorithms using real-life event logs</article-title>
          ,
          <source>IS</source>
          <volume>37</volume>
          (
          <year>2012</year>
          )
          <fpage>654</fpage>
          -
          <lpage>676</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>B. van Dongen</surname>
          </string-name>
          ,
          <source>Bpi challenge</source>
          <year>2019</year>
          ,
          <year>2019</year>
          . URL: https://doi.org/10.4121/UUID:
          <fpage>D06AFF4B</fpage>
          -79F0
          <string-name>
            <surname>-</surname>
          </string-name>
          45E6
          <string-name>
            <surname>-</surname>
          </string-name>
          8EC8
          <article-title>-E19730C248F1, data set</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Cremerius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pufahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Klessascheck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Weske</surname>
          </string-name>
          ,
          <article-title>Event log generation in MIMIC-IV research paper</article-title>
          , in: Process Mining Workshops - ICPM, Bozen-Bolzano, Italy, Springer,
          <year>2022</year>
          , pp.
          <fpage>302</fpage>
          -
          <lpage>314</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H. D.</given-names>
            <surname>Bandara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bockrath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hobeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Klinkmüller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Pufahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rebesky</surname>
          </string-name>
          , W. van der Aalst, I. Weber,
          <article-title>Event logs of ethereum-based applications</article-title>
          , in: BPM'
          <fpage>21</fpage>
          , Rome, Italy,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berti</surname>
          </string-name>
          , G. Park,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rafiei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Aalst</surname>
          </string-name>
          ,
          <article-title>A generic approach to extract object-centric event data from databases supporting sap erp</article-title>
          ,
          <source>Journal of Intelligent Information Systems</source>
          <volume>61</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Moher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Liberati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tetzlaf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. G.</given-names>
            <surname>Altman</surname>
          </string-name>
          ,
          <article-title>Preferred reporting items for systematic reviews and meta-analyses: the prisma statement</article-title>
          ,
          <source>BMJ</source>
          <volume>339</volume>
          (
          <year>2009</year>
          )
          <fpage>b2535</fpage>
          -
          <lpage>b2535</lpage>
          . doi:
          <volume>10</volume>
          .1136/bmj.b2535.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <surname>W. M. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Process mining: a 360 degree overview</article-title>
          , in: J.
          <string-name>
            <surname>C. Wil M. P. van der Aalst</surname>
          </string-name>
          (Ed.),
          <source>Process Mining Handbook</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>