<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>PyStack't: Real-Life Data for Object-Centric Process Mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Lien Bosmans</string-name>
          <email>lienbosmans@live.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jari Peeperkorn</string-name>
          <email>jari.peeperkorn@kuleuven.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johannes De Smedt</string-name>
          <email>johannes.desmedt@kuleuven.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Research Center for Information Systems Engineering (LIRIS), KU Leuven</institution>
          ,
          <addr-line>Leuven</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>The availability of representative event logs is a prerequisite for algorithmic design and evaluation of novel (object-centric) process mining techniques. This work presents PyStack't, a Python package that supports data preparation for object-centric processing mining. It provides predefined data transformations that extract process data from publicly available APIs (GitHub) and export it to diferent OCED formats (OCEL 2.0, EKG). In addition, it includes summary statistics and interactive graph visualizations for data exploration. By tailoring to newcomers in the field and focusing on integrations with other open-source tools, this contribution aims to strengthen the emerging OCPM (tool) ecosystem.</p>
      </abstract>
      <kwd-group>
        <kwd>object-centric process mining</kwd>
        <kwd>event logs</kwd>
        <kwd>event log pre-processing</kwd>
        <kwd>process visualisation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Smedt)</p>
      <p>CEUR</p>
      <p>ceur-ws.org
mining, such as students or new practitioners. However, we hope that the permissive license also
motivates more experienced people to adapt the tool to their specific needs.</p>
      <sec id="sec-1-1">
        <title>Motivating the study of collaborative software development with OCPM</title>
        <p>Open-source software projects that reach a certain maturity and size are often developed and maintained
by a small core team together with a broad group of contributors. This community efort is supported by
various processes. Some of those are described in publicly available contribution guidelines.2 However, a
large part remains invisible, buried deep in the activity logs of numerous issues, pull requests, and releases.</p>
        <p>We believe this data could provide an interesting source to study collaborative processes with process
mining, for example but not limited to: resolution time for reported bugs, consistency and quality of the
review process, or retention of contributors and how their contributions to the project (e.g., bug reports,
documentation improvements, code development) evolve over time. We consider OCPM a well-aligned
choice because of the connections between issues (bug reports and requests for new or improved
functionality), pull requests (submitting new contributions for review), the core team of maintainers, the larger
community of contributors, the code inside the repository, and possible dependencies on other software.</p>
        <p>We expect some learnings from successful open-source projects could be transferred to business
environments, since the collaboration dynamic of a small group of payroll employees supported by
various contract developers can resemble that of an an open-source community project.</p>
        <p>Currently, PyStack’t can only extract activity data from GitHub code repositories. Choosing GitHub
as a first data source is motivated by multiple reasons; its API is well documented, its issue tracker creates
suficient digital traces to be studied with object-centric processing mining, and GitHub hosts a wide
variety of substantial open-source code repositories.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Features</title>
      <p>PyStack’t is published on PyPi: https://pypi.org/project/pystackt/. The features of the current version
0.1.0 can be divided into three categories, as visualized in figure 1. A video that demonstrates the diferent
functionalities is available at https://youtu.be/AS8wI90wRM8.
2An example is the pandas contributing guide (https://pandas.pydata.org/docs/dev/development/contributing.html).</p>
      <sec id="sec-2-1">
        <title>Data Extraction</title>
        <p>
          Data Export
• get_github_log: Extracts activity data linked to a code repository using the GitHub API. Includes
predefined data mapping for multiple API responses 3 to the Stack’t relational schema, an object-centric
event data format. Output is stored in a DuckDB4 database file.
• export_to_ocel2: Maps object-centric event data to the OCEL 2.0 format [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. The result is stored
in a SQLite database file compatible with tools such as Ocelot 5 and OCPQ [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
• export_to_promg: Generates a folder structure consisting of CSV and JSON files that can be ingested
by PromG [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] to build an event knowledge graph.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Data Exploration</title>
        <p>• PyStack’t ofers a local interactive data visualization app. Users can view and interact with event traces
for any combination of objects with a filter on included event and object types.
• create_statistics_views: Supports initial analysis with predefined views that contain summary
statistics.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Stability and Coverage</title>
      <sec id="sec-3-1">
        <title>Feature</title>
        <p>Extract OCED
from GitHub repository
Export to OCEL 2.0
Export to PromG
Generate summary statistics
Interactive data visualization</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Documentation</title>
      <sec id="sec-4-1">
        <title>Stability Scope</title>
        <p>Reliable: includes error han- Supports all GitHub repos,
limdling, tested for large datasets, no ited customization
known bugs
Reliable: validated compatibility Any process data in Stack’t
relawith other tools, tested for large tional schema can be exported
datasets, no known bugs
Experimental: first version, out- Any process data in Stack’t
relaput requires user validation tional schema can be exported
Reliable: tested for large datasets, Only includes basic statistics
no known bugs
Usable for small to medium sized Attributes are not yet included,
datasets limited customization
Extensive documentation is hosted at: https://lienbosmans.github.io/pystackt/. Each feature is described
in a separate page, including:
• an example code snippet;
• descriptions of input parameters and expected function behavior;
• additional instructions, e.g., how to generate a GitHub access token or view data stored in a DuckDB
database file;
• overview of extracted data, including descriptions of event/object types, relations, and attributes (if
applicable);
• links to relevant information, such as GitHub data policies.
3Examples of such API responses can be found at https://api.github.com/repos/LienBosmans/stack-t/issues/33, https://api.
github.com/repos/LienBosmans/stack-t/issues/33/timeline and https://api.github.com/user/6475031.
4https://duckdb.org/
5https://ocelot.pm/</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Use Case</title>
      <p>To demonstrate PyStack’t, the pandas repository (github.com/pandas-dev/pandas) was used as a data
source. During the data extraction, intermittent save functionality mitigated the risk of forced system
restarts and GitHub API outages. The activity data of 57,806 GitHub issues could be extracted. Two
issues were skipped due to a 404 status message, indicated by a warning message in the log.</p>
      <p>The output is a DuckDB database file containing 1,151,801 events (37 types) with 370,529 event
attributes values (37 attributes), 253,857 objects (4 types) with 763,455 object attribute values (15
attributes), 2,484,082 event-to-object relations, and 68,796 object-to-object relations.</p>
      <p>To generate a smaller dataset for additional testing, the pm4py repository (github.com/
process-intelligence-solutions/pm4py) was used. Activity data for all 523 issues could be
extracted. The output file contains 3,919 events (21 types) with 1,559 event attributes values (21 attributes),
1,673 objects (4 types) with 5,107 object attribute values (15 attributes), 8,685 event-to-object relations,
and 559 object-to-object relations.</p>
      <sec id="sec-5-1">
        <title>5.1. Approximate run times</title>
        <p>Function
get_github_log
export_to_ocel2
export_to_promg
create_statistics_views
prepare_graph_data
start_visualization_app
Approximate run time6
29 hours, 10 minutes
(pandas),
10 minutes (pm4py)
20 seconds (pandas),
5 seconds (pm4py)
22 seconds (pandas),
3 seconds (pm4py)
&lt; 1 second (both)
10 seconds (pandas),
&lt; 1 second (pm4py)
5 seconds initial load time
(pm4py)</p>
        <sec id="sec-5-1-1">
          <title>Comment</title>
          <p>Limited by GitHub API rate limits.
Outputs DuckDB file of 87.7 MB ( pandas),
3.5 MB (pm4py)
Outputs SQLite file of 227 MB ( pandas),
1 MB (pm4py).</p>
          <p>Ocelot accepts pm4py but fails with Out
of Memory error for pandas. OCPQ can
load both.</p>
          <p>Outputs folder structure of 268 MB
(pandas), 1 MB (pm4py)
Data size does not afect the creation of
a database view.</p>
          <p>Needed once before running
visualization app.</p>
          <p>App freezes when attempting to load
pandas dataset.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Interactive data exploration</title>
        <p>The application generates interactive graph visualizations for the selected objects. Objects can be
searched, sorted and selected in the table at the top. Users can opt to only include a subset of event types
and object types using the check boxes on the left. A detailed description of all components is available
in the documentation.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>This work presents PyStack’t, a Python package that supports data preparation for object-centric process
mining. We demonstrated its ability to generate novel OCED logs in diferent formats by extracting
activity data from the GitHub repositories of pandas and pm4py. An interactive application for data
exploration was presented as well. Given the need for more real-life datasets, we believe this to be a
valuable addition to the (open-source) OCPM tool ecosystem.
6Measured on laptop with Intel(R) Core(TM) i7-8565U processor and 16 GB RAM.</p>
      <p>Maturity PyStack’t is a relatively new Python package, first released in February 2025, that can reliably
create OCED logs with over a million events. Not all features have the same level of maturity; a detailed
overview can be found in section 3.</p>
      <p>Future Roadmap We are motivated to extend PyStack’t with additional tool integrations and
improved support for creating real-life OCED datasets. Concretely, we are working on below features.
• Improved PromG integration.
• Functionality to manipulate datasets, e.g., create filtered dataset, combine diferent datasets, rename
types.
• More responsive and user-friendly UI for interactive visualizations.
• Research additional data sources to include.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, ChatGPT was used to generate a list of writing prompts. After
using this service, the authors answered these prompts and combined the replies into a first draft.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Bosmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Peeperkorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goossens</surname>
          </string-name>
          , G. Lugaresi,
          <string-name>
            <surname>J. De Smedt</surname>
          </string-name>
          , J. De Weerdt,
          <article-title>Dynamic and scalable data preparation for object-centric process mining</article-title>
          ,
          <source>arXiv preprint arXiv:2410.00596</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2410.00596.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Berti</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Koren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. N.</given-names>
            <surname>Adams</surname>
          </string-name>
          , G. Park,
          <string-name>
            <given-names>B.</given-names>
            <surname>Knopp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rafiei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Liß</surname>
          </string-name>
          , L. T. G. Unterberg,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , et al.,
          <article-title>Ocel (object-centric event log) 2.0 specification</article-title>
          , arXiv preprint arXiv:
          <fpage>2403</fpage>
          .
          <year>01975</year>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Küsters</surname>
          </string-name>
          , W. M. van der Aalst, Ocpq:
          <article-title>Object-centric process querying and constraints</article-title>
          , in: International Conference on Research Challenges in Information Science, Springer,
          <year>2025</year>
          , pp.
          <fpage>383</fpage>
          -
          <lpage>400</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Swevels</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Klijn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fahland</surname>
          </string-name>
          ,
          <article-title>Object-centric process mining (and more) using a graph-based approach with promg</article-title>
          .,
          <source>in: ICPM Doctoral Consortium/Demo</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>