<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Resilient Distributed P/T Net Simulators⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Laif-Oke Clasen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrick Leonhardt</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Leven Wichelmann</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Hamburg, Faculty of Mathematics</institution>
          ,
          <addr-line>Informatics and Natural Sciences</addr-line>
          ,
          <institution>Department of Informatics</institution>
        </aff>
      </contrib-group>
      <fpage>60</fpage>
      <lpage>80</lpage>
      <abstract>
        <p>A distributed simulation of P/T nets requires partitioning the overall model into modules that run on multiple simulators. Failures in distributed systems can compromise consistency, resulting in incorrect outcomes. Ensuring resilience in such simulations is essential for maintaining correctness, especially in long-running executions. This research develops a concept for resilient simulators in distributed P/T net simulations. A prototyping approach grounded in constructivist principles enables the detection and recovery of failures in P/T net simulations. The evaluation follows a summative ex-post methodology, applying a criteria-based assessment to validate efectiveness. Experimental validation demonstrates the feasibility of the proposed concept of resilience mechanisms for distributed P/T net simulations. The system maintains simulation consistency despite failures by integrating state-saving and recovery techniques. The results confirm improved fault tolerance and reliability in distributed P/T net simulations. The introduced concept for resilient P/T net simulations enhances the robustness of distributed simulations. Preventing inconsistencies ensures accurate analysis and reliable execution over extended periods. The findings contribute to the development of fault-tolerant simulation frameworks, supporting more reliable distributed computing environments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Resilience</kwd>
        <kwd>Failure Detection</kwd>
        <kwd>Failure Recovery</kwd>
        <kwd>Distributed Simulation</kwd>
        <kwd>P/T Nets</kwd>
        <kwd>P/T Nets with Synchronous Channels</kwd>
        <kwd>Event Streaming</kwd>
        <kwd>Container</kwd>
        <kwd>Container Orchestration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The distribution of complex simulation models across several independent computing nodes is becoming
increasingly important in software engineering, particularly system simulation. [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ] The ability to
simulate models in a distributed manner ofers significant advantages in terms of model simulation
scalability; however, it also risks inconsistencies and data loss due to errors in individual components.
Against this background, research into resilient mechanisms in distributed simulation environments is
becoming increasingly important, as it directly addresses the reliability and accuracy of such simulation
results. [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]
      </p>
      <p>The present work lies at the intersection of distributed systems and the simulation of Petri nets,
particularly in the context of resilience, i.e., the ability to recover system states after faults occur.
Particularly in application areas such as process automation, workflow management, or complex,
longrunning simulations of critical systems, simulations must continue to run resiliently and consistently
despite errors. Despite considerable progress in the distribution and scaling of P/T net simulations,
there are still significant research gaps regarding suitable strategies for fault detection, state safety, and
systematic recovery from faults. The development of resilient simulators is not only a technical challenge
but also opens up new methodological perspectives for the reliability of distributed simulations. The
state safety and recovery mechanisms developed and validated in this work are expected to enhance
fault tolerance and provide a comprehensive understanding of the causes and efects of inconsistencies
in distributed Petri net simulations.</p>
      <p>Therefore, the central research question of this work is: How can distributed simulators of P/T
nets be designed to guarantee consistent simulation results even in the event of system failures? This
research question is based on the hypothesis that implementing structured statefulness procedures and
liveness-based monitoring strategies can significantly increase the resilience of distributed simulations
of P/T nets, allowing simulators to continue operating consistently despite failures.</p>
      <p>
        This contribution adopts a constructivist approach to addressing the research question, developing
a prototype simulator grounded in resilient design principles. [
        <xref ref-type="bibr" rid="ref7 ref8 ref9">7, 8, 9</xref>
        ] As a starting point, we use
the distributed P/T net simulation [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] in the Petri net editor, simulator and verifier Renew1 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. In
doing so, concepts for the state assurance procedure and recovery after failures are developed, and
their efectiveness is experimentally tested. This evaluation involves a criteria-based assessment to
objectively prove the efectiveness and robustness of the developed concepts.
      </p>
      <p>Within the Foundations (Section 2), the topics of Renew (Section 2.1), Distributed P/T Nets
(Section 2.2), Failures (Section 2.3), Resilience (Section 2.4), and Kubernetes (Section 2.5) are systematically
introduced. Subsequently, the Problem Description (Section 3) and the design of the Distributed System
(Section 4) are presented. The ensuing section details the prototypes developed during this work,
specifically Detecting Failures of Simulators (Section 5) and Recovering Failures of Simulators –
DPTNResiliency (Section 6). The overall system is then evaluated using a classical case study in computer
science, the Producer-Storage-Consumer scenario (Section 7). A critical discussion (Section 8) of the
advantages, disadvantages, and limitations of the proposed concept follows. Finally, the article concludes
with an overview of Related Work (Section 9) and the Conclusion (Section 10).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Foundations</title>
      <p>
        This section introduces the relevant concepts and technologies that are the focus of this paper. Firstly, we
present Renew as a Java-based multi-formalism editor and simulator for especially reference nets (2.1)
and further introduce our set-up for distributed P/T nets (2.2). Then, we proceed to diferent possible
failure types, including classification, detection, and recovery (2.3). We finish with a definition of
resilience (2.4) and an overview of the relevant terms and concepts within the Kubernetes environment
(2.5).
2.1. Renew
Renew [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] is an open-source software tool for modeling, analyzing, and simulating various types
of Petri nets, with a particular focus on distributed P/T nets (Section 2.2). It was developed by the
Algorithms, Randomization, and Theory (ART) research group, formerly Theoretical Foundations of
Computer Science (TGI), at the University of Hamburg.
      </p>
      <p>
        Renew is implemented in Java 17 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] and built using Gradle 8.4 [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], ensuring robustness and
platform independence. Its software architecture is based on a modular plugin system, as described by
Duvigneau [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Its modularity and maintainability have recently been enhanced by adopting the Java
Platform Module System (JPMS) [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ].
      </p>
      <p>
        For each supported Petri net variant, Renew provides a dedicated formalism plugin, the most
prominent being the reference net formalism according to Kummer [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. Another plugin relevant to
this contribution is the cloud-native plugin [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], which can expose HTTP endpoints for Renew via Java
Spring [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Through these HTTP endpoints, Renew simulations can be initiated and controlled.
      </p>
      <sec id="sec-2-1">
        <title>2.2. Distributed P/T Nets</title>
        <p>
          The distributed P/T nets used in this contribution are the same as in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. In this context, they build
on the formal definition of [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] and extend it with the informal extension of distributed synchronous
channels. If we are considering simulation time, multiple instances of these distributed P/T nets are
executed within a single simulator; however, there are also multiple simulators in place.
        </p>
        <p>The informal extension of distributed synchronous channels ensures that P/T nets can communicate
with each other by implementing rendezvous synchronization. Transitions are labeled with signatures</p>
        <sec id="sec-2-1-1">
          <title>1Reference Net Workshop can be downloaded directly from its oficial website: http://renew.de.</title>
          <p>so that they can synchronize if they have the corresponding signature. These labeled transitions with
matching signatures can only fire together.</p>
          <p>
            The signature of a synchronous channel, which is used here, consists of type, identifier, and parameter.
There are two diferent types: the downlink and uplink, which can be denoted as  and  ,
respectively. [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] Downlinks are the active or calling part and uplinks are the passive or called part.
The identifier usually describes the name of the channel or a relation. Whereas the parameters are used
to exchange information between the synchronizing transitions.
          </p>
          <p>
            Each of the distributed P/T nets can be considered as an individual module, whereby the consideration
of a module based on [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ] is applied. Modules on this basis have a left and a right interface. In the
context of distributed P/T nets, the left and right interfaces only contain distributed synchronous
channels.
          </p>
          <p>
            The overall system architecture for the distributed P/T nets [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] comprises multiple simulators, an
event-based communication medium (Kafka), and a synchronization service. The distributed P/T nets
are statically allocated to the available simulators.
          </p>
          <p>
            The communication between P/T nets across simulator boundaries is facilitated through distributed
synchronous channels. The event-based communication medium Apache Kafka is employed for this
purpose. Apache Kafka is an open-source, distributed event-streaming platform designed to deliver
scalability and high performance [
            <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
            ]. It is extensively utilized in distributed systems for real-time
data processing and transmission. Event streaming refers to the continuous processing of data as
discrete, immutable events, each annotated with a timestamp and sequence number. Such events can be
persistently stored and subsequently reused, enabling eficient analysis and processing. Kafka provides
persistence, high throughput, real-time processing capabilities, and support for diverse architectures
and programming languages [24, p. 6f]. The decoupling of producers and consumers promotes the
development of loosely coupled system architectures, establishing Kafka as a scalable and robust solution
for modern distributed systems, particularly when deployed with high-availability configurations.
          </p>
          <p>
            An illustrative example is provided by Clasen et al. [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], who describes a classic IT scenario - the
producer-consumer storage model, which is visualized in Figure 1. In this example, the producer,
consumer, and storage components are distributed across diferent simulators. The producers and
consumers act as active components, whereas the storage operates as a reactive component, featuring
only distributed uplinks and lacking downlinks.
          </p>
          <p>(a) Producer and Consumer Net</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.3. Failures</title>
        <p>
          In every computer system, it is unavoidable that, at some point, some failure can occur. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] This applies
specifically to distributed simulation systems. These errors may have diferent causes and origins.
        </p>
        <p>
          Internal or in-process errors occur within an application and can be classified into diferent types.
Logical errors [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] occur when incorrect instructions are present in the application’s source code,
resulting in issues such as erroneous calculations or infinite loops. These represent implementation
faults caused by flawed algorithms or incorrect control flows.
        </p>
        <p>
          Closely related are semantic errors, which are often used synonymously. Semantic errors [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] afect
the interpretation of data or interactions: although the code functions correctly on a technical level, the
meaning of the results or processes deviates from the specification.
        </p>
        <p>
          In addition to the mentioned types of errors that result in faulty program behavior, runtime errors may
also occur despite a "correct" program implementation. For example, excessive nested calls of a recursive
function typically lead to a stack overflow, where the call stack of a program exceeds the allocated
memory space. If a program continuously consumes memory without releasing it correctly, this leads
to a memory leak. Both stack overflows and memory leaks fall into the category of memory-related
errors. [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ]
        </p>
        <p>
          Problems in the parallel execution of (sub-)processes can cause synchronization errors, such as
deadlocks or race conditions. A deadlock occurs when a cyclic waiting situation arises between the
involved processes, with each waiting for the release of a system resource that is exclusively held by
another. A race condition, on the other hand, describes an unintended fault where multiple operations
influence the final result due to their timing behavior. [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]
        </p>
        <p>
          We can group logical and semantic errors, as well as synchronization and memory errors, into a
broader category of software errors to fit into the categorization of Schroeder et al. [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ]. They also
mention other types of errors that are not directly software-related, which we look at next.
        </p>
        <p>When disruptions occur during data transmission or exchange between distributed components, they
are referred to as network or communication errors. These can result from issues such as packet loss or
connection failures, leading to inconsistencies in the state of distributed systems. Such inconsistencies
can adversely afect the synchronization and correctness of the simulation.</p>
        <p>System and hardware errors originate from physical defects or failures in the components of the
distributed system, such as the CPU or memory. These errors are often dificult to predict and can cause
abrupt system crashes or faulty data processing.</p>
        <p>In addition to internal errors, there are also external error types that depend on external factors.
Undesired user inputs or environmental influences can impair the correctness of the simulation or even
disrupt the intended functionality of the entire system.</p>
        <p>In the context of our contribution, a relevant structure of error types emerges, as shown in Figure 2.
We distinguish between internal and external errors that may afect our distributed simulation. Among
internal errors, we further diferentiate between software-based errors (such as logical, semantic,
synchronization-specific, and memory-related errors) and infrastructure-based errors, which depend
on the system architecture. The latter includes communication and hardware errors.</p>
        <p>To ensure the robustness and reliability of a distributed simulation, the early detection of occurring
failures is essential. These errors can occur in both deterministic and non-deterministic ways, which
complicates their identification and reproducibility. Typical deterministic errors include logical flaws or
faulty algorithms, whereas synchronization errors, such as deadlocks and race conditions, are considered
non-deterministic.</p>
        <p>
          The first method for detecting failures is through static analysis. Although they primarily serve
compile-time checking, formal analysis techniques can detect potential runtime errors before execution.
Static analysis tools such as SonarGraph [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] or FindBugs [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ] can flag unused variables, incorrect
assignments, or potential null references. Many modern IDEs already include basic static analysis tools
with various features as standard.
        </p>
        <p>
          On the other hand, there are dynamic methods to detect failures at runtime. This involves monitoring
the system during execution. Techniques such as assertion checking, instrumentation, or the use
of debugging tools like Valgrind [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ], the GNU Debugger [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], or AddressSanitizer [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ] enable the
detection of memory errors, invalid memory accesses, or synchronization issues. Particularly in
productive, complex, and automated systems or simulations, continuous and comprehensive monitoring
plays a crucial role. It is the foundation and, therefore, essential for ensuring the availability, reliability,
and performance of a system.
        </p>
        <p>
          Monitoring technologies such as Prometheus [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] ofer suitable solutions by collecting a wide range
of metrics and visualizing them. In most cases, an alert manager is included for triggering alerts in
the event of anomalies. The open-source monitoring framework Kieker is also worth mentioning as a
valuable tool for the runtime monitoring of software-based systems. It incorporates the aforementioned
capabilities and is particularly useful for analyzing performance, architectural behavior, and failures in
distributed applications.[
          <xref ref-type="bibr" rid="ref36">36</xref>
          ]
        </p>
        <p>
          Since we cannot entirely prevent failures, we must at least design our systems to tolerate or recover
from them when they occur. It is worth noting that, in some situations, it can be better to tolerate small
failures by not addressing them if an automated recovery system might do more harm than good [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ].
        </p>
        <p>
          In general, we can categorize fault tolerance methods into two main groups: reactive and proactive
methods. Static analysis methods act as a proactive method. Proactive methods take action preemptively
to try and limit failures as much as possible, while reactive methods rely on failure detection and start
acting when a failure has occurred. [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] For this reason, to make a system truly failure-tolerant, one
needs to implement at least one reactive method.
        </p>
        <p>
          Since proactive methods cannot entirely prevent failures, we do not elaborate on them further in this
paper. More information can be found, e.g., in [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ].
        </p>
        <p>
          An obvious reactive method is checkpointing, sometimes also referred to as checkpoint-restart. [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ]
Here, each system component regularly stores its state on some form of persistent, highly available
storage. If a component fails (e.g., due to a crash), it can be restarted by a failure detection system and
automatically load the latest checkpoint to continue execution from that point.
        </p>
        <p>
          Another reactive method, which can also be considered a hybrid method, is replication. Here,
individual components can be replicated, potentially even on diferent physical machines, resulting in
multiple instances of each component. If one instance fails, a replica can take its place to ensure smooth
execution, and the original instance gets restarted to serve as another replica. [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ]
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.4. Resilience</title>
        <p>
          According to Laprie et al. [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ], the term resilient has been used mainly as a synonym of fault-tolerant for
many years in the field of computer science. A resilient, fault-tolerant, or robust system should be able
to deliver its service, even in some circumstances that are not part of its typical mode of operation. We
use the term "some" here since we have already learned in section 2.3 that it is impossible to completely
prevent failures, including those that are not tolerable by any software system (like a simultaneous
hardware failure in all machines).
        </p>
        <p>
          For this paper, we use the definition from Pradhan et al. [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ], which defines a resilient system as
a system that "includes eficient techniques for [...] ensuring its correct operation [...], even in the
presence of faults and failures [...].". Additionally, it requires the resilient mechanism of a system to be
autonomous, as human interaction is slow and introduces an additional opportunity for errors.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.5. Kubernetes</title>
        <p>
          Kubernetes [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ] is an open-source software for container orchestration developed by Google. It is
commonly used to manage, harden and scale each individual component of deployed applications. It
allows for easy creation of clusters made of multiple physical computers (nodes).
        </p>
        <p>On each node, a service called kubelet acts as the "primary node agent" for Kubernetes, manages
everything that Kubernetes runs on that node. If some application crashes on a specific node for
example, the set of kubelets of all nodes would be responsible for restarting it.</p>
        <p>Additionally, various resource types allow applications to scale automatically according to demand,
making it suitable for all kinds of applications and workloads.</p>
        <p>
          Containerisation can be defined as the act of bundling a software application and all necessary
dependencies and system libraries into a single container. [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ] There are various technologies for
containerising applications like Docker [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ] or Podman [
          <xref ref-type="bibr" rid="ref46">46</xref>
          ]. Under the hood, they all implement the
specifications of the Open Container Initiative [
          <xref ref-type="bibr" rid="ref47">47</xref>
          ] (OCI), making them mostly interchangeable at
runtime.
        </p>
        <p>A key property of containers is that they are stateless by default. That means every time a container
is started, it has no recollection of past instances of itself or other state.</p>
        <p>
          A Pod is the smallest deployable entity in Kubernetes and consists of one or more containers [
          <xref ref-type="bibr" rid="ref48">48</xref>
          ].
Just like containers, Pods are not persistent by default; if a Pod dies and is restarted, a new replica of
the Pod is created, without any memory of previous instances of the Pod.
        </p>
        <p>
          Container lifecycle hooks are part of the OCI runtime specification [
          <xref ref-type="bibr" rid="ref49">49</xref>
          ] and widely used when
working with containers. They are used to intercept specific events in the container lifecycle, for
example the container starting or stopping.
        </p>
        <p>
          Kubernetes ofers a few hooks (more commonly called probes) related to Pod lifecycles, most
importantly liveness probes. As the name suggests, liveness probes allow Kubernetes to check that our Pods
are still active and have not failed or crashed [
          <xref ref-type="bibr" rid="ref50">50</xref>
          ]. If a liveness probe fails too often, Kubernetes will
treat the corresponding Pod as failed, and automatically restart it. This makes them an important tool
when developing any kind of failure-resistant application.
        </p>
        <p>In Kubernetes, one usually does not directly create Pods themselves. Instead, there are a number of
resources types that create and manage a set of Pods, each having unique use cases, advantages and
disadvantages.</p>
        <p>A StatefulSet owns a set of Pods and maintains a unique identity for each one, as well as an ordering
over all its Pods. The Pod identity it provides includes a network address, (if configured) a dedicated
storage mount, a name and and index label. If any Pod that belongs to a StatefulSet fails, for example
by crashing or because the associated Liveness Probe fails, a new Pod will be created with the same
identity of the failed Pod.</p>
        <p>Some applications do require Pods to have some form of persistent storage, or memory, even between
restarts. For this purpose, Kubernetes implements the concept of Persistent Volumes (PVs) and Persistent
Volume Claims (PVCs).</p>
        <p>A PV is a resource in a cluster that provides a piece of storage. Their lifecycle is independent from
Pods. If a Pod needs some persistent storage, it can request some via a PVC. In this case, the Pod
can only start when a fitting PV binds to its PVC. The volume will then mount in the corresponding
container(s) filesystem.</p>
        <p>PVs can either be created manually by cluster administrators, or by a StorageClass that is configured
to automatically provision PVs when a PVC requests storage from it.</p>
        <p>
          Ceph is a mature distributed file system that is developed for performance, scalability and reliability
[
          <xref ref-type="bibr" rid="ref51">51</xref>
          ]. Rook [
          <xref ref-type="bibr" rid="ref52">52</xref>
          ] is a cloud native Kubernetes deployment of Ceph [
          <xref ref-type="bibr" rid="ref53">53</xref>
          ] that allows developers to focus
on configuring Kubernetes resources, while silently managing the Ceph file system on all nodes in the
background. It does so by providing StorageClasses for various purporses that automatically provision
PVs as needed.
        </p>
        <p>
          The Cloud Native Computing Foundation [
          <xref ref-type="bibr" rid="ref54">54</xref>
          ], an ofshoot of the Linux Foundation [
          <xref ref-type="bibr" rid="ref55">55</xref>
          ], lists Rook as
one of only two graduated technologies in the context of Cloud Native Storage. [
          <xref ref-type="bibr" rid="ref56">56</xref>
          ] Graduated projects
are "[...] considered stable, widely adopted, and production ready, attracting thousands of constributors".
This makes Rook a great choice for managing storage in a Kubernetes cluster.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Description</title>
      <p>All simulators collaboratively execute a single simulation. To this end, each simulator is capable of
concurrently executing multiple distributed P/T nets. The global state of the simulation comprises the
complete set of distributed P/T nets along with their respective markings. The local state of a simulator
is defined as the subset of distributed P/T nets it executes, including the corresponding markings. Thus,
each simulator is responsible for a partition of the overall system and manages the associated segment
of the simulation state.</p>
      <p>In the context of complex systems or processes, simulations often run for extended periods of time.
Prolonged runtimes increase the probability of individual component failures. In the worst-case scenario,
such failures can result in inconsistencies that necessitate a complete restart of the simulation. Since
failures may stem from a variety of causes—most notably hardware faults—they cannot be entirely
precluded.</p>
      <p>It is, therefore, imperative to develop robust mechanisms for detecting and mitigating failures
in distributed simulations. Failures are detected through continuous monitoring of the simulators’
operational status. Once a failure is identified, the afected simulator is reinitiated on a functional
computing node.</p>
      <p>To ensure correct recovery, appropriate techniques must be employed to reconstruct the simulator’s
state, as each simulator retains a distinct portion of the global state. This is achieved by periodically
persisting the simulator’s state under predefined conditions to highly available and durable storage.
In the event of a failure, the simulator can resume execution from the most recent consistent state,
ensuring continuity of the simulation.</p>
      <p>The concept introduced in this work is validated in the context of distributed simulation of P/T nets.
Given the distributed nature of the simulation, the overall system is inherently decentralized. The
ifrst objective of this work is the conceptual design and technical realization of the distributed system
(Section 4).</p>
      <p>A subsequent objective involves developing mechanisms for fault detection in simulators (Section 5).
Since simulators may be inaccessible in the event of failure, direct fault detection within the simulators
is infeasible. Consequently, fault detection must be implemented as part of the surrounding distributed
system infrastructure.</p>
      <p>Recovery techniques must be tailored to the specific architectural design of the simulators, which
are themselves responsible for ensuring their resilience (Section 6). Simulators must be capable of
persistently storing their state at semantically meaningful intervals and resuming execution from that
state following a failure.</p>
      <p>Finally, the proposed concept is evaluated within a representative scenario (Section 7), followed by a
critical discussion of its advantages, limitations, and potential drawbacks (Section 8).</p>
    </sec>
    <sec id="sec-4">
      <title>4. Distributed System</title>
      <p>The distributed system comprises simulation components, a highly available and distributed memory
architecture, and the overarching distributed environment. A core requirement of this environment
is the ability to detect failures in the participating simulation components and initiate appropriate
recovery measures, such as restarting a failed simulator instance on an alternative computational node.</p>
      <p>The simulation components operate within the context of distributed simulations of P/T nets. This
setup necessitates the presence of the simulation components themselves and a reliable communication
medium through which simulators can interact and coordinate during distributed execution.</p>
      <p>Given the distributed nature of the simulation of these P/T nets, the system requires at least two
simulators operating concurrently. By the proposed recovery mechanism, a highly distributed and
persistent memory system is essential for maintaining the simulator states. As long as a simulator’s
state is preserved within this memory, it remains accessible even after a failure, enabling efective
recovery and continuation of the simulation process.</p>
      <p>communication medium
synchronisation service
distributed
storage
physical
machine
...</p>
      <p>distributed environment
physical
machine
simulator 1
P/T net with distributed
synchronous channel(s)
physical
machine
simulator n
P/T net with distributed
synchronous channel(s)
distributed storage</p>
      <p>distributed storage
communication medium
communication medium</p>
      <p>
        The container orchestration platform Kubernetes [
        <xref ref-type="bibr" rid="ref43">43</xref>
        ] (Section 2.5) is employed to implement the
distributed environment. This choice necessitates that all system components be encapsulated as
containers. At the same time, Kubernetes’ built-in fault detection mechanisms, such as liveness probes,
can be leveraged to monitor and maintain system integrity.
      </p>
      <p>
        The distributed architecture (Section 2.5) further necessitates a storage system that is not only
distributed and persistent but also highly available to maintain the simulators’ states reliably. To this
end, the open-source software Rook [
        <xref ref-type="bibr" rid="ref52">52</xref>
        ] is utilized as a cloud-native storage orchestrator within the
Kubernetes environment. Rook builds upon the mature and widely adopted storage solution Ceph [
        <xref ref-type="bibr" rid="ref57">57</xref>
        ],
which is extensively used in production environments.
      </p>
      <p>While there is no predefined upper limit on the total number of nodes, a minimum number of nodes is
essential for the distributed storage system to function correctly. Specifically, a minimum of three nodes
is required to ensure high availability and consistency. Operating with only two nodes introduces the
risk of a split-brain scenario, whereas a single-node configuration constitutes a single point of failure.</p>
      <p>The simulation components are developed in the context of distributed P/T nets (Section 2.2). The
simulators must incorporate efective failure recovery mechanisms to support fault tolerance,
necessitating a design emphasizing extensibility. The simulator Renew (Section 2.1) is particularly well suited for
this purpose, as it not only facilitates the distributed execution of P/T nets but also features a modular
plugin architecture that supports straightforward extension.</p>
      <p>Moreover, Renew requires a dedicated synchronization service to coordinate the distributed
simulation of P/T nets (Section 2.2). This service is responsible for determining which distributed synchronous
transitions may fire together. For this coordination, the synchronization service employs Renew’s
unification algorithm.</p>
      <p>Finally, an event-driven communication infrastructure (Section 2.2) is essential for inter-simulator
messaging. Kafka is employed as a communication medium by integrating within the Renew
simulation framework to meet this requirement. In addition, Kafka itself is provided as a highly available
communication medium in order to create resilience for the communication medium as well.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Detecting Failures of Simulators</title>
      <p>Reliable detection of simulator failures is a fundamental prerequisite for ensuring the overall resilience
and correctness of the simulation framework. This prototype outlines the systematic approach adopted
to address this challenge, beginning with a detailed analysis of the requirements (Section 5.1) that such a
failure detection mechanism must satisfy. Based on these requirements, we then provide a specification
(Section 5.2) of the detection logic, followed by a discussion of the design (Section 5.3) that guided the
development of the solution. The corresponding implementation (Section 5.4) is subsequently described.
Finally, the efectiveness of the proposed approach is assessed through a comprehensive evaluation
(Section 5.5).</p>
      <sec id="sec-5-1">
        <title>5.1. Requirements</title>
        <p>To achieve a simulation of a distributed P/T net that is as error-free and correct as possible, it is all the
more important to eficiently detect as many types of occurring errors during the simulation as possible.
This enables their subsequent elimination using appropriate recovery mechanisms in the following
prototype (Section 6), thereby ensuring the resilience of the simulators.</p>
        <p>For this purpose, various error detection methods described in Section 2.3 are employed, with dynamic
runtime analysis, as well as monitoring and logging, playing a key role in identifying potential runtime
errors within and between individual simulators. The focus of this prototype is the requirement that it
should be possible to detect failures within a simulator.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Specification</title>
        <p>Besides possible logical and semantic errors—which we already try to detect and resolve within our
quality assurance process—the most critical errors to identify are fail-stop errors. These are the kinds of
errors that can halt the entire simulation and potentially compromise its results.</p>
        <p>A mechanism is therefore required that can identify the fail-stop. For this purpose, the liveliness, i.e.,
the availability, of the simulator is to be checked at regular short intervals.</p>
        <p>Particular attention is given to infrastructure-related faults. Thus, if communication errors occur
during the simulation—i.e., errors in data transmission between the distributed components caused by,
for example, connection losses or packet loss—they must be detected and reported. Similarly, if system
or hardware faults arise due to processor or memory failures or malfunctions, these errors must also
be identified for recovery. Furthermore, all external errors, including those due to Force Majeure or
specific user inputs, are also known but not relevant to our context of distributed simulations.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Design</title>
        <p>In order to detect errors in a simulator, we need an infrastructure that can check the availability of the
simulators and, if necessary, start new simulators on other nodes. For this purpose, a container-based
infrastructure is built that can recognize failing nodes and control a container’s lifecycle.</p>
        <p>In addition, a corresponding HTTP endpoint is required in Renew to check the availability of the
simulator. The CloudNative plugin is used to provide this endpoint.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Implementation</title>
        <p>
          We implement our containers with Docker [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ] and the container orchestration system using
Kubernetes [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ] (Section 2.5). For this reason, we can utilize Kubernetes liveness probes to detect the liveness
of our simulators.
        </p>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Evaluation</title>
        <p>With this implementation, we can detect failures of physical machines, as Kubernetes detects node
failures and reschedules Pods on healthy nodes. Additionally, we can detect failures of the simulator
Pods directly if they disturb the availability of the HTTP endpoint of the CloudNative plugin.</p>
        <p>This means we can detect fail-stop errors using this method, including crashes, node failures, network
errors, and other similar issues. However, Silent errors, like deadlocks in the internal Renew-internal
simulation thread pool, would remain undetected.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Recovering Failures of Simulators - DPTNResiliency</title>
      <p>This section presents the DPTNResiliency prototype, which addresses the challenge of recovering from
simulator failures within distributed simulation environments. We begin by outlining the requirements
(Section 6.1) that guide the development of a resilient recovery mechanism. Based on these requirements,
we then provide a specification (Section 6.2) of the failure recovery behavior. This is followed by a
detailed description of the design (Section 6.3) that shape the structure and coordination logic of
DPTNResiliency. Subsequently, we describe the concrete implementation (Section 6.4) of the proposed
mechanism within our simulation framework. Finally, the section concludes with a thorough evaluation
(Section 6.5) of this prototype.</p>
      <sec id="sec-6-1">
        <title>6.1. Requirements</title>
        <p>Our overarching objective is to develop a fully resilient DPTN simulation. However, the present
contribution explicitly addresses the resilience of the simulators themselves.</p>
        <p>In this context, it is imperative that fail-stop failures afecting individual simulators do not compromise
the overall functionality or correctness of the distributed system. To this end, resilience must be ensured
through a reactive failure recovery mechanism, as proactive strategies alone are insuficient to eliminate
the occurrence of fail-stop failures.</p>
        <p>The prototype developed in this work is designed to facilitate reactive recovery from such
simulator failures. Specifically, when a simulator process crashes or experiences a fail-stop event, it must
be automatically restarted without impairing the operational integrity of the distributed simulation.
While a minimal delay associated with the recovery process is unavoidable, it remains functionally
inconsequential to the system as a whole.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Specification</title>
        <p>In accordance with the requirements outlined in Section 6.1, a reactive recovery mechanism is necessary
to mitigate the efects of fail-stop events. Given that the distributed P/T nets simulated within the
Renew framework are serializable, we adopt a checkpoint-based recovery strategy. This approach
entails periodically saving and, if required, reloading the simulation state within Renew.</p>
        <p>Each checkpoint must encapsulate the complete state of the simulation, including both the internal
markings of all distributed P/T Net (DPTN) instances and the communication state with the
synchronization service. This includes metadata on distributed transitions—such as whether a firing request
has been issued—and a consistent record of which communication events have been processed up to
the checkpoint.</p>
        <p>To ensure the durability of checkpoints beyond the occurrence of a fail-stop event, these must be
stored in a fault-tolerant, highly available storage system. This storage must not reside within a single
container or be bound to a single physical node. Reliance on a single container is inherently fragile,
as container failure leads to complete data loss. Similarly, tying checkpoint persistence to a single
node is inadequate since a node-level failure would result in the irrevocable loss of all checkpoint data.
Consequently, a resilient, distributed storage solution is imperative—one that can withstand partial
system failures without compromising checkpoint integrity.</p>
        <p>
          Our distributed simulation system can be viewed as a distributed database system that executes
distributed transactions, in the sense that each simulator processes events coming in from the event
broker. We can design our system in a way similar to how database systems manage transactions [
          <xref ref-type="bibr" rid="ref58">58</xref>
          ],
making sure to uphold the ACID properties that serve as foundational guarantees [
          <xref ref-type="bibr" rid="ref59 ref60">59, 60</xref>
          ]. The
distributed nature of our system also means we are subject to the constraints of the CAP theorem [
          <xref ref-type="bibr" rid="ref61">61</xref>
          ].
By implementing ACID, we make sure to guarantee Consistency and Partition Tolerance at the cost of
Availability, since it’s never possible to guarantee all three properties at once. Sacrificing availability
means we may get stuck in a recovery loop for a while, in order to make sure the other properties are
upheld.
        </p>
        <p>Upon initialization, a simulator will check if it finds an existing checkpoint, in which case it must
assume a previous failure and recover from that checkpoint. Using the list of events from the event
broker as a log, it can recover from the failure by replaying uncommited events on top of the latest
checkpoint. In order for this to work, we must only commit events (i.e., mark them as processed) once a
checkpoint has been created that includes the implications of the events.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Design</title>
        <p>
          To facilitate resilient simulations, we developed a dedicated Renew plugin named DPTNResiliency.
This plugin relies on other essential Renew components—specifically, the Simulator and GUI
plugins—to enable functionalities such as saving and loading simulation states. It is employed by the
DPTNFormalism plugin, introduced in [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], to execute distributed P/T net simulations.
        </p>
        <p>The DPTNResiliency plugin introduces the console command startResilientSimulation,
which initiates the simulation of a specified distributed P/T net. If a checkpoint is available, the
simulation automatically resumes from the most recent one. This command serves as the entry point
for determining the current simulation state when launching a simulator instance.</p>
        <p>Furthermore, the event-streaming mechanism within the DPTNFormalism plugin has been extended
to notify the DPTNResiliency plugin after the successful processing of each event. This notification
mechanism is essential for persisting and tracking the accurate communication state throughout the
simulation. Additionally, the DPTNResiliency plugin manages Kafka commits to make sure events are
only marked as read once a checkpoint for them has been created.</p>
        <p>The responsibility for checkpoint creation lies with the DPTNResiliency plugin, which utilizes the
current marking and communication status. To ensure fault tolerance, especially in the event of a
simulator crash, each checkpoint is initially written to a temporary location. Only upon successful
creation is it copied to its final destination. Subsequently, the corresponding event is marked as
consumed, ensuring that it cannot be processed again.</p>
        <p>
          Checkpoints generated by the DPTNResiliency plugin must be stored in a resilient, highly available
storage system. This storage operation is triggered after each Kafka event has been successfully
processed. For this purpose, the distributed storage system Ceph [
          <xref ref-type="bibr" rid="ref57">57</xref>
          ] has been selected. Ceph ensures
high availability and fault tolerance by requiring a minimum of three participating nodes, thereby
eliminating single points of failure and mitigating split-brain scenarios.
        </p>
        <p>The creation of checkpoints can also be regarded as a transaction in a (multi-)database system, making
it efectively a sub-transaction in the context of the triggering of a distributed synchronous channel.
As its write operations conform to ACID principles, Ceph ensures the correctness and durability of
stored simulation checkpoints. Additionally, being a distributed system itself, Ceph also underlies the
constraints of the CAP theorem. Being designed to prioritize Consistency and Partition Tolerance, in the
event of network partitioning, it may temporarily compromise the availability of specific components
to uphold global consistency guarantees, matching the specification of our system.</p>
      </sec>
      <sec id="sec-6-4">
        <title>6.4. Implementation</title>
        <p>The DPTNResiliency plugin is implemented as an additional modular Renew plugin. The simulators, as
well as the synchronization service, run in Docker containers within Kubernetes (Section 2.5), deployed
via StatefulSets that include liveness probes. Kafka (Section 2.2), our event broker, is deployed with
high availability on Kubernetes. Attached to each Renew container is a Persistent Volume provided by
Rook, the Kubernetes Deployment of Ceph, on which the checkpoints are stored.</p>
      </sec>
      <sec id="sec-6-5">
        <title>6.5. Evaluation</title>
        <p>Our recovery mechanism is backed by the Kafka event history that acts as a highly available and
distributed log. Additionally, our checkpoints are stored on a highly available distributed storage system
as well. After processing an event, a simulator creates a checkpoint, using an atomic copy operation
when putting it into the right place to prevent checkpoint corruption. Only after the checkpoint is
created, the simulator commits its Kafka ofset, marking the event as consumed in the log and making
sure it’s not consumed again. Furthermore, any simulator that fails is automatically restarted, which
triggers the recovery mechanism that upholds the simulations integrity.</p>
        <p>If a failure occurs before a checkpoint for an event is written, the simulator restores from the previous
checkpoint and executes the event again, thus recovering successfully. If a failure occurs after a
checkpoint is written, but before the Kafka commit has been completed, the simulator restores from
the new checkpoint but is unable to execute the event again. This will lead to a global event timeout
and the event will be attempted again, in which case it will now succeed, thus completing the recovery.
Failures during the recovery process also fall into one of these two categories.</p>
        <p>When drawing a parallel to database transactions, it becomes evident that our solution adheres to the
ACID properties. If we view the processing of each Kafka event as a single transaction, it becomes clear
that our methods guarantee atomicity and durability, in a similar way to how databases implement
transaction logic. This is because we either process a single event fully or not at all (in which case
we later recover and do process it), and we make the changes durable by writing them to our storage
medium. The non-resilient Distributed P/T Net implementation already ensures consistency and
isolation and remains unafected by our enhancements.</p>
        <p>Altogether, these guarantees fulfill all functional requirements for our prototype, thereby confirming
the resilience of the simulators. Nonetheless, one limitation persists: unresolved race conditions
exist within the simulation thread pool of Renew. Although a delay mechanism has been introduced
to temporarily mitigate this issue—and occurrences are exceedingly rare—it nonetheless imposes a
slowdown on the distributed simulation.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Producer-Storage-Consumer Scenario</title>
      <p>To validate the functionality of our prototypes, as outlined in the preceding sections, we experimented
within our Kubernetes cluster to substantiate this claim. The structure of the experiment is presented
as follows: Section 7.1 details the experimental setup, Section 7.2 describes the execution process, and
Section 7.3 discusses our observations.</p>
      <p>The experimental scenario is based on a classical problem in computer science: the
ProducerConsumer–Storage example. In this context, we employ a modeling approach based on distributed
P/T nets (Section 2.2), featuring distributed up- and downlinks in place of standard synchronous
communication channels, as depicted in Figure 1.</p>
      <p>The scenario comprises three components: a producer, a consumer, and a storage unit. The producer
operates within a cyclic process in which a message is first generated and subsequently transmitted to the
storage via a distributed downlink. Conversely, the consumer follows a cyclic process in which messages
are actively retrieved from storage—again via a distributed downlink—and subsequently consumed. This
configuration implies the presence of two active entities: the producer and the consumer. In contrast,
the storage component, which exclusively features distributed uplinks, remains entirely passive.
7.1. Setup
As the foundation for this experiment, we employ the distributed system described previously in
Section 4, comprising three physical machines. Within our Kubernetes cluster, we deploy a highly available
Kafka cluster to serve as the communication medium. Furthermore, we deploy four components—each
a Renew instance—using Kubernetes StatefulSets: the synchronization service, Producer, Storage, and
Consumer. Each of these components consists of exactly one Container in a Kubernetes Pod with a
replica size of one. All StatefulSets are configured with liveness probes that target endpoints exposed
by the CloudNative Renew plugin. Moreover, each component is provisioned with 5 GiB of highly
available persistent storage via Rook/Ceph.</p>
      <p>Prior to initiating the experiment, we ensure that all system components are returned to their default
state. To this end, we delete the four StatefulSets, if present, and erase the data stored in their associated
volumes. Additionally, we remove all Kafka topics and consumer ofsets to prevent residual data from
afecting communication in the upcoming simulation. This reset procedure is critical, as remnants of
prior experiments could otherwise influence the outcomes.</p>
      <p>ScriptCommand: Try to load file startscript_storage.txt
Opening gui...</p>
      <p>Passing args to gui...</p>
      <p>Initialising CheckpointStorageServiceImpl with no previous checkpoint
Starting Simulation...
...</p>
      <p>Simulation initialized.</p>
      <p>INFO: Consumed event UpdateUplink from topic receiveMessage.</p>
      <p>INFO: Consumed event UpdateDownlink from topic sendMessage.</p>
      <p>INFO: Consumed event RequestFiring from topic sendMessage.</p>
      <p>INFO: Record ConfirmUplink sent successfully on topic sendMessage.</p>
      <p>INFO: Consumed event ConfirmUplink from topic sendMessage.</p>
      <p>INFO: Consumed event ConfirmDownlink from topic sendMessage.</p>
      <p>INFO: Consumed event ConfirmFiring from topic sendMessage.</p>
      <p>INFO: Fired Confirm Transition of channel: sendMessage in Storage.</p>
      <p>INFO: Record UpdateDownlink sent succcessfully on topic receiveMessage.</p>
      <p>INFO: Consumed event UpdateDownlink from topic sendMessage.</p>
      <p>INFO: Consumed event UpdateDownlink from topic receiveMessage.</p>
      <p>INFO: Consumed event UpdateDownlink from topic sendMessage.</p>
      <p>INFO: Consumed event RequestFiring from topic sendMessage.</p>
      <p>INFO: Record ConfirmUplink sent successfully from topic sendMessage.</p>
      <p>INFO: Consumed event ConfirmDownlink from topic sendMessage.</p>
      <p>INFO: Consumed event RequestFiring from topic receiveMessage.</p>
      <p>INFO: Record ConfirmDownlink sent successfully on topic receiveMessage.</p>
      <p>INFO: Fired Request Transition of channel: receiveMessage in Storage.</p>
      <p>INFO: Record UpdateDownlink sent successfully on topic receiveMessage.</p>
      <p>INFO: Consumed event ConfirmUplink from topic receiveMessage.</p>
      <p>INFO: Consumed event ConfirmUplink from topic sendMessage.</p>
      <p>INFO: Consumed event UpdateDownlink from topic sendMessage.</p>
      <p>INFO: Consumed event ConfirmFiring from topic sendMessage.</p>
      <p>...</p>
      <sec id="sec-7-1">
        <title>7.2. Execution</title>
        <p>Once the setup has been completed, the experiment can be initiated. The simulation commences with
the deployment of the four system components. To verify correct execution, the logs of each Pod are
inspected to ensure that the simulation is actively running.</p>
        <p>Following confirmation of execution, a brief waiting period is introduced. This period is suficiently
long to allow the simulation to make measurable progress, yet not so extensive that it reaches completion
during this interval.</p>
        <p>Subsequently, a fail-stop fault is emulated by deliberately terminating one of the simulator Pods.
Upon automatic restart, the process resumes, and the simulation continues until it reaches completion.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.3. Observations</title>
        <p>k8user@artpc17:~$ kubectl get pods -n dptn
NAME READY STATUS RESTARTS AGE
consumer-statefulset-0 1/1 Running 0 60s
producer-statefulset-0 1/1 Running 0 60s
storage-statefulset-0 1/1 Running 0 60s
syncservice-statefulset-0 1/1 Running 0 60s
k8user@artpc17:~$ kubectl delete pod storage-statefulset-0 -n dptn
pod "storage-statefulset-0" deleted
k8user@artpc17:~$ kubectl get pods -n dptn
NAME READY STATUS RESTARTS AGE
consumer-statefulset-0 1/1 Running 0 115s
producer-statefulset-0 1/1 Running 0 115s
storage-statefulset-0 1/1 Running 0 21s
syncservice-statefulset-0 1/1 Running 0 115s</p>
        <p>ScriptCommand: Try to load file startscript_storage.txt
Opening gui...</p>
        <p>Passung args to gui...</p>
        <p>Initialising CheckpointStorageServiceImpl with checkpoint Storage-232.rst
Starting Simulation...
...</p>
        <p>Simulation initialised.</p>
        <p>INFO: Record RegisterDownlink sent successfully on topic RegisterTopic.</p>
        <p>INFO: Subscribed to topic: receiveMessage.</p>
        <p>INFO: Subscribed to topic: sendMessage.</p>
        <p>INFO: Consumed event ConfirmDownlink from topic receiveMessage.</p>
        <p>INFO: Consumed event ConfirmFiring from topic receiveMessage.</p>
        <p>INFO: Fired Confirm Transition of channel: receiveMessage Storage.</p>
        <p>INFO: Consumed event UpdateDownlink from topic receiveMessage.</p>
        <p>INFO: Consumed event UpdateUplink from topic receiveMessage.</p>
        <p>INFO: Consumed event UpdateDownlink from topic receiveMessage.</p>
        <p>INFO: Consumed event RequestFiring from topic receiveMessage.</p>
        <p>INFO: Record ConfirmDownlink sent successfully on topic receiveMessage.</p>
        <p>INFO: Fired Request Transition of channel: receiveMessage in Storage.</p>
        <p>INFO: Record UpdateDownlink sent successfully on topic receiveMessage.</p>
        <p>...</p>
        <p>
          Upon initiating the simulation, the logs of all simulator instances consistently indicate that the
simulation is actively running. As illustrated in Figure 4, Kafka events are successfully transmitted and
received, confirming the correct operation of the simulation. In this context, the UpdateUplink and
UpdateDownlink events represent communication with the synchronization service regarding updates
to the up- and downlinks of the registered distributed synchronous channels. The RequestFiring
event is employed to coordinate a distributed firing across all relevant simulators. Upon successful
execution, the ConfirmUplink, ConfirmDownlink, and ConfirmFiring events confirm the resulting
state changes. Detailed specifications of these event types are provided by Clasen et al. [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>Following the deletion of a simulator Pod, Kubernetes automatically initiates a replacement Pod to
restore the simulation topology. As depicted in Figure 5, one simulator Pod exhibits a delayed startup
relative to the others, corresponding to the previously deleted instance.</p>
        <p>Subsequently, once the newly instantiated Pod begins to receive Kafka events, Figure 6 demonstrates
that it resumes participation in distributed transitions, including up- and downlink operations. Moreover,
an examination of the logs from the remaining simulator Pods verifies that interaction with the restarted
Pod is functioning as expected.</p>
        <p>This experiment can be replicated with either the Consumer or the Producer Pod, yielding analogous
results. These observations collectively substantiate the resilience of our simulator architecture.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Discussion</title>
      <p>A key advantage of the proposed concept lies in the inherent resilience of the simulators. This resilience
ensures that the failure of individual simulators during runtime does not compromise the integrity or
continuity of the overall simulation process.</p>
      <p>Furthermore, the system enables the execution of resilient distributed simulations over extended
durations—potentially spanning several weeks or even months. This capability markedly enhances the
system’s usability, as it alleviates user concerns regarding potential simulator failures during long-term
simulations.</p>
      <p>A notable drawback of the proposed system concerns the complexity of the underlying execution
infrastructure. Deployment necessitates the orchestration of multiple physical nodes within
containerized environments such as Kubernetes. These infrastructural requirements impose substantial demands
on the system’s architectural and operational design.</p>
      <p>Another disadvantage, albeit with a very low probability of occurrence, is the potential for deadlocks
arising from the interaction between the DPTNResiliency plugin and the internal Renew simulation
thread pool. Specifically, to utilize Renew’s existing functionality for saving simulation states, the
simulation must be paused. However, there is no guarantee that all queued events will be processed
before the pause takes efect. Given the highly asynchronous architecture of Renew, a rapid succession
of pause and resume operations—especially in conjunction with checkpoint creation—can, under rare
circumstances, cause the internal thread pool to enter a deadlock state. If such a deadlock occurs, it halts
the simulator and remains undetectable. This issue is currently the subject of ongoing investigation.</p>
      <p>One limitation of the current system design is the non-resilient nature of the synchronization service.
Ensuring system stability under failure conditions necessitates the development and integration of
supplementary recovery mechanisms for this component.</p>
      <p>The checkpointing-based recovery strategy, while critical for fault tolerance, introduces additional
computational overhead. This overhead negatively impacts the performance of individual simulators
and the overall simulation, resulting in increased execution times. In particular, delays may occur in
the processing of events transmitted via Kafka, further contributing to reduced simulation eficiency.</p>
      <p>A further limitation stems from the current checkpointing policy, which creates a checkpoint for
every Kafka event consumed. This leads to the generation of a substantial number of checkpoints
during the simulation. Future work will focus on optimizing checkpoint frequency to minimize this
overhead while maintaining resilience.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Related Work</title>
      <p>
        Research conducted by Moldt et al.[
        <xref ref-type="bibr" rid="ref62">62</xref>
        ] and Röwekamp et al.[
        <xref ref-type="bibr" rid="ref18 ref63 ref64 ref65 ref66 ref67 ref68">63, 64, 65, 66, 67, 18, 68</xref>
        ] focuses on
distributed Reference Net simulations, with a particular emphasis on platform management. Their
work incorporates Mulan agent concepts [
        <xref ref-type="bibr" rid="ref69">69</xref>
        ] and employs Spring Boot to enable initial experimental
implementations. While these contributions establish essential foundations, they do not fully exploit the
potential of distributed systems, thereby limiting their applicability to complex, real-world application
scenarios.
      </p>
      <p>
        In contrast, the studies presented in [
        <xref ref-type="bibr" rid="ref70 ref71 ref72 ref73">70, 71, 72, 73</xref>
        ] address the distributed simulation of timed Petri
nets, employing various strategies to ensure accurate simulation of time-dependent behaviors. This
work diferentiates itself from these approaches by focusing on a P/T net class that does not incorporate
a notion of time.
      </p>
      <p>
        A related yet distinct approach is proposed in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which addresses resilient simulation by replicating
entire simulators to tolerate failures. The primary distinction from the concept presented in this paper
lies in the recovery mechanism: rather than replication, checkpointing is employed as the means of
fault tolerance.
      </p>
      <p>
        This contribution builds on the article by Clasen et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] and extends the simulators developed
there by the property of resilience through a checkpoint-based recovery technique of the simulators.
Whereas Clasen et al. [
        <xref ref-type="bibr" rid="ref74">74</xref>
        ] does not focus on resilience but on scalability. The idea there is that the
number of simulators can be adapted dynamically.
10. Conclusion
10.1. Summary
The following conclusion begins with a concise summary (Section 10.1) of the main findings and
contributions of this work. It then outlines potential directions for future research and development
(Section 10.2).
      </p>
      <p>After introducing the foundational concepts—Renew, distributed P/T nets, failure semantics, resilience
strategies, and Kubernetes (Section 2)—this work delineates the central research problem (Section 3): the
design of a resilient simulation framework for P/T nets within a Kubernetes-based cloud environment.
The distributed system described in Section 4 comprises a communication medium, highly available
storage, and simulation components for distributed P/T nets.</p>
      <p>Subsequently, we present the developed prototypes. In Section 5, Detecting Failures of
Simulators, we examine strategies for reliably identifying faults within simulation components. Building
upon these insights, Section 6, Recovering Failures of Simulators, introduces the novel Renew plugin
DPTNResiliency, which facilitates the automated recovery of failed simulation instances based on
checkpointing.</p>
      <p>To evaluate the proposed system, we employ the well-established Producer-Storage-Consumer
scenario (Section 7). During simulation, a simulator instance is intentionally terminated. Recovery is
demonstrated as a new instance seamlessly resumes execution using state checkpoints generated by its
predecessor. Finally, Section 8 ofers a critical assessment of the approach’s strengths, limitations, and
potential trade-ofs, followed by a contextualization within the scope of related research (Section 9).
10.2. Future Work
The immediate next steps involve systematically resolving existing workarounds and eliminating current
race conditions. This efort is expected to reduce the overhead associated with the current approach
significantly. Furthermore, we intend to eliminate the need for checkpoint creation on every consumed
event, thereby further enhancing performance.</p>
      <p>Concurrently, a more advanced and fully automated testing infrastructure is required—one that can
realistically emulate a broad range of failure scenarios, including those studied in the field of chaos
engineering. Such a framework is essential to bolster the reliability and credibility of the proposed
solution.</p>
      <p>The synchronization service must also be extended to ensure fault tolerance not only at the individual
simulator level but across the entire simulation system. Using the aforementioned improved testing
infrastructure, the proposed concept should be rigorously validated against real-world models to
substantiate its empirical robustness.</p>
      <p>Additional refinements are conceivable concerning the health metrics reported by Renew. Increasing
the granularity and precision of these metrics could facilitate more nuanced operational responses and
may enable the early detection of system-level anomalies such as deadlocks.</p>
      <p>A further avenue for research pertains to the scalability of the simulation framework. To date, a
ifxed number of simulators has been employed, resulting in uneven load distribution during
simulations. Exploring dynamic scaling strategies could mitigate this ineficiency and unlock new levels of
performance.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <sec id="sec-10-1">
        <title>During the preparation of this work, the authors used . . .</title>
        <p>• . . . Bing Translate in order to: Translate Text.
• . . . DeepL in order to: Translate Text.
• . . . ChatGPT in order to: Rephrasing.</p>
        <p>• . . . Grammarly in order to: Grammar and spelling check, Repharsing.</p>
        <p>After using these tool(s)/service(s), the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>R.</given-names>
            <surname>Fujimoto</surname>
          </string-name>
          ,
          <article-title>Parallel and distributed simulation</article-title>
          ,
          <source>in: Proceedings of the 2015 Winter Simulation Conference</source>
          , Huntington Beach, CA, USA, December 6-
          <issue>9</issue>
          ,
          <year>2015</year>
          , IEEE/ACM,
          <year>2015</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>59</lpage>
          . URL: https://doi.org/10.1109/WSC.
          <year>2015</year>
          .
          <volume>7408152</volume>
          . doi:
          <volume>10</volume>
          .1109/WSC.
          <year>2015</year>
          .
          <volume>7408152</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Fujimoto</surname>
          </string-name>
          ,
          <article-title>Research challenges in parallel and distributed simulation</article-title>
          ,
          <source>ACM Transactions on Modeling and Computer Simulation (TOMACS) 26</source>
          (
          <year>2016</year>
          )
          <fpage>1</fpage>
          -
          <lpage>29</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Fujimoto</surname>
          </string-name>
          ,
          <article-title>Development of the parallel and distributed simulation field</article-title>
          ,
          <source>Simul</source>
          .
          <volume>100</volume>
          (
          <year>2024</year>
          )
          <fpage>1197</fpage>
          -
          <lpage>1223</lpage>
          . URL: https://doi.org/10.1177/00375497241261407. doi:
          <volume>10</volume>
          .1177/00375497241261407.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Avizienis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-C.</given-names>
            <surname>Laprie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Randell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Landwehr</surname>
          </string-name>
          ,
          <article-title>Basic concepts and taxonomy of dependable and secure computing</article-title>
          ,
          <source>IEEE transactions on dependable and secure computing 1</source>
          (
          <year>2004</year>
          )
          <fpage>11</fpage>
          -
          <lpage>33</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E. N.</given-names>
            <surname>Elnozahy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Alvisi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-M.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. B.</given-names>
            <surname>Johnson</surname>
          </string-name>
          ,
          <article-title>A survey of rollback-recovery protocols in message-passing systems</article-title>
          ,
          <source>ACM Computing Surveys (CSUR) 34</source>
          (
          <year>2002</year>
          )
          <fpage>375</fpage>
          -
          <lpage>408</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>G. D'Angelo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ferretti</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Marzolla</surname>
          </string-name>
          ,
          <article-title>Fault tolerant adaptive parallel and distributed simulation through functional replication</article-title>
          ,
          <source>Simulation Modelling Practice and Theory</source>
          <volume>93</volume>
          (
          <year>2019</year>
          )
          <fpage>192</fpage>
          -
          <lpage>207</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Budde</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kautz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kuhlenkamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Züllighoven</surname>
          </string-name>
          , What is prototyping?,
          <source>Information Technology &amp; People</source>
          <volume>6</volume>
          (
          <year>1990</year>
          )
          <fpage>89</fpage>
          -
          <lpage>95</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Pomberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Pree</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stritzinger</surname>
          </string-name>
          ,
          <article-title>Methoden und Werkzeuge für das Prototyping und ihre Integration, Inform</article-title>
          .,
          <source>Forsch. Entwickl</source>
          .
          <volume>7</volume>
          (
          <year>1992</year>
          )
          <fpage>49</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wilde</surname>
          </string-name>
          , T. Hess,
          <source>Forschungsmethoden der Wirtschaftsinformatik, Wirtschaftsinformatik</source>
          <volume>4</volume>
          (
          <year>2007</year>
          )
          <fpage>280</fpage>
          -
          <lpage>287</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>L.</given-names>
            <surname>Clasen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bartelt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Stahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moldt</surname>
          </string-name>
          , Distributed P/
          <source>T Net Simulation Prototypes Based on Event Streaming</source>
          , in: M.
          <string-name>
            <surname>Köhler-Bußmeier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Moldt</surname>
          </string-name>
          , H. Rölke (Eds.),
          <source>Proceedings of the International Workshop on Petri Nets and Software Engineering</source>
          <year>2024</year>
          co
          <article-title>-located with the 45th International Conference on Application and Theory of Petri Nets and Concurrency (PETRI NETS</article-title>
          <year>2024</year>
          ), June 24 - 25,
          <year>2024</year>
          , Geneva, Switzerland, volume
          <volume>3730</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>192</fpage>
          -
          <lpage>216</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3730</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kummer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wienberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Duvigneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cabac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Haustermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Mosteller</surname>
          </string-name>
          , Renew - the
          <source>Reference Net Workshop</source>
          ,
          <year>2023</year>
          . URL: http://www.renew.de/, release 4.1.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Oracle</surname>
            ,
            <given-names>Java</given-names>
          </string-name>
          <string-name>
            <surname>Documentation</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: https://docs.oracle.com/en/java/, accessed:
          <fpage>2025</fpage>
          -04-25.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Gradle</surname>
            ,
            <given-names>Gradle</given-names>
          </string-name>
          <string-name>
            <surname>Documentation</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: https://docs.gradle.org/, accessed:
          <fpage>2025</fpage>
          -04-25.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M.</given-names>
            <surname>Duvigneau</surname>
          </string-name>
          ,
          <article-title>Konzeptionelle Modellierung von Plugin-Systemen mit Petrinetzen</article-title>
          , volume
          <volume>4</volume>
          <source>of Agent Technology - Theory and Applications</source>
          , Logos Verlag, Berlin,
          <year>2010</year>
          . URL: http://www. logos-verlag.de/cgi-bin/engbuchmid?isbn=2561&amp;lng=eng&amp;id=.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>L.</given-names>
            <surname>Clasen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hansson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Willrodt</surname>
          </string-name>
          , L. Voß,
          <article-title>Enhancement of Renew to Version 4.0 using JPMS</article-title>
          , in: M.
          <string-name>
            <surname>Köhler-Bußmeier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Moldt</surname>
          </string-name>
          , H. Rölke (Eds.),
          <source>Proceedings of the International Workshop on Petri Nets and Software Engineering</source>
          <year>2022</year>
          co
          <article-title>-located with the 43rd International Conference on Application and Theory of Petri Nets and Concurrency (PETRI NETS</article-title>
          <year>2022</year>
          ), Bergen, Norway, June 20th,
          <year>2022</year>
          , volume
          <volume>3170</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>165</fpage>
          -
          <lpage>176</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3170</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Moldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Johnsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Streckenbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Clasen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Haustermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Heinze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hansson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Feldmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ihlenfeldt</surname>
          </string-name>
          , RENEW: Modularized Architecture and New Features, in: L.
          <string-name>
            <surname>Gomes</surname>
          </string-name>
          , R. Lorenz (Eds.),
          <source>Application and Theory of Petri Nets and Concurrency - 44th International Conference, PETRI NETS</source>
          <year>2023</year>
          , Lisbon, Portugal, June 25-30,
          <year>2023</year>
          , Proceedings, volume
          <volume>13929</volume>
          of Lecture Notes in Computer Science, Springer Nature Switzerland AG, Cham, Switzerland,
          <year>2023</year>
          , pp.
          <fpage>217</fpage>
          -
          <lpage>228</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -33620-1_
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>O.</given-names>
            <surname>Kummer</surname>
          </string-name>
          , Referenznetze, Logos Verlag, Berlin,
          <year>2002</year>
          . URL: http://www.logos-verlag.de/cgi-bin/ engbuchmid?isbn=0035&amp;lng=eng&amp;id=.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Röwekamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Taube</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mohr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moldt</surname>
          </string-name>
          , Cloud Native Simulation of Reference Nets, in: M.
          <string-name>
            <surname>Köhler-Bußmeier</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Kindler</surname>
          </string-name>
          , H. Rölke (Eds.),
          <source>Proceedings of the International Workshop on Petri Nets and Software Engineering</source>
          <year>2021</year>
          co
          <article-title>-located with the 42nd International Conference on Application and Theory of Petri Nets and Concurrency (PETRI NETS</article-title>
          <year>2021</year>
          ), Paris, France, June 25th,
          <year>2021</year>
          (due to COVID-19
          <source>: virtual conference)</source>
          , volume
          <volume>2907</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>85</fpage>
          -
          <lpage>104</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2907</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , J. Hoeller,
          <string-name>
            <given-names>K.</given-names>
            <surname>Donald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Sampaleanu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Harrop</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Risberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Arendsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kopylenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pollack</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Templier</surname>
          </string-name>
          , E. Vervaet,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Colyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Leau</surname>
          </string-name>
          , M. Fisher, S. Brannen,
          <string-name>
            <given-names>R.</given-names>
            <surname>Laddad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poutsma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Beams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Abedrabbo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Clement</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Syer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Gierke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Stoyanchev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Webb</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Winch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Clozel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nicoll</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deleuze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bryant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Paluch</surname>
          </string-name>
          , Spring Framework Reference Documentation, https://docs.spring.io/spring-framework/reference/ index.html,
          <source>2025. Version 6.2.6, abgerufen am 25. April</source>
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>L.</given-names>
            <surname>Voß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Willrodt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Haustermann</surname>
          </string-name>
          , Between Expressiveness and Verifiability: P/
          <article-title>T-nets with Synchronous Channels and Modular Structure</article-title>
          , in: M.
          <string-name>
            <surname>Köhler-Bußmeier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Moldt</surname>
          </string-name>
          , H. Rölke (Eds.),
          <source>Proceedings of the International Workshop on Petri Nets and Software Engineering</source>
          <year>2022</year>
          co
          <article-title>-located with the 43rd International Conference on Application and Theory of Petri Nets and Concurrency (PETRI NETS</article-title>
          <year>2022</year>
          ), Bergen, Norway, June 20th,
          <year>2022</year>
          , volume
          <volume>3170</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>40</fpage>
          -
          <lpage>59</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3170</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>P.</given-names>
            <surname>Fettke</surname>
          </string-name>
          , W. Reisig,
          <article-title>Once and for all: how to compose modules - The composition calculus</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2408.15031. arXiv:
          <volume>2408</volume>
          .
          <fpage>15031</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kreps</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Narkhede</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rao</surname>
          </string-name>
          , et al.,
          <article-title>Kafka: A distributed messaging system for log processing</article-title>
          ,
          <source>in: NetDB 2011: 6th Workshop on Networking meets Databases</source>
          , volume
          <volume>11</volume>
          , Athens, Greece,
          <year>2011</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Foundation</surname>
          </string-name>
          , Apache Kafka Documentation,
          <year>2025</year>
          . URL: https://kafka.apache.org/ documentation/, accessed:
          <fpage>2025</fpage>
          -01-21.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>N.</given-names>
            <surname>Garg</surname>
          </string-name>
          , Apache Kafka, Packt Publishing Birmingham, UK,
          <year>2013</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>McDirmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Bergan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bodík</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Musuvathi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , L. Zhou, Failure Recovery:
          <article-title>When the Cure Is Worse Than the Disease</article-title>
          , in: HotOS, USENIX,
          <year>2013</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          . URL: https://www.microsoft.com/en-us/research/publication/ failure
          <article-title>-recovery-when-the-cure-is-worse-than-the-disease/.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>Calvanese</surname>
          </string-name>
          , Diego, Types of program errors,
          <year>2006</year>
          . URL: https://www.inf.unibz.it/~calvanese/ teaching/06-07-ip/lecture-notes/uni10/node2.html.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghanavati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Andrzejak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Seboek</surname>
          </string-name>
          ,
          <article-title>Memory and resource leak defects in java projects: an empirical study</article-title>
          ,
          <source>in: Proceedings of the 40th International Conference on Software Engineering: Companion Proceeedings</source>
          , ICSE '18,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2018</year>
          , p.
          <fpage>410</fpage>
          -
          <lpage>411</lpage>
          . URL: https://doi.org/10.1145/3183440.3195032. doi:
          <volume>10</volume>
          .1145/ 3183440.3195032.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>D.</given-names>
            <surname>Giebas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Wojszczyk</surname>
          </string-name>
          ,
          <source>Deadlocks Detection in Multithreaded Applications Based on Source Code Analysis, Applied Sciences</source>
          <volume>10</volume>
          (
          <year>2020</year>
          ). URL: https://www.mdpi.com/2076-3417/10/2/532. doi:
          <volume>10</volume>
          .3390/app10020532.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>B.</given-names>
            <surname>Schroeder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Gibson</surname>
          </string-name>
          ,
          <article-title>A Large-Scale Study of Failures in High-Performance Computing Systems</article-title>
          ,
          <source>IEEE Transactions on Dependable and Secure Computing</source>
          <volume>7</volume>
          (
          <year>2010</year>
          )
          <fpage>337</fpage>
          -
          <lpage>350</lpage>
          . doi:
          <volume>10</volume>
          . 1109/TDSC.
          <year>2009</year>
          .
          <volume>4</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <article-title>hello2morrow GmbH, SonarGraph - Static Analysis and Architecture Validation Tool</article-title>
          , https: //www.hello2morrow.com/,
          <source>2024. Accessed on June 1</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>B.</given-names>
            <surname>Pugh</surname>
          </string-name>
          , D. Hovemeyer, FindBugs - Static Bug Detector for Java, https://findbugs.sourceforge.net/,
          <source>2015. Accessed on June 1</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Seward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Developers</surname>
          </string-name>
          , Valgrind - Debugging and Profiling Tools, https://valgrind.org/,
          <source>2024. Accessed on June 1</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>F. S.</given-names>
            <surname>Foundation</surname>
          </string-name>
          ,
          <article-title>GDB: The GNU Project Debugger</article-title>
          , https://sourceware.org/gdb/,
          <source>2024. Accessed on June 1</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>G. LLC</given-names>
            ,
            <surname>AddressSanitizer - A Fast Memory Error Detector</surname>
          </string-name>
          , https://github.com/google/sanitizers/ wiki/AddressSanitizer,
          <year>2024</year>
          . Accessed on June 1,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>P.</given-names>
            <surname>Authors</surname>
          </string-name>
          , Prometheus - Monitoring
          <string-name>
            <surname>System</surname>
          </string-name>
          &amp; Time Series Database, https://prometheus.io/,
          <source>2024. Accessed on June 1</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>W.</given-names>
            <surname>Hasselbring</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. van Hoorn</surname>
          </string-name>
          ,
          <article-title>Kieker: A monitoring framework for software engineering research</article-title>
          ,
          <source>Software Impacts</source>
          <volume>5</volume>
          (
          <year>2020</year>
          )
          <article-title>100019</article-title>
          . URL: https://www.sciencedirect.com/science/article/ pii/S2665963820300063. doi:https://doi.org/10.1016/j.simpa.
          <year>2020</year>
          .
          <volume>100019</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ledmi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bendjenna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Hemam</surname>
          </string-name>
          ,
          <article-title>Fault Tolerance in Distributed Systems: A Survey</article-title>
          ,
          <source>in: 2018 3rd International Conference on Pattern Analysis and Intelligent Systems (PAIS)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . doi:
          <volume>10</volume>
          .1109/PAIS.
          <year>2018</year>
          .
          <volume>8598484</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Malhotra</surname>
          </string-name>
          , Study of Various Proactive Fault Tolerance Techniques in Cloud Computing,
          <source>International Journal of Computer Sciences and Engineering</source>
          <volume>06</volume>
          (
          <year>2018</year>
          )
          <fpage>81</fpage>
          -
          <lpage>87</lpage>
          . doi:
          <volume>10</volume>
          .26438/ ijcse/v6si3.
          <fpage>8187</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>W.</given-names>
            <surname>Bland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bouteiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Herault</surname>
          </string-name>
          , G. Bosilca,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dongarra</surname>
          </string-name>
          ,
          <article-title>Post-failure recovery of MPI communication capability: Design and rationale</article-title>
          ,
          <source>The International Journal of High Performance Computing Applications</source>
          <volume>27</volume>
          (
          <year>2013</year>
          )
          <fpage>244</fpage>
          -
          <lpage>254</lpage>
          . URL: https://doi.org/10.1177/1094342013488238. doi:
          <volume>10</volume>
          .1177/1094342013488238.
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dagur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Yadav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ranvijay</surname>
          </string-name>
          ,
          <article-title>Fault Tolerance in Real Time Distributed System</article-title>
          ,
          <source>International Journal on Computer Science and Engineering</source>
          <volume>3</volume>
          (
          <year>2011</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <surname>J.-C. Laprie</surname>
          </string-name>
          , et al.,
          <article-title>From dependability to resilience</article-title>
          ,
          <source>in: 38th IEEE/IFIP Int. Conf. On dependable systems and networks</source>
          ,
          <source>2008</source>
          , pp.
          <fpage>G8</fpage>
          -
          <lpage>G9</lpage>
          . URL: https://2008.dsn.org/fastabs/dsn08fastabs_laprie.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pradhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Levendovszky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. A.</given-names>
            <surname>Emfinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Balasubramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Otte</surname>
          </string-name>
          , G. Karsai,
          <article-title>Achieving resilience in distributed software systems via self-reconfiguration</article-title>
          ,
          <source>Journal of Systems and Software</source>
          <volume>122</volume>
          (
          <year>2016</year>
          )
          <fpage>344</fpage>
          -
          <lpage>363</lpage>
          . URL: https://www.sciencedirect.com/science/article/pii/ S0164121216300590. doi:https://doi.org/10.1016/j.jss.
          <year>2016</year>
          .
          <volume>05</volume>
          .038.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>The</given-names>
            <surname>Kubernetes</surname>
          </string-name>
          <string-name>
            <surname>Authors</surname>
          </string-name>
          , Kubernetes,
          <year>2025</year>
          . URL: https://kubernetes.io/docs.
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <surname>A. M. Potdar</surname>
          </string-name>
          , N. D G, S. Kengond,
          <string-name>
            <surname>M. M. Mulla</surname>
          </string-name>
          ,
          <article-title>Performance Evaluation of Docker Container</article-title>
          and
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Machine</surname>
          </string-name>
          ,
          <source>Procedia Computer Science</source>
          <volume>171</volume>
          (
          <year>2020</year>
          )
          <fpage>1419</fpage>
          -
          <lpage>1428</lpage>
          . URL: https://www. sciencedirect.com/science/article/pii/S1877050920311315. doi:https://doi.org/10.1016/j. procs.
          <year>2020</year>
          .
          <volume>04</volume>
          .152, third International Conference on Computing and
          <article-title>Network Communications (CoCoNet'19).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>D.</given-names>
            <surname>Inc</surname>
          </string-name>
          ., Docker - Empowering App Development for Developers, https://www.docker.com/,
          <source>2024. Accessed on June 1</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>I. Red</given-names>
            <surname>Hat</surname>
          </string-name>
          ,
          <string-name>
            <surname>Podman - A Daemonless Container</surname>
          </string-name>
          <article-title>Engine for Developers</article-title>
          , https://docs.podman.io/en/ latest/,
          <source>2024. Accessed on June 1</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>O. C.</given-names>
            <surname>Initiative</surname>
          </string-name>
          , OCI - Open Container Initiative Specifications, https://opencontainers.org/,
          <source>2024. Accessed on June 1</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pace</surname>
          </string-name>
          ,
          <article-title>Zero Trust Networks with Istio, Master's thesis</article-title>
          , Politecnico Di Torino,
          <year>2021</year>
          . URL: https://webthesis.biblio.polito.it/21170/.
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>Open</given-names>
            <surname>Container</surname>
          </string-name>
          <string-name>
            <surname>Initiative</surname>
          </string-name>
          ,
          <source>OCI Runtime Spec</source>
          ,
          <year>2025</year>
          . URL: https://github.com/opencontainers/ runtime-spec/blob/main/runtime.md.
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>C.</given-names>
            <surname>Albuquerque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Relvas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. F.</given-names>
            <surname>Correia</surname>
          </string-name>
          , K. Brown,
          <article-title>Proactive monitoring design patterns for cloud-native applications</article-title>
          ,
          <source>in: Proceedings of the 27th European Conference on Pattern Languages of Programs</source>
          , EuroPLop '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . URL: https://doi.org/10.1145/3551902.3551961. doi:
          <volume>10</volume>
          .1145/3551902.3551961.
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Weil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Brandt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. L.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. D. E.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Maltzahn</surname>
          </string-name>
          , Ceph: A Scalable,
          <article-title>HighPerformance Distributed File System</article-title>
          ,
          <source>in: 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI 06)</source>
          , USENIX Association, Seattle, WA,
          <year>2006</year>
          , pp.
          <fpage>307</fpage>
          -
          <lpage>320</lpage>
          . URL: https: //www.usenix.org/conference/osdi-06/
          <article-title>ceph-scalable-high-performance-distributed-file-system.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <surname>Rook</surname>
          </string-name>
          , Authors, Rook Documentation,
          <year>2025</year>
          . URL: https://rook.io/docs/rook/latest-release, accessed:
          <fpage>2025</fpage>
          -04-25.
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>L.</given-names>
            <surname>Mercl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlik</surname>
          </string-name>
          ,
          <article-title>Public Cloud Kubernetes Storage Performance Analysis</article-title>
          , in: N. T. Nguyen,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chbeir</surname>
          </string-name>
          , E. Exposito,
          <string-name>
            <given-names>P.</given-names>
            <surname>Aniorté</surname>
          </string-name>
          ,
          <string-name>
            <surname>B.</surname>
          </string-name>
          Trawiński (Eds.),
          <source>Computational Collective Intelligence</source>
          , Springer International Publishing, Cham,
          <year>2019</year>
          , pp.
          <fpage>649</fpage>
          -
          <lpage>660</lpage>
          . URL: https://doi.org/10. 1007/978-3-
          <fpage>030</fpage>
          -28374-2_
          <fpage>56</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -28374-2_
          <fpage>56</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>C. N. C.</given-names>
            <surname>Foundation</surname>
          </string-name>
          , CNCF - Cloud Native Computing Foundation, https://www.cncf.io/,
          <source>2024. Accessed on June 1</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Foundation</surname>
          </string-name>
          , The Linux Foundation - Supporting Open Source Innovation, https://www. linuxfoundation.org/,
          <source>2024. Accessed on June 1</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>The</given-names>
            <surname>Linux</surname>
          </string-name>
          <string-name>
            <surname>Foundation</surname>
          </string-name>
          , Cloud Native Landscape,
          <year>2025</year>
          . URL: https://landscape.cncf.io/.
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [57]
          <string-name>
            <surname>Ceph</surname>
          </string-name>
          , Authors, Ceph Documentation: Reef Release,
          <year>2025</year>
          . URL: https://docs.ceph.com/en/reef/, accessed:
          <fpage>2025</fpage>
          -04-25.
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Breitbart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Garcia-Molina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Silberschatz</surname>
          </string-name>
          ,
          <article-title>Overview of multidatabase transaction management, in: CASCON First Decade High Impact Papers</article-title>
          , CASCON '10,
          <string-name>
            <given-names>IBM</given-names>
            <surname>Corp</surname>
          </string-name>
          ., USA,
          <year>2010</year>
          , p.
          <fpage>93</fpage>
          -
          <lpage>126</lpage>
          . URL: https://doi.org/10.1145/1925805.1925811.
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          [59]
          <string-name>
            <given-names>T.</given-names>
            <surname>Haerder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Reuter</surname>
          </string-name>
          ,
          <article-title>Principles of transaction-oriented database recovery, ACM computing surveys (CSUR) 15 (</article-title>
          <year>1983</year>
          )
          <fpage>287</fpage>
          -
          <lpage>317</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          [60]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kemper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Eickler</surname>
          </string-name>
          , Datenbanksysteme - Eine
          <string-name>
            <surname>Einführung</surname>
          </string-name>
          , 8 ed., Oldenbourg Verlag,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          [61]
          <string-name>
            <given-names>E. A.</given-names>
            <surname>Brewer</surname>
          </string-name>
          ,
          <article-title>Towards robust distributed systems</article-title>
          , in: PODC, volume
          <volume>7</volume>
          ,
          <string-name>
            <surname>Portland</surname>
            ,
            <given-names>OR</given-names>
          </string-name>
          ,
          <year>2000</year>
          , pp.
          <fpage>343</fpage>
          -
          <lpage>477</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          [62]
          <string-name>
            <given-names>D.</given-names>
            <surname>Moldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Röwekamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simon</surname>
          </string-name>
          ,
          <article-title>A Simple Prototype of Distributed Execution of Reference Nets Based on Virtual Machines</article-title>
          , in: R.
          <string-name>
            <surname>Bergenthum</surname>
          </string-name>
          , E. Kindler (Eds.),
          <source>Algorithms and Tools for Petri Nets Proceedings of the Workshop AWPN</source>
          <year>2017</year>
          ,
          <article-title>Kgs</article-title>
          . Lyngby,
          <source>Denmark October 19-20</source>
          ,
          <year>2017</year>
          ,
          <source>DTU Compute Technical Report 2017-06</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>51</fpage>
          -
          <lpage>57</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref63">
        <mixed-citation>
          [63]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Röwekamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Feldmann</surname>
          </string-name>
          ,
          <article-title>Investigation of Containerizing Distributed Petri Net Simulations</article-title>
          , in: D.
          <string-name>
            <surname>Moldt</surname>
            , E. Kindler, H. Rölke (Eds.), Petri Nets and
            <given-names>Software</given-names>
          </string-name>
          <string-name>
            <surname>Engineering</surname>
          </string-name>
          . International Workshop, PNSE'
          <fpage>18</fpage>
          , Bratislava, Slovakia, June 25-26,
          <year>2018</year>
          . Proceedings, volume
          <volume>2138</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>133</fpage>
          -
          <lpage>142</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2138</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref64">
        <mixed-citation>
          [64]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Röwekamp</surname>
          </string-name>
          ,
          <article-title>Investigating the Java Spring Framework to Simulate Reference Nets with Renew</article-title>
          , in: R.
          <string-name>
            <surname>Lorenz</surname>
          </string-name>
          , J. Metzger (Eds.),
          <article-title>Algorithms and Tools for Petri Nets</article-title>
          , number
          <year>2018</year>
          -02 in Reports / Technische Berichte der Fakultät für Angewandte Informatik der Universität Augsburg,
          <year>2018</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>46</lpage>
          . URL: https://opus.bibliothek.uni-augsburg.de/opus4/41861.
        </mixed-citation>
      </ref>
      <ref id="ref65">
        <mixed-citation>
          [65]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Röwekamp</surname>
          </string-name>
          , D. Moldt,
          <article-title>RenewKube: Reference Net Simulation Scaling with Renew and Kubernetes</article-title>
          , in: S. Donatelli, S. Haar (Eds.),
          <source>Application and Theory of Petri Nets and Concurrency - 40th International Conference, PETRI NETS</source>
          <year>2019</year>
          , Aachen, Germany, June 23-28,
          <year>2019</year>
          , Proceedings, volume
          <volume>11522</volume>
          of Lecture Notes in Computer Science, Springer,
          <year>2019</year>
          , pp.
          <fpage>69</fpage>
          -
          <lpage>79</lpage>
          . URL: https://doi.org/ 10.1007/978-3-
          <fpage>030</fpage>
          -21571-
          <issue>2</issue>
          _
          <fpage>4</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref66">
        <mixed-citation>
          [66]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Röwekamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Feldmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simon</surname>
          </string-name>
          , Simulating Place /
          <article-title>Transition Nets by a Distributed, Web Based, Stateless Service</article-title>
          , in: D.
          <string-name>
            <surname>Moldt</surname>
            , E. Kindler, M. Wimmer (Eds.), Petri Nets and
            <given-names>Software</given-names>
          </string-name>
          <string-name>
            <surname>Engineering</surname>
          </string-name>
          . International Workshop, PNSE'19,
          <string-name>
            <surname>Aachen</surname>
          </string-name>
          , Germany, June 24,
          <year>2019</year>
          . Proceedings, volume
          <volume>2424</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>163</fpage>
          -
          <lpage>164</lpage>
          . URL: http://CEUR-WS.org/Vol-
          <volume>2424</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref67">
        <mixed-citation>
          [67]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Röwekamp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Buchholz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moldt</surname>
          </string-name>
          , Petri Net Sagas, in: M.
          <string-name>
            <surname>Köhler-Bußmeier</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Kindler</surname>
          </string-name>
          , H. Rölke (Eds.),
          <source>Proceedings of the International Workshop on Petri Nets and Software Engineering</source>
          <year>2021</year>
          co
          <article-title>-located with the 42nd International Conference on Application and Theory of Petri Nets and Concurrency (PETRI NETS</article-title>
          <year>2021</year>
          ), Paris, France, June 25th,
          <year>2021</year>
          (due to COVID-19
          <source>: virtual conference)</source>
          , volume
          <volume>2907</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>84</lpage>
          . URL: http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2907</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref68">
        <mixed-citation>
          [68]
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Röwekamp</surname>
          </string-name>
          ,
          <article-title>Skalierung von nebenläufigen und verteilten Simulationssystemen für interagierende Agenten</article-title>
          ,
          <source>Ph.D. thesis</source>
          , University of Hamburg, Department of Informatics, Vogt-Kölln Str. 30,
          <string-name>
            <given-names>D</given-names>
            <surname>-</surname>
          </string-name>
          22527
          <string-name>
            <surname>Hamburg</surname>
          </string-name>
          ,
          <year>2023</year>
          . URL: https://ediss.sub.uni-hamburg.de/handle/ediss/10040.
        </mixed-citation>
      </ref>
      <ref id="ref69">
        <mixed-citation>
          [69]
          <string-name>
            <given-names>H.</given-names>
            <surname>Rölke</surname>
          </string-name>
          ,
          <source>Modellierung von Agenten und Multiagentensystemen - Grundlagen und Anwendungen</source>
          , volume
          <volume>2</volume>
          <source>of Agent Technology - Theory and Applications</source>
          , Logos Verlag, Berlin,
          <year>2004</year>
          . URL: http: //logos-verlag.de/cgi-bin/engbuchmid?isbn=0768&amp;lng=eng&amp;id=.
        </mixed-citation>
      </ref>
      <ref id="ref70">
        <mixed-citation>
          [70]
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Thomas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zahorjan</surname>
          </string-name>
          ,
          <article-title>Parallel simulation of performance petri nets: Extending the domain of parallel simulation</article-title>
          ,
          <source>Technical Report, Institute of Electrical and Electronics Engineers (IEEE)</source>
          ,
          <year>1991</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref71">
        <mixed-citation>
          [71]
          <string-name>
            <given-names>H. H.</given-names>
            <surname>Ammar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <article-title>Time warp simulation of stochastic Petri nets</article-title>
          ,
          <source>in: Proceedings of the Fourth International Workshop on Petri Nets and Performance Models PNPM91</source>
          , IEEE,
          <year>1991</year>
          , pp.
          <fpage>186</fpage>
          -
          <lpage>195</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref72">
        <mixed-citation>
          [72]
          <string-name>
            <given-names>G.</given-names>
            <surname>Chiola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferscha</surname>
          </string-name>
          ,
          <article-title>Distributed simulation of petri nets</article-title>
          ,
          <source>IEEE Parallel and Distributed Technology</source>
          <volume>1</volume>
          (
          <year>1993</year>
          )
          <fpage>33</fpage>
          -
          <lpage>50</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref73">
        <mixed-citation>
          [73]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ferscha</surname>
          </string-name>
          ,
          <article-title>Adaptive time warp simulation of timed petri nets</article-title>
          ,
          <source>IEEE Transactions on Software Engineering</source>
          <volume>25</volume>
          (
          <year>1999</year>
          )
          <fpage>237</fpage>
          -
          <lpage>257</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref74">
        <mixed-citation>
          [74]
          <string-name>
            <given-names>L.</given-names>
            <surname>Clasen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Nayci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Nacyi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Middendorf</surname>
          </string-name>
          , T. Mack, Investigations Towards Dynamic Scaling of Distributed P/T Nets, in: M.
          <string-name>
            <surname>Köhler-Bußmeier</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Moldt</surname>
          </string-name>
          , H. Rölke (Eds.),
          <source>Proceedings of the International Workshop on Petri Nets and Software Engineering</source>
          <year>2025</year>
          co
          <article-title>-located with the 46th International Conference on Application and Theory of Petri Nets and Concurrency (PETRI NETS</article-title>
          <year>2025</year>
          ), June 22 - 27,
          <year>2025</year>
          , Paris, France, CEUR Workshop Proceedings, CEUR-WS.org,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>