<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Enhancing process understanding through multimodal data analysis and extended reality</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aleksandar Gavric</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Business Informatics, TU Wien</institution>
          ,
          <addr-line>Favoritenstrasse 9-11/194-3, 1040 Vienna</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <issue>1</issue>
      <abstract>
        <p>The significance of process mining lies in its ability to enable organizations to gain insights and improve eficiency by analyzing vast amounts of data generated in IT supported processes. Process mining often falls short when it comes to understanding manual processes, as it primarily captures the what but not the how of such activities. We see great potential in the growth of videos capturing various processes, which can lead to a rich source of data for process understanding and analysis, enabling a more comprehensive insight into manual workflows. However, extracting actionable insights from these videos poses significant challenges due to their unstructured nature. This doctoral thesis is about an approach that combines multimodal data analysis and extended reality (XR) techniques to enhance business process understanding. By integrating visual, textual, and audio information from videos, the proposed solution facilitates comprehensive analysis of processes, facilitating process mining, monitoring, and guidance. To address the complexities arising from an amalgamation of rich data sources, we propose three primary research objectives: (1) evaluating methods for mining relevant process information from these multimodal data sources, particularly video; (2) exploring the integration of XR technologies with enriched event logs to foster an immersive, interactive data visualization experience and accurate domain-specific modeling; and (3) determining the influence of integrating these technologies on the interpretability of process mining results. Ultimately, our study explores the way for a specialized, comprehensive approach to process mining, harnessing the power of XR for enriched event log analysis. To demonstrate the efectiveness of the proposed approach, the thesis elaborates applications in various domains, including manufacturing, logistics, healthcare, and beyond.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Multimodal data analysis</kwd>
        <kwd>Extended reality</kwd>
        <kwd>Process understanding</kwd>
        <kwd>Process monitoring</kwd>
        <kwd>Process guidance</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In an era where data is both ubiquitous and diverse, we envision a significant promise in
the proliferation of rich data sources documenting diverse processes, which can serve as
a valuable data for comprehending and dissecting processes in-depth, afording a broader
comprehension of manual workflows. One such type of data source is video, which is rich in
content but also inherently complex and high-dimensional. Process discovery goal is to take an
event log containing example behaviors and create a process model that adequately describes
the underlying process [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Despite the omnipresence of event logs data, most organizations
diagnose problems based on fiction rather than facts [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and there are limitations in what
processmining practitioners tend to use actively [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. Existing techniques for process understanding have
traditionally relied on manual observation, interpretation, and note-taking. These techniques
are not only time-consuming but are also prone to human error and inconsistency. The process
mining field’s growth and the emergence of challenges faced by analysts, as identified through
comprehensive [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ] interviews and surveys, underscore the need for enhanced support and
research at multiple levels, including individual, technical, and organizational, to fully leverage
process mining’s potential in competitive business environments.
      </p>
      <p>
        The rapid advancements in multimodal data analysis and extended reality technologies present
promising opportunities for the automatic extraction of valuable process-related information.
Various studies show that interdisciplinary research field of conceptual modeling and artificial
intelligence gains mutual benefits [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. However, despite these advancements, the integration
of multimodal data, particularly videos, into process analysis and guidance remains
underexplored and lacks efective approaches.
      </p>
      <p>
        In the past, we saw how adding additional modality to process mining showed positive
results. Some studies showed that in fields such as humanities, social sciences and medicine
where workers follow processes and log their execution manually in textual forms instead, we
can achieve process discovery results that are very satisfactory with 88% correctly discovered
activities [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. However, other studies showed there is a considerable gap of research into the
semantic aspects of process model text labels and natural language descriptions [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] which
can be enhanced by attaching more modalities as descriptors and those modalities need to be
mineable. Even on a scale of a meta-model for representing all aspects of the digital enterprise,
adding the notion of constraints and modalities can give modelers the option to add more
precision to their models where needed [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ]. Next to, e.g., an enterprise’s operational capabilities,
enterprise’s modeling capabilities will become an increasingly important foundational capability
of enterprises, and the challenge is to further improve these modeling capabilities by means of
tools, modeling languages, and associated processes while balancing the return on modeling
efort [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ].
      </p>
      <p>There is an imperative need to develop robust, scalable, and versatile frameworks that can
seamlessly analyze multimodal data and translate them into comprehensive process
understanding and guidance. This doctoral dissertation revolves around a strategy that connects the
analysis of multimodal data with extended reality (XR) methods to elevate the comprehension
of business processes.</p>
      <p>The remainder of this paper is structured as follows. In Section 2, we discuss problem
formulation. Following that, in Section 3, we explore the landscape of related work. Section 4
is dedicated to an examination of the planned research strategy, encompassing our proposed
solution design and the pertinent techniques involved. Moving forward to Section 5, we carefully
dissect the evaluation plans of our solution design and identify the key audience and beneficiaries
of this thesis. In addition, we highlight and categorize the contributions of this thesis into three
organized topics. Given these intended contributions, Section 6 discusses the actual research
plan. Finally, in Section 7, we draw our paper to a conclusion.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Problem formulation</title>
      <p>
        The age of digital transformation has brought forward the prominence of complex processes in
various industries. Nevertheless, gaining a profound understanding of these processes, especially
from multimodal data sources like videos, remains an unresolved challenges. It is noted that in
many organizations, documentation of process knowledge is scattered around various process
information sources which introduces considerable problems [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], but we are introducing a new
concept of information fragmentation through various modalities. To the best of our knowledge,
no prior research has addressed this specific topic. Efective data analysis, visualization, and
interpretation can bridge the comprehension gap, ensuring processes are not just eficient but
also understandable.
      </p>
      <p>To achieve such, we can capture the digital footprint of activities and transactions, allowing
businesses to streamline their processes, identify bottlenecks, and enhance eficiency. However,
it is important to recognize that not all aspects of real-world processes are adequately represented
in any representation of the business process. To underscore the limitations of current Process
Mining methods, let’s consider a familiar scenario – the repair process for a mobile phone. Fig. 1
represents a simplified business process for phone repair, specifically focusing on repairing
a phone with a broken camera. This figure employs a flowchart format to visually map out
the sequence of activities involved in the repair process, with the manual tasks highlighted in
grey. This visualization aids in comprehending the steps and stages essential for conducting
phone repair within a business setting, facilitating both a high-level overview and a detailed
understanding of the repairment procedure. Such visual representations are valuable tools for
enhancing process eficiency, quality control, and workforce training within businesses that
ofer phone repair services.</p>
      <p>Customer agreed
to the repair</p>
      <p>Perform
diagnostics</p>
      <p>Repair</p>
      <p>Test and confirm</p>
      <p>proper
functioning</p>
      <p>No</p>
      <p>X
Phone is
repaired?</p>
      <p>Contact the
customer with final
Yes cost and readiness</p>
      <p>Customer received
repaired phone</p>
      <p>Building valuable representations of a process can be based on event logs like presented in
Table 1. In this example, the event log serves as a detailed record of activities, their timestamps,
sources of information, actors involved, objects in use, duration, and the respective environments
in which these activities took place.</p>
      <p>One aspect highlighted within these event logs is the source of information for each event. It
is noted that most information sources are manual, indicating that a human actor manually
entered data into the system to record these events. Another noteworthy element in this table
is the presence of unknown timestamps, denoted by question marks, which occur when there is
no interaction with a digital system to record a specific event. Consequently, this lack of digital
tracking makes it challenging to determine the exact duration of certain activities, resulting in
[00:28- [00:33-?] [?-00:59]
00:33]</p>
      <sec id="sec-2-1">
        <title>Camera Diag- Replacement Camera</title>
        <p>nosis Part Order Replacement</p>
      </sec>
      <sec id="sec-2-2">
        <title>Machine log Manual Manual</title>
      </sec>
      <sec id="sec-2-3">
        <title>Technician A Administrator Technician A</title>
      </sec>
      <sec id="sec-2-4">
        <title>Phone, Screw- Phone, Diag- None</title>
        <p>driver nostic Tool</p>
      </sec>
      <sec id="sec-2-5">
        <title>Phone, Cam</title>
        <p>era Module,</p>
      </sec>
      <sec id="sec-2-6">
        <title>Screwdriver</title>
        <p>? minutes</p>
      </sec>
      <sec id="sec-2-7">
        <title>Workshop</title>
        <p>[00:5901:04]</p>
      </sec>
      <sec id="sec-2-8">
        <title>Quality</title>
      </sec>
      <sec id="sec-2-9">
        <title>Check</title>
      </sec>
      <sec id="sec-2-10">
        <title>Manual</title>
      </sec>
      <sec id="sec-2-11">
        <title>Quality</title>
        <p>spector</p>
      </sec>
      <sec id="sec-2-12">
        <title>Phone</title>
      </sec>
      <sec id="sec-2-13">
        <title>5 minutes</title>
      </sec>
      <sec id="sec-2-14">
        <title>Workshop In</title>
        <p>Duration
Environment
10 minutes</p>
      </sec>
      <sec id="sec-2-15">
        <title>Workshop</title>
      </sec>
      <sec id="sec-2-16">
        <title>5 minutes</title>
      </sec>
      <sec id="sec-2-17">
        <title>Workshop</title>
        <p>? minutes</p>
      </sec>
      <sec id="sec-2-18">
        <title>Ofice</title>
        <p>estimates provided as a range of minimal to maximal values. Additionally, the table reveals that
tracking the objects in use during each event is a somewhat vague aspect of this process. The
list of objects is not exhaustive, and it is unclear whether some objects can be used in parallel.
This ambiguity can complicate resource allocation and potentially afect the eficiency of the
repair process.</p>
        <p>Integrating multimodal data sources into event logs can ofer innovative solutions to address
the challenges of enriching event logs. By supplementing manual data entry with diverse data
streams such as video recordings from multiple perspectives, sensor data like depth sensing
maps, and audio recordings capturing specific sound events, businesses can significantly enhance
the accuracy, completeness, and richness of their event logs, as illustrated in Fig. 2.</p>
        <p>Multiple camera angles can capture diferent aspects of the repair procedure, allowing for
precise and real-time documentation of technician actions and the condition of the device. Video
data also ofers the advantage of providing a visual timeline, eliminating the need for uncertain
timestamps. With this visual evidence, the duration of each repair step can be accurately
determined, enabling more precise process analysis. Moreover, video data can be used to verify
the usage of tools and objects, making it easier to track resources in parallel and optimize
resource allocation. Sensor data, particularly unstructured ones, ofers a wealth of information
that can complement traditional event logs. These data streams can provide insights into the
intricate details of a repair, such as the depth and dimensions of components, the accuracy of
alignments, and the quality of connections. Audio data recording of specific sound events can
add another layer of context to event logs. For instance, the detection of particular sounds, such
as clicks, whirrs, or engine sounds during the repair process, can serve as additional markers for
events and their timing. These audio cues can be correlated with the corresponding visual and
sensor data, ofering a holistic understanding of the repair workflow. Furthermore, audio data
can assist in identifying potential issues or anomalies during the repair, facilitating real-time
intervention and quality assurance. Table 2 shows sample event logs by adding multimodal
data sources of information to the system. The measurable exact duration of specific activities
is a crucial factor in evaluating manual processes, as it often indicates eficiency, identifies
areas for improvement, and guides resource allocation. Shorter durations suggest streamlined
and optimized processes, while longer durations may signal ineficiencies. Analyzing activity
durations is essential for benchmarking, compliance, quality control, and ensuring a positive
customer experience. However, it’s vital to consider other factors such as accuracy, quality, and
process efectiveness in conjunction with timing to comprehensively evaluate manual processes,
as not all processes prioritize speed above all else.</p>
        <p>To fully harness the potential of incorporating business logic on top of enriched event logs,
there is a pressing need for a specialized approach that can deal with unique demands of working
with video and volumetric data in conjunction with temporal and spatial information. Such an
approach may involve applications of extended reality (XR) for immersive data manipulation
and precise domain-specific modeling. Incorporating XR into the analysis of enriched event logs
may not only enhances the understanding of complex processes but also ofers a more intuitive
and interactive approach to data analysis. To this end, three research questions emerge:
[RQ1] How can we efectively mine and extract process mining-relevant information from
multimodal data sources, particularly from video data?
[RQ2] How can extended reality technologies be efectively integrated with enriched event
logs to facilitate immersive data manipulation and visualization for precise modeling and
domain-knowledge annotation?
[RQ3] How does the integration of these technologies influence the interpretability of process
mining results?</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Related work</title>
      <sec id="sec-3-1">
        <title>Technician A</title>
      </sec>
      <sec id="sec-3-2">
        <title>Phone,</title>
      </sec>
      <sec id="sec-3-3">
        <title>Cleaner,</title>
      </sec>
      <sec id="sec-3-4">
        <title>Textile</title>
      </sec>
      <sec id="sec-3-5">
        <title>7.1 minutes</title>
      </sec>
      <sec id="sec-3-6">
        <title>Work-desk1</title>
        <p>[00:40- [00:52-1:11]
00:52]</p>
      </sec>
      <sec id="sec-3-7">
        <title>Module Re- Quality</title>
        <p>placement Check</p>
      </sec>
      <sec id="sec-3-8">
        <title>Video Video, Manual</title>
      </sec>
      <sec id="sec-3-9">
        <title>Technician A Quality Inspector</title>
      </sec>
      <sec id="sec-3-10">
        <title>Phone</title>
      </sec>
      <sec id="sec-3-11">
        <title>Phone, Cam</title>
        <p>era Module,</p>
      </sec>
      <sec id="sec-3-12">
        <title>Screwdriver</title>
        <p>12.3 minutes</p>
      </sec>
      <sec id="sec-3-13">
        <title>Work-desk2</title>
        <p>19.2 minutes</p>
      </sec>
      <sec id="sec-3-14">
        <title>Workshop</title>
        <p>In this section, we examine related work. A broader study into related work is still ongoing, but
two areas that provide promising inputs are: Immersive Process Manipulations and Video and
Sensory Analysis.</p>
        <sec id="sec-3-14-1">
          <title>3.1. Immersive process manipulations</title>
          <p>
            There are studies on how modeling tools visualize the models, and how modelers interact with
the models [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ]. Immersive visualizations have gained significant attention as a powerful tool
for exploring and understanding complex data. Researchers have explored the application of
immersive technologies, such as virtual reality (VR) and augmented reality (AR), to visualize
and interact with process logs.
          </p>
          <p>
            Several scholarly papers have investigated the utility of immersive visualizations as
efective tools for immersive analytic [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ]. Additionally, specific research has critically reviewed
immersive environments, particularly in the context of process planning [
            <xref ref-type="bibr" rid="ref16">16</xref>
            ], where authors
proposed a set of design guidelines for the development of VR-based Computer-Aided Process
Planning. Their findings indicate that immersive VR technologies have the potential to enhance
various process planning scenarios, including decision-making, real-time response support,
verification, training, and the automatic generation of process plans. However, the identified
challenges these technologies still need to address are data interoperability, incorporation of
organizational aspects, and technological operational accuracy. Our approach aims to address
these aforementioned challenges by mining processes from richer sources of information and
semantics, specifically from unstructured and highly dimensional data.
          </p>
          <p>
            While conducting a general review of immersive design approaches, as outlined in [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ], it
becomes evident that there are limitations in design reviews and untapped potential in utilizing
VR during the design process. The identified potential state-of-the-art practices involve the
creation of design sketches and the activation of functions within VR using a personal data
assistant. Our approach is implementing a solution that facilitates model manipulation within
an immersive world through the assistance of an AI-driven personal assistant.
          </p>
          <p>
            Oberhauser et-al previously published VR-based tools for visualizing, navigating, and
interacting with various modeling notations such as ArchiMate [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ], Business Process Modeling
Notation (BPMN) [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ], Process Mining results [
            <xref ref-type="bibr" rid="ref20">20</xref>
            ], and program code structures fly-through [
            <xref ref-type="bibr" rid="ref21">21</xref>
            ].
Zenner et-al contributed to the field with a tool for Process Model Exploration [
            <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
            ], where
they introduces a concept that spatializes event-driven process chains (EPCs) by mapping
traditional 2D graphs to a 3D virtual environment. While the tools presented by these authors
represent the first public implementations of room-scale floating platforms, allowing users to
explore various models through natural walking, neither of them explores the potential of 1)
semi-automated process mining or 2) the utilization of Audio-Visual multimodal sensory data as
sources of process logs and interactive entities. Additionally, these tools are not commercially
or open-source available.
          </p>
          <p>
            A conducted survey [
            <xref ref-type="bibr" rid="ref22">22</xref>
            ] gives important contribution to our proposal, as it demonstrates that
using a VR interface as an alternative to traditional paper or desktop-like monitor representations
does not result in a significant decrease in model understanding performance.
          </p>
          <p>
            As for preservation of the process knowledge we encode models in various ways to make
the encoded models suitable for applying Machine Learning algorithms [
            <xref ref-type="bibr" rid="ref24">24</xref>
            ]. Studies shows
that event knowledge graphs are a very versatile tool that opens the door to process mining
analyses in multiple behavioral dimensions at once [
            <xref ref-type="bibr" rid="ref25">25</xref>
            ], and the production of platforms for
transforming conceptual models into knowledge graphs is emerging [
            <xref ref-type="bibr" rid="ref26">26</xref>
            ].
          </p>
        </sec>
        <sec id="sec-3-14-2">
          <title>3.2. Video and sensory analysis</title>
          <p>Understanding the semantics and meaning conveyed in instructional videos is a fundamental
challenge in computer vision and natural language processing. This research stream focuses on
developing approaches to automatically analyze and interpret instructional videos to extract
high-level semantic information. Researchers have explored diferent methods to tackle this
problem, ranging from video segmentation and action recognition to language understanding
and multimodal fusion.</p>
          <p>
            Some presented studies like [
            <xref ref-type="bibr" rid="ref27">27</xref>
            ] proposes approach for generating graphical representations
of instructional videos that doesn’t require any annotations. They use cross-modal attention to
utilizing agreement between multiple modalities to learn a unified graphical structure
representing videos as joint embedding between visual, audio and textual signals obtained from automatic
speech recognition. To learn complex activities in videos and reduced computational complexity
of global/data-set level representation of sub-actions, researchers transform the pipeline down
to local/video level. They performed rigour evaluation of the generated graphs with a user
study, and graphical and qualitative analysis.In response to the question Tell me what happened,
the authors present a framework [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ] that enables video prediction, video rewind, and video
infilling, all during inference time. They evaluated their approach in various video scenarios like
animation and gaming. When it comes to action recognition, novel approaches like [
            <xref ref-type="bibr" rid="ref29">29</xref>
            ] are not
using object-level graphs or scene graphs to represent the dynamics of objects and relationships
between them, but rather relationship transitions directly. Their solution, 2, recognises
attribute transitions of objects which leads to leveraging potential in reasoning methods that
are aware of relationships between objects.
          </p>
          <p>
            In diference to our approach, none of the found approaches focuses on aligning recognised
semantics with business logic, quantifying processes into business-relevant valuable metrics,
and most importantly, eficiency of incorporating human in the loop to represent
domainexpert knowledge. We identify as a critical step in the pipeline from raw data to valuable
representations of semantics, a modeling method that follows the complexity of data that is
analysed. In our approach we aim at transforming modeling to an immersive world where
complex raw data such as 360° video data can be properly visualized. Approach from [
            <xref ref-type="bibr" rid="ref28">28</xref>
            ]
relies on text guidance which, in our approach, we want to provide by an engaged modeler
that is fully immersed into mining process. With novel available data-set for pan-optic scene
graph generation such as [
            <xref ref-type="bibr" rid="ref30">30</xref>
            ], we want to make a step toward incorporating business logic
into relations between entities on the scene and for such a challenging task we want to involve
modelers, giving them tools for high-quality abstract-level highly-specific process mining tool
that is semi-automated through AI trained guidance. Furthermore, we propose an entirely new
methodology for multi-conceptualizations of the spatio-temporal event-logs that is opening
various valuable applications, such as re-mining of logged data for post-mining relevance
insights changes. We plan to explore multimodal event log employment of automatic techniques
to automatically identify activity correspondences that represent similar behaviors [
            <xref ref-type="bibr" rid="ref31">31</xref>
            ] or
annotate process models with concepts such as taxonomy [
            <xref ref-type="bibr" rid="ref32">32</xref>
            ].
          </p>
        </sec>
        <sec id="sec-3-14-3">
          <title>3.3. Process mining based on audio-visual sources</title>
          <p>
            The topic of process mining from videos as sources is yielding very few results. In this regard, [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ]
is providing a reference architecture for process mining from video data. Their solution,
ViProMiRA, is a supervised learning-based, case-driven, context-specific tool for extracting
event-logs from unstructured video data. With a prototypical implementation, they showed that
ViProMiRA was capable of automatically extracting more than 70% of the process-relevant events
from a real-world synthetic video data-set. They also explicitly stated as their limitation that
ViProMiRA itself is not directly transferable to practical use, as it is a prototypical instantiation
serving as a blueprint. Authors also pointed that their limitation is that evaluation of ViProMiRA
is done on a video data which does not represent a real-world process and is limited in duration.
Notably, ViProMiRA exclusively focuses on video data and does not consider other modalities
such as audio, sensor data, or text information. Furthermore, [
            <xref ref-type="bibr" rid="ref34">34</xref>
            ] discusses an approach that
applies process mining on video surveillance data of pigpens. The authors highlight that the
process analytic pipeline from raw video data to a discovered process model has not yet been
fully implemented and they see further use cases of their approach in medicine and material
science. They recorded process-specific videos of 4 pigpens with a camera installation from
diferent angle. For knowledge mining, they tailored techniques to their use case of creating a
heat-map of common pig positions that is used for pig activity recognition and tracking. Their
discovery of process model is enhanced with domain-specific knowledge. Finally, in the context
of process mining from videos, there is a found study [
            <xref ref-type="bibr" rid="ref35">35</xref>
            ] that is analysing the consistency
between the process model and the predefined Petri-net model to do a conformance checking
on process models extracted from videos. They perform video data pre-processing that removes
the background information irrelevant to the moving target in the video picture and only keeps
interest point area. They are performing classifications for action placement and recognition.
          </p>
          <p>
            Our approach is filling the gap of labeling of continuous video and other multimodal sensory
data using immersive technologies and leveraging larger process-related, more detailed,
spatiotemporal incorporation of semantics. With our approach, a modeler can interact in a novel and
more optimal way with significantly more valuable data sources. In contrast to found solutions,
our approach will consider multiple conceptualizations of detected entities utilizing novel
techniques such as [
            <xref ref-type="bibr" rid="ref27 ref29">27, 29</xref>
            ] for entity and action segmentation on video and other spatio and/or
temporal related data. Modern techniques have potential to make automation of understanding
highly-unstructured data in a semantically richer way, and we aim at applying those techniques
in an immersive VR-based setting where a modeler involved in process mining can use automatic
entity detections as a toolkit to be very eficiant in specifying business logic. To the best of our
knowledge, no publicly available literature demonstrates a similar application, which suggests
that our approach would contribute with: 1) an innovative flexible, robust, scalable,
multiperson modeling tool on real world data, 2) a larger set of semantically rich activities that can
be recognised, and 3) process mining from valuable sources of multimodal data including 360°
video data.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Research strategy</title>
      <sec id="sec-4-1">
        <title>4.1. Background</title>
        <p>The underlying principle of event log creation in videos involves the extraction and analysis
of meaningful semantic information from low-level event logs. While low-level logs capture
detailed data such as object detection, sound recognition, and speech detection, the higher-level
analysis focuses on incorporating the underlying semantics into these logs. This process involves
identifying and categorizing events based on their semantic context, such as recognizing specific
events composed of actions, identifying objects or people in diferent contexts and understanding
their interactions. By incorporating semantics into the low-level event logs, the resulting
higherlevel event logs provide a more comprehensive and abstract representation of the video content,
enabling advanced applications like activity recognition, video summarization, and semantic
search.</p>
        <p>
          Scene logging involves creating a knowledge graph of entities present in videos and their
spatio-temporal changes within the video content. By constructing a knowledge graph, entities
can be represented as nodes and their relationships and interactions can be modeled as edges
connecting these nodes with additional data which may include the date, time, location, duration,
and participants of a given interaction. To capture the relationships between entities, methods
such as [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] can be applied to identify and recognize relations among two given objects in
videos based on a collected detaset. By analyzing the changes in spatial proximity, relative
positions, and trajectories of objects, meaningful relationships can be inferred.
Ontology modeling can organize objects, people, actions and events into a hierarchical
structure. For example, in a scene, diferent types of vehicles can be categorized into subclasses
such as cars, trucks, or motorcycles. Maintaining this hierarchical organization is crucial
for eficient object identification, as it allows for a more detailed understanding of the scene
and facilitates higher-level reasoning. Hierarchical object recognition can be achieved through
hierarchical classifiers or by employing deep learning architectures that incorporate hierarchical
features. These approaches enable the system to identify objects at diferent levels of abstraction,
from general object categories to fine-grained subclasses.
        </p>
        <p>Combining data from diferent modalities To gain a comprehensive understanding of the
processes, data from diferent modalities, such as visual, audio, or sensor data, can be combined.
Integration techniques and data fusion algorithms can be employed to merge information
from various sources, enriching the analysis and providing a more holistic view of the tracked
processes.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Solution architecture design</title>
        <p>The solution architecture depicted in Fig. 3 ofers a comprehensive, layered approach to
integrating and analyzing multimodal data sources in the context of process mining. This design
embodies a holistic strategy, enabling the conversion of raw data inputs into enriched conceptual
models. The goal of this architecture is its applicability in a multitude of fields. The solution
should show transformative impact across domains of Process Mining, where the model aids in
deciphering complex processes, or Process Monitoring/Tracking, where real-time insights are
extracted, or even Process Guidance, where the model provides direction and recommendations
that can be valuable for both training purposes or real-time on-field guidance.
Source data acquisition The foundation of our architecture begins with the Source Data
Acquisition phase. This phase accumulates a wide array of input sources, ranging from videos
captured from diverse vantage points, to audio recordings, in-depth sensor data maps, and
detailed machine logs that track all interactions and database entries with the associated software.
Given the heterogeneity of these data sources, it’s imperative to have a unified architecture
to ensure that all these disparate data streams are captured, processed, and made ready for
subsequent layers of analysis.</p>
        <p>Automated contextualization (observable entity recognition) The Automated Observable
Entity Recognition phase, also termed as Automated Contextualization, is predominantly
AIdriven and boasts of capabilities like object recognition (initially unnamed), identification of
people (who are not yet recognized as system actors), motion tracking that detects sequences of
postures of individuals and objects, classifiers for predicting the categories of various resources
or activities, and background isolation to discern between diverse environments and settings.
The power of AI in this phase ensures that the raw data is swiftly transformed into recognizable
entities and contexts, paving the way for deeper analysis and integration.</p>
        <p>Immersive contextualization The Immersive Contextualization phase is manual and calls
upon domain-knowledge experts to embed business logic into the evolving model. This
immersive experience is facilitated through cutting-edge augmented reality glasses or virtual reality
headsets, enabling experts to navigate through 360° videos. At this juncture, the Entity Naming
process comes to the fore, serving as a verification mechanism for the automated classifications
performed earlier. This is complemented by Artifact Manipulations, which provide the flexibility
to either group entities into more abstract categories or further specify them. For instance, an
environment can be identified as a derivative of another setting, contingent on the availability
Video 1
Video 2</p>
        <p>Audio 1
Depth Sensing</p>
        <p>Machine Logs</p>
        <p>Free Form Entries
“This phone has a broken
camera that requires repair, as
it currently needs cleaning...”
...</p>
        <p>Object 1
Object 2
Object 3</p>
        <p>...</p>
        <p>Action 1
Recognized Motion</p>
        <p>Sequences
Recognized Environments</p>
        <p>Background 1</p>
        <p>Resource Classification
Object 1 - a Phone (68%)
Object 1 - an Electronic device (77%)
Object 2 - a Tool (98%)
Object 3 - a box (84%)
Object 3 - a Toy (34%)
Action 1 - Screwing (53%)
Environment 1 - an Office (40%)
...
video or with real
-time Augmented</p>
        <p>surrounding
Domain-knowledge Modeler
in Extended Reality</p>
        <p>Entity Annotation
Object 1 - “Phone on repairment” : P1
Object 3 - “Toolbox 1“ : instance1
Object 2 - “Tool A“ : instance1
Object 4 - “Camera module“ : old, new
Action 1 - Screwing : removingScreen
Environment 1 - Work-desk1 : wd1</p>
        <p>Artifact Manipulation
Tool A
Tool A.1
Tool A.2</p>
        <p>Desk2
Relationship Annotation</p>
        <p>using_tool_A
Actor1</p>
        <p>uses
enters</p>
        <p>RoomW
in</p>
        <p>Tool A</p>
        <p>repairs
Camera
module</p>
        <p>Event Logic Triggers
if event1 uses object 1 report
while event2 set status: Status
on event3 save duration of a1
l
e
d
o
M
l
a
u
t
p
e
c
n
o
C
d
e
c
n
a
h
n
E
e
c
n
a
Source Data Acquisition</p>
        <p>Automated Contextualization Immersive Contextualization
of specific tools. Relationship Annotation further enriches the model, charting out a graph
of pertinent links between artifacts and activities. The culmination of this phase is the Event
Triggers module, where intricate business logic rules and advanced connections between the
recognized entities and activities are established.</p>
        <p>Enhanced conceptual model After traversing through the intricate layers of the architecture,
what emerges is the Enhanced Conceptual Model. This model is not only aware of the enriched
multimodal event log but is also primed for diverse applications. It encapsulates the depth and
breadth of the information from the varied data sources and the insights garnered through
automated and manual contextualization.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Source data acquisition</title>
        <p>The foundation of our architecture begins with the Source Data Acquisition phase. This section
accumulates a wide array of input sources, ranging from videos captured from diverse vantage
points, to audio recordings, in-depth sensor data maps, and detailed machine logs that track all
interactions and database entries with the associated software. Given the heterogeneity of these
data sources, it’s imperative to have a unified architecture to ensure that all these disparate data
streams are captured, processed, and made ready for subsequent layers of analysis.
4.4. Automated contextualization (observable entity recognition)
Following the data collection phase, the architecture delves into the Automated Observable
Entity Recognition section, also termed as Automated Contextualization. This segment is
predominantly AI-driven and boasts of capabilities like object recognition (initially unnamed),
identification of people (who are not yet recognized as system actors), motion tracking that
detects sequences of postures of individuals and objects, classifiers for predicting the categories
of various resources or activities, and background isolation to discern between diverse
environments and settings. The power of artificial intelligence in this phase ensures that the raw
data is swiftly transformed into recognizable entities and contexts, paving the way for deeper
analysis and integration.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.5. Immersive contextualization</title>
        <p>The subsequent tier, Immersive Contextualization, is manual and calls upon domain-knowledge
experts to embed business logic into the evolving model. This immersive experience is facilitated
through cutting-edge augmented reality glasses or virtual reality headsets, enabling experts to
navigate through 360-degree videos. At this juncture, the Entity Naming process comes to the
fore, serving as a verification mechanism for the automated classifications performed earlier.
This is complemented by Artifact Manipulations, which provide the flexibility to either group
entities into more abstract categories or further specify them. For instance, an environment can
be identified as a derivative of another setting, contingent on the availability of specific tools.
Relationship Annotation further enriches the model, charting out a graph of pertinent links
between artifacts and activities. The culmination of this phase is the Event Triggers module,
where intricate business logic rules and advanced connections between the recognized entities
and activities are established.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.6. Enhanced conceptual model</title>
        <p>After traversing through the intricate layers of the architecture, what emerges is the Enhanced
Conceptual Model. This model is not only aware of the enriched multimodal event log but is also
primed for diverse applications. It encapsulates the depth and breadth of the information from the
varied data sources and the insights garnered through automated and manual contextualization.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation discussion</title>
      <p>5.1. Identification of the target audience and beneficiaries of the thesis
This thesis primarily targets industries that experience gaps or “blind spots” in their event
logs due to manual or physically-based processes. Industries like manufacturing, logistics,
warehousing, agriculture, and construction, which heavily rely on manual labor and physical
operations, often lack comprehensive electronic tracking of every nuanced event. The proposed
approach for extracting information from multimodal data sources, especially video data,
ofer a groundbreaking way for these sectors to digitize and analyze manual processes that
were previously opaque. Extended reality technologies can provide these industries with
immersive training and process understanding tools, making the shift from manual oversight to
digital monitoring smoother and more intuitive. Additionally, with the inclusion of BPMN in
multimodal data analysis, businesses can gain contextually richer insights, aiding in process
optimization. Operational managers in these sectors can not only interpret the newly available
process mining results but also predict future process bottlenecks and constraints. Furthermore,
organizations aiming to create a standardized blueprint of their manual processes can leverage
the business process libraries proposed by this research. All in all, this thesis promises a
transformational shift for industries traditionally limited by manual process constraints.</p>
      <sec id="sec-5-1">
        <title>5.2. Validation techniques</title>
        <p>To ensure the robustness and applicability of the approach proposed for the planned thesis, it
is imperative to employ rigorous validation techniques. One such technique is prototyping.
Given the novel amalgamation of video, audio, and sensor data with XR technologies, creating
a prototype ofers a tangible representation of our concepts and allows stakeholders to
interact with and refine the proposed system. By ofering an early visualization of the system’s
functionalities and interfaces, prototyping facilitates immediate feedback, highlights potential
pitfalls, and ofers avenues for iterative refinement. Moreover, a prototype can simulate the data
integration process, giving researchers a glimpse into the challenges and solutions of working
with disparate data types in a unified XR environment.</p>
        <p>Additionally, field experiments stand out as a powerful validation approach for our research.
By applying our methods in real-world settings, we can assess the practicality, eficiency, and
accuracy of our proposed techniques. Observing domain-experts as they engage with our
XR-powered process mining tool can provide valuable insights into its usability, efectiveness,
and areas for improvement. Such experiments also help in determining the tool’s impact on
business process workflows, stakeholders satisfaction, and the overall quality of monitored
processes. Meanwhile, argumentation, although more theoretical, ofers a structured platform
to articulate and defend the reasoning behind our approach, ensuring that the choices made in
data integration, XR application, and process analysis are not only innovative but also logically
sound and consistent with existing domain knowledge.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.3. Contribution of the thesis</title>
        <p>The contribution of the thesis follows three topics:
Topic 1 – Mining: Focused on extracting meaningful information from data. It handles
the visualization of data, manages multi-view video data, detects process loops, visualizes
dependencies, identifies critical paths, and more.</p>
        <p>Topic 2 – Tracking and monitoring: Uses videos, other modalities and extended reality for
real-time process tracking, ensuring continuity of processes. It can detect anomalies, handle
variations in processes over time, and create reports like summaries.</p>
        <p>Topic 3 – Guiding: Utilizes extended reality for process guidance, transforming data into
interactive guides. The topic also delves into data visualization techniques, user-friendly
interface design, and training requirements.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Research execution plan</title>
      <p>In our journey thus far, we have explored and tested an array of technologies, harnessing their
potential for process mining in an enriched multimodal environment. Our current explorations
have centered around fine-tuning pre-trained models tailored for both visual and audio
recognition. Recognizing the power of visual language models, we conducted experiments to mine
business logic from video frames. By furnishing these models with tailored prompts, we’ve
made significant strides in extracting meaningful insights from the visual data. Additionally,
our tests of human body pose estimation, image segmentation, and image classification have
further equipped us with tools to understand and interpret the vast and varied visual inputs at
our disposal.</p>
      <p>Looking ahead, our roadmap is illustrated in Fig. 4. Our immediate endeavor is the
development of an XR tool designed to empower modelers in navigating 360° videos, allowing them
to pinpoint and select relevant objects with precision. This selection process is envisioned
to be intuitive and immersive, leveraging a cube selector to volumetrically extract the mesh
of the scene. This hands-on approach ensures that the extracted data is both relevant and
contextual. Once this foundational tool is in place, our energies will shift toward the creation
of a comprehensive immersive modeling tool. This tool will be pivotal in preparing us for
the subsequent phase of data labeling, ensuring that the raw data is categorized and marked
for further analysis. As we progress, the collection of source data will be paramount, and we
anticipate diving deeper into state-of-the-art pre-trained models that can assist in AI-driven
contextualizations.</p>
      <p>The proposed solution is designed to address a multitude of challenges and considerations
across various domains. One critical aspect is ensuring interoperability with diferent data
formats. This entails the integration, standardization, and compatibility of diverse data inputs,
allowing the solution to handle data from various sources efectively. This feature is pivotal in
enabling the solution to be adaptable and versatile in processing information. Another essential
feature focuses on accommodating diverse user needs. By ofering customization options, user
interface personalization, and adaptive features, the solution aims to enhance user-friendliness
and satisfaction. This approach ensures that users from diferent backgrounds and with varying
requirements can efectively utilize the solution to meet their specific needs.</p>
      <p>Handling biases in multimodal data is another crucial aspect addressed in the solution.
It explores methods for detecting, mitigating, and assessing biases, which is essential for</p>
      <p>Problem statement and testing objectives
Immersive modeling of multimodal event logs
AI-assisted video event log creation
AI-assisted multimodal event log creation
Improving interpretability of process mining</p>
      <p>Exploring application domains for process mining, tracking and guidance
maintaining the integrity and reliability of analysis results. This feature underscores the
commitment to fairness and objectivity in data analysis. Dealing with missing or incomplete
data is a common challenge in data analysis, and the solution provides strategies for addressing
this issue. Techniques such as imputation, data augmentation, and sensitivity analysis are
explored to mitigate the impact of missing information on analysis results, ensuring more
robust outcomes.</p>
      <p>Concluding our roadmap, our focus will pivot towards diverse applications for our integrated
system. Moreover, as we venture deeper into process mining, ensuring the interpretability of
these mined processes will be crucial. This will ensure that the insights gleaned from our system
are not only accurate but also actionable, lending them real-world relevance and applicability.</p>
      <p>The future work can dive into advanced topics like real-time monitoring, faster data analysis,
and accelerated process improvement, showcasing the potential for innovation in data analysis
methods. Additionally, future work considers eficient storage strategies, concurrent process
mining, uncovering hidden process patterns, predicting process performance improvement, and
structuring business process libraries, highlighting a comprehensive approach to data analysis
and process optimization. The solution’s ability to be applied successfully in various industries
is a testament to its adaptability and potential for widespread utility.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>Our investigation into the integration of multimodal data sources and extended reality
technologies in process mining has illuminated the transformative potential of this synergistic approach.
As demonstrated in our solution design architecture, the journey from Source Data Acquisition
to an Enhanced Conceptual Model underscores the power of combining raw data inputs with
both automated and immersive manual contextualization techniques. The goal of this integrated
approach is not only in its capacity to transform heterogeneous data into actionable insights
but also in its versatility, ofering applications across process mining, monitoring, and guidance
domains.</p>
      <p>The implications of this research are profound for industries reliant on process analyses,
especially those in repair and maintenance sectors. Our architectural model ofers organizations
a blueprint to harness their diverse data streams, contextualize them through AI and human
expertise, and subsequently derive enhanced, actionable conceptual models.</p>
      <p>In summation, while the horizon of process mining and analysis is expansive, our study
introduces a pioneering approach that marries the best of technology and human expertise.
Future endeavors in this field should focus on scalability, ensuring that our architecture can
cater to even more complex and voluminous data environments, and further bridging the gap
between raw data and actionable insights.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>Authors are deeply grateful to Prof. Dr. Henrik Leopold at Kühne Logistics University (KLU) for
his invaluable guidance in shaping this work.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>W. M. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Foundations of process discovery</article-title>
          ,
          <source>in: Process Mining Handbook</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>37</fpage>
          -
          <lpage>75</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>W. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Data science in action</article-title>
          , in: Process mining, Springer,
          <year>2016</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>23</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>W. M. Van Der Aalst</surname>
          </string-name>
          ,
          <article-title>A practitioner's guide to process mining: Limitations of the directlyfollows graph</article-title>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zimmermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zerbato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Weber</surname>
          </string-name>
          ,
          <article-title>What makes life for process mining analysts dificult? a reflection of challenges, Software and Systems Modeling (</article-title>
          <year>2023</year>
          ).
          <source>doi: 10.1007/ s10270-023-01134-0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Michael</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wimmer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Mayr</surname>
          </string-name>
          , Quo vadis modeling?,
          <source>Software and Systems Modeling</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1007/s10270-023-01128-y.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Roelens</surname>
          </string-name>
          ,
          <article-title>Conceptual modeling and artificial intelligence: A systematic mapping study</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>06758</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reddy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Proper</surname>
          </string-name>
          ,
          <string-name>
            <surname>The AI-Enabled</surname>
            <given-names>Enterprise</given-names>
          </string-name>
          , Springer International Publishing, Cham,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>12</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -29053-
          <issue>4</issue>
          _
          <fpage>1</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Proper</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Gils</surname>
          </string-name>
          ,
          <source>Coordinated Continuous Digital Transformation</source>
          , Springer International Publishing, Cham,
          <year>2023</year>
          , pp.
          <fpage>101</fpage>
          -
          <lpage>120</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -29053-
          <issue>4</issue>
          _
          <fpage>6</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>E. V.</given-names>
            <surname>Epure</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Martin-Rodilla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hug</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Deneckère</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Salinesi</surname>
          </string-name>
          ,
          <article-title>Automatic process model discovery from textual methodologies</article-title>
          ,
          <source>in: 2015 IEEE 9th International Conference on Research Challenges in Information Science (RCIS)</source>
          , IEEE,
          <year>2015</year>
          , pp.
          <fpage>19</fpage>
          -
          <lpage>30</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mendling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Leopold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pittke</surname>
          </string-name>
          ,
          <volume>25</volume>
          challenges of semantic process modeling,
          <source>International Journal of Information Systems and Software Engineering for Big Companies</source>
          <volume>1</volume>
          (
          <year>2015</year>
          )
          <fpage>78</fpage>
          -
          <lpage>94</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>B. v.</given-names>
            <surname>Gils</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Proper</surname>
          </string-name>
          ,
          <string-name>
            <surname>Next-Generation Enterprise</surname>
            <given-names>Modeling</given-names>
          </string-name>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>279</fpage>
          -
          <lpage>305</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -30214-5_
          <fpage>21</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Proper</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. van Gils</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Haki</surname>
          </string-name>
          ,
          <source>Final Conclusions and Outlook</source>
          , Springer Nature Switzerland, Cham,
          <year>2023</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>314</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>031</fpage>
          -30214-5_
          <fpage>23</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>H.</given-names>
            van der Aa, H.
            <surname>Leopold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mannhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Reijers</surname>
          </string-name>
          ,
          <article-title>On the fragmentation of process information: Challenges, solutions, and outlook</article-title>
          , in: Enterprise,
          <source>Business-Process and Information Systems Modeling</source>
          , Springer,
          <year>2015</year>
          , pp.
          <fpage>3</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Bork</surname>
          </string-name>
          , G. De Carlo,
          <article-title>An extended taxonomy of advanced information visualization and interaction in conceptual modeling</article-title>
          ,
          <source>Data &amp; Knowledge Engineering</source>
          <volume>147</volume>
          (
          <year>2023</year>
          )
          <article-title>102209</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.datak.
          <year>2023</year>
          .
          <volume>102209</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>M.</given-names>
            <surname>Kraus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Klein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fuchs</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. A.</given-names>
            <surname>Keim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schreiber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sedlmair</surname>
          </string-name>
          ,
          <article-title>The value of immersive visualization</article-title>
          ,
          <source>IEEE computer graphics and applications 41</source>
          (
          <year>2021</year>
          )
          <fpage>125</fpage>
          -
          <lpage>132</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Camba</surname>
          </string-name>
          ,
          <article-title>Computer-aided process planning in immersive environments: A critical review</article-title>
          ,
          <source>Computers in Industry</source>
          <volume>133</volume>
          (
          <year>2021</year>
          )
          <fpage>103547</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Weidlich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Polzin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cristiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zickner</surname>
          </string-name>
          ,
          <article-title>Virtual reality approaches for immersive design</article-title>
          ,
          <source>CIRP annals 56</source>
          (
          <year>2007</year>
          )
          <fpage>139</fpage>
          -
          <lpage>142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>R.</given-names>
            <surname>Oberhauser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pogolski</surname>
          </string-name>
          ,
          <article-title>Vr-ea: Virtual reality visualization of enterprise architecture models with archimate and bpmn</article-title>
          ,
          <source>in: Business Modeling and Software Design: 9th International Symposium, BMSD</source>
          <year>2019</year>
          , Lisbon, Portugal,
          <source>July 1-3</source>
          ,
          <year>2019</year>
          , Proceedings 9, Springer,
          <year>2019</year>
          , pp.
          <fpage>170</fpage>
          -
          <lpage>187</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>R.</given-names>
            <surname>Oberhauser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pogolski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Matic</surname>
          </string-name>
          ,
          <article-title>Vr-bpmn: Visualizing bpmn models in virtual reality</article-title>
          ,
          <source>in: Business Modeling and Software Design: 8th International Symposium, BMSD</source>
          <year>2018</year>
          , Vienna, Austria,
          <source>July 2-4</source>
          ,
          <year>2018</year>
          , Proceedings 8, Springer,
          <year>2018</year>
          , pp.
          <fpage>83</fpage>
          -
          <lpage>97</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>R.</given-names>
            <surname>Oberhauser</surname>
          </string-name>
          ,
          <article-title>Vr-processmine: Immersive process mining visualization and analysis in virtual reality</article-title>
          ,
          <source>in: Proceedings of the Fourteenth International Conference on Information, Process, and Knowledge Management</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>75</fpage>
          -
          <lpage>80</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Oberhauser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lecon</surname>
          </string-name>
          ,
          <article-title>Virtual reality flythrough of program code structures</article-title>
          ,
          <source>in: Proceedings of the Virtual Reality International Conference-Laval Virtual</source>
          <year>2017</year>
          ,
          <year>2017</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>4</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Makhsadov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Klingner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liebemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krüger</surname>
          </string-name>
          ,
          <article-title>Immersive process model exploration in virtual reality</article-title>
          ,
          <source>IEEE transactions on visualization and computer graphics 26</source>
          (
          <year>2020</year>
          )
          <fpage>2104</fpage>
          -
          <lpage>2114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zenner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Klingner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liebemann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Makhsadov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Krüger</surname>
          </string-name>
          ,
          <article-title>Immersive process models</article-title>
          ,
          <source>in: Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gavric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Proper</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bork</surname>
          </string-name>
          ,
          <article-title>Encoding conceptual models for machine learning: A systematic review (</article-title>
          <year>2023</year>
          )
          <fpage>562</fpage>
          -
          <lpage>570</lpage>
          . doi:
          <volume>10</volume>
          .1109/MODELS-C59198.
          <year>2023</year>
          .
          <volume>00094</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fahland</surname>
          </string-name>
          ,
          <article-title>Process mining over multiple behavioral dimensions with event knowledge graphs</article-title>
          ,
          <source>in: Process Mining Handbook</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>274</fpage>
          -
          <lpage>319</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>M.</given-names>
            <surname>Smajevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bork</surname>
          </string-name>
          , Cm2kgcloud
          <article-title>- an open web-based platform to transform conceptual models into knowledge graphs</article-title>
          ,
          <source>Science of Computer Programming</source>
          <volume>231</volume>
          (
          <year>2024</year>
          )
          <article-title>103007</article-title>
          . doi:
          <volume>10</volume>
          .1016/j.scico.
          <year>2023</year>
          .
          <volume>103007</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>M. C. Schiappa</surname>
            ,
            <given-names>Y. S.</given-names>
          </string-name>
          <string-name>
            <surname>Rawat</surname>
          </string-name>
          ,
          <article-title>Svgraph: Learning semantic graphs from instructional videos</article-title>
          , in: 2022 IEEE Eighth International Conference on
          <article-title>Multimedia Big Data (BigMM)</article-title>
          , IEEE,
          <year>2022</year>
          , pp.
          <fpage>45</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>T.-J. Fu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , C.-Y. Fu,
            <given-names>J.-C.</given-names>
          </string-name>
          <string-name>
            <surname>Su</surname>
            ,
            <given-names>W. Y.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bell</surname>
          </string-name>
          ,
          <article-title>Tell me what happened: Unifying text-guided video completion via multimodal masked video generation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>10681</fpage>
          -
          <lpage>10692</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Mi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Object-relation reasoning graph for action recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>20133</fpage>
          -
          <lpage>20142</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. Z.</given-names>
            <surname>Ang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Z. Liu,
          <article-title>Panoptic scene graph generation</article-title>
          ,
          <source>in: European Conference on Computer Vision</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>178</fpage>
          -
          <lpage>196</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>H.</given-names>
            <surname>Leopold</surname>
          </string-name>
          ,
          <article-title>Business process model matching</article-title>
          , in: S. Sakr, A. Y. Zomaya (Eds.),
          <source>Encyclopedia of Big Data Technologies</source>
          , Springer,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>H.</given-names>
            <surname>Leopold</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meilicke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fellmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Pittke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Stuckenschmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mendling</surname>
          </string-name>
          ,
          <article-title>Towards the automated annotation of process models</article-title>
          ,
          <source>in: International Conference on Advanced Information Systems Engineering</source>
          , Springer,
          <year>2015</year>
          , pp.
          <fpage>401</fpage>
          -
          <lpage>416</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kratsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>König</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Röglinger</surname>
          </string-name>
          ,
          <article-title>Shedding light on blind spots-developing a reference architecture to leverage video data for process mining</article-title>
          ,
          <source>Decision Support Systems</source>
          <volume>158</volume>
          (
          <year>2022</year>
          )
          <fpage>113794</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>A.</given-names>
            <surname>Lepsien</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bosselmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Melfsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koschmider</surname>
          </string-name>
          ,
          <article-title>Process mining on video data</article-title>
          .,
          <source>ZEUS</source>
          <volume>3113</volume>
          (
          <year>2022</year>
          )
          <fpage>56</fpage>
          -
          <lpage>62</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>S.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <article-title>Video process mining and model matching for intelligent development: Conformance checking</article-title>
          ,
          <source>Sensors</source>
          <volume>23</volume>
          (
          <year>2023</year>
          )
          <fpage>3812</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>