<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>for Real-time Surgical Conformance Checking and Guidance</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aleksandar Gavric</string-name>
          <email>aleksandar.gavric@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dominik Bork</string-name>
          <email>dominik.bork@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Henderik A. Proper</string-name>
          <email>henderik.proper@tuwien.ac.at</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Business Informatics, TU Wien</institution>
          ,
          <addr-line>Erzherzog-Johann-Platz 1, Vienna, 1040</addr-line>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Multimodal data analysis</institution>
          ,
          <addr-line>Mixed reality, Surgery AI, Surgical guidance, Process mining, BPMN, LLM, Healthcare</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>20</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper discusses an end-to-end methodology for real-time surgical conformance checking that uses multimodal process mining, mixed reality (MR), and large language model (LLM) prompting. Our approach aims to support surgeons and medical teams by comparing as-is operational data-captured through a variety of sensors including MR-based gaze tracking-with a reference surgical process model encoded in Business Process Modeling Notation (BPMN). We illustrate how shallow and deep human-in-the-loop feedback mechanisms can be integrated with chain-of-thought prompting to provide relevant, context-aware, and iterative feedback during surgery. We further indicate which aspects of the surgery can be monitored (and hence queried) by our multimodal process mining engine. By enabling precise, actionable feedback during critical surgical procedures, our approach enhances the ability to identify deviations, ensure adherence to best practices, and reduce human error. Ultimately, this methodology empowers surgical teams to make data-driven adjustments, promotes better patient outcomes, and allows hospitals to monitor surgical conformance efectively, setting a new standard for process-driven healthcare Vienna'25: 17th Central European Workshop on Services and their Composition (ZEUS), February 20-21, 2025, Vienna, Austria</p>
      </abstract>
      <kwd-group>
        <kwd>assistant</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Proper)
CEUR
Workshop
Proceedings</p>
      <p>ceur-ws.org
ISSN1613-0073</p>
      <p>Modern surgical procedures are intricate and involve numerous steps, actors, instruments, and real-time
decisions. Ensuring that each step in the as-is surgery conforms to a reference (or “desired”) model is
crucial for patient safety, consistent outcomes, and compliance with institutional guidelines. Traditional
methods of process oversight often rely on paper-based checklists or single-modality digital signals
(e.g., time stamps of major milestones), which ofer limited real-time insight.</p>
      <p>
        Process mining [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] addresses this gap by extracting event logs from complex systems and
reconstructing an as-is process model. Yet standard process mining may overlook the depth of real-time
information available from modern medical devices, images, sensor data, and user interactions in an
operating room [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The growing accessibility of mixed reality (MR) systems and advanced wearable
sensors (like gaze trackers) opens the door to multimodal process mining [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], where we capture a
richer set of signals beyond textual or numeric logs (e.g., surgeon gaze, instrument position, physical
environment changes, real-time vitals).
      </p>
      <p>
        Meanwhile, Large Language Models (LLMs) allow us to harness conversational and chain-of-thought
prompting to incorporate human expertise dynamically. Surgeons, nurses, and other staf can interact
with the system at various depths: (1) Shallow feedback: Quick confirmations or corrections to immediate
queries (e.g., “Is the incision completed?”), and (2) Deep feedback: More reflective input that leads to
refining the underlying process model or augmenting the system’s domain knowledge [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Mixed reality interfaces can further project relevant information in the surgical environment,
supporting Spatial Conceptual Modeling [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to visualize conformance data in situ. This integration bridges
the gap between human expertise and automated systems by enabling real-time contextual feedback and
adaptive process modeling. For instance, visual overlays or auditory alerts can notify surgeons of
deviations from standard procedures or highlight critical decision points, leveraging AI-based interpretation
of multimodal data [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The integration of artificial intelligence (AI) and mixed reality (MR) in surgical environments has
emerged as a promising research area, driven by advancements in computer vision, language models,
and multimodal process mining. This section reviews the most relevant contributions in this domain.</p>
      <p>
        Recent eforts, such as Surgical-LLaVA [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], have demonstrated the potential of large language and
vision models for understanding surgical scenarios, ofering a foundation for enhanced decision support
systems. Similarly, Yuan et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] proposed a procedure-aware surgical video-language pretraining
approach, utilizing hierarchical knowledge augmentation to improve the interpretability of surgical
workflows. Digital twins, as described by Ding et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], provide a unifying framework for surgical data
science, leveraging geometric scene understanding to create comprehensive models of the operating
room (OR). Complementing this, holistic OR domain modeling using semantic scene graphs has been
explored by Özsoy et al. [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], enabling a detailed representation of surgical environments.
      </p>
      <p>
        Further advancements in surgical scene graph knowledge have been achieved by Yuan et al. [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
who incorporated scene graphs into visual question answering (VQA) systems for surgical applications,
thereby enhancing context-awareness in automated systems. The Ophnet benchmark by Hu et al. [12]
provides a large-scale video dataset for ophthalmic surgical workflow understanding, facilitating the
development of robust AI models in the domain.
      </p>
      <p>
        Incorporating mixed reality into surgical planning and execution has also gained traction. Bracale
et al. [13] highlighted the utility of MR in preoperative planning for colorectal surgery, showcasing
its potential to improve surgical outcomes. From a conceptual perspective, Fill [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] introduced
spatial conceptual modeling, which anchors knowledge in the physical world using augmented reality
technologies, enabling innovative applications in medical and other domains.
      </p>
      <p>
        Our prior contributions have laid the groundwork for advancing multimodal process mining and
its applications. In Multimodal Process Mining [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], we introduced an approach to enrich traditional
process mining with multimodal evidence, capturing data from diverse sources such as sensors, images,
and user interactions. Building on this, we explored how to enhance business process event logs with
multimodal evidence in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], demonstrating the potential for deeper insights. In [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], we addressed the
challenge of tailoring multimodal data representations to stakeholder-specific terminology for improved
interpretability. Finally, in [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], we extended the multimodal paradigm to conceptual modeling,
showcasing how AI can leverage visual and auditory cues to interpret UML diagrams. These contributions
collectively highlight the potential of multimodal approaches in augmenting traditional process and
conceptual modeling practices.
      </p>
      <p>By uniting algorithmic-symbolic rigor with LLM-driven sub-symbolic flexibility and human expertise,
our approach transcends the constraints of rule-based process mining, enabling a more dynamic and
contextually rich analysis of surgical workflows.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology Overview</title>
      <p>We formalize the multimodal process monitoring and adaptive feedback mechanism as an optimization
problem over a hybrid state space  consisting of structured process models, multimodal sensor inputs,
and user feedback mechanisms.</p>
      <p>State Representation Let the state at time  be represented as:   = (  ,   ,   ), where   ∈ ℳ
represents the current process model state (e.g., BPMN graph representation, stored in a Retrieval
Augmented Graph [14]),   ∈  denotes the vector of multimodal sensor observations (e.g., gaze
tracking, instrument logs, voice commands), and   ∈  captures the human feedback at time  , either
shallow (e.g., confirmation) or deep (e.g., structural model changes).</p>
      <p>Transition Function The state transition function  ∶  ×  →  maps the current state and action
to a new state,  +1 =  (  ,   ) where   represents an action taken by the system or user, such as:
•   (System Actions): Process conformance checking, real-time alerting, adaptive workflow
modification,
•   (Human Actions): Explicit feedback confirmation, model refinement, procedural adjustments.
Objective Function The system aims to minimize a cumulative deviation function  that quantifies
non-conformance with the desired process model while maximizing the incorporation of human
feedback. This ensures continuous process adaptation and human-in-the-loop refinement over time.</p>
      <sec id="sec-3-1">
        <title>3.1. Application to a Specific Use Case: Surgery</title>
        <p>We instantiate our proposed framework in the context of surgery, a domain characterized by strict
procedural adherence and real-time decision-making.</p>
        <p>Process Modeling and Sensor Integration The BPMN model for procedures includes predefined
steps such as incision, trocar placement, laparoscope insertion, and organ manipulation. The system
continuously maps real-world observations to this structured model through:
• Vision-based Instrument Detection (  inst): Identifies tool usage and compares with expected
sequences.
• Eye-tracking (  gaze): Confirms if surgeons are focusing on critical areas at appropriate steps.
• Hand Gesture Recognition (  gest): Detects compliance with required movements (e.g., correct
suturing technique).
• Voice Commands (  voice): Captures surgeon-nurse communications for validation.
• Real-time Imaging (  img): Analyzes anatomical landmarks for correct procedure execution.
Illustrative Scenario Consider a scenario where a surgeon employs a novel technique requiring a
secondary incision. The system detects a deviation (  img and   inst difer from the expected process).</p>
        <p>Figure 1 provides a high-level schematic overview of our proposed framework adapted for the domain
of surgery. Two major phases of human-in-the-loop involvement are depicted:
1. Shallow Feedback (A):
2. Deep Feedback (B):
• The system continuously captures data from multiple sources (e.g., gaze tracking, instrument
usage, sensor logs).
• It compares the as-is process to the desired BPMN model.
• When a potential discrepancy or question arises, the system prompts the user (surgeon,
nurse, etc.) for feedback.
• The user provides confirmation, correction, or small clarifications. This feedback is used to
adjust or annotate the current run-time process instance.
• In-depth reflections by human experts are used to refine the model itself or the methods
that interpret the captured data.
• For instance, if the current process model does not account for a new device or step
introduced by the surgical team, deep feedback cycles can lead to an updated BPMN model or a
reconfiguration of the data capture pipeline.</p>
        <p>• Over time, repeated deep feedback loops result in an evolving knowledge base that is more
robust and better tailored to each specific surgery or environment.</p>
        <p>
          A critical component of our setup is the Mixed Reality (MR) environment, which serves multiple
purposes:
• Precision of Multimodal Recordings: By using MR headsets, the system can track the surgeon’s
gaze in relation to specific instruments or areas of the patient’s body. Likewise, position and
orientation of surgical staf can be recorded.
• Spatial Conceptual Modeling for Feedback: We build on Spatial Conceptual Modeling [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ],
which allows us to overlay real-time process conformance data directly into the OR environment.
For example, a soft highlight (visible in the MR headset) might appear over the next instrument
to be used, or an alert icon might appear above a piece of equipment that must be sanitized.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Chain-of-Thought LLM Prompting</title>
        <p>The proposed methodology employs a conversation engine powered by Large Language Models (e.g.,
GPT variants) that can: (1) Parse sensor events and interpret them in the context of the BPMN model,
(2) Generate feedback prompts when conformance might be violated, (3) Solicit clarifications and
deeper insights from the surgeon or nurse (for refining the model), and (4) Provide intermediate
“food-for-thought” (chain-of-thought) to guide the surgical team or system designers on why certain
steps are suggested or flagged.</p>
        <p>A key component of our methodology is the construction of well-curated LLM prompts that merge
domain knowledge (e.g., typical steps in a laparoscopic procedure) with real-time sensor data (e.g., the
last tool recognized by a vision sensor was a cauterizing instrument). Below is a conceptual example of
the layered prompts:</p>
        <sec id="sec-3-2-1">
          <title>Example Prompt for Understanding Mixed Reality Inputs</title>
          <p>System (LLM context): “The surgeon’s gaze has been fixated on the laparoscope for 5 seconds,
and the nurse passed the laparoscope 10 seconds ago. The BPMN model indicates we are in the
“Insert Laparoscope” task. Confirm if this step is complete. If uncertain, ask for feedback from
the user.”</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>Example Prompt for Generating Feedback</title>
          <p>System (LLM context after receiving user input): “User indicated that they are testing a
new technique requiring a secondary incision. The current BPMN model does not include this
step. Rather than adding an optional sub-process, this should be modeled as an alternative
process path. Insert an exclusive Gateway with the existing technique subprocess and the new
technique subprocess as subsequent elements. Record new recommended tasks accordingly.</p>
          <p>Provide a revised BPMN snippet.”</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Aspects to Monitor for Multimodal Process Mining</title>
        <p>Beyond the questions a surgeon might explicitly ask, the system continuously mines data to update
the as-is process model. Table 1 outlines various aspects of surgery that are relevant for conformance
checking, each corresponding to multimodal sensor inputs.
Instrument detec- To confirm correct usage sequence, detect
missing/extion via computer tra usage, alert if an instrument wasn’t sterilized, etc.
vision (camera) +
staf input logs
MR headset with
eye-tracking</p>
        <p>To assess if the surgeon is focused on the correct
region/patient area. Non-conformance might arise if the
surgeon fails to visually confirm a step (e.g., lack of
inspection).</p>
        <p>MR/IR sensors, To detect if certain steps (e.g., suturing technique) are
glove-based track- performed in standard manner, or to confirm that a
ers gesture-based command has been recognized.</p>
        <p>Anesthesia ma- To ensure anesthesia compliance steps, watch for
chine logs, heart anomalies that might require altering the process (e.g.,
rate monitor, SpO2 emergency protocols).
sensor
Vision-based object To check if the correct number of instruments/sponges
detection, manual are present before closure (avoid retained surgical
logs from nurses items).</p>
        <p>UV sensor logs, Conformance checking for infection control steps,
verstaf compliance ifying that each area is sanitized prior to the next step.
logs (handwashing,
glove changes)
MR device track- To ensure correct posture or vantage point is taken for
ing (position/orien- certain steps (e.g., for laparoscopic approach, a specific
tation) angle might be recommended).</p>
        <p>Camera feed from To verify compliance with recommended incision size,
laparoscope or location, and closure technique.
overhead camera
Imaging data To confirm that the correct organ or region is identified
(e.g., real-time before proceeding (e.g., right kidney instead of left).
ultrasound, MRI
overlays)
(continued on next page)</p>
        <p>Reason for Relevance to Conformance
…(table continued)</p>
        <p>To confirm that each task is within an acceptable
time window (e.g., prophylactic antibiotics repeated
in time).</p>
        <p>To verify that critical verbal confirmations are done
(e.g., “Time Out” procedure).</p>
        <p>Commu- Voice recognition
nication or typed notes
Logs
Clinical Doc- EHR (Electronic To confirm data entry is complete and consistent with
umentation Health Record) the surgical plan (e.g., procedure codes, lab results).</p>
        <p>system
Unexpected Automatic anomaly To trigger re-routing of the BPMN process to an
emerEvents detection (vitals, gency sub-process if necessary (e.g., severe
hemorsudden camera rhage).</p>
        <p>movements)
Sur- Smartwatches, To monitor the physical state of surgeons and staf
geon/Staf wearable health (e.g., heart rate, stress levels, fatigue) and proactively
Vitals trackers suggest breaks or duty switches when signs of
exhaustion or stress are detected, especially in surgeries
involving multiple surgeons.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Proposed Evaluation and Future Work</title>
      <p>We propose a multi-faceted evaluation framework, leveraging established surgical video datasets
[15, 16, 17] to benchmark performance across several key metrics:
• Annotation Accuracy: Measure tool and event recognition accuracy against expert annotations.
• Temporal Consistency: Evaluate the alignment between detected events and ground truth
timelines, ensuring timely alerts and correct sequencing.
• Process Conformance: Assess the system’s ability to detect deviations from standard protocols
using conformance checking metrics, such as deviation frequency and critical event
misclassifications.
• For evaluating the robustness across modalities, we will analyze performance consistency
across diferent sensor inputs to ensure reliable multimodal integration.</p>
      <p>By applying these metrics on diverse datasets from cataract [15], laparoscopic [16], and robotic
surgery domains [17], we aim to demonstrate the system’s versatility and readiness for real-time
surgical support.</p>
      <p>Our roadmap for future work outlines key steps to ensure continuous improvement and user-centric
development: (1) Extend evaluations to large, diverse datasets, including complex and rare surgical
procedures, (2) implement rigorous testing on annotated datasets to validate real-time performance and
scalability, and (3) engage with final users (surgeons, nurses) to gather feedback on system performance
and usability.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>By integrating multimodal process mining with Mixed Reality and LLM-driven chain-of-thought prompting,
we propose a highly granular, real-time conformance checking methodology for surgical processes.
User confirmations can augment immediate decisions in the operating room, while deeper reflection
iteratively improves the process model over time.</p>
      <p>As a result, surgeons can rely on the system to (1) provide step-by-step prompts and clarifications, (2)
alert them when tasks are out of sequence or incomplete, (3) suggest new tasks when a procedure deviates
from established protocols, and (4) support advanced analytics using chain-of-thought reasoning that
ties sensor data to context-specific knowledge of surgical procedures.</p>
      <p>Despite its promising capabilities, our approach has several limitations. Inaccuracies in sensor data
(e.g., video feeds, gaze tracking) or inconsistent data quality may afect the system’s reliability. The
methodology validated on selected surgical datasets, may require significant adaptation to perform
efectively across diverse surgical procedures and environments. Achieving true real-time performance
can be challenging due to the computational complexity of multimodal data fusion and chain-of-thought
processing.</p>
      <p>Addressing these threats and limitations through continued testing, iterative user feedback, and
technological refinements will be essential for future deployments in dynamic surgical environments.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used Grammarly in order to: Grammar and spelling
check. After using this tool, the authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
scene graph knowledge, International Journal of Computer Assisted Radiology and Surgery (2024)
1–9.
[12] M. Hu, P. Xia, L. Wang, S. Yan, F. Tang, Z. Xu, Y. Luo, K. Song, J. Leitner, X. Cheng, et al., Ophnet:
A large-scale video benchmark for ophthalmic surgical workflow understanding, in: European
Conference on Computer Vision, Springer, 2025, pp. 481–500.
[13] U. Bracale, B. Iacone, A. Tedesco, A. Gargiulo, M. M. Di Nuzzo, D. Sannino, S. Tramontano,
F. Corcione, The use of mixed reality in the preoperative planning of colorectal surgery: Preliminary
experience with a narrative review, Cirugía Española (English Edition) (2024).
[14] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. tau Yih,
T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-augmented generation for knowledge-intensive nlp
tasks, 2021. arXiv:2005.11401.
[15] H. Al Hajj, M. Lamard, P.-H. Conze, S. Roychowdhury, X. Hu, G. Maršalkaitė, O. Zisimopoulos,
M. A. Dedmari, F. Zhao, J. Prellberg, M. Sahu, A. Galdran, T. Araújo, D. M. Vo, C. Panda, N. Dahiya,
S. Kondo, Z. Bian, A. Vahdat, J. Bialopetravičius, E. Flouty, C. Qiu, S. Dill, A. Mukhopadhyay,
P. Costa, G. Aresta, S. Ramamurthy, S.-W. Lee, A. Campilho, S. Zachow, S. Xia, S. Conjeti, D.
Stoyanov, J. Armaitis, P.-A. Heng, W. G. Macready, B. Cochener, G. Quellec, Cataracts: Challenge
on automatic tool annotation for cataract surgery, Medical Image Analysis 52 (2019) 24–41.
doi:https://doi.org/10.1016/j.media.2018.11.008.
[16] A. P. Twinanda, S. Shehata, D. Mutter, J. Marescaux, M. de Mathelin, N. Padoy, Endonet: A deep
architecture for recognition tasks on laparoscopic videos, 2016. arXiv:1602.03012.
[17] M. Allan, S. Kondo, S. Bodenstedt, S. Leger, R. Kadkhodamohammadi, I. Luengo, F. Fuentes,
E. Flouty, A. Mohammed, M. Pedersen, A. Kori, V. Alex, G. Krishnamurthi, D. Rauber, R. Mendel,
C. Palm, S. Bano, G. Saibro, C.-S. Shih, H.-A. Chiang, J. Zhuang, J. Yang, V. Iglovikov, A. Dobrenkii,
M. Reddiboina, A. Reddy, X. Liu, C. Gao, M. Unberath, M. Kim, C. Kim, C. Kim, H. Kim, G. Lee,
I. Ullah, M. Luna, S. H. Park, M. Azizian, D. Stoyanov, L. Maier-Hein, S. Speidel, 2018 robotic scene
segmentation challenge, 2020. arXiv:2001.11190.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          , Process Mining: Discovery, Conformance and Enhancement of Business Processes, Springer, Berlin, Heidelberg,
          <year>2011</year>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>642</fpage>
          -19345-3. doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>642</fpage>
          - 19345- 3.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gavric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Proper</surname>
          </string-name>
          ,
          <article-title>Multimodal process mining</article-title>
          ,
          <source>in: 26th International Conference on Business Informatics (CBI)</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gavric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Proper</surname>
          </string-name>
          ,
          <article-title>Enriching business process event logs with multimodal evidence</article-title>
          ,
          <source>in: The 17th IFIP WG 8.1 Working Conference on the Practice of Enterpris Modeling (PoEM)</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gavric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Proper</surname>
          </string-name>
          ,
          <article-title>Stakeholder-specific jargon-based representation of multimodal data within business process</article-title>
          ,
          <source>in: Companion Proceedings of the 17th IFIP WG 8.1 Working Conference on the Practice of Enterprise Modeling (PoEM Forum</source>
          <year>2024</year>
          ),
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.-G.</given-names>
            <surname>Fill</surname>
          </string-name>
          , Spatial Conceptual Modeling:
          <article-title>Anchoring Knowledge in the Real World</article-title>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>35</fpage>
          -
          <lpage>50</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>031</fpage>
          -56862-
          <issue>6</issue>
          _3. doi:
          <volume>10</volume>
          .1007/978- 3-
          <fpage>031</fpage>
          - 56862-
          <issue>6</issue>
          _
          <fpage>3</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gavric</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bork</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Proper</surname>
          </string-name>
          ,
          <article-title>How does uml look and sound? using ai to interpret uml diagrams through multimodal evidence</article-title>
          ,
          <source>in: 43rd International Conference on Conceptual Modeling (ER)</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Jin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. W.</given-names>
            <surname>Jeong</surname>
          </string-name>
          , Surgical-llava:
          <article-title>Toward surgical scenario understanding via large language</article-title>
          and
          <source>vision models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410.09750. arXiv:
          <volume>2410</volume>
          .
          <fpage>09750</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Srivastav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Navab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Padoy</surname>
          </string-name>
          ,
          <article-title>Procedure-aware surgical video-language pretraining with hierarchical knowledge augmentation</article-title>
          ,
          <source>arXiv preprint arXiv:2410.00263</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>H.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Seenivasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. D.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Unberath</surname>
          </string-name>
          ,
          <article-title>Digital twins as a unifying framework for surgical data science: the enabling role of geometric scene understanding</article-title>
          ,
          <source>Artificial Intelligence Surgery</source>
          <volume>4</volume>
          (
          <year>2024</year>
          )
          <fpage>109</fpage>
          -
          <lpage>138</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Özsoy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Czempiel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Örnek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Eck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tombari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Navab</surname>
          </string-name>
          ,
          <article-title>Holistic or domain modeling: a semantic scene graph approach</article-title>
          ,
          <source>International Journal of Computer Assisted Radiology and Surgery</source>
          <volume>19</volume>
          (
          <year>2024</year>
          )
          <fpage>791</fpage>
          -
          <lpage>799</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>K.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kattel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Lavanchy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Navab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Srivastav</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Padoy</surname>
          </string-name>
          , Advancing surgical vqa with
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>