<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Clinical Guidelines, domain ontology, and LLMs for Personalized Leukemia Treatment Recom mendations</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Xingru Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michel Dumontier</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chang Sun</string-name>
          <email>chang.sun@maastrichtuniversity.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Clinical Practice Guidelines, Ontologies, Knowledge Graphs, Treatment Recommendation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Advanced Computing Sciences, Faculty of Science and Engineering, Maastricht University</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Institute of Data Science, Faculty of Science and Engineering, Maastricht University</institution>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Large Language Models (LLMs) ofer new opportunities for clinical decision support, but face challenges in reliability, precise recommendations for individual patients, and adherence to medical guidelines. Challenges such as insuficient domain knowledge, generic outputs, and hallucination are risks to their clinical adoption. This paper proposes an approach that integrates LLMs with Clinical Practice Guidelines (CPGs) and medical ontologies to enhance personalized treatment recommendations. We compared four strategies to generate treatment recommendations with and without integrating clinical guidelines: (1) LLMs without any guideline input, (2) providing the full guideline document as textual input to LLMs with retrieval-augmented generation (RAG) technique; (3) converting guideline documents from PDF to markdown files capturing the structure of tables, diagrams, and references and using Chain-of-Thoughts to reason each decision steps; (4) structuring guidelines as graphs and linking medical concepts to ontologies as input to LLMs. We experimented on GPT-3.5 Turbo, GPT-4, and Llama 2. The evaluations assessed guideline adherence, treatment completeness, path alignment, and answer relevancy with Acute Lymphoblastic Leukemia as the primary use case. Additionally, we developed a user interface for health professionals to input patient descriptions and obtain treatment recommendations and explanations. Preliminary results demonstrate the feasibility of the graph-based approach in decision path tracing, graph-augmented reasoning, and natural language explanations to enhance transparency for clinician validation.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models (LLMs) have been proposed as promising tools for enhancing clinical decision
support systems with their ability to process vast amounts of medical literature, guidelines, and patient
records to assist clinicians in diagnosis, treatment planning, and patient management. However, their
adoption in clinical settings is hindered by unreliable outputs, lack of explainability, and risks of
hallucination — where models may generate incorrect or misleading recommendations [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. To
enhance the reliability of LLMs for clinical decision support, it is crucial to integrate authoritative,
structured medical knowledge directly into their reasoning process. One approach is to provide LLMs
with access to Clinical Practice Guidelines (CPGs). CPGs ofer standardized, evidence-based treatment
protocols that clinicians use to guide medical decisions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        However, current LLMs face challenges to accurately interpret and reliably apply CPGs. Unlike
other medical knowledge sources, CPGs are often lengthy, complex, semi-structured documents that
combine narrative text, cross-references, logic diagrams, workflow visualizations, tables, and conditional
recommendations [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Simply converting CPGs to plain text and feeding it to an LLM can result in
lost information, leading the model to miss key steps or generate wrong recommendations. Many
recommendations in CPGs depend on ”if-then” conditions, requiring logical reasoning to determine
which pathway is relevant for a specific patient case. When CPGs are presented as raw text, LLMs
struggle to navigate these dependencies and generalize without properly following the decision flow.
Then, the majority of recommendations in CPGs are not stated in one place but instead refer to other
      </p>
      <p>CEUR</p>
      <p>ceur-ws.org
sections or external guidelines. LLMs often fail to resolve these references correctly. Furthermore, unlike
other medical knowledge sources, CPGs require precise interpretation of dosages, contraindications, risk
stratifications, and exception cases. LLMs, which rely on probabilistic text generation, may oversimplify
nuanced recommendations or fail to capture clinically significant details. Finally, the lack of decision
traceability in LLM-generated outputs poses a significant barrier to clinical adoption, as clinicians need
transparent reasoning to validate the generated recommendations.</p>
      <p>To address these challenges, we propose a framework that converts CPGs to graph structure and
links them with relevant entities from medical ontologies. The graphs are fed into LLMs to improve the
accuracy of treatment recommendations. Our approach centers on three innovations: (1) transforming
decision diagrams in CPGs to graphs navigable by LLMs, formalizing clinical workflows for risk
stratification, treatment staging, and decision logic; (2) integrating text, tables, and references into
LLMs to ground recommendations; and (3) deploying domain knowledge from medical ontology to
augment reasoning and minimize hallucinations. We designed four generation strategies: baseline
model with structured prompting, RAG model, chain-of-thought model, and graph-based model. Three
LLM models are experimented including GPT-3, GPT-4o, and Llama-2-70B. Our study focuses on Acute
Lymphoblastic Leukemia (ALL), utilizing clinical guidelines from authoritative sources such as the
National Comprehensive Cancer Network (NCCN). We evaluate the framework using synthetically
generated patient datasets that mirror clinical scenarios, assessing performance through accuracy,
guideline adherence, pathway alignment, and treatment completeness. We include explainability
mechanisms, such as decision path tracing and rule-based reasoning, to enable clinicians to audit model
outputs against CPGs. We demonstrate how structured guidelines and medical ontology integration
enhance LLMs’ capacity to deliver transparent and reliable treatment recommendations.</p>
      <p>This paper is organized as follows: Section 2 reviews related work in medical LLM and guideline
integration. Section 3 describes the proposed methodology and experimental architecture covering
information retrieval and clinical guidelines integration. Section 4 details the experiment setting and
evaluation methods. Section 5 discusses the results and implications. Finally, Section 6 summarizes the
work and proposes future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Translating CPGs into machine-actionable formats with integration with LLMs remains a significant
challenge. Recent research has explored various approaches to address this issue, ranging from
unstructured text-based methods to structured representations that enhance model reasoning. A straightforward
approach is to incorporate guidelines as unstructured text, either through fine-tuning or real-time
retrieval [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Fine-tuning can improve performance on relevant queries, but the model’s knowledge
is static, and it may hallucinate or misapply guidance. Retrieval-Augmented Generation (RAG)
enables LLMs to fetch relevant guideline content dynamically from an external source, making it more
adaptable to new and updated guidelines [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. However, it does not inherently enforce structured
decision-making during clinical reasoning.
      </p>
      <p>
        Beyond using raw text, researchers have explored structured representations of CPGs, such as decision
trees, to enforce systematic adherence to medical guidelines[
        <xref ref-type="bibr" rid="ref8 ref9">8, 9</xref>
        ]. In this approach, LLMs function as
reasoning engines, traversing a structured decision tree to make stepwise, logic-driven clinical decisions.
Each node in the tree corresponds to a clinical decision point, guiding the model through a series of
intermediate steps until it arrives at the recommended treatment. This method has been tested with
multiple LLMs, including GPT-4, GPT-3.5, and PaLM-2, demonstrating improved alignment with correct
treatment recommendations compared to zero-shot prompting. This highlights that hard-coding the
guideline’s decision logic can lead to more reliable outputs.
      </p>
      <p>
        Another structured approach involves graph-based representations of CPGs, which capture complex
relationships between medical concepts and decision pathways [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Guidelines are encoded as a graph,
and the LLM selects a path through the graph that matches the patient’s conditions. The LLM builds a
reasoning path from patient data to a guideline-prescribed action, treating the guideline like a map of
connected decisions. The study also found that preserving the structure of guidelines (tables, flowcharts,
condition-action pairs) is crucial. In summary, structured integration approaches treat CPGs not just as
reference text but as algorithms or knowledge graphs that guide the model’s reasoning.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Guideline Extraction</title>
        <p>
          We selected the National Comprehensive Cancer Network (NCCN) guidelines for Acute Lymphoblastic
Leukemia (Version 1.2024) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] as an external knowledge source for LLMs to provide treatment
recommendations for diferent patients. The NCCN guidelines cover the entire patient management process
from diagnosis to treatment to follow-up monitoring. The guidelines provide detailed diagnostic criteria,
stratified risk assessment protocols, and treatment pathways for multiple patient subgroups.
        </p>
        <p>The guideline contains various information - text, tables, references, and decision flowchart -
accompanied by detailed supporting documentation. We first parse the guideline PDF files using Google
Gemini 2.0 Flash Thinking, a large multi-modal language model with visual understanding capabilities
and an extensive context window. The model was prompted to extract all textual explanations, tables,
footnotes, references, and particularly the visual elements (e.g., decision flowcharts) to a semi-structured
textual description in a markdown file.</p>
        <p>Furthermore, we utilized Anthropic’s Sonnet-3.7-thinking model to transform the extracted
components from the markdown file to a Directed Graph (DG) representation. An example of a part of
the graph shown in Figure 1. In the graph, each node represents a decision point or treatment option
in the clinical pathway, while each edge represents transitions between nodes with conditional logic.
Properties include the patient characteristics and clinical variables that may influence decisions and
references that are traceable to the source guidelines or articles. For the ALL case, we created three
graphs for diferent ALL subtypes (Ph+ B-ALL, Ph- B-ALL, and T-ALL) to accurately reflect the diferent
treatment pathways for each subtype. Finally, the graph structures were manually validated by the
authors to ensure the accuracy of the extracted knowledge from the original guidelines, including
verifying the decision logic, checking the completeness of the treatment pathways and inclusion of the
references and footnotes, and correcting the misinterpretation of the visual elements.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. LLMs and Generation Strategies</title>
        <p>We design and compare diferent strategies to retrieve from the CPGs and generate the recommendations,
including a baseline LLM, RAG LLM model, RAG LLM with chain-of-thought reasoning, and
graphbased RAG combined with LLM logical reasoning. We apply GPT-3.5 Turbo, GPT-4, and Llama-2-70B in
the experiments.</p>
        <p>Baseline LLM Model The baseline model consists of an LLM without CPGs augmentation or retrieval
mechanisms. The baseline model generates responses solely based on patient descriptions and its
inherent pre-trained knowledge, showing its capabilities of providing accurate recommendations in the
absence of external CPGs. We constructed a comprehensive prompting template for the model to extract
key characteristics (e.g., age, ALL subtype, treatment history) from the patient description and shape
the responses to follow the treatment pathway and consider the risk factors indicated in the guideline.
The baseline model serves as a comparative standard to assess the impact of integrating guidelines in
the following generation strategies. The following models use the same prompting template to prevent
the potential influence of diferent promptings.</p>
        <p>RAG-Based LLM Model The second model employs Retrieval-Augmented Generation (RAG) to
retrieve relevant sections from CPGs during the response generation process. In this model, CPGs are
injected directly as textual input, meaning that tables, references, decision pathways, and flowcharts are
converted into plain text, resulting in the loss of decision logic and hierarchical structure. Compared to
the baseline model, the RAG-based model can retrieve specific information from the guideline content
based on patient description, which can help reduce hallucinations and improve the relevance and
accuracy of the generated responses.</p>
        <p>Chain-of-Thought (CoT) Enhanced LLM The CoT-enhanced LLM is based on the RAG-based
model and converts flowcharts from CPGs into a graph representation and integrates stepwise reasoning.
In this model, the LLM is provided with textual input from plain text, which was converted from a PDF
version of the guideline to a markdown file. The tables, diagrams, and references are presented as text
with their structures preserved. We provide a structured reasoning framework, including a multi-step
decision process: 1) patient symptom analysis, 2) risk stratification, 3) treatment stage identification,
and 4) personalized treatment recommendations. With the structured reasoning steps, CoT-enhanced
LLM is enforced to follow the decision-making process defined in the CPG to reduce the possibility of
irrational inferences or logical errors.</p>
        <p>
          Graph-Guided LLM The fourth approach is Graph-Guided LLM, in which the guideline is
transformed into a graph data structure, and the relevant medical concepts are linked with medical ontologies
such as Human Phenotype Ontology [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and Orphanet Rare Disease ontology [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. The LLM extracts
the defined key features and values from the patient description, such as age, comorbidity, and response
to therapy, and constructs queries to retrieve relevant information from graphs. The decision path is
guided based on the patient’s conditions and retrieved outcome from the graph. Finally, the treatment
recommendation is generated, including the extracted features from the patient, decision pathway
selection, and how the path was navigated in each step.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Architectural Implementation</title>
        <p>We implement the system with a user interface using Streamlit (shown in Figure 3). The interaction flow
of the system was illustrated in Figure 4. The process begins when a user inputs a patient description in
a chatbox. Then, the LLM will process and analyze the question and identify the patient characteristics
that are required in the guideline. Then, the model will retrieve and query from the additional input,
which is either the textual content or/and the graph constructed from CPGs, to find relevant treatment
paths based on patient characteristics. For the graph-based models, the reasoning is generated for each
decision step, including clinical rationale, supporting evidence from the graph, treatment decisions,
and references to guidelines. On the UI, the user can choose the LLMs and generation strategies. In
the generated responses, the extracted patient characteristics, selected pathway, pathway navigation,
treatment recommendation, and the explanations of each step are displayed as results 3.</p>
        <p>In practical clinical use, clinicians often consider multiple factors simultaneously, request additional
information, or revise earlier decisions based on new data. We implement the system to these
requirements by treating each clinical interaction as a series of events that can be processed, tracked, and
audited independently. The system captures every decision and modification in the event log, storing a
history of the decision-making process. The system can reconstruct the state of any clinical case so
that the clinicians can examine how recommendations would have difered under various scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Results</title>
      <p>To evaluate diferent models, we generated synthetic data for 20 patients, including patient and condition
descriptions, treatment pathways, and recommendations based on the guideline. We first extracted
valid treatment paths from the guideline’s graph data and then generated corresponding patient
descriptions based on these paths using GPT-4o. The patient descriptions include patients’ demographic
information, medical history, clinical symptoms, and test results. Then, we generated treatment
recommendations based on the diagnostic and treatment processes of each pathway as a reference for
model output. The following shows an example of a patient description. The whole dataset is accessible
at: https://github.com/MaastrichtU-IDS/guideline_graph_chatbot
“34-year-old male with newly diagnosed acute lymphoblastic leukemia. Flow cytometry shows B-lineage
ALL (CD19+, CD20+, CD10+, CD34+, TdT+). FISH analysis confirms BCR::ABL1 fusion (p190 variant). No
CNS involvement detected.Medical history notable for well-controlled hypertension on lisinopril 10mg daily.
Performance status ECOG 1. Laboratory studies show normal liver and kidney function.”</p>
      <sec id="sec-4-1">
        <title>4.1. Evaluation Metrics</title>
        <p>
          In our experiments, we used four evaluation metrics to measure the model’s performance regarding
guideline adherence, treatment completeness, path alignment, and answer relevance. All metrics are
scaled to [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ], with higher values indicating better performance
        </p>
        <sec id="sec-4-1-1">
          <title>Guideline Adherence (GA)</title>
          <p>This metric measures whether the generated responses from the model
follow the specific treatment phases (induction, consolidation, maintenance, and surveillance) and are
adherent to the terminology indicated in the CPGs. Let  = {
1,  2, ...,   } represent the defined key
terms mandated by the CPGs (such as TKI, tyrosine kinase inhibitor, transplantation),  represents
expected responses (as true answers), and  denotes terms in the generated responses.</p>
          <p>GA =
 ⎧</p>
          <p>1
1</p>
          <p>∑
 =1 ⎨
⎩0
0.5 if   ∈  ∖ 
if   ∈  ∖ 
if (  ∈  ∩ ) ∨ ( ∉  ∪ )
(1)
(2)
(3)</p>
        </sec>
        <sec id="sec-4-1-2">
          <title>Treatment Completeness (TC)</title>
          <p>This metric assesses whether the generated recommendations cover
the complete treatment steps specified in the CPG, including initial treatment, follow-up monitoring,
and potential follow-up treatment. This assessment is important to prevent the model from missing
important treatment steps that could afect the quality of patient care.
where  step are generated treatment steps and  step are required steps in the guideline.</p>
        </sec>
        <sec id="sec-4-1-3">
          <title>Path Alignment (PA)</title>
          <p>This metric measures how closely the LLM’s reasoning paths are consistent
with the expected paths from the graph data structure from CPGs. We examine if the LLM agent follows
the correct decision points (nodes in graphs) by calculating the longest common subsequence (LCS) of
nodes between the LLM agent’s path and the expected path as the measure of alignment. PA=1 indicates
the LLM’s reasoning path perfectly aligns with the expected guideline path, while PA=0 indicates no
overlap between the two paths.</p>
          <p>TC =
| step ∩  step|</p>
          <p>| step|
  =
(
path,  path)
| path|
guideline path. (
 path.</p>
          <p>Where  path = [ 1,  2, … ,   ] denotes LLM’s reasoning path,  path = [ 1,  2, … ,   ] represents the expected
path,  path) is the length of the longest common subsequence between  path and</p>
        </sec>
        <sec id="sec-4-1-4">
          <title>Answer Relevancy (AR)</title>
          <p>This metric measures how relevant the generated recommendations are to
the given patient’s case. A high score indicates that the generated recommendations can identify and
address the specific characteristics of the patient, such as the ALL subtype, age, and risk factors.</p>
          <p>AR =</p>
          <p>∑
=1  ⋅ (  ∈ ) +  ⋅ (</p>
          <p>∈ )
||
where  = { 1,  2, ...,   } denotes the key patient characteristics from the description,  = { 1,  2, ...,   }
represents the values of these characteristics. (⋅) is an indicator function (1 if true, 0 otherwise), and 
and  weight the characteristics terms and their values, respectively (default  = 1,  = 1 ).</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Preliminary results</title>
        <p>The performance results of using GPT-3.5, GPT-4, and Llama-2-70B in four diferent generation strategies
are presented in the following tables. Among these, GPT-4 outperforms the other two models across all
evaluation metrics. GPT-4 achieved baseline scores ranging from 0.462 to 0.513 across diferent metrics,
compared to GPT-3.5 Turbo’s range of 0.268 to 0.325 and Llama-2’s range of 0.448 to 0.505. These results
indicate that larger and more advanced language models may contain more medical knowledge and
have better inherent capabilities in understanding and generating clinically relevant recommendations.</p>
        <p>All three LLMs demonstrated a consistent pattern of improvement when progressively advanced
generation strategies were applied. The RAG-based approach outperformed the baseline model by
showing the contribution of external guideline documents. Specifically, RAG-based generation improved
GPT-4’ Guideline Adherence from 0.475 to 0.549 and Treatment Completeness from 0.462 to 0.541.
Similar trends were observed for Llama-2 and GPT-3.5 Turbo.</p>
        <p>The CoT-based model resulted in further improvements by giving better reasoning and sequential
decision-making, leading to increased Treatment Completeness and Path Alignment scores. For example,
CoT-based generation yielded Path Alignment scores of 0.574 for GPT-3.5 Turbo, 0.624 for Llama-2, and
0.639 for GPT-4.</p>
        <p>Among all generation strategies, the Graph-based approach consistently achieved the highest
evaluation scores across all LLM models. GPT-4’s Graph-based generation attained an Answer Relevancy of
0.690, Guideline Adherence of 0.662, Treatment Completeness of 0.685, and a Path Alignment score of
0.698. Similarly, Llama-2 achieved 0.669, 0.652, 0.661, and 0.672 on the same metrics, respectively. These
results prove the eficacy of explicitly incorporating structured knowledge representations during the
generation process.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>As observed from the experiments, the baseline model’s performance showed limited capability in
generating recommendations that adhered strictly to clinical guidelines. The incorporation of RAG
improved performance by providing the models with access to relevant and unstructured guideline
content. However, its efect was comparatively modest, likely due to the limitations of LLMs in
processing lengthy and complex unstructured documents with heterogeneous input (text, table, figure,
diagram, and references).</p>
      <p>By giving structured input from guidelines, CoT model can produce consistent and adherent
recommendations. By enforcing step-wise reasoning, CoT model resulted in significant gains in metrics such as
Treatment Completeness and Path Alignment. These improvements demonstrate that reasoning-based
prompting may lead to a better generation of clinical decision-making processes.</p>
      <p>The Graph-based generation strategy outperformed all other methods, showing the advantages
of representing clinical guidelines in structured graphs. Graph-based methods encourage models to
generate not only complete and relevant but also strictly aligned with established treatment pathways
defined in the guidelines. The improvements observed in Path Alignment scores, particularly in
GPT-4, highlight its ability to maintain consistency and logical sequencing in complex treatment
recommendations.</p>
      <p>We observe a consistent trend across three models for generation strategies: Baseline &lt; RAG &lt; CoT &lt;
Graph-based. This trend indicates that the knowledge structuring strategies for LLMs to retrieve and
reason are model-agnostic and can be generalized across LLM architectures.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Limitations and Future Work</title>
      <p>This work can be improved in several aspects. For the test dataset in this study, we generated synthetic
patient descriptions using GPT-4, which may not fully capture real-world patient scenarios. In future
work, we plan to incorporate real patient cases from the MIMIC Clinical Database. The MIMIC
database has clinical notes containing comprehensive patient information, including the brief hospital
course, discharge summaries, prescriptions, patient illness histories, and treatment recommendations
documented by clinicians. Integrating real patient descriptions and their actual clinical outcomes will
enhance the reliability and diversity of the evaluation data and provide a more accurate assessment of
model performance. For further validation, it will be valuable to ask clinicians to evaluate the generated
recommendation and collect feedback from them.</p>
      <p>Moreover, to generalize the proposed methods and enable their application to a broader range
of diseases, a more flexible graph construction approach is required. Although the current method
employs Gemini and other LLMs to automate the extraction process from guideline documents to graph
structures, it still requires an amount of manual work. Specifically, manual corrections are often needed
to adjust the extracted diagrams and validate their structures to ensure they accurately represent the
decision logic of the original guidelines. Future work will focus on improving the methods to extract
content and construct graphs in a more efective and accurate way. In addition, the current graph
construction process is not fully integrated with disease and phenotype ontologies. At present, only
medical entities are linked, and their definitions are added to the graph. Future eforts could aim to
strengthen these connections to enhance richer semantic representation and interoperability.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgement</title>
      <p>This work has been supported by ICAI lab GENIUS (Generative Enhanced Next-Generation Intelligent
Understanding Systems), a part of the NWO Long-Term Programme ROBUST initiated by the Innovation
Centre for Artificial Intelligence (ICAI) and by REALM (Real-world data-enabled assessment for health
regulatory decision-making), a project funded by Horizon Europe with grant number 101095435.
The authors used GPT-4 for grammar and spelling checking. The author(s) reviewed and edited the
content as needed and takes full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Liévin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Hother</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Motzfeldt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Winther</surname>
          </string-name>
          ,
          <article-title>Can large language models reason about medical questions?</article-title>
          ,
          <source>Patterns</source>
          <volume>5</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferdush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Begum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. T.</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <article-title>Chatgpt and clinical decision support: scope, application, and limitations</article-title>
          ,
          <source>Annals of Biomedical Engineering</source>
          <volume>52</volume>
          (
          <year>2024</year>
          )
          <fpage>1119</fpage>
          -
          <lpage>1124</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A.</given-names>
            <surname>Qaseem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Forland</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Macbeth</surname>
          </string-name>
          , G. Ollenschläger,
          <string-name>
            <given-names>S.</given-names>
            <surname>Phillips</surname>
          </string-name>
          , P. van der Wees, B.
          <article-title>of Trustees of the Guidelines International Network*, Guidelines international network: toward international standards for clinical practice guidelines</article-title>
          ,
          <source>Annals of internal medicine 156</source>
          (
          <year>2012</year>
          )
          <fpage>525</fpage>
          -
          <lpage>531</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Fast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Adams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Busch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fallon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Huppertz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Siepmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Prucker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Truhn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Makowski</surname>
          </string-name>
          , et al.,
          <article-title>Autonomous medical evaluation for guideline adherence of large language models</article-title>
          ,
          <source>NPJ Digital Medicine</source>
          <volume>7</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>14</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Petro</surname>
          </string-name>
          , Benefits, limits, and
          <article-title>risks of gpt-4 as an ai chatbot for medicine</article-title>
          ,
          <source>New England Journal of Medicine</source>
          <volume>388</volume>
          (
          <year>2023</year>
          )
          <fpage>1233</fpage>
          -
          <lpage>1239</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zakka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Shad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chaurasia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. R.</given-names>
            <surname>Dalal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Moor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Phillips</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Alexander</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Ashley</surname>
          </string-name>
          , et al.,
          <article-title>Almanac-retrieval-augmented language models for clinical medicine</article-title>
          ,
          <source>Nejm ai 1</source>
          (
          <year>2024</year>
          )
          <article-title>AIoa2300068</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Ong</surname>
          </string-name>
          , R. Cheng, D. Ong,
          <article-title>Potential for gpt technology to optimize future clinical decision-making using retrieval-augmented generation</article-title>
          ,
          <source>Annals of biomedical engineering 52</source>
          (
          <year>2024</year>
          )
          <fpage>1115</fpage>
          -
          <lpage>1118</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhai</surname>
          </string-name>
          , T. Ruan, Meddm:
          <article-title>Llm-executable clinical guidance tree for clinical decision-making</article-title>
          ,
          <source>arXiv preprint arXiv:2312.02441</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Oniani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Visweswaran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kapoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kooragayalu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Polanska</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Enhancing large language models for clinical decision support by incorporating clinical practice guidelines</article-title>
          ,
          <source>in: 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI)</source>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>694</fpage>
          -
          <lpage>702</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>National</given-names>
            <surname>Comprehensive Cancer</surname>
          </string-name>
          <article-title>Network (NCCN)</article-title>
          ,
          <source>NCCN Clinical Practice Guidelines For Acute Lymphoblastic Leukemia</source>
          ,
          <year>2024</year>
          . URL: https://www.nccn.org/guidelines/guidelines-detail
          <source>? category=1&amp;id=1410</source>
          , accessed:
          <fpage>2024</fpage>
          -03-08.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Castellanos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Caufield</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chute</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cruz-Rojo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dahan-Oliel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davids</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. de Dieuleveult</surname>
          </string-name>
          , V. de Souza, B. de Vries, et al.,
          <article-title>The human phenotype ontology in 2024: phenotypes around the world</article-title>
          .,
          <source>Nucleic Acids Research</source>
          <volume>52</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Rath</surname>
          </string-name>
          , Annie Olry, Boulares Ouchenne, Caterina Lucano, David Lagorce,
          <string-name>
            <given-names>Marc</given-names>
            <surname>Hanauer</surname>
          </string-name>
          , Valérie Lanneau,
          <source>Orphanet Rare Disease Ontology</source>
          ,
          <year>2023</year>
          . URL: https://www.orphadata.com/data/ ontologies/ordo/last_version/ORDO_en_4.4.owl.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>