<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Applying Retrieval-Augmented Generation on Open LLMs for a Medical Chatbot Supporting Hypertensive Patients</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gianluca Aguzzi</string-name>
          <email>gianluca.aguzzi@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Magnini</string-name>
          <email>matteo.magnini@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Pio Salcuni</string-name>
          <email>giuseppepio.salcuni@studio.unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Ferretti</string-name>
          <email>s.ferretti@unibo.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sara Montagna</string-name>
          <email>sara.montagna@uniurb.it</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, University of Bologna</institution>
          ,
          <addr-line>Via dell'Università 50, Cesena</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Pure and Applied Sciences, University of Urbino</institution>
          ,
          <addr-line>Piazza della Repubblica 13, Urbino</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Disease management, especially for chronic conditions or the elderly, involves continuous monitoring, lifestyle adjustments, and frequent healthcare interactions, necessitating efective home-care ICT solutions. To address these needs, chatbot technology has emerged as a promising tool for supporting patients in managing their health autonomously. In this context, chatbots must provide timely and accurate information and continuous empathetic support to maintain patient engagement. Additionally, data privacy concerns necessitate avoiding third-party Natural Language Processing and Generation services. To meet these needs, in this paper we propose the development of a chatbot to support patients in managing chronic conditions, focusing on hypertension. Particularly, we utilise open-source large language models to avoid proprietary systems due to privacy requirements. Given that their performance, based on state-of-the-art metrics, do not compete third-party services, we incorporate retrieval augmented generation (RAG) techniques, building a knowledge base with input from medical professionals to enhance model performance. We evaluated seven open-source models, including two specifically trained in the medical domain. Our results indicate that RAG significantly improves performance, surpassing that of specialised medical-domain models without RAG. This approach ofers a promising solution for managing chronic conditions independently and securely.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Chronic Disease Self-management</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Retrieval-Augmented Generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Chronic disease management presents a substantial challenge for both healthcare systems and patients.
Conditions like hypertension require continuous monitoring, lifestyle adjustments, and often involve
significant healthcare costs. This burden is amplified by the need for frequent interaction with healthcare
professionals, leading to increased wait times and potential access barriers for patients. To address
this, we propose the development of a chatbot designed to support patients in the self-management
of chronic conditions, with a focus on hypertension. The goal is to empower hypertensive patients to
manage their condition more independently by providing them with timely, accurate, and empathetic
guidance, particularly aimed at periodically acquiring patient vital signs and at maintaining a healthy
lifestyle [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
      <p>Two critical requirements emerge. First, the interaction between the patient and the chatbot must be
as empathetic as possible to ensure patients remain motivated and engaged in managing their condition.
At the same time, the information provided by the chatbot must be highly accurate, as there is no
healthcare professional directly mediating the conversation. Second, given that patients are likely
to share personal health data during these interactions, the chatbot must comply with data privacy
regulations, which precludes the use of third-party systems for natural language processing (NLP) and
natural language generation (NLG).</p>
      <p>
        Considering these requirements, we look to large language models (LLMs) as the core technology
for our chatbot. Their demonstrated ability to produce trustable, reliable and emphatic text makes
them an ideal choice for this application. For instance, [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] shows that a set of patients, that receive
responses in a public online forum both from physicians and from a LLM-based chatbot, preferred
chatbot replies, rating both the quality and the empathy of the chatbot-generated responses higher
than those of physicians. We focus on open LLMs, including both models pre-trained specifically for
medical domains and general-purpose models, to avoid reliance on proprietary third-party services.
To enhance the performance of these open LLMs, we explored the integration of retrieval augmented
generation (RAG) techniques [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>This approach involved constructing a knowledge base by collecting data from medical professionals
and subsequently enriching this dataset using RAG techniques. This enabled us to perform a
comprehensive comparison of various retrieval strategies and LLM models (both specialised and general-purpose).
Our findings demonstrate that RAG is a vital technique for significantly improving the performance of
these models, even surpassing the performance of specialised models in mainly all the tested cases.</p>
      <p>The remainder of this paper is organised as follows. Section 2 provides background information
and motivation for our work in the context of chronic disease management and the development of
a chatbot for hypertensive patients; Section 3 describes the methods used in our study, including the
RAG technique, and details the dataset used for our experiments; Section 4 presents the results of our
study and discusses the implications of our findings; and Section 5 concludes the paper with a summary
of our work and suggestions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Motivation</title>
      <p>LLM applications in the healthcare field span various contexts and serve multiple purposes. Over the
past year, the number of research studies focusing on LLMs has grown exponentially, reflecting the
increasing interest within the scientific community working on artificial intelligence ( AI) applied to
medicine. The adoption of LLMs has become a focal point for numerous investigations in this area.</p>
      <p>
        Potential application areas, as outlined by [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], can be divided into three main categories:
Patient Care There is a general agreement that LLM-based chatbots may become a methodological
tool assisting physicians or nurses, during their clinical practice, in various areas of medicine. As
an example, they may support clinical decisions, by abstracting key results from literature. Or
they can detect medical errors, by identifying discrepancies between diagnosis and treatment.
On the patient side, they may be a crucial component for bootstrapping patient empowerment by
providing trustworthy and emphatic answers to user queries. In this context, they must resemble
a dialogue between the physician and the patient which is a key element to provide an efective
and compassionate care. Moreover, they should be able to proactively suggest actions, reasoning
on tracked patient activities and vital signs dynamic.
      </p>
      <p>Research In research, an LLM-based chatbot may assist basic research by automating certain tasks,
such as data analysis, acquisition and interpretation, summarising information, paraphrasing
text, scientific literature search for medical knowledge and related work extraction.
Education In medical education, an LLM-based chatbot may be used to provide teaching material and
as a tool for students who can benefit from interacting tutoring. In this context, noteworthy are
the very good performances demonstrated in passing medical examinations.</p>
      <p>
        In this study, we focus on the use of LLMs as an integral part of a chatbot designed for supporting
hypertensive patients. This chatbot is designed to collect hypertension parameters, motivate patients
with periodic messages suggesting healthy lifestyle changes, and assist them with any concerns related
to their chronic condition [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ].
      </p>
      <p>
        Three key requirements emerge for such an application:
1. Ethical concerns, including risks of privacy and security [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] must be addressed: third-party
technology, such as ChatGPT, carries an inherent risk of compromising patient privacy, if patients
enter test results, photos of their face, communication information, and more. All of this vital
health information is collected and stored, potentially compromising patient privacy.
2. The system must be highly reliable, with no hallucinations or erroneous information: Before
deploying LLMs in real-world medical environments, it is essential to ensure that models designed
for healthcare are accurate, unbiased, and safe for patient use [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
3. The system should communicate empathetically, motivating the patient, making them feel heard,
and providing ongoing real-time support [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Many of today’s top-performing LLMs are proprietary models with hundreds of billions of parameters
trained on vast amounts of data. However, due to the first requirement, the immediate choice falls on
open-source LLMs that can be deployed locally. While these models show promise in matching the
performance of proprietary counterparts, they have notable limitations, such as generating false or
irrelevant information and mistyped or incomplete sentences, which can undermine the trustworthiness,
accuracy, and regulatory compliance of LLM-generated content. On the positive side, open-source
models allow developers to access model weights directly, enabling hosting on their own infrastructure.
Accordingly, to address the second and third requirements, the literature suggests two primary
techniques: querying local databases to complete specific tasks through RAG and fine-tuning, for which
technical details will be provided in Section 3. These approaches are recommended to improve model
performance and enhance the conversational experience according to a domain-specific dataset.</p>
      <sec id="sec-2-1">
        <title>2.1. Related Work</title>
        <p>On the RAG in Medical Domain The RAG model represents a significant innovation in the
integration of information retrieval and generative models, allowing access to a specific medical knowledge
base for generating precise and contextually relevant responses, thus enabling a safer and more efective
application and deployment of LLMs in healthcare. This technology is particularly useful in healthcare,
a field that demands high precision and sensitivity where accuracy and specificity of information are
not just metrics but directly afect patient care quality.</p>
        <p>
          Most of the studies found in literature reports eforts in supporting the clinical decision-making in
specific medical domain, by deploying clinical decision support systems based on LLMs that exploit
RAG based on relevant national guidelines in diverse context. For example, in [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], a RAG approach
is implemented to improve the accuracy of LLMs, ensuring their outputs are consistent with expert
knowledge in the field of digestive diseases. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] introduces a new LLM framework that combines clinical
guidelines with RAG to enhance text interpretation for managing Hepatitis C Virus infection. The
ifndings indicate that this integrated framework outperforms the baseline LLM GPT-4 Turbo model in
delivering precise, guideline-specific recommendations.
        </p>
        <p>
          However, implementing RAG in healthcare presents numerous challenges due to the inherent
diversity of clinical practices across healthcare institutions. These variations are influenced by multiple
factors, such as patient demographics, available resources, geographical context, and specific cultural
sensitivities. Therefore, tailoring RAG solutions to local needs is crucial. The models must adapt to
variables like resource availability, specific medical protocols, and ethical and cultural practices. In this
context, RAG’s flexibility becomes essential, requiring a modular and highly configurable architecture.
Fine-tuning in Medical Domain LLMs fine-tuning has demonstrated impressive results against
medical benchmarks. The literature is rich with examples that demonstrate the successes of fine-tuning
in various specialised medical domains, highlighting its efectiveness and versatility in enhancing
model performance [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. For instance, Wang et al. [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] deploy the Llama 2 base model to generate
training sentences that incorporate clinical concepts drawn from standardised vocabularies, particularly
focusing on rare diseases. This approach utilises resources like the Human Phenotype Ontology
(HPO) to ensure accurate concept normalisation. By aligning model outputs with established medical
terminologies, the study aimed to mitigate issues related to underdiagnosis, misdiagnosis, and
mistreatment.
        </p>
        <p>In this paper we focus on experimenting the RAG since the current dataset size does not
allow for eficient fine-tuning. Moreover, the RAG approach allow to update the knowledge base
dynamically, ensuring that the model remains relevant with recent information.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Materials &amp; Methods</title>
      <sec id="sec-3-1">
        <title>3.1. Fine-tuning</title>
        <p>LLMs are initially trained on extensive, general-purpose text corpora with the aim of predicting the next
token in a sequence, adjusting model parameters to maximise the likelihood of accurate predictions.
This process, known as pre-training, equips the model with a broad understanding of language and
serves as a foundation for fine-tuning.</p>
        <p>During fine-tuning, the model is further trained on a smaller, domain-specific dataset, allowing it to
adapt to the nuances of a particular task or field. This approach is computationally eficient compared
to training from scratch, as it leverages the model’s pre-existing knowledge while refining it for specific
applications. Fine-tuning adjusts the model’s weights based on new data to better align the model with
the intended task such as summarisation, translation, or domain-specific text generation.
Full Fine-Tuning This method updates all the model’s parameters to specialise its performance for a
specific task. While this enhances model capabilities, it is resource-intensive, requiring significant
computational power and memory, comparable to initial training.</p>
        <p>
          Parameter-Eficient Fine-Tuning (PEFT) A more eficient approach, PEFT [
          <xref ref-type="bibr" rid="ref12">12, 13</xref>
          ]1 updates only
a subset of the model’s parameters, freezing the rest. This reduces memory requirements
while retaining the model’s general linguistic knowledge. Techniques like low-rank
adaptation (LoRA) [14] renfie smaller weight matrices to minimise resource usage, and Quantized
LoRA (QLoRA) [15] further compresses memory demands by lowering the precision of the
adapter weights.
        </p>
        <p>By selecting the appropriate fine-tuning method, models can be optimised for specific use cases,
maintaining performance while minimising resource consumption.
3.2. RAG
Traditionally, LLMs generate responses solely based on patterns and information learned during the
training phase. However, these models are inherently limited by the data on which they were trained,
often leading to responses that may lack depth or specific knowledge. RAG overcomes this limitation
by drawing on external data during the response generation process.</p>
        <p>The functioning of RAG involves two phases: first retrieving relevant information from a large
dataset or knowledge base in response to a query, second using that information to inform and guide the
generation of the response. This approach enables software agents – such as chatbots – to provide more
accurate and context-specific answers supplementing the model’s internal knowledge with relevant
external information, such as private documentation, PDF files, or SQL databases. Figure 1 summarises
the key components of a RAG system.</p>
        <p>The retriever in a RAG system identifies relevant information to help answer a query. It begins by
loading documents, splitting them into smaller fragments (a.k.a., chunks), and converting these chunks
into embedding vectors using specialised algorithms. These vectors are stored in an indexed knowledge
base for eficient retrieval.</p>
        <p>Query Embedding</p>
        <p>Vector Store
Similarity search with
query embedding</p>
        <p>Embedding Model</p>
        <p>Combine
prompt
with
relevant
context</p>
        <p>Relevant context</p>
        <p>LLM
Response using
information from
relevant contest</p>
        <p>When a new query is processed, the system generates a query vector and matches it with stored
document vectors using vector similarity techniques. Two main types of embeddings are used:
Sparse Embeddings These rely on keyword matches, using algorithms like TF-IDF [16] and BM25 [17].</p>
        <p>They are computationally eficient but may struggle with synonyms and semantic meaning.
Dense Embeddings These are generated by language models like BERT [18], capturing deeper
semantic relationships and enabling retrieval based on meaning rather than exact words.
Hybrid approaches, combining both methods, can optimise retrieval by balancing speed and semantic
depth, leading to more accurate and context-aware responses.</p>
        <p>The generator is a language model that produces the final text in response to a query. It does not
work in isolation but relies on the context provided by the retriever to guide its response, ensuring that
the output is both plausible and detailed. Once the most relevant passages are retrieved, the generator
synthesises this information and expresses it in natural language to produce the final answer.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.3. RAG vs. Fine-tuning</title>
        <p>RAG and fine-tuning represent two distinct approaches to enhancing foundational models, and their
combination can yield significant performance improvements.</p>
        <p>Knowledge Integration vs. Task Specialisation RAG integrates external knowledge into the
generation process, providing the model with greater versatility and the ability to stay current. In
contrast, fine-tuning hones the model for a specific task, thereby improving task-specific accuracy
and eficiency.</p>
        <p>Dynamic vs. Static Learning RAG facilitates dynamic access to up-to-date external data, ensuring
the model remains relevant with recent information. Fine-tuning, however, is a static method,
updating the model solely based on the most recent training cycle.</p>
        <p>Generalisation vs. Customisation RAG preserves the model’s generality by enriching it with
external data, thus increasing adaptability across multiple tasks. Fine-tuning, in contrast, tailors the
model for a specific use case, which may limit its performance in more generalised applications.
Resource Demands RAG requires a continuous retrieval and integration mechanism for external
data, potentially making it resource-intensive during runtime. Fine-tuning, while resource-heavy
during the training phase, does not demand additional resources during deployment.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Chosen Approach</title>
        <p>
          In our study, we opted to experiment with RAG techniques to enhance the performance of open LLMs
for the development of a chatbot supporting hypertensive patients. We selected diferent retrieval
strategies and many LLMs, including both general-purpose and medical-domain-specific models, to
evaluate the impact of RAG on model performance. We chose RAG over fine-tuning for several reasons.
First, we plan to improve our previous work on the hypertension chatbot [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Because the users of
the chatbot are intended to be numerous and diverse, we need a system that can adapt to diferent
contexts and provide accurate information. With this respect, RAG is a more suitable approach, as it
dynamically allows the model to access a wide range of external data sources. Second, fine-tuning
requires more extensive computational resources, which we currently lack. Third, we want to avoid
unpleasant behaviours such as model drift (e.g., due to outdated knowledge), hallucinations, and biases
that can occur during fine-tuning. For all these reasons, we believe that RAG is the most appropriate
approach for our study.
        </p>
        <p>Concerning the retrievers, we tested three main strategies:
Base Retriever This strategy retrieves information based on vector similarity – maximum marginal
relevance [19] – with the query. To generate the embeddings, we used the nomic-embed-text
model [20] from Ollama available on [21].</p>
        <p>Multi-Query Retriever To overcome the limits of the base retrieve, this method uses multiple queries
to retrieve information, enhancing the diversity and relevance of the retrieved data. The queries
are generated by an LLM – Llama3.1 [22] – based on the original query, then for each query, the
retriever retrieves relevant chunks.</p>
        <p>Ensemble Retriever This strategy combines the outputs of multiple retrievers to enhance retrieval
performance. We used two retrievers – the base retriever and a second retriever based on BM25
– to generate the final response. Finally, the results are sorted using the reciprocal rank fusion
algorithm [23] to select the most relevant information.</p>
        <p>The documents that we use for the RAG consist of a dataset of a collection of question-answer pairs
extracted from medical consultations. The dataset has 1,473 question-answer pairs, with each pair
consisting of a question that a patient might pose to a chatbot and the corresponding answer that the
chatbot should generate. The questions cover a wide range of topics related to hypertension, including
symptoms, causes, treatments, and lifestyle recommendations. All the questions and answers are in
Italian, as the chatbot is intended for use in Italy. All the answers have been reviewed by medical
professionals to ensure their accuracy and relevance to the questions.</p>
        <p>The LLMs we evaluated consist both of general-purpose models (e.g., LLama 3.1, Qwen2, Mistral
Nemo, Phi3, Gemma2) and medical-domain-specific models (e.g., Llama3.1-Medical, Qwen2-Medical)—
see Table 1 for more details.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental setup</title>
        <p>We assessed the efectiveness of various RAG techniques using the RAGAS 2 framework. This framework
provides a comprehensive suite for evaluating diferent metrics and employs a test set derived from our
training set. Adhering to the RAGAS methodology, instead of partitioning the dataset into training and</p>
        <p>Description
A general-purpose LLM based on the Llama3 architecture with 8B
parameters, trained on a diverse range of text data [24].</p>
        <p>A medical-domain-specific version of Llama3.1, fine-tuned on
medical data to enhance its performance in healthcare applications:
https://ollama.com/qordmlwls/llama3.1-medical.</p>
        <p>A general-purpose LLM with 7B parameters, trained on 29 diferent
languages to improve cross-lingual performance [25].</p>
        <p>A medical-domain-specific version of Qwen2, fine-tuned on medical
data to improve its performance in healthcare tasks: https://ollama
.com/echelonify/med-qwen2.</p>
        <p>A 12B parameter model with a large context window (128K tokens)
developed by Nvidia: https://ollama.com/library/mistral-nemo.</p>
        <p>A relatively small LLM with 3B parameters, trained by Microsoft on
filtered high-quality data [26].</p>
        <p>A 9B parameter model based on Deepmind Gemini developed by</p>
        <p>Google [27].
test sets, we leveraged an external LLM (GPT-4o) to generate a test set of 20 question-context-answer
triplets. These generated triplets maintain statistical relevance to the original dataset. We evaluated a
range of state-of-the-art open-source LLMs, both specialised (medical domain) and general-purpose
(see Section 3.4), with and without RAG, to check the impact of these diferent configurations. Each
model was evaluated using the following prompt:</p>
        <p>You are an AI medical assistant specializing in hypertension. Provide detailed and evidence-based
answers, using clear and accessible language. Always respect patient privacy, and if you are unsure of
the answer, state “I am not sure of the answer.” Base your response on the provided context to answer
accurately. Include current recommendations and explain medical concepts in an understandable way.
**Context:** { context }
**Question:** { question }</p>
        <p>Where { context } is the context provided to the model, and { question } is the question generated by
the model. In case of RAG, the context is the retrieved information from the knowledge base, otherwise,
it will be an empty string.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Metrics</title>
        <p>To assess the performance of the diferent RAG systems, we employed a set of metrics specifically chosen
to evaluate the retrieval and generation aspects of these systems. Following the RAGAS framework, we
focused on three key metrics: Answer relevancy for assessing the relevance of the generated response to
the original question; Answer correctness for assessing the factual correctness of the generated response;
Faithfulness for evaluating the alignment of the generated response with the information present in the
retrieved context. Each metric uses an LLM as a reference model to analyse the quality of the generated
responses. For this study, we used GPT-4o as the reference model as it is one of the most advanced
LLMs currently available.</p>
        <p>Answer relevancy. This metric is computed by measuring the cosine similarity between the
embedding of the original question, denoted as E, and the embeddings of  generated questions, denoted as
E where  ∈ {1, ...,  }. Formally:
answer relevancy =</p>
        <p>1 ∑︁ cos(E , E) =
 =1</p>
        <p>1 ∑︁ E · E
 =1 ||E ||||E||
• cos(E , E) is the cosine similarity between the embedding of the -th generated question and
the embedding of the original question.
• E is the embedding of the -th generated question.
• E is the embedding of the original question.</p>
        <p>•  is the number of generated questions.</p>
        <p>Note that while the cosine similarity ranges from -1 to 1, in practice, the answer relevance score typically
falls between 0 and 1. This metric provides a measure of how well the generated responses align with
the original question, with higher values indicating greater relevance.</p>
        <p>Answer correctness. This metric, denoted as , measures the accuracy of a generated answer 
with respect to a ground truth answer . It combines two key aspects: factual correctness ( ) and
semantic similarity (), both ranging from 0 to 1, with higher values indicating greater accuracy. In
RAGAS,   is computed using a language model to quantify the factual overlap between  and :
True Positive (TP): Statements present in both  and ; False Positive (FP): Statements present in  but
not in ; False Negative (FN): Statements present in  but not in . The 1 score, a harmonic mean of
precision and recall, is used to calculate  :
  = 1 Score =</p>
        <p>|  |
|  | + 0.5 × (|  | + |  |)
For , RAGAS employs a language model to measure the semantic resemblance between  and .
The model generates embeddings for both answers, and their cosine similarity is calculated, resulting
in the  score. Finally,  is computed as a weighted average of   and :</p>
        <p>= 1 ×   + 2 × 
where 1 and 2 are user-defined weights that determine the relative importance of factual correctness
and semantic similarity, respectively, with 1 + 2 = 1.</p>
        <p>
          Faithfulness Let  be a generated answer and  be the given context. Let  represent a claim, where
a claim is defined as a unit of information that can be independently verified. Define the set of claims in
 as  = {1, 2, ..., }, where || = . The faithfulness score,  , is calculated as follows:
 (, ) = |{ ∈ | can be inferred from }|

(1)
where  (, ) ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. A higher score indicates greater factual consistency with the provided
context.
        </p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Results</title>
        <p>This section presents the findings of our experiments, focusing on the performance comparison of
RAGbased systems against plain LLM systems. We investigate the impact of diferent retrieval strategies
and the efectiveness of domain-specific fine-tuning.</p>
        <p>RAG vs. Base LLM Our experiments revealed that RAG-based systems consistently outperformed
their plain LLM counterparts in terms of both answer relevancy and answer correctness (Figure 2). This
observation strongly suggests that integrating external knowledge sources via retrieval augmentation
significantly improves the quality of generated responses. The performance boost (up to 20% in answer
correctness and 40% in answer relevancy in some cases) provided by RAG proved particularly relevant
for models trained on medical data, showcasing the efectiveness of this approach in domain-specific
tasks.</p>
        <p>(a) Answer Correctness
(b) Answer Relevancy</p>
        <p>Impact of Retrieval Strategies The choice of retrieval strategy significantly influenced the
performance of the RAG systems. As depicted in Figure 3, each strategy exhibits strengths and weaknesses:
Base Retriever excelled in answer correctness, relevance, and faithfulness, showcasing a good balance
across all metrics; Ensemble Retriever demonstrated superior answer relevance but lagged in answer
correctness and faithfulness; MultiQuery Retriever showed promise in answer relevance and
faithfulness but sufered in answer correctness, potentially due to the increased likelihood of incorporating
irrelevant information from the larger context. These results underscore the importance of aligning the
retrieval strategy with the specific task and model. For instance, LlamaMed performed best with the</p>
        <p>Base Retriever, while Mistral Nemo excelled with the MultiQuery Retriever. A detailed metric-based
analysis is crucial for identifying the optimal strategy for a given scenario.</p>
        <p>Domain-Specific Fine-tuning The comparison between base LLMs and specialised LLMs
highlighted the benefits of domain-specific fine-tuning. Specialised models, such as LlamaMed and Qwen2
Med, consistently outperformed their base counterparts (Llama and Qwen2) in terms of answer
relevance (up to 5%) and correctness (up to 3%). This finding underscores the value of tailoring LLMs to
specific domains, particularly in specialised fields like healthcare, to maximise accuracy and relevance
in generated responses.</p>
        <p>RAG’s Advantage Over specialised LLMs Our results showed that RAG-augmented base models
consistently outperformed even the specialised LLMs without RAG across all evaluation metrics (see
Figure 2). This observation suggests that when suficient data is available to construct a comprehensive
knowledge base, employing RAG techniques can be more advantageous than solely relying on
domainspecific fine-tuning. This finding holds significant practical implications, particularly in scenarios where
data scarcity might make fine-tuning challenging or impractical.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Discussion</title>
        <p>The findings from our study underscore the efectiveness of RAG as a robust tool for enhancing the
capabilities of open-source LLMs in the medical domain. By incorporating a knowledge base of verified
medical data, RAG significantly improves both the accuracy and relevance of the chatbot’s responses,
making it an optimal solution for supporting hypertensive patients.</p>
        <p>One key insight from the study is that RAG-augmented models consistently outperform specialised
LLMs. This suggests that even without large datasets for fine-tuning, RAG can serve as an eficient,
out-of-the-box solution for creating high-performance chatbots in specialised domains. Furthermore,
our experiments emphasize the critical importance of choosing an appropriate retrieval strategy tailored
to the specific task and model. A thorough evaluation of diferent retrieval methods is essential to
ensure optimal performance before deployment.</p>
        <p>A limitation of this study lies in the evaluation methodology. We utilised the RAGAS framework,
which assesses chatbot responses based on a set of metrics using a reference model. However, given that
the reference model is a LLM itself, it may not capture all aspects of medical knowledge and context.
To ensure the quality of the responses, therefore, for future work, we suggest incorporating human
evaluation by medical professionals to assess the clinical accuracy and appropriateness of the chatbot’s
responses.</p>
        <p>Another aspect not addressed in our study is the measurement of empathetic responses. While the
generated replies were accurate and relevant with respect to the provided dataset (which was reviewed
by medical professionals), we did not directly evaluate the empathy of the responses, which is crucial
in patient interactions. Future research should focus on establishing metrics for assessing empathy,
potentially involving a group of experts to evaluate responses from this perspective.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study investigated the application of RAG on open-source LLMs for developing a medical
empathetic chatbot designed to support hypertensive patients. Driven by the need for data privacy, accuracy,
and empathetic communication, we explored RAG as a viable alternative to proprietary LLMs and
ifne-tuning approaches. Our experiments evaluated several open-source LLMs, including both
generalpurpose and medical domain-specific models, across three diferent retrieval strategies: Base Retriever,
Ensemble Retriever, and MultiQuery Retriever.</p>
      <p>Our findings demonstrate that incorporating RAG significantly improves the performance of
opensource LLMs in this specific medical context. Across all tested models, RAG consistently enhanced both
answer relevance and correctness compared to their base LLM counterparts. Notably, RAG-augmented
models outperformed even the specialised, medically fine-tuned LLMs, indicating the potential of RAG
as a powerful and eficient alternative, particularly in scenarios where fine-tuning data is limited.</p>
      <p>However, this study has limitations. The evaluation relied on the RAGAS framework, which utilizes
another LLM (GPT-4) as a reference model. While this approach ofers a standardised and comprehensive
evaluation, it might not fully capture the complexities of medical expertise and context. Furthermore,
our evaluation did not explicitly address the empathy of generated responses, a crucial aspect of patient
interaction in healthcare applications.</p>
      <p>Future research should focus on two key areas: First, incorporating human evaluation by medical
professionals to assess the clinical accuracy and appropriateness of the chatbot’s responses. Second,
developing and integrating metrics specifically designed to evaluate the empathy and emotional
intelligence of the chatbot’s communication. Addressing these limitations will pave the way for developing
more robust and clinically valuable empathetic chatbots for supporting patients in managing chronic
conditions like hypertension.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been partially funded by the European Union - NextGenerationEU within the framework
of PNRR Mission 4 - Component 2 - Investment 1.1 under the Italian Ministry of University and Research
(MUR) programme "PRIN 2022" - grant number 2022N2NH42 - SmartShires - CUP: H53D23003570006
A comprehensive survey, CoRR abs/2403.14608 (2024). doi:10.48550/ARXIV.2403.14608.
arXiv:2403.14608.
[13] S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, B. Bossan, Peft: State-of-the-art
parametereficient fine-tuning methods, https://github.com/huggingface/peft, 2022.
[14] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, Lora:
Lowrank adaptation of large language models, in: The Tenth International Conference on
Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022, OpenReview.net, 2022. URL:
https://openreview.net/forum?id=nZeVKeeFYf9.
[15] T. Dettmers, A. Pagnoni, A. Holtzman, L. Zettlemoyer, Qlora: Eficient finetuning of quantized llms,
in: A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural
Information Processing Systems 36: Annual Conference on Neural Information Processing Systems
2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL: http://papers.nips.
cc/paper_files/paper/2023/hash/1feb87871436031bdc0f2beaa62a049b-Abstract-Conference.html.
[16] J. Leskovec, A. Rajaraman, J. D. Ullman, Mining of Massive Datasets, 2nd Ed, Cambridge University</p>
      <p>Press, 2014. URL: http://www.mmds.org/.
[17] S. E. Robertson, H. Zaragoza, The probabilistic relevance framework: BM25 and beyond, Found.</p>
      <p>Trends Inf. Retr. 3 (2009) 333–389. doi:10.1561/1500000019.
[18] J. Devlin, M. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers
for language understanding, in: J. Burstein, C. Doran, T. Solorio (Eds.), Proceedings of the 2019
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume
1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 4171–4186. doi:10
.18653/V1/N19-1423.
[19] J. G. Carbonell, J. Goldstein, The use of mmr, diversity-based reranking for reordering documents
and producing summaries, in: W. B. Croft, A. Mofat, C. J. van Rijsbergen, R. Wilkinson, J. Zobel
(Eds.), SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, August 24-28 1998, Melbourne, Australia, ACM, 1998,
pp. 335–336. doi:10.1145/290941.291025.
[20] Z. Nussbaum, J. X. Morris, B. Duderstadt, A. Mulyar, Nomic embed: Training a reproducible
long context text embedder, CoRR abs/2402.01613 (2024). doi:10.48550/ARXIV.2402.01613.
arXiv:2402.01613.
[21] H. Chase, LangChain, 2022. URL: https://github.com/langchain-ai/langchain.
[22] A. D. et al., The Llama 3 Herd of Models, CoRR abs/2407.21783 (2024). doi:10.48550/ARXIV.2
407.21783. arXiv:2407.21783.
[23] G. V. Cormack, C. L. A. Clarke, S. Büttcher, Reciprocal rank fusion outperforms condorcet and
individual rank learning methods, in: J. Allan, J. A. Aslam, M. Sanderson, C. Zhai, J. Zobel
(Eds.), Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR 2009, Boston, MA, USA, July 19-23, 2009, ACM, 2009,
pp. 758–759. doi:10.1145/1571941.1572114.
[24] A. Dubey, et. al., The Llama 3 Herd of Models, 2024. doi:10.48550/ARXIV.2407.21783.
[25] A. Yang, et. al., Qwen2 technical report, 2024. doi:10.48550/ARXIV.2407.10671.
[26] M. Abdin, et. al., Phi-3 technical report: A highly capable language model locally on your phone,
2024. URL: https://arxiv.org/abs/2404.14219. arXiv:2404.14219.
[27] G. Team, Gemma 2: Improving open language models at a practical size, 2024. URL: https:
//arxiv.org/abs/2408.00118. arXiv:2408.00118.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Montagna</surname>
          </string-name>
          , G. Aguzzi,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ferretti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Pengo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Klopfenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ungolo</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Magnini, LLMbased Solutions for Healthcare Chatbots: a Comparative Analysis</article-title>
          ,
          <source>in: 2024 IEEE International Conference on Pervasive Computing and Communications Workshops and other Afiliated Events (PerCom Workshops)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>346</fpage>
          -
          <lpage>351</lpage>
          . doi:
          <volume>10</volume>
          .1109/PerComWorkshops59983.
          <year>2024</year>
          .
          <volume>1050</volume>
          3257.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Montagna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ferretti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Klopfenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Florio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Pengo</surname>
          </string-name>
          ,
          <article-title>Data decentralisation of llm-based chatbot systems in chronic disease self-management</article-title>
          ,
          <source>in: Proceedings of the 2023 ACM Conference on Information Technology for Social Good</source>
          , GoodIT '23,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>205</fpage>
          -
          <lpage>212</lpage>
          . doi:
          <volume>10</volume>
          .1145/3582515.3609536.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Ayers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Poliak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dredze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Leas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. B.</given-names>
            <surname>Kelley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Faix</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Longhurst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hogarth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <article-title>Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum</article-title>
          .,
          <source>JAMA Internal Medicine</source>
          (
          <year>2023</year>
          ). doi:
          <volume>10</volume>
          .1001/jamainternmed.
          <year>2023</year>
          .
          <year>1838</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Neural Information Processing Systems</source>
          , NIPS '20, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2020</year>
          . URL: https://proceedings.ne urips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Clusmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. R.</given-names>
            <surname>Kolbinger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. S.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. I.</given-names>
            <surname>Carrero</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-N.</given-names>
            <surname>Eckardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. G.</given-names>
            <surname>Laleh</surname>
          </string-name>
          ,
          <string-name>
            <surname>C. M. L. Löfler</surname>
            , S.-C. Schwarzkopf,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Unger</surname>
            ,
            <given-names>G. P.</given-names>
          </string-name>
          <string-name>
            <surname>Veldhuizen</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          <string-name>
            <surname>Wagner</surname>
            ,
            <given-names>J. N.</given-names>
          </string-name>
          <string-name>
            <surname>Kather</surname>
          </string-name>
          ,
          <article-title>The future landscape of large language models in medicine</article-title>
          ,
          <source>Communications Medicine</source>
          <volume>3</volume>
          (
          <year>2023</year>
          )
          <article-title>141</article-title>
          . doi:
          <volume>10</volume>
          .1038/s438 56-
          <fpage>023</fpage>
          -00370-1.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. T.</given-names>
            <surname>Moon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Purkayastha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. A.</given-names>
            <surname>Celi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Trivedi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Gichoya</surname>
          </string-name>
          ,
          <article-title>Ethics of large language models in medicine</article-title>
          and
          <source>medical research</source>
          ,
          <source>The Lancet Digital Health</source>
          <volume>5</volume>
          (
          <year>2023</year>
          )
          <fpage>e333</fpage>
          -
          <lpage>e335</lpage>
          . doi:10.1 016/S2589-
          <volume>7500</volume>
          (
          <issue>23</issue>
          )
          <fpage>00083</fpage>
          -
          <lpage>3</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Haltaufderheide</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ranisch</surname>
          </string-name>
          ,
          <article-title>The ethics of chatgpt in medicine and healthcare: a systematic review on large language models (llms</article-title>
          ),
          <source>npj Digital Medicine</source>
          <volume>7</volume>
          (
          <year>2024</year>
          )
          <article-title>183</article-title>
          . doi:
          <volume>10</volume>
          .1038/s41746 -024-01157-x.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Giufrè</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kresevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Pugliese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>You</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Shung</surname>
          </string-name>
          ,
          <article-title>Optimizing large language models in digestive disease: strategies and challenges to improve clinical outcomes</article-title>
          ,
          <source>Liver International</source>
          <volume>44</volume>
          (
          <year>2024</year>
          )
          <fpage>2114</fpage>
          -
          <lpage>2124</lpage>
          . doi:https://doi.org/10.1111/liv.15974.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kresevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Giufrè</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ajcevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Accardo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. S.</given-names>
            <surname>Crocè</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. L.</given-names>
            <surname>Shung</surname>
          </string-name>
          ,
          <article-title>Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework</article-title>
          ,
          <source>npj Digital Medicine</source>
          <volume>7</volume>
          (
          <year>2024</year>
          )
          <article-title>102</article-title>
          . URL: https://doi.org/10.1038/s417 46-
          <fpage>024</fpage>
          -01091-y.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Maharjan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garikipati</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. P.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Cyrus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ciobanu</surname>
          </string-name>
          , G. Barnes,
          <string-name>
            <given-names>R.</given-names>
            <surname>Thapa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <surname>R. Das</surname>
          </string-name>
          ,
          <article-title>Openmedlm: prompt engineering can out-perform fine-tuning in medical questionanswering with open-source large language models</article-title>
          ,
          <source>Scientific Reports</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <article-title>14156</article-title>
          . doi: 10.1 038/s41598-024-64827-6.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Weng</surname>
          </string-name>
          ,
          <article-title>Fine-tuning large language models for rare disease concept normalization</article-title>
          ,
          <source>Journal of the American Medical Informatics Association</source>
          <volume>31</volume>
          (
          <year>2024</year>
          )
          <fpage>2076</fpage>
          -
          <lpage>2083</lpage>
          . doi:
          <volume>10</volume>
          .1093/jamia/ocae133.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Han</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <article-title>Parameter-eficient fine-tuning for large models:</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>