<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>“Doctor, is it normal?” Enabling medical chatbots to provide certified replies to normalcy questions</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonardo Sanna</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Magnolini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Patrizio Bellan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Saba Ghanbari Haez</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marina Segala</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Monica Consolandi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mauro Dragoni</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">ITALY</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Free University of Bozen</institution>
          ,
          <addr-line>Bozen</addr-line>
          ,
          <country country="IT">ITALY</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents a work in progress to enhance a Retrieval-Augmented Generation (RAG) pipeline for a medical chatbot designed to address evaluative questions related to patient concerns about “normalcy”. The chatbot uses a novel approach called Hypothetical Document Embeddings (HyDoc) to augment queries and improve the retrieval of certified medical information. In the first evaluation of the chatbot, it emerged that evaluative queries often fail to retrieve relevant documents as well as to produce appropriately framed responses. We, therefore, experiment with the impact of an additional naive-RAG module to improve the retrieval and a Chain-of-Thought (CoT) inspired prompting strategy to contextualize the queries better and advance response generation. Results demonstrate that this method enhances document retrieval and the framing of generated replies, improving the chatbot's ability to generate responses that consider emotional and communicative aspects.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Medical Chatbot</kwd>
        <kwd>Retrieval-Augmented Generation</kwd>
        <kwd>Certified Information</kwd>
        <kwd>Chain-of-Thought</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Normalcy is particularly relevant in medical conversations [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The term “normal” is used particularly
to describe wellness, hence the absence of critical health conditions [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. When designing a FAQ-oriented
conversational agent, we should, therefore, expect that a significant amount of questions will be related
to normalcy.
      </p>
      <p>
        In our first implementation of an FAQ-based chatbot for pregnancy assistance, we did a user evaluation
to test user perception towards the agent[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. While our Retrieval-Augmented Generation (RAG) pipeline
has shown promising results in initial testing, it still faces significant challenges, particularly when
handling questions related to normalcy. In our first round of evaluations, we observed that the chatbot
often provides of-topic or incomplete information when responding to normalcy questions, failing to
frame its answers to meet the patient’s emotional and communicative needs. Moreover, the pipeline
often fails to retrieve relevant documents.
      </p>
      <p>
        The core approach behind our chatbot is to provide certified medical information via a RAG pipeline.
Our idea is to enhance document retrieval by augmenting the user query with a generated document
in a framework called Hypothetical Documents Embeddings [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Although Large Language Models
(LLMs) may sometimes produce hallucinations, they have shown significant reliability in answering
medical care-related questions [
        <xref ref-type="bibr" rid="ref6 ref7 ref8">6, 7, 8</xref>
        ]. We can, therefore, assume that LLMs can capture the essential
information in a query and generate relevant textual patterns tied to specific medical knowledge; in
other words, LLMs should be able to generate a text that is consistently similar to actual documents
that contain certified information. This text, called Hypothetical Document (HyDoc), can be employed
in RAG systems to enhance document retrieval. Indeed, to search our certified repository, we use the
sentence embedding of the HyDoc to retrieve the most similar text chunks and their related documents.
The retrieved records are then used to generate the final response, which, in this initial implementation,
is a summary of the certified sources. The chatbot’s response not only provides this summary but also
includes references to the original certified documents, specifically highlighting the three key sources
used to generate the answer.
      </p>
      <p>
        In the case of evaluative questions, i.e. queries on normalcy, it is dificult to retrieve relevant
documents, probably due to the structure of our certified data. Our texts (1512 in total) came from FAQ
sections; therefore, they are not meant to be used in conversational settings, nor is the text structured
to answer evaluative queries. On the other hand, the dificulty in producing an appropriately framed
answer is probably due to the fact that LLMs struggle to understand the pragmatic context of a query [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Producing an appropriately framed response addressing the patient’s medical and emotional needs
is just as important as retrieving accurate information. Merely providing factual content can be
insuficient in a medical setting, potentially eroding patient trust if the response fails to account for
emotional concerns. A chatbot’s inability to properly frame answers in evaluative scenarios can lead to
misunderstandings and disrupt the flow of communication between the patient and the system [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ].
      </p>
      <p>
        This paper explores possible solutions to enhance document retrieval and text generation. On the
one hand, we introduce an additional RAG step, using the patient’s query before generating the HyDoc.
Although less precise, in the case of normalcy questions, this should help retrieve some of the related
documents, also preventing the HyDoc from being of-topic. On the other hand, we experiment with
a prompting strategy where we provide the LLM with the pragmatic context needed to answer the
question. This pragmatic context is made of implicatures and presuppositions [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. The implicature is
the unstated message of a sentence; for instance, if we say the sentence "It’s getting late" what we are
implying is that we should probably leave. On the other hand, the presupposition is what is assumed as
a common background in a conversation; for instance, the sentence "Her brother is a talented musician"
presupposes that the speaker has a brother.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>
        Retrieval-Augmented Generation (RAG) integrates external knowledge into LLM prompts via data
retrieval, utilizing parametric and non-parametric memory [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ]. By incorporating retrieval
mechanisms, RAG surpasses parametric-only seq2seq models in tasks such as Question Answering (QA) and
summarization, leading to enhanced language generation [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. However, it still faces challenges when
dealing with data outside its training set.
      </p>
      <p>
        The foundational works of Lewis et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and Karpukhin et al. [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] open new perspectives for
subsequent developments in RAG. For example, the framework introduced in Guu et al. [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] incorporates
a knowledge retrieval mechanism into neural language models, while the approach presented in Izacard
and Grave [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] uses a two-step process combining Dense Passage Retrieval (DPR) with generative
seq2seq LMs, efectively leveraging both methods to generate comprehensive and contextually relevant
responses in open-domain QA tasks.
      </p>
      <p>
        In this work, we introduce in our pipeline an additional step that we call naive-RAG. We call it naive
since we are using user questions instead of the HyDoc to query our certified repository. Since our
retrieval system employs a Bi-Encoder model [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] in the first stage, this might result in an inaccurate
retrieval because of the diferent vector sizes of our documents’ chunk and user query. Nonetheless, we
keep employing a Cross-Encoder 1 to re-rank the first selection of the document and prevent completely
of-topic retrieval.
      </p>
      <p>
        On the other hand, there has been a vast range of prompting techniques to elicit (or at least simulate)
reasoning capabilities in LLMs2. Surely the most renowned method is the Chain of Thought (CoT)
prompting. These type of prompting demonstrated significant efectiveness in enhancing LLMs’ ability
to manage complex reasoning tasks, particularly those involving heterogeneous data such as tables and
questions, with substantial empirical improvements reported [
        <xref ref-type="bibr" rid="ref19 ref20 ref21">19, 20, 21</xref>
        ].
      </p>
      <sec id="sec-2-1">
        <title>1https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2</title>
        <p>2For a complete review see Vatsal, S., &amp; Dubey, H. (2024). A Survey of Prompt Engineering Methods in Large Language Models
for Diferent NLP Tasks. arXiv preprint arXiv:2407.12994.</p>
        <p>
          Further advancements have shown that generating intermediate reasoning steps—breaking down
problems into manageable parts known as a Chain of Thought (CoT)—significantly improves LLM
performance on complex reasoning tasks [
          <xref ref-type="bibr" rid="ref19 ref22 ref23">19, 22, 23</xref>
          ].
        </p>
        <p>In this work, we will propose a simple CoT approach. Instead of solving a reasoning problem, our
goal is to generate an LLM response that addresses both implicature and presupposition.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <sec id="sec-3-1">
        <title>The goal of this paper is to answer the following two research questions:</title>
        <p>RQ1: Does the naive-RAG step improve document retrieval for evaluative questions?
RQ2: Can we produce appropriately framed responses using CoT to provide pragmatic context?</p>
        <p>Regarding RQ1, our goal is to see whether or not we can retrieve a higher number of documents with
the naive-RAG approach. Our retrieval returns, in fact, 3 documents for each query; using the HyDoc
generated after an evaluative question on normalcy often results in few (or even zero) documents being
retrieved. On the other hand, RQ2 aims to explore a more qualitative dimension of text generation,
namely the correct addressing of presupposition and implicature in the LLM-generated reply.</p>
        <p>To answer our research questions, we generated 50 questions on normalcy starting from our
documents. Each question has been manually validated to ensure that is relevant as an evaluative question.
We then tested our original pipeline against a first variation, hence including a step of naive-RAG before
HyDoc generation and a second updated version that includes an augmented CoT-style prompt to
generate the final reply (see Figure 1). We used Prompt 1 to pass our implicatures and presuppositions
to the LLM.</p>
        <p>Prompt 1 (CoT prompt). “You are an assistant for pregnant women. You have answered the question from
the input message: {hydoc} ### Input: {question} The correct answer is found in this text: {text_1} {text_2}
{text_3}. Keep in mind that the text implies that {Implicature}. Keep in mind that the text presupposes
that {Presuppostion}. Re-adapt the answer considering the provided information. ### Answer: ”</p>
        <p>The LLM used in this study is Llama3, specifically the 70B version. We accessed the model through
the GroqCloud API3.</p>
        <p>In the case of a question of normalcy, to keep the evaluation as simple as possible, we considered one
presupposition and one implicature.</p>
        <p>• Presupposition: it is what is taken for granted in a sentence. In our case is therefore that “The
patient thinks that something is not normal”
• Implicature: broadly speaking it includes patient’s main goal as well as the emotional context. In
our case therefore it would be that: “The patient is worried and wants reassurance” ”
We decided to have the same presuppostion and same implicature for all our queries for two main
reasons. First, as mentioned in Section 1, LLMs face challenges in understanding the pragmatic context
of a query, so automatically generating implicatures and presuppositions is not the most efective
solution. The second reason is that we want to experiment the consistency of our prompting strategy in
an experimental setting. Using a fixed set of implicatures and presuppositions enables us to experiment
with our prompt by varying only one element, i.e., the question.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>
        To evaluate the reliability of the produced response, we compared the pipeline’s final reply against the
retrieved documents using BERTScore [
        <xref ref-type="bibr" rid="ref24">24</xref>
        ].
      </p>
      <p>As shown in Table 1 our final responses are quite comparable. The summary is, not surprisingly the
most similar to the original texts; however, the influence of CoT seems to be minimal, ensuring that the
text generated by the LLM is still reliable.</p>
      <p>We then compared the capacity of the naive-RAG module to overcome HyDoc limitations in evaluative
contexts. As shown in Figure 2 the improvement is consistent.</p>
      <p>To answer RQ2, we opted for a qualitative evaluation, manually checking whether the pragmatic
aspects had been addressed. As shown in Figure 3, the CoT strategy notably improves text generation.
Surely, it is complicated to answer evaluative questions using the LLM to summarize the documents.
The length of the documents is influential and makes it dificult to capture evaluative aspects.</p>
      <p>The naive-RAG pipeline, without including CoT, provides the LLM response as the final output, and
we observe that it already performs better than the summary. Indeed, questions of normalcy often make
explicit their presupposition (e.g. “Is it normal that I feel this way”). In that case, the LLM can produce
better responses. Nonetheless, it is only by providing the CoT prompt with pragmatic context that we
can properly frame our final reply.</p>
      <p>Addressing implicature and presupposition not only produces more natural responses but it also
strongly contributes to providing highly relevant information. For instance, in Example 1 and Example 2,
we show the diferent replies to Query 1. As we can see, in Example 1, the LLM response is limited in
providing bare information on the average weight of a newborn. However, this information alone is
insuficient to give a satisfactory reply.</p>
      <p>Opposite, Example 2, thanks to the correct addressing of presupposition and implicature, provides
crucial additional information that might help the patient in dealing with her problem.
Query 1. What is considered “low” weight for an infant at birth?
Example 1 (LLM reply + naive-RAG). The weight considered “low” for an infant at birth is less than 2,500
g.</p>
      <p>Example 2 (LLM reply + naive-RAG + CoT ). Hi. I hope I can reassure you. The weight considered “low”
for a newborn at birth is less than 2,500 g. This is not an isolated case; in fact, about 5% of newborns in
Italy are born weighing less than this value. It is important to know that even if your child is born with a
low weight, there are many things we can do to help him or her grow and develop healthily. Breastfeeding,
for example, is crucial for babies born premature or low birth weight, as breast milk contains anti-infective
proteins that can help prevent infections and other illnesses. Also, it is important to remember that your
child’s weight growth will be calculated from the lowest weight he or she reaches after birth, so don’t worry
if it takes a while to get back to birth weight. We are here to help and support you on this journey.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Limitations</title>
      <p>We did our first test on a rather limited question dataset. Although this was necessary to perform an
in-depth qualitative evaluation, surely further experiments on larger datasets are necessary. Dataset
expansion should also take into account diferent types of evaluative questions.</p>
      <p>Regarding the retrieval evaluation, we limited our experiment to evaluating the capacity of retrieving
documents, regardless of their relevance. Our data is currently completely unstructured, and therefore,
it is challenging to evaluate the quality of the retrieval. We are currently in the process of annotating
our data so that we can better assess the retrieval step.</p>
      <p>Moreover, also the qualitative evaluation of LLM replies did not take into account possible
hallucinations, which may occasionally be present even in well-structured responses.</p>
      <p>Finally, we used Llama 70B via an API. Despite our efort to keep the responses consistent, setting
the model temperature to 0.2, some variability is still possible and it might afect reproducibility.
Further experiments should be conducted in more controlled environments, and other LLMs should be
considered to evaluate whether our approach can be applied across diferent models.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusions</title>
      <p>In this paper, we propose an enhancement to the existing RAG pipeline to address medical evaluative
queries related to normalcy. Our initial experiment demonstrates that incorporating a naive-RAG
module helps overcome the limitations of evaluative queries, where document retrieval is often dificult
using the HyDoc approach. Additionally, we developed a CoT-inspired prompting strategy to provide
the LLM with the pragmatic context necessary for answering evaluative questions. Our qualitative
analysis reveals that CoT prompting significantly improves the quality of text generation.</p>
      <p>Our ongoing research is focused on expanding the dataset and further evaluating various
CoTprompting strategies. In the long run, we aim to implement data annotation to rigorously assess
retrieval performance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Catita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Águas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Morgado</surname>
          </string-name>
          ,
          <article-title>Normality in medicine: a critical review, Philosophy, Ethics, and Humanities in Medicine 15 (</article-title>
          <year>2020</year>
          )
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Handberg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Seibaek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Beedholm</surname>
          </string-name>
          ,
          <article-title>Reflections on the complexity of normalcy in nursing and health care</article-title>
          ,
          <source>Advances in Nursing Science</source>
          <volume>46</volume>
          (
          <year>2023</year>
          )
          <fpage>210</fpage>
          -
          <lpage>218</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Gutzmer</surname>
          </string-name>
          , W. A. Beach, “
          <article-title>having an ovary this big is not normal”: Physicians' use of normal to assess wellness and sickness during oncology interviews</article-title>
          ,
          <source>Health Communication 30</source>
          (
          <year>2015</year>
          )
          <fpage>8</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ghanbari Haez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Segala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bellan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Magnolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sanna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Consolandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dragoni</surname>
          </string-name>
          ,
          <article-title>A retrieval-augmented generation strategy to enhance medical chatbot reliability</article-title>
          , in: J.
          <string-name>
            <surname>Finkelstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Moskovitch</surname>
          </string-name>
          , E. Parimbelli (Eds.),
          <source>Artificial Intelligence in Medicine</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>213</fpage>
          -
          <lpage>223</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          ,
          <article-title>Precise zero-shot dense retrieval without relevance labels</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>ACL</source>
          <year>2023</year>
          , Toronto, Canada, July 9-
          <issue>14</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>1762</fpage>
          -
          <lpage>1777</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>99</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .
          <article-title>ACL-LONG</article-title>
          .
          <year>99</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mihalache</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. M.</given-names>
            <surname>Popovic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. H.</given-names>
            <surname>Muni</surname>
          </string-name>
          ,
          <article-title>Chatgpt-4: an assessment of an upgraded artificial intelligence chatbot in the united states medical licensing examination</article-title>
          ,
          <source>Medical Teacher</source>
          <volume>46</volume>
          (
          <year>2024</year>
          )
          <fpage>366</fpage>
          -
          <lpage>372</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R. C. T.</given-names>
            <surname>Cheong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. P.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Unadkat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Mcneillis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Williamson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Joseph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Randhawa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Andrews</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Paleri</surname>
          </string-name>
          ,
          <article-title>Performance of artificial intelligence chatbots in sleep medicine certification board exams: Chatgpt versus google bard, European Archives of Oto-Rhino-</article-title>
          <string-name>
            <surname>Laryngology</surname>
          </string-name>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Cascella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Montomoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Bellini</surname>
          </string-name>
          , E. Bignami,
          <article-title>Evaluating the feasibility of chatgpt in healthcare: an analysis of multiple clinical and research scenarios</article-title>
          ,
          <source>Journal of Medical Systems</source>
          <volume>47</volume>
          (
          <year>2023</year>
          )
          <fpage>33</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sravanthi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Doshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tankala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Murthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dabre</surname>
          </string-name>
          , P. Bhattacharyya,
          <article-title>PUB: A pragmatics understanding benchmark for assessing LLMs' pragmatics capabilities</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Findings of the Association for Computational Linguistics ACL</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Bangkok, Thailand and virtual meeting,
          <year>2024</year>
          , pp.
          <fpage>12075</fpage>
          -
          <lpage>12097</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .findings-acl.
          <volume>719</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .findings-acl.
          <volume>719</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E.</given-names>
            <surname>Engdahl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lidskog</surname>
          </string-name>
          ,
          <article-title>Risk, communication and trust: Towards an emotional understanding of trust</article-title>
          ,
          <source>Public understanding of science 23</source>
          (
          <year>2014</year>
          )
          <fpage>703</fpage>
          -
          <lpage>717</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Consolandi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Magnolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dragoni</surname>
          </string-name>
          ,
          <article-title>Misunderstanding and risk communication in healthcare</article-title>
          .,
          <source>in: NL4AI@ AI* IA</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H. P.</given-names>
            <surname>Grice</surname>
          </string-name>
          , Presupposition and conversational implicature,
          <source>Radical pragmatics 183</source>
          (
          <year>1981</year>
          )
          <fpage>41</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Oguz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W.-t. Yih,
          <article-title>Dense passage retrieval for open-domain question answering</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>6769</fpage>
          -
          <lpage>6781</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>550</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>550</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>S.</given-names>
            <surname>Siriwardhana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Weerasekera</surname>
          </string-name>
          , E. Wen,
          <string-name>
            <given-names>T.</given-names>
            <surname>Kaluarachchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Nanayakkara</surname>
          </string-name>
          ,
          <article-title>Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering</article-title>
          ,
          <source>Transactions of the Association for Computational Linguistics</source>
          <volume>11</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .tacl-
          <volume>1</volume>
          .1. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00530</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>K.</given-names>
            <surname>Guu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Tung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pasupat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <article-title>Retrieval augmented language model pre-training</article-title>
          , in: H.
          <string-name>
            <surname>D. III</surname>
          </string-name>
          , A. Singh (Eds.),
          <source>Proceedings of the 37th International Conference on Machine Learning</source>
          , volume
          <volume>119</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>3929</fpage>
          -
          <lpage>3938</lpage>
          . URL: https://proceedings.mlr.press/v119/guu20a.html.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          , E. Grave,
          <article-title>Leveraging passage retrieval with generative models for open domain question answering</article-title>
          , in: P. Merlo,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tiedemann</surname>
          </string-name>
          , R. Tsarfaty (Eds.),
          <source>Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:</source>
          Main Volume,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>874</fpage>
          -
          <lpage>880</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          . eacl-main.
          <volume>74</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .eacl-main.
          <volume>74</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks, in: K. Inui</article-title>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP</source>
          <year>2019</year>
          ,
          <string-name>
            <given-names>Hong</given-names>
            <surname>Kong</surname>
          </string-name>
          , China, November 3-
          <issue>7</issue>
          ,
          <year>2019</year>
          , Association for Computational Linguistics,
          <year>2019</year>
          , pp.
          <fpage>3980</fpage>
          -
          <lpage>3990</lpage>
          . URL: https://doi.org/10.18653/v1/
          <fpage>D19</fpage>
          -1410. doi:
          <volume>10</volume>
          .18653/V1/D19-1410.
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Chainof-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>in: Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS</source>
          <year>2022</year>
          ), Google Research, Brain Team,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>She</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Chain-of-thought reasoning in tabular language models, in: Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics</article-title>
          ,
          <year>2023</year>
          , pp.
          <fpage>11006</fpage>
          -
          <lpage>11019</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Terry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <article-title>Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts</article-title>
          ,
          <source>in: Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI)</source>
          , ACM, New Orleans, LA, USA,
          <year>2022</year>
          . URL: https://doi.org/10. 1145/3491102.3517582. doi:
          <volume>10</volume>
          .1145/3491102.3517582,
          <article-title>copyright 2022 by the owner/author(s).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Madaan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , U. Alon, P. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Callan</surname>
          </string-name>
          , G. Neubig, Pal:
          <article-title>Program-aided language models</article-title>
          ,
          <source>in: Proceedings of the 40th International Conference on Machine Learning (ICML)</source>
          , PMLR, Honolulu, Hawaii, USA,
          <year>2023</year>
          . URL: http://reasonwithpal.com,
          <article-title>copyright 2023 by the author(s).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Memisevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <article-title>Deductive verification of chainof-thought reasoning</article-title>
          ,
          <source>in: 37th Conference on Neural Information Processing Systems (NeurIPS</source>
          <year>2023</year>
          ), NeurIPS,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang*</surname>
          </string-name>
          , V. Kishore*, F. Wu*,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2020</year>
          . URL: https://openreview. net/forum?id=SkeHuCVFDr.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>