=Paper=
{{Paper
|id=Vol-3784/short1
|storemode=property
|title=Chain-of-Thought to Enhance Document Retrieval in Certified Medical Chatbots
|pdfUrl=https://ceur-ws.org/Vol-3784/short1.pdf
|volume=Vol-3784
|authors=Leonardo Sanna,Simone Magnolini,Patrizio Bellan,Saba Ghanbari Haez,Marina Segala,Monica Consolandi,Mauro Dragoni
|dblpUrl=https://dblp.org/rec/conf/ir-rag/SannaMBHSCD24
}}
==Chain-of-Thought to Enhance Document Retrieval in Certified Medical Chatbots==
<pdf width="1500px">https://ceur-ws.org/Vol-3784/short1.pdf</pdf>
<pre>
                         Chain-of-Thought to Enhance Document Retrieval in Certified
                         Medical Chatbots
                         Leonardo Sanna1,* , Simone Magnolini1 , Patrizio Bellan1 , Saba Ghanbari Haez1,2 , Marina Segala1 ,
                         Monica Consolandi1 and Mauro Dragoni1,*
                         1
                             Fondazione Bruno Kessler, Trento, ITALY
                         2
                             Free University of Bozen, Bozen, ITALY


                                           Abstract
                                           We propose a Retrieval-Augmented Generation pipeline aimed at retrieving certified medical information. Inspired by the recently
                                           introduced Hypothetical Document Embeddings framework, we use the LLM to generate a document to query our certified repository.
                                           Although showing promising results in the first user evaluation, the proposed pipeline sometimes fails to retrieve the correct documents.
                                           We therefore propose a second Chain-of-thought-inspired pipeline to enhance the generation of the Hypothetical Document and,
                                           consequently, the retrieval of the certified documents.

                                           Keywords
                                           Conversational Agent, Digital Health, Chain-of-Thought, Certified Information


                         1. Introduction                                                                                              approaches add a further layer of algorithmic opacity since
                                                                                                                                      the user is unaware of the documents used to generate the
                         The Hypothetical Document Embeddings (HyDE) frame-                                                           reply. Therefore, on the one hand, we use the retrieved
                         work has been recently introduced as an effective method                                                     document to produce a well-grounded and informed reply,
                         to build dense retrievers completely unsupervised [1]. The                                                   while on the other hand, we provide the certified sources
                         key idea behind HyDE is to leverage the Large Language                                                       that have been retrieved, computing the similarity with the
                         Model (LLM) creative abilities to generate a Hypothetical                                                    HyDoc.
                         Document (HyDoc) which is then used to retrieve a real                                                          Nonetheless, the quality of the generated HyDoc remains
                         document in a repository.                                                                                    a substantial issue in medical domains. Although LLMs have
                            Hence, HyDE is particularly well-suited for building med-                                                 shown impressive results in addressing medical queries [3,
                         ical chatbots that operate with“certified information”, i.e.                                                 4, 5], relying on the sole abilities of the LLM might result in
                         conversational agents capable of providing trustworthy in-                                                   generating inaccurate or low-quality HyDocs.
                         formation that has been created or verified by domain ex-                                                       In fact, in a first user evaluation of our proposed modular
                         perts such as physicians or other healthcare professionals                                                   pipeline, we found evidence that the retrieval step might be
                         in the digital health industry                                                                               problematic when encountering specific types of questions,
                            To provide “certified information”, the chatbot’s reply                                                   e.g. evaluative questions. This paper therefore introduces
                         must be predetermined, namely that we have a predefined                                                      the main challenges we found in developing a modular RAG
                         set of answers for each specific question. The existing lack                                                 pipeline in a certified context. In particular, we focus on
                         of conversational datasets in the medical domain, however,                                                   the proposal of a Chain-of-thought-inspired pipeline to en-
                         poses a substantial challenge in creating a certified med-                                                   hance the HyDoc generation and, consequently, improve
                         ical chatbot. To tackle this issue, we devised a Retrieval-                                                  the retrieval of the certified sources.
                         Augmented Generation (RAG) pipeline within the HyDE
                         framework so that we could benefit from the conversational
                         capabilities of an LLM and, at the same time, exploit the                                                    2. Related work
                         LLM to retrieve the certified sources supporting the reply.
                            We believe that adopting HyDE addresses two major is-                                                     LLMs’ credibility and effectiveness are crucial in AI research,
                         sues of RAG pipelines. First of all, we are trying to build a                                                especially in areas like digital health and wellbeing that
                         FAQ-based chatbot, therefore most of the interactions with                                                   require precision and reliability [6]. RAG and Chain of
                         the patients would be short questions. In a FAQ-oriented                                                     Thought (CoT) prompting are highly effective in reducing
                         conversational agent, using a simple naive-RAG pipeline the                                                  hallucinations and enhancing factual content generation in
                         user query would employed to retrieve the certified sources.                                                 LLMs by integrating external knowledge.
                         Yet, since we are operating with vector databases, the vector                                                   RAG integrates external knowledge into LLMs’ prompts
                         representation of the query might be significantly distant                                                   through data retrieval using parametric and non-parametric
                         from the certified documents in the semantic space, yielding                                                 memory [7, 8]. It has been shown that RAG outperforms
                         a remarkable risk of excluding relevant documents in the                                                     parametric-only seq2seq models in tasks like Question An-
                         retrieval process.                                                                                           swering (QA) and summarization, improving text genera-
                            Moreover, in a digital health context, it is important to                                                 tion [9].
                         keep our certified medical chatbot explainable [2]. RAG                                                         Various approaches have been explored to advance QA
                                                                                                                                      systems. For instance, the work [10] involves a two-stage
                                                                                                                                      process that combines Dense Passage Retrieval (DPR) with
                          Information Retrieval’s Role in RAG Systems (IR-RAG) - 2024
                                                                                                                                      generative sequence-to-sequence LMs. Other examples are
                         *
                           Corresponding author.
                          $ lsanna@fbk.eu (L. Sanna); magnolini@fbk.eu (S. Magnolini);                                                the iterative integration of retrieval and generation [11], a
                          pbellan@fbk.eu (P. Bellan); sghanbarihaez@fbk.eu (S. G. Haez);                                              combination of retrieval and generation techniques for infor-
                          msegala@fbk.eu (M. Segala); mconsolandi@fbk.eu (M. Consolandi);                                             mative answers [12], and dynamic real-time retrieval during
                          dragoni@fbk.eu (M. Dragoni)                                                                                 generation [13]. Other approaches include techniques to im-
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
prove the accuracy of language models integrating external           In this work, we used GPT-4-turbo (gpt-4-0125-preview
knowledge [14, 15], as well as advancing implicit reasoning       specifically) as LLM. However, our pipeline is intended as
and adaptability in QA tasks [9].                                 LLM-agnostic. The use of OpenAI-GPT has, therefore, been
   On the other hand, CoT methods have been highly effec-         intended as a convenient solution to test our RAG pipeline
tive in improving LLMs’ ability to handle complex reasoning       using a stable and well-performing LLM. Indeed, to deploy
tasks, such as those that involve heterogeneous data from         a conversational assistant in a real-case scenario, an open-
tables and questions [16, 17, 18]. Some recent studies have       source model would likely be required due to cost and pri-
shown that breaking down problems into manageable steps           vacy issues in accessing any LLM via API.
significantly enhances LLMs’ performance in complex rea-
soning tasks [16, 19, 20].                                        4.1. A first (zero-shot) implementation
   The work of [21] refines self-consistency decoding for
broader applications like translation strategies and senti-
ment analysis, while [22] introduces the Zero-shot-CoT ap-
proach, a technique to improve LLM performance on diverse
reasoning tasks, without hand-crafted few-shot examples.
   Finally, we should mention the Tree of Thoughts (ToT)
framework [23], which has a particularly relevant approach
for QA, namely the Probabilistic Tree-of-thought Reason-
ing (ProbTree) [24]. This approach breaks down QA into
two stages, understanding and reasoning, to solve retrieval
issues and prevent error propagation.
   Despite the high research interest and the diversity of ap-
proaches both in RAG and CoT, there are currently no stud-
ies focusing on certified medical chatbots. Moving within
the HyDE framework, we believe that we can employ CoT
techniques to improve the generation of the Hypothetical          Figure 1: An overview of the RAG model we are implementing.
Document that would be then used as the query to retrieve
the certified documents.

                                                                     Our approach employs a modular RAG framework de-
3. Dataset                                                        signed to address the challenge of delivering natural, verified
                                                                  responses through a medical chatbot by leveraging unstruc-
In our dataset, we have three certified sources. We have (i)      tured data. To achieve this, we create a HyDoc in response
179 informational cards, which were created by the Obstetri-      to the user’s questions.
cian Department of the Hospital of Trento (Italy). Then we           The essence of our strategy lies in enhancing the docu-
have 953 documents from (ii) UPPA, a medical webzine, and         ment retrieval process with the HyDoc. Despite the poten-
380 documents from (iii) ISS-Salute, which is the informa-        tial for inaccuracies and hallucinations, the LLM is expected
tive website of the Istituto Superiore di Sanità - ISS (Italian   to discern the fundamental aspects of the query and identify
National Institute of Health).                                    textual patterns pertinent to the specific domain of knowl-
   It is important to highlight that the dataset we have is       edge. Given the proven efficacy of LLMs in fielding medical
not conversational, nor it is meant to be used in a medical       queries [3, 4, 5], the HyDoc is anticipated to closely align
chatbot. All sources are what we might call content made          with genuine documents that provide accurate, verified re-
for FAQ sections. Therefore, it is often quite verbose and        sponses to the user’s question.
dense in information. All the data we have is unstructured           To query our verified document repository, we utilize the
text, with a notable stylistic heterogeneity within the same      sentence embeddings generated from our HyDoc. The area
source. This characteristic is combined with the semantic         of general-purpose sentence embeddings remains an active
homogeneity given by the specific medical domain, creating        field of research [25], in contrast to the more established uni-
a substantial issue for automatic topic extraction.               versal word embedding techniques like word2vec [26]. Our
   Finally, we should recall that content editing is not per-     workflow incorporates the paraphrase-multilingual-mpnet-
mitted due to the certified nature of our information. Since      base-v2 Bi-Encoder model [27] for generating embeddings
each specific question should consistently correspond to a        of both the HyDoc and the verified data.
particular set of equivalent answers., it becomes essential          This model introduces a pooling operation to produce a
the adoption of modular RAG solutions.                            fixed-size embedding vector normalized to a size of 1.00.
                                                                  These vectors are then compared using cosine similarity.
                                                                  However, the Bi-Encoder model encounters challenges in
4. Methods                                                        accurately comparing documents of varying lengths, which
In this section, we will explain the methods used in our          can lead to the retrieval of irrelevant documents due to the
implementation. Our first implementation was a sort of zero-      disparity in length between our HyDocs and the documents
shot implementation since we generated the HyDoc only             in the repository.
relying on LLM knowledge, without providing any other                To address this issue, we employ the ms-marco-MiniLM-
context. This solution is shown in Figure 1. We assessed          L-6-v2 cross encoder 1 . Unlike the Bi-Encoder which uses
the performance of this first implementation by doing a           separate encoders for each input, the cross-encoder pro-
user evaluation. The technology presented in this section is      cesses pairs of sentences through a single shared encoder,
the same used for the second implementation illustrated in
Section 5.                                                        1
                                                                      https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2
producing a joint representation that is evaluated by a clas-               relevance is 0.31, whereas non-evaluative questions have a
sifier to yield a similarity score between the texts.                       0.48 average link relevance.
    Given the computational demands of the Cross-Encoder,                      We argue that the worse performance on evaluative ques-
it is applied selectively to a shortlist of potential documents.            tions is mostly because generating an evaluative answer
Following the computation of cosine similarity across all                   might be complex for the LLM also. Moreover, the HyDoc
HyDoc-document pairs <HyDoc, 𝐷𝑖 >, where 𝑖 ranges from                      generate would likely be a punctual reply on the precise
1 to 𝑛 and 𝐷𝑖 represents the 𝑖𝑡ℎ document in the verified                   aspect, since this is the expected natural reply in a conver-
repository, we rank and select the top 50 documents for their               sation. Since we are retrieving full documents, it might be
relevance. This guarantees to have an acceptable number of                  that the vector representation of an evaluative HyDoc is
documents from an information retrieval perspective [28].                   quite distant from the original document where we can find
Subsequently, the top 3 documents from this refined list are                the reply.
chosen to augment the original prompt, enhancing the text                      Therefore, we are annotating our dataset to enable the
of the final response provided to the user. This decision is                retrieval of shorter text segments. The idea is that we can
based on preliminary tests indicating that using more than                  split our documents into shorter and more meaningful seg-
three documents could negatively impact the framework’s                     ments to ease the retrieval step and enhance the generation
effectiveness.                                                              part.
    Finally, a Guard-Rail module 2 is implemented to ensure                    A second version of our pipeline has been tested on the
the response generated by the LLM adheres to the specified                  subset of evaluative questions (Figure 2). The new pipeline
prompt length, incorporating generated text and references                  is inspired by a CoT logic and, therefore, is aimed at gener-
to the three selected certified documents in the final answer.              ating a better HyDoc. First, we generate the HyDoc after a
    An initial user evaluation of our zero-shot model was con-              naive-RAG step. In a pre-retrieval step, the user question
ducted using 100 questions related to pregnancy, deemed                     is hence used to query our certified repository, and the re-
representative by expert reviewers. This evaluation focused                 trieved context is used to generate the HyDoc. Moreover, we
on seven metrics: {Q1} the relevance of the answer to the                   also include more contextual information about the query
question, {Q2} the relevance of the links (documents) pro-                  aimed at enhancing the similarity between the HyDoc and
vided, {Q3} text quality, {Q4} reliability, {Q5} clarity, {Q6}              the contexts that need to be retrieved in the augmented
completeness, and {Q7} an overall evaluation score. Accord-                 prompt. For instance, we provide within the prompt useful
ing to Table 1, while the model demonstrated potential in                   pragmatic information to generate an evaluative reply, such
text quality, it highlighted the need for improved document                 as presupposition and implications [29].
retrieval, as evidenced by the document link relevance scor-                   The CoT has proven to be capable of enhancing the quality
ing an average of 0.44. This value demonstrates that there                  of the generated HyDoc. Moreover, it has shown the ability
is still room for improvement, but on average, half of the                  to increase the semantic similarity between the HyDoc and
documents included in the links sent to the users have been                 the relevant documents to retrieve. This comparison consid-
considered fully relevant.                                                  ers the relevant textual segments containing the pertinent
                                                                            information using the paraphrase-multilingual-mpnet-base-
Table 1                                                                     v2 Bi-Encoder.
The results of the first user evaluation. All metrics are Likert               In the naive-RAG step, we employ a Chroma vector
scales with a range of 1 to 5 except {Q1}, which is a binary metric         database. We experimented three different embedders,
(1 for positive, zero for negative), and {Q2} which is a precision          namely the two OpenAI models text-embedding-3-small
score calculated on the three links                                         (hereafter GPT-small), text-embedding-3-large (hereafter
    Evaluation Criterion              Avg        Max      Min     Var       GPT-large), and the Bi-Encoder model used for the doc-
                                                                            ument retrieval module. As shown in Table 2, using CoT
    {Q1} Relevance to question        0.93       1.00     0.50    0.02      prompting generated a better HyDoc with OpenAI embed-
    {Q2} Links relevance              0.44       1.00     0.00    0.05
                                                                            dings, while it seems not influential for the Bi-Encoder
    {Q3} Text quality                 4.59       5.00     3.33    0.06
    {Q4} Reliability                  3.79       4.75     2.33    0.40
                                                                            model. Even though the increase in cosine similarity is
    {Q5} Clarity                      4.60       5.00     3.33    0.05      small we should recall that our documents share a consider-
    {Q6} Completeness                 3.38       4.75     1.33    0.81      able degree of semantic similarity. Consequently, this leads
    {Q7} Overall evaluation           3.40       4.75     1.67    0.59      to a densely populated vector space, where even marginal
                                                                            enhancements in similarity can yield substantial benefits
                                                                            in the retrieval process. Anyhow, the naive-RAG step effec-
                                                                            tively enhances HyDoc similarity both using GPT-large and
5. Towards a CoT pipeline                                                   in the Bi-Encoder embeddings.
                                                                               Finally, the last step of the pipeline uses the HyDoc, the
As shown in Section 4, our first implementation has substan-                query context and the retrieved certified context to generate
tial room for improvement in the retrieval step. In particular,             the reply. This provides the user with an appropriately
we noticed a decline in the link relevance evaluation regard-               framed answer as well as the documents involved in the
ing a particular type of question, i.e., evaluative questions.              generation process.
Evaluative questions are quite common in the medical do-
main and they represent the 23% of the dataset within the
user evaluation we performed. In a nutshell, they are in-                   6. Conclusions
quiries that need direct feedback on a particular aspect (e.g.,
                                                                            We have presented a modular RAG approach that enables
“Why I am feeling so tired?”). In this case, the average link
                                                                            the delivery of certified medical information. The modular
2
                                                                            pipeline allowed us to operate on unstructured texts with
    Refer to Mangaokar et al. https://arxiv.org/abs/2402.15911 for an ex-
                                                                            limited data annotation possibilities. A first user evaluation
    ample
          Figure 2: The proposed CoT pipeline


    Table 2
    The average cosine similarity between the HyDoc and the actual certified context in the "Evaluative Questions" subset

                          Prompt                                 GPT-small        GPT-large     Bi-encoder
                          Question + Context + Naive-RAG         0.766            0.820         0.801
                          Question + Naive-RAG                   0.736            0.806         0.807
                          Question                               0.717            0.717         0.717


showed promising results for our approach, although it                   References
revealed some flaws in some specific types of questions,
namely evaluative questions.                                             [1] L. Gao, X. Ma, J. Lin, J. Callan, Precise zero-shot dense
   We therefore tested a CoT pipeline on this specific sub-                  retrieval without relevance labels, in: A. Rogers, J. L.
type of questions, to overcome the limitations showed in                     Boyd-Graber, N. Okazaki (Eds.), Proceedings of the
the user evaluation. This approach proved to have a posi-                    61st Annual Meeting of the Association for Compu-
tive impact on the retrieval modules, enhancing semantic                     tational Linguistics (Volume 1: Long Papers), ACL
similarity between the HyDoc and the certified contexts, as                  2023, Toronto, Canada, July 9-14, 2023, Association for
well as on textual generation.                                               Computational Linguistics, 2023, pp. 1762–1777. URL:
   Surely, we should consider that we tested the CoT pipeline                https://doi.org/10.18653/v1/2023.acl-long.99. doi:10.
on a rather small dataset and that we used OpenAI-GPT as a                   18653/V1/2023.ACL-LONG.99.
readily available state-of-the-art LLM. Our research efforts             [2] W. Saeed, C. Omlin, Explainable ai (xai): A systematic
are currently focusing on expanding the dataset and testing                  meta-survey of current challenges and future opportu-
different open-source LLMs, as we intend our pipeline as                     nities, Knowledge-Based Systems 263 (2023) 110273.
completely LLM-agnostic.                                                 [3] A. Mihalache, R. S. Huang, M. M. Popovic, R. H. Muni,
   Finally, we should also recall that in this work we pre-                  Chatgpt-4: an assessment of an upgraded artificial
sented a user evaluation and the analysis of its results. Fur-               intelligence chatbot in the united states medical licens-
ther work is needed to create a ground truth on a compre-                    ing examination, Medical Teacher 46 (2024) 366–372.
hensive dataset of questions to assess the performance of                [4] R. C. T. Cheong, K. P. Pang, S. Unadkat, V. Mcneillis,
the retrieval modules.                                                       A. Williamson, J. Joseph, P. Randhawa, P. Andrews,
                                                                             V. Paleri, Performance of artificial intelligence chatbots
                                                                             in sleep medicine certification board exams: Chatgpt
Acknowledgments                                                              versus google bard, European Archives of Oto-Rhino-
                                                                             Laryngology (2023) 1–7.
We acknowledge the support provided by the PNRR initia-                  [5] M. Cascella, J. Montomoli, V. Bellini, E. Bignami, Eval-
tives: INEST (Interconnected North-East Innovation Ecosys-                   uating the feasibility of chatgpt in healthcare: an anal-
tem), project code ECS00000043, and FAIR (Future AI Re-                      ysis of multiple clinical and research scenarios, Journal
search), project code PE00000013. These projects are part of                 of Medical Systems 47 (2023) 33.
the NRRP MUR program, funded by the NextGenerationEU.                    [6] K. T. Pham, A. Nabizadeh, S. Selek,                  Arti-
This paper is supported by the TrustAlert project, funded                    ficial intelligence and chatbots in psychia-
by Fondazione Compagnia San Paolo and Fondazione CDP                         try,        Psychiatr Q 93 (2022) 249–253. URL:
under the “Artificial Intelligence” call.                                    https://doi.org/10.1007/s11126-022-09973-8.
                                                                             doi:10.1007/s11126-022-09973-8,                  received
     26 September 2021, Revised 23 January 2022, Accepted            Knowledge-augmented language model verification,
     26 January 2022, Published 25 February 2022.                    in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings
 [7] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin,        of the 2023 Conference on Empirical Methods in Nat-
     N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rock-             ural Language Processing, Association for Compu-
     täschel, et al., Retrieval-augmented generation for             tational Linguistics, Singapore, 2023, pp. 1720–1736.
     knowledge-intensive nlp tasks, Advances in Neural               URL: https://aclanthology.org/2023.emnlp-main.107.
     Information Processing Systems 33 (2020) 9459–9474.             doi:10.18653/v1/2023.emnlp-main.107.
 [8] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu,            [16] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter,
     S. Edunov, D. Chen, W.-t. Yih, Dense passage retrieval          F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chain-of-thought
     for open-domain question answering, in: B. Web-                 prompting elicits reasoning in large language models,
     ber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the          in: Proceedings of the 36th Conference on Neural In-
     2020 Conference on Empirical Methods in Natural                 formation Processing Systems (NeurIPS 2022), Google
     Language Processing (EMNLP), Association for Com-               Research, Brain Team, 2022.
     putational Linguistics, Online, 2020, pp. 6769–6781.       [17] M. Zheng, Y. Hao, W. Jiang, Z. Lin, Y. Lyu, Q. She,
     URL: https://aclanthology.org/2020.emnlp-main.550.              W. Wang, Chain-of-thought reasoning in tabular lan-
     doi:10.18653/v1/2020.emnlp-main.550.                            guage models, in: Findings of the Association for
 [9] S. Siriwardhana, R. Weerasekera, E. Wen, T. Kalu-               Computational Linguistics: EMNLP 2023, Association
     arachchi, R. Rana, S. Nanayakkara, Improving the                for Computational Linguistics, 2023, pp. 11006–11019.
     domain adaptation of retrieval augmented generation        [18] T. Wu, M. Terry, C. J. Cai, Ai chains: Transparent
     (RAG) models for open domain question answering,                and controllable human-ai interaction by chaining
     Transactions of the Association for Computational               large language model prompts, in: Proceedings of
     Linguistics 11 (2023) 1–17. URL: https://aclanthology.          the 2022 CHI Conference on Human Factors in Com-
     org/2023.tacl-1.1. doi:10.1162/tacl_a_00530.                    puting Systems (CHI), ACM, New Orleans, LA, USA,
[10] G. Izacard, E. Grave, Leveraging passage retrieval              2022. URL: https://doi.org/10.1145/3491102.3517582.
     with generative models for open domain question                 doi:10.1145/3491102.3517582, copyright 2022 by
     answering, in: P. Merlo, J. Tiedemann, R. Tsar-                 the owner/author(s).
     faty (Eds.), Proceedings of the 16th Conference of         [19] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
     the European Chapter of the Association for Com-                J. Callan, G. Neubig, Pal: Program-aided language mod-
     putational Linguistics: Main Volume, Association for            els, in: Proceedings of the 40th International Confer-
     Computational Linguistics, Online, 2021, pp. 874–               ence on Machine Learning (ICML), PMLR, Honolulu,
     880. URL: https://aclanthology.org/2021.eacl-main.74.           Hawaii, USA, 2023. URL: http://reasonwithpal.com,
     doi:10.18653/v1/2021.eacl-main.74.                              copyright 2023 by the author(s).
[11] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan,              [20] Z. Ling, Y. Fang, X. Li, Z. Huang, M. Lee, R. Memise-
     W. Chen, Enhancing retrieval-augmented large lan-               vic, H. Su, Deductive verification of chain-of-thought
     guage models with iterative retrieval-generation syn-           reasoning, in: 37th Conference on Neural Information
     ergy, in: H. Bouamor, J. Pino, K. Bali (Eds.), Find-            Processing Systems (NeurIPS 2023), NeurIPS, 2023.
     ings of the Association for Computational Linguis-         [21] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. H. Chi,
     tics: EMNLP 2023, Association for Computational Lin-            S. Narang, A. Chowdhery, D. Zhou, Self-consistency
     guistics, Singapore, 2023, pp. 9248–9274. URL: https:           improves chain of thought reasoning in language mod-
     //aclanthology.org/2023.findings-emnlp.620. doi:10.             els, in: International Conference on Learning Repre-
     18653/v1/2023.findings-emnlp.620.                               sentations (ICLR), Google Research, Brain Team, 2023.
[12] W. Huang, M. Lapata, P. Vougiouklis, N. Papasaran-         [22] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa,
     topoulos, J. Z. Pan, Retrieval augmented generation             Large language models are zero-shot reasoners, in:
     with rich answer encoding, in: Proceedings of the               The U 36th Conference on Neural Information Pro-
     13th International Joint Conference on Natural Lan-             cessing Systems (NeurIPS 2022), 2022.
     guage Processing and the 3rd Conference of the Asia-       [23] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao,
     Pacific Chapter of the Association for Computational            K. Narasimhan,         Tree of thoughts: Deliberate
     Linguistics (Volume 1: Long Papers), Association for            problem solving with large language models, in:
     Computational Linguistics, 2023, pp. 1012–1025.                 A. Oh, T. Neumann, A. Globerson, K. Saenko,
[13] Z. Jiang, F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-            M. Hardt, S. Levine (Eds.), Advances in Neural
     Yu, Y. Yang, J. Callan, G. Neubig, Active retrieval             Information Processing Systems, volume 36, Curran
     augmented generation, in: H. Bouamor, J. Pino,                  Associates, Inc., 2023, pp. 11809–11822. URL: https:
     K. Bali (Eds.), Proceedings of the 2023 Conference              //proceedings.neurips.cc/paper_files/paper/2023/file/
     on Empirical Methods in Natural Language Process-               271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.
     ing, Association for Computational Linguistics, Singa-          pdf.
     pore, 2023, pp. 7969–7992. URL: https://aclanthology.      [24] S. Cao, J. Zhang, J. Shi, X. Lv, Z. Yao, Q. Tian, J. Li,
     org/2023.emnlp-main.495. doi:10.18653/v1/2023.                  L. Hou, Probabilistic tree-of-thought reasoning for
     emnlp-main.495.                                                 answering knowledge-intensive complex questions,
[14] Z. Yu, C. Xiong, S. Yu, Z. Liu, Augmentation-adapted            in: Findings of the Association for Computational Lin-
     retriever improves generalization of language models            guistics: EMNLP, Association for Computational Lin-
     as generic plug-in, in: Proceedings of the 61st An-             guistics, Beijing, China, 2023, pp. 12541–12560.
     nual Meeting of the Association for Computational          [25] R. Li, X. Zhao, M. Moens, A brief overview of universal
     Linguistics, Volume 1: Long Papers, Association for             sentence representation methods: A linguistic view,
     Computational Linguistics, 2023, pp. 2421–2436.                 ACM Comput. Surv. 55 (2023) 56:1–56:42. URL: https:
[15] J. Baek, S. Jeong, M. Kang, J. Park, S. Hwang,                  //doi.org/10.1145/3482853. doi:10.1145/3482853.
[26] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean,
     Distributed representations of words and phrases and
     their compositionality, Advances in neural informa-
     tion processing systems 26 (2013).
[27] N. Reimers, I. Gurevych, Sentence-bert: Sentence em-
     beddings using siamese bert-networks, in: K. Inui,
     J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the
     2019 Conference on Empirical Methods in Natu-
     ral Language Processing and the 9th International
     Joint Conference on Natural Language Processing,
     EMNLP-IJCNLP 2019, Hong Kong, China, November
     3-7, 2019, Association for Computational Linguistics,
     2019, pp. 3980–3990. URL: https://doi.org/10.18653/v1/
     D19-1410. doi:10.18653/V1/D19-1410.
[28] H. Li, Learning to rank for information retrieval and
     natural language processing, Springer Nature, 2022.
[29] H. P. Grice, Logic and conversation, in: Speech acts,
     Brill, 1975, pp. 41–58.

</pre>