=Paper=
{{Paper
|id=Vol-3784/short1
|storemode=property
|title=Chain-of-Thought to Enhance Document Retrieval in Certified Medical Chatbots
|pdfUrl=https://ceur-ws.org/Vol-3784/short1.pdf
|volume=Vol-3784
|authors=Leonardo Sanna,Simone Magnolini,Patrizio Bellan,Saba Ghanbari Haez,Marina Segala,Monica Consolandi,Mauro Dragoni
|dblpUrl=https://dblp.org/rec/conf/ir-rag/SannaMBHSCD24
}}
==Chain-of-Thought to Enhance Document Retrieval in Certified Medical Chatbots==
Chain-of-Thought to Enhance Document Retrieval in Certified
Medical Chatbots
Leonardo Sanna1,* , Simone Magnolini1 , Patrizio Bellan1 , Saba Ghanbari Haez1,2 , Marina Segala1 ,
Monica Consolandi1 and Mauro Dragoni1,*
1
Fondazione Bruno Kessler, Trento, ITALY
2
Free University of Bozen, Bozen, ITALY
Abstract
We propose a Retrieval-Augmented Generation pipeline aimed at retrieving certified medical information. Inspired by the recently
introduced Hypothetical Document Embeddings framework, we use the LLM to generate a document to query our certified repository.
Although showing promising results in the first user evaluation, the proposed pipeline sometimes fails to retrieve the correct documents.
We therefore propose a second Chain-of-thought-inspired pipeline to enhance the generation of the Hypothetical Document and,
consequently, the retrieval of the certified documents.
Keywords
Conversational Agent, Digital Health, Chain-of-Thought, Certified Information
1. Introduction approaches add a further layer of algorithmic opacity since
the user is unaware of the documents used to generate the
The Hypothetical Document Embeddings (HyDE) frame- reply. Therefore, on the one hand, we use the retrieved
work has been recently introduced as an effective method document to produce a well-grounded and informed reply,
to build dense retrievers completely unsupervised [1]. The while on the other hand, we provide the certified sources
key idea behind HyDE is to leverage the Large Language that have been retrieved, computing the similarity with the
Model (LLM) creative abilities to generate a Hypothetical HyDoc.
Document (HyDoc) which is then used to retrieve a real Nonetheless, the quality of the generated HyDoc remains
document in a repository. a substantial issue in medical domains. Although LLMs have
Hence, HyDE is particularly well-suited for building med- shown impressive results in addressing medical queries [3,
ical chatbots that operate with“certified information”, i.e. 4, 5], relying on the sole abilities of the LLM might result in
conversational agents capable of providing trustworthy in- generating inaccurate or low-quality HyDocs.
formation that has been created or verified by domain ex- In fact, in a first user evaluation of our proposed modular
perts such as physicians or other healthcare professionals pipeline, we found evidence that the retrieval step might be
in the digital health industry problematic when encountering specific types of questions,
To provide “certified information”, the chatbot’s reply e.g. evaluative questions. This paper therefore introduces
must be predetermined, namely that we have a predefined the main challenges we found in developing a modular RAG
set of answers for each specific question. The existing lack pipeline in a certified context. In particular, we focus on
of conversational datasets in the medical domain, however, the proposal of a Chain-of-thought-inspired pipeline to en-
poses a substantial challenge in creating a certified med- hance the HyDoc generation and, consequently, improve
ical chatbot. To tackle this issue, we devised a Retrieval- the retrieval of the certified sources.
Augmented Generation (RAG) pipeline within the HyDE
framework so that we could benefit from the conversational
capabilities of an LLM and, at the same time, exploit the 2. Related work
LLM to retrieve the certified sources supporting the reply.
We believe that adopting HyDE addresses two major is- LLMs’ credibility and effectiveness are crucial in AI research,
sues of RAG pipelines. First of all, we are trying to build a especially in areas like digital health and wellbeing that
FAQ-based chatbot, therefore most of the interactions with require precision and reliability [6]. RAG and Chain of
the patients would be short questions. In a FAQ-oriented Thought (CoT) prompting are highly effective in reducing
conversational agent, using a simple naive-RAG pipeline the hallucinations and enhancing factual content generation in
user query would employed to retrieve the certified sources. LLMs by integrating external knowledge.
Yet, since we are operating with vector databases, the vector RAG integrates external knowledge into LLMs’ prompts
representation of the query might be significantly distant through data retrieval using parametric and non-parametric
from the certified documents in the semantic space, yielding memory [7, 8]. It has been shown that RAG outperforms
a remarkable risk of excluding relevant documents in the parametric-only seq2seq models in tasks like Question An-
retrieval process. swering (QA) and summarization, improving text genera-
Moreover, in a digital health context, it is important to tion [9].
keep our certified medical chatbot explainable [2]. RAG Various approaches have been explored to advance QA
systems. For instance, the work [10] involves a two-stage
process that combines Dense Passage Retrieval (DPR) with
Information Retrieval’s Role in RAG Systems (IR-RAG) - 2024
generative sequence-to-sequence LMs. Other examples are
*
Corresponding author.
$ lsanna@fbk.eu (L. Sanna); magnolini@fbk.eu (S. Magnolini); the iterative integration of retrieval and generation [11], a
pbellan@fbk.eu (P. Bellan); sghanbarihaez@fbk.eu (S. G. Haez); combination of retrieval and generation techniques for infor-
msegala@fbk.eu (M. Segala); mconsolandi@fbk.eu (M. Consolandi); mative answers [12], and dynamic real-time retrieval during
dragoni@fbk.eu (M. Dragoni) generation [13]. Other approaches include techniques to im-
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
prove the accuracy of language models integrating external In this work, we used GPT-4-turbo (gpt-4-0125-preview
knowledge [14, 15], as well as advancing implicit reasoning specifically) as LLM. However, our pipeline is intended as
and adaptability in QA tasks [9]. LLM-agnostic. The use of OpenAI-GPT has, therefore, been
On the other hand, CoT methods have been highly effec- intended as a convenient solution to test our RAG pipeline
tive in improving LLMs’ ability to handle complex reasoning using a stable and well-performing LLM. Indeed, to deploy
tasks, such as those that involve heterogeneous data from a conversational assistant in a real-case scenario, an open-
tables and questions [16, 17, 18]. Some recent studies have source model would likely be required due to cost and pri-
shown that breaking down problems into manageable steps vacy issues in accessing any LLM via API.
significantly enhances LLMs’ performance in complex rea-
soning tasks [16, 19, 20]. 4.1. A first (zero-shot) implementation
The work of [21] refines self-consistency decoding for
broader applications like translation strategies and senti-
ment analysis, while [22] introduces the Zero-shot-CoT ap-
proach, a technique to improve LLM performance on diverse
reasoning tasks, without hand-crafted few-shot examples.
Finally, we should mention the Tree of Thoughts (ToT)
framework [23], which has a particularly relevant approach
for QA, namely the Probabilistic Tree-of-thought Reason-
ing (ProbTree) [24]. This approach breaks down QA into
two stages, understanding and reasoning, to solve retrieval
issues and prevent error propagation.
Despite the high research interest and the diversity of ap-
proaches both in RAG and CoT, there are currently no stud-
ies focusing on certified medical chatbots. Moving within
the HyDE framework, we believe that we can employ CoT
techniques to improve the generation of the Hypothetical Figure 1: An overview of the RAG model we are implementing.
Document that would be then used as the query to retrieve
the certified documents.
Our approach employs a modular RAG framework de-
3. Dataset signed to address the challenge of delivering natural, verified
responses through a medical chatbot by leveraging unstruc-
In our dataset, we have three certified sources. We have (i) tured data. To achieve this, we create a HyDoc in response
179 informational cards, which were created by the Obstetri- to the user’s questions.
cian Department of the Hospital of Trento (Italy). Then we The essence of our strategy lies in enhancing the docu-
have 953 documents from (ii) UPPA, a medical webzine, and ment retrieval process with the HyDoc. Despite the poten-
380 documents from (iii) ISS-Salute, which is the informa- tial for inaccuracies and hallucinations, the LLM is expected
tive website of the Istituto Superiore di Sanità - ISS (Italian to discern the fundamental aspects of the query and identify
National Institute of Health). textual patterns pertinent to the specific domain of knowl-
It is important to highlight that the dataset we have is edge. Given the proven efficacy of LLMs in fielding medical
not conversational, nor it is meant to be used in a medical queries [3, 4, 5], the HyDoc is anticipated to closely align
chatbot. All sources are what we might call content made with genuine documents that provide accurate, verified re-
for FAQ sections. Therefore, it is often quite verbose and sponses to the user’s question.
dense in information. All the data we have is unstructured To query our verified document repository, we utilize the
text, with a notable stylistic heterogeneity within the same sentence embeddings generated from our HyDoc. The area
source. This characteristic is combined with the semantic of general-purpose sentence embeddings remains an active
homogeneity given by the specific medical domain, creating field of research [25], in contrast to the more established uni-
a substantial issue for automatic topic extraction. versal word embedding techniques like word2vec [26]. Our
Finally, we should recall that content editing is not per- workflow incorporates the paraphrase-multilingual-mpnet-
mitted due to the certified nature of our information. Since base-v2 Bi-Encoder model [27] for generating embeddings
each specific question should consistently correspond to a of both the HyDoc and the verified data.
particular set of equivalent answers., it becomes essential This model introduces a pooling operation to produce a
the adoption of modular RAG solutions. fixed-size embedding vector normalized to a size of 1.00.
These vectors are then compared using cosine similarity.
However, the Bi-Encoder model encounters challenges in
4. Methods accurately comparing documents of varying lengths, which
In this section, we will explain the methods used in our can lead to the retrieval of irrelevant documents due to the
implementation. Our first implementation was a sort of zero- disparity in length between our HyDocs and the documents
shot implementation since we generated the HyDoc only in the repository.
relying on LLM knowledge, without providing any other To address this issue, we employ the ms-marco-MiniLM-
context. This solution is shown in Figure 1. We assessed L-6-v2 cross encoder 1 . Unlike the Bi-Encoder which uses
the performance of this first implementation by doing a separate encoders for each input, the cross-encoder pro-
user evaluation. The technology presented in this section is cesses pairs of sentences through a single shared encoder,
the same used for the second implementation illustrated in
Section 5. 1
https://huggingface.co/cross-encoder/ms-marco-MiniLM-L-6-v2
producing a joint representation that is evaluated by a clas- relevance is 0.31, whereas non-evaluative questions have a
sifier to yield a similarity score between the texts. 0.48 average link relevance.
Given the computational demands of the Cross-Encoder, We argue that the worse performance on evaluative ques-
it is applied selectively to a shortlist of potential documents. tions is mostly because generating an evaluative answer
Following the computation of cosine similarity across all might be complex for the LLM also. Moreover, the HyDoc
HyDoc-document pairs , where 𝑖 ranges from generate would likely be a punctual reply on the precise
1 to 𝑛 and 𝐷𝑖 represents the 𝑖𝑡ℎ document in the verified aspect, since this is the expected natural reply in a conver-
repository, we rank and select the top 50 documents for their sation. Since we are retrieving full documents, it might be
relevance. This guarantees to have an acceptable number of that the vector representation of an evaluative HyDoc is
documents from an information retrieval perspective [28]. quite distant from the original document where we can find
Subsequently, the top 3 documents from this refined list are the reply.
chosen to augment the original prompt, enhancing the text Therefore, we are annotating our dataset to enable the
of the final response provided to the user. This decision is retrieval of shorter text segments. The idea is that we can
based on preliminary tests indicating that using more than split our documents into shorter and more meaningful seg-
three documents could negatively impact the framework’s ments to ease the retrieval step and enhance the generation
effectiveness. part.
Finally, a Guard-Rail module 2 is implemented to ensure A second version of our pipeline has been tested on the
the response generated by the LLM adheres to the specified subset of evaluative questions (Figure 2). The new pipeline
prompt length, incorporating generated text and references is inspired by a CoT logic and, therefore, is aimed at gener-
to the three selected certified documents in the final answer. ating a better HyDoc. First, we generate the HyDoc after a
An initial user evaluation of our zero-shot model was con- naive-RAG step. In a pre-retrieval step, the user question
ducted using 100 questions related to pregnancy, deemed is hence used to query our certified repository, and the re-
representative by expert reviewers. This evaluation focused trieved context is used to generate the HyDoc. Moreover, we
on seven metrics: {Q1} the relevance of the answer to the also include more contextual information about the query
question, {Q2} the relevance of the links (documents) pro- aimed at enhancing the similarity between the HyDoc and
vided, {Q3} text quality, {Q4} reliability, {Q5} clarity, {Q6} the contexts that need to be retrieved in the augmented
completeness, and {Q7} an overall evaluation score. Accord- prompt. For instance, we provide within the prompt useful
ing to Table 1, while the model demonstrated potential in pragmatic information to generate an evaluative reply, such
text quality, it highlighted the need for improved document as presupposition and implications [29].
retrieval, as evidenced by the document link relevance scor- The CoT has proven to be capable of enhancing the quality
ing an average of 0.44. This value demonstrates that there of the generated HyDoc. Moreover, it has shown the ability
is still room for improvement, but on average, half of the to increase the semantic similarity between the HyDoc and
documents included in the links sent to the users have been the relevant documents to retrieve. This comparison consid-
considered fully relevant. ers the relevant textual segments containing the pertinent
information using the paraphrase-multilingual-mpnet-base-
Table 1 v2 Bi-Encoder.
The results of the first user evaluation. All metrics are Likert In the naive-RAG step, we employ a Chroma vector
scales with a range of 1 to 5 except {Q1}, which is a binary metric database. We experimented three different embedders,
(1 for positive, zero for negative), and {Q2} which is a precision namely the two OpenAI models text-embedding-3-small
score calculated on the three links (hereafter GPT-small), text-embedding-3-large (hereafter
Evaluation Criterion Avg Max Min Var GPT-large), and the Bi-Encoder model used for the doc-
ument retrieval module. As shown in Table 2, using CoT
{Q1} Relevance to question 0.93 1.00 0.50 0.02 prompting generated a better HyDoc with OpenAI embed-
{Q2} Links relevance 0.44 1.00 0.00 0.05
dings, while it seems not influential for the Bi-Encoder
{Q3} Text quality 4.59 5.00 3.33 0.06
{Q4} Reliability 3.79 4.75 2.33 0.40
model. Even though the increase in cosine similarity is
{Q5} Clarity 4.60 5.00 3.33 0.05 small we should recall that our documents share a consider-
{Q6} Completeness 3.38 4.75 1.33 0.81 able degree of semantic similarity. Consequently, this leads
{Q7} Overall evaluation 3.40 4.75 1.67 0.59 to a densely populated vector space, where even marginal
enhancements in similarity can yield substantial benefits
in the retrieval process. Anyhow, the naive-RAG step effec-
tively enhances HyDoc similarity both using GPT-large and
5. Towards a CoT pipeline in the Bi-Encoder embeddings.
Finally, the last step of the pipeline uses the HyDoc, the
As shown in Section 4, our first implementation has substan- query context and the retrieved certified context to generate
tial room for improvement in the retrieval step. In particular, the reply. This provides the user with an appropriately
we noticed a decline in the link relevance evaluation regard- framed answer as well as the documents involved in the
ing a particular type of question, i.e., evaluative questions. generation process.
Evaluative questions are quite common in the medical do-
main and they represent the 23% of the dataset within the
user evaluation we performed. In a nutshell, they are in- 6. Conclusions
quiries that need direct feedback on a particular aspect (e.g.,
We have presented a modular RAG approach that enables
“Why I am feeling so tired?”). In this case, the average link
the delivery of certified medical information. The modular
2
pipeline allowed us to operate on unstructured texts with
Refer to Mangaokar et al. https://arxiv.org/abs/2402.15911 for an ex-
limited data annotation possibilities. A first user evaluation
ample
Figure 2: The proposed CoT pipeline
Table 2
The average cosine similarity between the HyDoc and the actual certified context in the "Evaluative Questions" subset
Prompt GPT-small GPT-large Bi-encoder
Question + Context + Naive-RAG 0.766 0.820 0.801
Question + Naive-RAG 0.736 0.806 0.807
Question 0.717 0.717 0.717
showed promising results for our approach, although it References
revealed some flaws in some specific types of questions,
namely evaluative questions. [1] L. Gao, X. Ma, J. Lin, J. Callan, Precise zero-shot dense
We therefore tested a CoT pipeline on this specific sub- retrieval without relevance labels, in: A. Rogers, J. L.
type of questions, to overcome the limitations showed in Boyd-Graber, N. Okazaki (Eds.), Proceedings of the
the user evaluation. This approach proved to have a posi- 61st Annual Meeting of the Association for Compu-
tive impact on the retrieval modules, enhancing semantic tational Linguistics (Volume 1: Long Papers), ACL
similarity between the HyDoc and the certified contexts, as 2023, Toronto, Canada, July 9-14, 2023, Association for
well as on textual generation. Computational Linguistics, 2023, pp. 1762–1777. URL:
Surely, we should consider that we tested the CoT pipeline https://doi.org/10.18653/v1/2023.acl-long.99. doi:10.
on a rather small dataset and that we used OpenAI-GPT as a 18653/V1/2023.ACL-LONG.99.
readily available state-of-the-art LLM. Our research efforts [2] W. Saeed, C. Omlin, Explainable ai (xai): A systematic
are currently focusing on expanding the dataset and testing meta-survey of current challenges and future opportu-
different open-source LLMs, as we intend our pipeline as nities, Knowledge-Based Systems 263 (2023) 110273.
completely LLM-agnostic. [3] A. Mihalache, R. S. Huang, M. M. Popovic, R. H. Muni,
Finally, we should also recall that in this work we pre- Chatgpt-4: an assessment of an upgraded artificial
sented a user evaluation and the analysis of its results. Fur- intelligence chatbot in the united states medical licens-
ther work is needed to create a ground truth on a compre- ing examination, Medical Teacher 46 (2024) 366–372.
hensive dataset of questions to assess the performance of [4] R. C. T. Cheong, K. P. Pang, S. Unadkat, V. Mcneillis,
the retrieval modules. A. Williamson, J. Joseph, P. Randhawa, P. Andrews,
V. Paleri, Performance of artificial intelligence chatbots
in sleep medicine certification board exams: Chatgpt
Acknowledgments versus google bard, European Archives of Oto-Rhino-
Laryngology (2023) 1–7.
We acknowledge the support provided by the PNRR initia- [5] M. Cascella, J. Montomoli, V. Bellini, E. Bignami, Eval-
tives: INEST (Interconnected North-East Innovation Ecosys- uating the feasibility of chatgpt in healthcare: an anal-
tem), project code ECS00000043, and FAIR (Future AI Re- ysis of multiple clinical and research scenarios, Journal
search), project code PE00000013. These projects are part of of Medical Systems 47 (2023) 33.
the NRRP MUR program, funded by the NextGenerationEU. [6] K. T. Pham, A. Nabizadeh, S. Selek, Arti-
This paper is supported by the TrustAlert project, funded ficial intelligence and chatbots in psychia-
by Fondazione Compagnia San Paolo and Fondazione CDP try, Psychiatr Q 93 (2022) 249–253. URL:
under the “Artificial Intelligence” call. https://doi.org/10.1007/s11126-022-09973-8.
doi:10.1007/s11126-022-09973-8, received
26 September 2021, Revised 23 January 2022, Accepted Knowledge-augmented language model verification,
26 January 2022, Published 25 February 2022. in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings
[7] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, of the 2023 Conference on Empirical Methods in Nat-
N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rock- ural Language Processing, Association for Compu-
täschel, et al., Retrieval-augmented generation for tational Linguistics, Singapore, 2023, pp. 1720–1736.
knowledge-intensive nlp tasks, Advances in Neural URL: https://aclanthology.org/2023.emnlp-main.107.
Information Processing Systems 33 (2020) 9459–9474. doi:10.18653/v1/2023.emnlp-main.107.
[8] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, [16] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter,
S. Edunov, D. Chen, W.-t. Yih, Dense passage retrieval F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chain-of-thought
for open-domain question answering, in: B. Web- prompting elicits reasoning in large language models,
ber, T. Cohn, Y. He, Y. Liu (Eds.), Proceedings of the in: Proceedings of the 36th Conference on Neural In-
2020 Conference on Empirical Methods in Natural formation Processing Systems (NeurIPS 2022), Google
Language Processing (EMNLP), Association for Com- Research, Brain Team, 2022.
putational Linguistics, Online, 2020, pp. 6769–6781. [17] M. Zheng, Y. Hao, W. Jiang, Z. Lin, Y. Lyu, Q. She,
URL: https://aclanthology.org/2020.emnlp-main.550. W. Wang, Chain-of-thought reasoning in tabular lan-
doi:10.18653/v1/2020.emnlp-main.550. guage models, in: Findings of the Association for
[9] S. Siriwardhana, R. Weerasekera, E. Wen, T. Kalu- Computational Linguistics: EMNLP 2023, Association
arachchi, R. Rana, S. Nanayakkara, Improving the for Computational Linguistics, 2023, pp. 11006–11019.
domain adaptation of retrieval augmented generation [18] T. Wu, M. Terry, C. J. Cai, Ai chains: Transparent
(RAG) models for open domain question answering, and controllable human-ai interaction by chaining
Transactions of the Association for Computational large language model prompts, in: Proceedings of
Linguistics 11 (2023) 1–17. URL: https://aclanthology. the 2022 CHI Conference on Human Factors in Com-
org/2023.tacl-1.1. doi:10.1162/tacl_a_00530. puting Systems (CHI), ACM, New Orleans, LA, USA,
[10] G. Izacard, E. Grave, Leveraging passage retrieval 2022. URL: https://doi.org/10.1145/3491102.3517582.
with generative models for open domain question doi:10.1145/3491102.3517582, copyright 2022 by
answering, in: P. Merlo, J. Tiedemann, R. Tsar- the owner/author(s).
faty (Eds.), Proceedings of the 16th Conference of [19] L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang,
the European Chapter of the Association for Com- J. Callan, G. Neubig, Pal: Program-aided language mod-
putational Linguistics: Main Volume, Association for els, in: Proceedings of the 40th International Confer-
Computational Linguistics, Online, 2021, pp. 874– ence on Machine Learning (ICML), PMLR, Honolulu,
880. URL: https://aclanthology.org/2021.eacl-main.74. Hawaii, USA, 2023. URL: http://reasonwithpal.com,
doi:10.18653/v1/2021.eacl-main.74. copyright 2023 by the author(s).
[11] Z. Shao, Y. Gong, Y. Shen, M. Huang, N. Duan, [20] Z. Ling, Y. Fang, X. Li, Z. Huang, M. Lee, R. Memise-
W. Chen, Enhancing retrieval-augmented large lan- vic, H. Su, Deductive verification of chain-of-thought
guage models with iterative retrieval-generation syn- reasoning, in: 37th Conference on Neural Information
ergy, in: H. Bouamor, J. Pino, K. Bali (Eds.), Find- Processing Systems (NeurIPS 2023), NeurIPS, 2023.
ings of the Association for Computational Linguis- [21] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. H. Chi,
tics: EMNLP 2023, Association for Computational Lin- S. Narang, A. Chowdhery, D. Zhou, Self-consistency
guistics, Singapore, 2023, pp. 9248–9274. URL: https: improves chain of thought reasoning in language mod-
//aclanthology.org/2023.findings-emnlp.620. doi:10. els, in: International Conference on Learning Repre-
18653/v1/2023.findings-emnlp.620. sentations (ICLR), Google Research, Brain Team, 2023.
[12] W. Huang, M. Lapata, P. Vougiouklis, N. Papasaran- [22] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa,
topoulos, J. Z. Pan, Retrieval augmented generation Large language models are zero-shot reasoners, in:
with rich answer encoding, in: Proceedings of the The U 36th Conference on Neural Information Pro-
13th International Joint Conference on Natural Lan- cessing Systems (NeurIPS 2022), 2022.
guage Processing and the 3rd Conference of the Asia- [23] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao,
Pacific Chapter of the Association for Computational K. Narasimhan, Tree of thoughts: Deliberate
Linguistics (Volume 1: Long Papers), Association for problem solving with large language models, in:
Computational Linguistics, 2023, pp. 1012–1025. A. Oh, T. Neumann, A. Globerson, K. Saenko,
[13] Z. Jiang, F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi- M. Hardt, S. Levine (Eds.), Advances in Neural
Yu, Y. Yang, J. Callan, G. Neubig, Active retrieval Information Processing Systems, volume 36, Curran
augmented generation, in: H. Bouamor, J. Pino, Associates, Inc., 2023, pp. 11809–11822. URL: https:
K. Bali (Eds.), Proceedings of the 2023 Conference //proceedings.neurips.cc/paper_files/paper/2023/file/
on Empirical Methods in Natural Language Process- 271db9922b8d1f4dd7aaef84ed5ac703-Paper-Conference.
ing, Association for Computational Linguistics, Singa- pdf.
pore, 2023, pp. 7969–7992. URL: https://aclanthology. [24] S. Cao, J. Zhang, J. Shi, X. Lv, Z. Yao, Q. Tian, J. Li,
org/2023.emnlp-main.495. doi:10.18653/v1/2023. L. Hou, Probabilistic tree-of-thought reasoning for
emnlp-main.495. answering knowledge-intensive complex questions,
[14] Z. Yu, C. Xiong, S. Yu, Z. Liu, Augmentation-adapted in: Findings of the Association for Computational Lin-
retriever improves generalization of language models guistics: EMNLP, Association for Computational Lin-
as generic plug-in, in: Proceedings of the 61st An- guistics, Beijing, China, 2023, pp. 12541–12560.
nual Meeting of the Association for Computational [25] R. Li, X. Zhao, M. Moens, A brief overview of universal
Linguistics, Volume 1: Long Papers, Association for sentence representation methods: A linguistic view,
Computational Linguistics, 2023, pp. 2421–2436. ACM Comput. Surv. 55 (2023) 56:1–56:42. URL: https:
[15] J. Baek, S. Jeong, M. Kang, J. Park, S. Hwang, //doi.org/10.1145/3482853. doi:10.1145/3482853.
[26] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, J. Dean,
Distributed representations of words and phrases and
their compositionality, Advances in neural informa-
tion processing systems 26 (2013).
[27] N. Reimers, I. Gurevych, Sentence-bert: Sentence em-
beddings using siamese bert-networks, in: K. Inui,
J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the
2019 Conference on Empirical Methods in Natu-
ral Language Processing and the 9th International
Joint Conference on Natural Language Processing,
EMNLP-IJCNLP 2019, Hong Kong, China, November
3-7, 2019, Association for Computational Linguistics,
2019, pp. 3980–3990. URL: https://doi.org/10.18653/v1/
D19-1410. doi:10.18653/V1/D19-1410.
[28] H. Li, Learning to rank for information retrieval and
natural language processing, Springer Nature, 2022.
[29] H. P. Grice, Logic and conversation, in: Speech acts,
Brill, 1975, pp. 41–58.