PersonaRAG: Enhancing Retrieval-Augmented Generation Systems with User-Centric Agents Saber Zerhoudi1 , Michael Granitzer1 1 University of Passau, Passau, Germany Abstract Large Language Models (LLMs) struggle with generating reliable outputs due to outdated knowledge and hallucinations. Retrieval- Augmented Generation (RAG) models address this by enhancing LLMs with external knowledge, but often fail to personalize the retrieval process. This paper introduces PersonaRAG, a novel framework incorporating user-centric agents to adapt retrieval and generation based on real-time user data and interactions. Evaluated across various question answering datasets, PersonaRAG demonstrates superiority over baseline models, providing tailored answers to user needs. The results suggest promising directions for user-adapted information retrieval systems. Findings and resources are available at https://github.com/padas-lab-de/ir-rag-sigir24-persona-rag. Keywords User interactions, Retrieval-Augmented Generation (RAG), Personalized Information Retrieval, Multi-Agent RAG 1. Introduction In this study, we present PersonaRAG, an innovative methodology that extends traditional RAG frameworks by Large Language Models (LLMs) such as GPT-4 [2] and incorporating user-centric agents into the retrieval process. LLaMA 3 [3] have significantly advanced the field of natural This approach addresses the previously mentioned limita- language processing (NLP) by demonstrating impressive tions by promoting active engagement with retrieved con- performance across various tasks and exhibiting emergent tent and utilizing dynamic, real-time user data to continu- abilities that push the boundaries of artificial intelligence [4]. ously refine and personalize interactions. PersonaRAG aims However, these models face challenges such as generating to enhance the precision and relevance of LLM outputs, unreliable outputs due to issues like hallucination and out- adapting dynamically to user-specific needs while maintain- dated parametric memories [5]. ing full transparency regarding the personalization process. Retrieval-Augmented Generation (RAG) models have Our experiments, conducted using GPT-3.5, develop the shown promise in addressing these issues by integrating ex- PersonaRAG model and evaluate its performance across ternally retrieved information to support more effective per- various question answering datasets. The results indicate formance on complex, knowledge-intensive tasks [6]. De- that PersonaRAG achieves an improvement of over 5% in spite these advancements, the deployment of RAG systems accuracy compared to baseline models. Furthermore, Per- within broader AI frameworks continues to face significant sonaRAG demonstrates an ability to adapt responses based challenges, particularly in handling noise and irrelevance on user profiles and information needs, enhancing the per- in retrieved data [7]. sonalization of results. Additional analysis shows that the A key limitation of existing RAG systems is their inability principles underlying PersonaRAG can be generalized to to adapt outputs to users’ specific informational and contex- different LLM architectures, such as Llama 3 70b and Mix- tual needs. Personalized techniques in information retrieval, ture of Experts (MoE) 8x7b [12]. These architectures benefit such as adaptive retrieval based on user interaction data and from the integration of external knowledge facilitated by context-aware strategies, are increasingly recognized as es- PersonaRAG, with improvements exceeding 10% in some sential for enhancing user interaction and satisfaction [8, 9]. cases. This evidence indicates that PersonaRAG not only These methods aim to refine the retrieval process dynam- contributes to the progress of RAG systems but also provides ically, tailoring it more closely to individual user profiles notable advantages for various LLM applications, signify- and situational contexts [10]. ing a meaningful step forward in the development of more The integration of agent-based systems with personal- intelligent and user-adapted information retrieval systems. ized RAG architectures presents a compelling avenue for research. Such systems utilize a multi-agent framework to simulate complex, adaptive interactions tailored to user- 2. Related Work specific requirements [11]. By embedding intelligent, user- oriented agents within the RAG framework, these systems Retrieval-Augmented Generation (RAG) systems have can evolve into more sophisticated tools that not only re- emerged as a significant advancement in natural language trieve relevant information but also align it closely with the processing and machine learning, enhancing language mod- user’s specific preferences and contexts in real-time. Im- els by integrating external knowledge bases to improve per- portantly, the personalization strategy employed in these formance across various tasks, such as question answering, systems is fully transparent to the user, ensuring that the dialog understanding, and code generation [6, 13]. These user is aware of how their information is being used to tailor systems employ dense retrievers to pull relevant informa- the results. tion, which the language model then uses to generate re- sponses. However, the development of RAG systems and Information Retrieval’s Role in RAG Systems (IR-RAG) workshop at SIGIR, their integration within broader artificial intelligence frame- 2024, Washington D.C., USA works is an ongoing area of research, with several challenges Envelope-Open saber.zerhoudi@uni-passau.de (S. Zerhoudi); and opportunities for improvement. michael.granitzer@uni-passau.de (M. Granitzer) Orcid 0000-0003-2259-0462 (S. Zerhoudi); 0000-0003-3566-5507 Recent developments in RAG systems have focused on re- (M. Granitzer) fining these models to better handle the noise and irrelevant © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). information often retrieved during the process. Xu et al. CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings needs. For instance, Jeong et al. [17] proposed adaptive retrieval strategies that dynamically adjust the retrieval pro- cess based on the complexity of the query and the user’s historical interaction data. These personalized approaches not only improve user satisfaction but also increase the effi- ciency of information retrieval by reducing the time users spend sifting through irrelevant information. The integration of personalized techniques with agent- based systems provides a promising pathway to augment the capabilities of RAG systems. Agent-based systems, par- ticularly in the form of LLM-Based Multi-Agent Frame- works [18], enable the simulation of complex interactions that can lead to more nuanced and contextually appropri- ate outputs. By incorporating multi-agent systems into RAG frameworks, there is potential for developing more robust and adaptive retrieval mechanisms that can handle a broader range of queries and generate more accurate re- sponses, closely tailored to the specific needs and contexts of individual users. In conclusion, while significant progress has been made in enhancing the effectiveness and personalization of RAG systems, ongoing research is crucial to address their existing limitations and expand their applications. The integration of personalized information retrieval and agent-based enhance- ments represents a promising avenue for further enhancing the adaptability and accuracy of RAG systems, potentially leading to intelligent information retrieval tailored to the specific needs of users. 3. Methodology In this section, we present the methodology underlying our PersonaRAG approach, which aims to enhance the ability of Language Large Models (LLMs) to actively engage with, understand, and leverage user profile information for per- sonalized content generation. We begin by discussing the fundamental concepts of Retrieval-Augmented Generation Figure 1: Illustrations of Various RAG Models. Vanilla RAG and (RAG) models (Section 3.1) and then introduce our Per- Chain-of-Thought [1] use passive learning, while PersonaRAG involves user-centric knowledge acquisition. sonaRAG technique, which encourages LLMs to actively assimilate knowledge from live search sessions (Section 3.2). [13] addressed this issue by employing natural language inference models to select pertinent sentences, thereby en- 3.1. Fundamentals of Retrieval-Augmented hancing the RAG’s robustness. Additionally, advancements Generation (RAG) Models have been made in adaptively retrieving information, with State-of-the-art RAG models, as described in previous stud- systems like those proposed by Jiang et al. [14] dynamically ies [19, 20, 21], employ retrieval systems to identify a set fetching passages that are most likely to improve generation of passages 𝐷 = {𝑑1 , … , 𝑑𝑛 } when given a query q. These accuracy. passages are intended to enhance the generative capabili- Despite these improvements, RAG systems still face limi- ties of LLMs by providing them with contextually relevant tations, particularly in adapting their output to the user’s information. specific profile, such as their information needs or intel- Early versions of RAG models typically employ a tra- lectual knowledge. This limitation stems from the current ditional retrieval-generation framework, in which the re- design of most RAG systems, which do not typically incor- trieved data set 𝐷 = {𝑑1 , … , 𝑑𝑛 } is directly fed into LLMs porate user context or personalized information retrieval to generate responses to the query 𝑞. However, these pas- strategies [15]. Consequently, there exists a gap between sages often contain irrelevant information, and the direct the general effectiveness of RAG systems and their applica- utilization approach in RAG has been shown to restrict the bility in personalized user experiences, where context and potential benefits of the RAG framework [22]. This limi- individual user preferences play a crucial role. tation has sparked further discussion on how to improve Personalization in information retrieval is increasingly LLMs by integrating retrieval results and outputs generated recognized as essential for enhancing user interaction and by the models themselves [23]. satisfaction [16]. Techniques such as user profiling, context- aware retrieval, and adaptive feedback mechanisms are com- monly employed to tailor search results to individual users’ Figure 2: Overview of Our PersonaRAG Model showcasing the dynamic interaction among specialized agents within the system, facilitated by a global message pool for structured communication. The diagram illustrates the flow from user query input through various agents, including User Profile, Context Retrieval, Session Analysis, Document Ranking, and Feedback Agents, highlighting their contributions to real-time adaptation and personalized content generation by integrating live user data and feedback for continuous improvement and contextually relevant search experiences. 3.2. PersonaRAG: RAG with User-Centric processes. Specifically, we prompt the system to adjust its Agents query responses based on an initial understanding of the user’s needs and refine these responses as more user data Drawing from the principles of adaptive learning and user- becomes available. This approach not only personalizes the centered design, we develop a new PersonaRAG architecture search results but also helps in correcting any misalignments to enable IR systems to dynamically learn from and adapt or errors in real-time. to user behavior in real-time. As shown in Figure 2, Per- PersonaRAG employs a highly specialized agent architec- sonaRAG introduces a three-step pipeline: retrieval, user ture, with each agent focusing on a specific aspect of the interaction analysis, and cognitive dynamic adaptation. Un- information retrieval process. All agents utilize in-context like traditional IR models that statically respond to queries, learning, i.e., prompting, to perform their designated tasks. PersonaRAG focuses on leveraging live user data to contin- This role specialization allows for the efficient decompo- ually refine its understanding and responses without the sition of complex user queries into manageable tasks [25]. need for manual retraining. To foster this, we engage the IR system as five specialized agents to analyze user interactions based on retrieved data. 3.2.1. User Interaction Analysis At present, the focus is on the functionality and interaction of these agents rather than their individual performance To understand user behavior from live interactions, Person- metrics. aRAG treats the IR system as a cognitive structure capable of receiving, interpreting, and acting upon user feedback [24]. Mimicking human learning behaviors, we establish four User Profile Agent This component manages and up- distinct agents within the system dedicated to analyzing dates user profile data, incorporating historical user inter- user interactions from different perspectives: engagement actions and preferences [26, 27]. It monitors how users tracking, preference analysis, context understanding, and interact with search results, such as click-through rates and feedback integration. These agents’ roles are detailed in navigation paths. The User Profile Agent helps the system Section 3.2.2. understand what captures user interest and leads to deeper engagement, enabling personalized search experiences. 3.2.2. Cognitive Dynamic Adaptation Contextual Retrieval Agent This agent is responsible Following adaptive learning principles, we employ a dy- for the initial retrieval of documents based on the user’s namic adaptation mechanism to assist the IR system in uti- current query. It accesses both a traditional search index lizing real-time user data for continuous improvement. This and a more dynamic context-aware system that can con- mechanism facilitates the integration of insights gained sider broader aspects of the query environment. It utilizes from User Interaction Analysis into the system’s retrieval user profile data to modify and refine search queries or to prioritize search results. For instance, if a user consistently protocols [28]. The process involves the User Profile Agent, engages more with certain types of documents or topics, Contextual Retrieval Agent, Live Session Agent, Document the retrieval agent can boost those document types in the Ranking Agent, and Feedback Agent working together to search results, ensuring that the most relevant information refine search queries, prioritize relevant results, and im- is presented to the user. prove document scoring and re-ranking based on user pro- file, session-specific contexts, and feedback. Live Session Agent This agent analyzes the current ses- PersonaRAG’s modular design allows for flexibility in the sion in real-time, observing user actions such as clicks, time system setup, enabling researchers to focus on the most rele- spent on documents, modifications to the query, and any vant aspects of the user’s profile, session, and feedback data. feedback provided. It creates a session-specific context Agents work collaboratively by utilizing content from the model that captures the user’s immediate needs and inter- Global Message Pool, which serves as a central hub for inter- ests. The real-time data collected by this agent is used to agent communication [28], eliminating inefficiencies and adjust the ongoing session, potentially re-ranking search enabling agents to access or update information as required. results or suggesting new queries based on the user’s behav- The Feedback Agent collects and analyzes implicit and ior and preferences. Additionally, the Live Session Agent explicit user feedback to generate insights into the effective- updates the user profile with new insights gleaned from ness of retrieval strategies and document relevance. This the session, allowing for a more personalized and efficient feedback is used to make dynamic adjustments to the sys- search experience in future interactions. tem, refining retrieval methods and altering the weighting of user profile factors. Through this iterative process, Person- Document Ranking Agent This agent is responsible aRAG continuously adapts and improves its performance, for re-ranking the documents retrieved by the Contextual enhancing the accuracy and user satisfaction of the retrieval Retrieval Agent. It integrates insights from both the User results [29]. Profile Agent and the Live Session Agent to score and order the documents more effectively. By considering the user’s 4. Experimental Setups historical preferences and their current session behavior, the Document Ranking Agent ensures that the most rele- In this section, we present the experimental setup employed vant and valuable documents are presented to the user in in our study, including the datasets, baseline models, evalu- a prioritized manner. This agent continuously adapts its ation metrics, and implementation details. We also provide ranking algorithms based on the feedback received from the an overview of the prompts used in our experiments. user and the insights provided by the other agents in the system. 4.1. Datasets Feedback Agent This agent gathers implicit and explicit Our experiments are conducted on three widely used single- feedback during and after user interactions. Implicit feed- hop benchmark datasets in the field of Information Retrieval back includes behavioral data like time spent on documents, (IR): NaturalQuestions (NQ) [30], TriviaQA [31], and We- click counts, and navigation patterns. Explicit feedback bQuestions (WebQ) [32]. NQ is a well-known dataset in involves direct user input on document relevance and qual- Natural Language Understanding (NLU), consisting of struc- ity, collected through ratings, surveys, or comments. The tured questions and corresponding Wikipedia pages anno- agent uses this information to train and refine models for tated with long and short answers. TriviaQA comprises other agents, particularly the Document Ranking Agent. question-answer pairs collected from trivia and quiz-league This process enhances the system’s ability to anticipate user websites, while WebQ consists of questions selected using needs and deliver relevant documents based on accumulated the Google Suggest API, with answers being entities in Free- feedback and insights. base. By dynamically integrating insights from the User Pro- Table 1 summarizes the datasets used in our initial study. file Agent, Contextual Retrieval Agent, Live Session Agent, Due to the high cost of using language models and the large Document Ranking Agent, and Feedback Agent into the IR number of API calls required, we randomly sampled 500 processes, PersonaRAG not only adapts to immediate user questions from each raw dataset to create more manageable needs but also evolves over time to better anticipate and subsets for our experiments. While this sampling approach meet user expectations. This multi-agent approach enables limits the scope of our study, it allows us to conduct an PersonaRAG to embody a truly adaptive and user-focused initial investigation into the performance of different RAG information retrieval system, leveraging specialized agents systems on these datasets. We acknowledge that future to analyze user interactions from different behavioral per- work with larger sample sizes and more comprehensive ex- spectives and deliver highly personalized and contextually periments will be necessary to draw definitive conclusions. relevant search experiences. The inclusion of the Document Nonetheless, we believe this preliminary study provides Ranking Agent ensures that the most pertinent documents valuable insights into the relative strengths and weaknesses are identified and presented to users, further enhancing the of the tested RAG approaches. system’s ability to effectively satisfy user information needs. 4.2. Models 3.3. PersonaRAG Operational Workflow We compare PersonaRAG with several baseline models, in- The PersonaRAG framework employs a structured work- cluding prompt learning and RAG models. The prompt tem- flow that allows for sequential and parallel processing of plates used in user interaction analysis and dynamic adap- tasks, ensuring clarity and consistency in communication tation are presented in Section 4.4. Initially, the question- between agents through well-defined data structures and answering (QA) instruction is fed to ChatGPT to conduct Dataset #Query #Corpus Sampling Rate of IR. NQ 8,757 79,168 5.7% TriviaQA 8,837 78,785 5.7% 4.4. Implementation Details WebQ 2,032 3,417 24.6% For a fair comparison and following the work of Mallen et al. Table 1 [35] and Trivedi et al. [37], the same retriever, a term-based Summary of datasets. Each dataset consists of randomly sampled sparse retrieval model known as BM25 [38], is used across all 500 questions from the raw dataset. different models. The retrieval model is implemented using the OpenMatch toolkit [39]. For the external document corpus, the KILT-Wikipedia corpus preprocessed by Petroni the vanilla answer generation model. Following the work of et al. [40] is used, and the top-k relevant documents are Wei et al. [33], the Chain-of-Thought model is implemented, retrieved. which generates question rationale results to produce the Regarding the LLMs used to generate answers, the Llama final results. Additionally, the Guideline model serves as 3 model instruct (ref) with 70b parameters, Mixture of Ex- a baseline, generating problem-solving steps and guiding perts (MoE) 8x7b (ref), and the GPT-3.5 model (gpt-3.5- Language Models (LLMs) to generate the answer. turbo-0125) are employed. For the retrieval-augmented For the RAG-based baselines, two models are imple- LLM design, the implementation details from Trivedi et al. mented: vanilla RAG and Chain-of-Thought, which include [37] are followed, which include input prompts, instruc- utilizing raw retrieved passages (CoT with Passage) and tions, and the number of test samples for evaluation (e.g., refining the passages as notes (CoT with Note). The vanilla 500 samples per dataset). RAG model directly feeds the top-ranked passages to the LLM. The Chain-of-Note model [1] is also implemented, 4.5. Prompts Used in PersonaRAG which refines and summarizes the retrieved passages for generation. Inspired by Self-RAG Asai et al. [34], the Self- This subsection presents the prompt templates employed in Rerank model is conducted, which filters out unrelated con- the construction of the PersonaRAG model. The prompts tents without fine-tuning LLMs. utilized in the User Interaction Analysis and Cognitive Dy- namic Adaptation components are detailed below. The 4.3. Evaluation Metrics prompt templates used by the baseline models are available in the project repository 1 . In the templates, {question} When evaluating adaptive models, it is crucial to consider represents the input question, {global_memory} the Global both task performance and user-centric adaptability simul- Message Pool, while {passages} denotes the retrieved pas- taneously, along with their trade-offs. Therefore, the results sages. Additionally, {cot_answer} is populated with the are reported using different metrics, some of which measure output generated by the Chain-of-Thought model. effectiveness and others measure efficiency. The placeholder {user_profile_answer} is filled For effectiveness, accuracy is used, following the standard with the response produced by the User Profile agent evaluation protocol in the field of Information Retrieval model. Respectively, {contextual_answer} corre- (IR) [35, 36, 34]. Accuracy assesses whether the predicted sponds to the Contextual Retrieval agent model, answer contains the ground-truth answer. Both the outputs {live_session_answer} to the Live Session agent of the Language Learning Model (LLM) and golden answers model, {document_ranking_answer} to the Document are converted to lowercase, and string matching (StringEM) Ranking agent model, and {feedback_answer} to the is performed between each golden answer and the model Feedback agent model. prediction to calculate accuracy. To evaluate user-centric adaptability, the BLEU-2 score is 4.5.1. Prompts Used in User Interaction Analysis measured to assess the text similarity between different RAG and baseline setups and how well the generated answers User Profile Agent resemble each other. This metric provides insights into Your task is to help the User Profile Agent the system’s ability to generate consistent and coherent improve its understanding of user preferences responses across various configurations. Additionally, the based on ranked document lists and the shared average sentence length and the average number of syllables global memory pool. of the answers from different RAG setups are reported as Question: {question} a post-hoc analysis. These measures validate whether the Passages: {passages} RAG system effectively adjusts its responses based on user Global Memory: {global_memory} knowledge levels, ensuring that the generated answers are tailored to the user’s understanding and expertise. Task Description: Combining these evaluation strategies provides a com- From the provided passages and global memory prehensive view of both the effectiveness and user-centric pool, analyze clues about the user's search adaptability of the RAG system. The accuracy metric en- preferences. Look for themes, types of sures that the system generates correct answers, while the documents, and navigation behaviors that reveal BLEU-2 score and post-hoc analysis of sentence length and user interest. Use these insights to recommend syllable count confirm the system’s ability to adapt to user how the User Profile Agent can refine and expand the user profile to deliver better-personalized knowledge levels. As the understanding of user needs and results. system capabilities evolves, it is essential to continuously refine these metrics to maintain the RAG system’s effective- ness in delivering personalized, context-aware responses that cater to the diverse requirements of users in the field 1 https://github.com/padas-lab-de/ir-rag-sigir24-persona-rag Contextual Retrieval Agent Task Description: You are a search technology expert guiding the Using the retrieved passages and global memory Contextual Retrieval Agent to deliver context- pool, identify methods for collecting implicit aware document retrieval. and explicit user feedback. Suggest ways to refine feedback mechanisms to align with user Question: {question} preferences, such as ratings, surveys, or Passages: {passages} behavioral data. Your recommendations should Global Memory: {global_memory} guide the Feedback Agent in updating other agents' models for more personalized and Task Description: relevant results. Using the global memory pool and the retrieved passages, identify strategies to refine document retrieval. Highlight how user preferences, Global Message Pool immediate needs, and global insights can You are responsible for maintaining and be leveraged to adjust search queries and enriching the Global Message Pool, serving prioritize results that align with the user's as a central hub for inter-agent communication. interests. Ensure the Contextual Retrieval Agent uses this shared information to deliver more Question: {question} relevant and valuable results. Agent Responses: {agent_responses} Existing Global Memory: {global_memory} Live Session Agent Task Description: Your expertise in session analysis is required Using the responses from individual agents to assist the Live Session Agent in dynamically and the existing global memory, consolidate adjusting results. key insights into a shared repository. Your goal is to organize a comprehensive Question: {question} message pool that includes agent-specific Passages: {passages} findings, historical user preferences, session- Global Memory: {global_memory} specific behaviors, search queries, and user feedback. This structure should provide Task Description: all agents with meaningful data points and Examine the retrieved passages and information strategic recommendations, reducing redundant in the global memory pool. Determine how the communication and improving the system's overall Live Session Agent can use this data to refine efficiency. its understanding of the user's immediate needs. Suggest ways to dynamically adjust search results or recommend new queries in real-time, 4.5.2. Prompts Used in Cognitive Dynamic ensuring that session adjustments align with Adaptation user preferences and goals. Chain-of-Thought To solve the problem, Please think and reason Document Ranking Agent step by step, then answer. Your task is to help the Document Ranking Agent prioritize documents for better ranking. Question: {question} Passages: {passages} Question: {question} Reasoning process: Passages: {passages} 1. Read the given question and passages to Global Memory: {global_memory} gather relevant information. 2. Write reading notes summarizing the key Task Description: points from these passages. Analyze the retrieved passages and global 3. Discuss the relevance of the given question memory pool to identify ways to rank documents and passages. effectively. Focus on combining historical 4. If some passages are relevant to the given user preferences, immediate needs, and session question, provide a brief answer based on the behavior to refine ranking algorithms. Your passages. insights should ensure that documents presented 5. If no passage is relevant, directly provide by the Document Ranking Agent are prioritized to the answer without considering the passages. match user interests and search context. Answer: Feedback Agent Cognitive Agent You are an expert in feedback collection and analysis, guiding the Feedback Agent to gather Your task is to help the Cognitive Agent and utilize user insights. enhance its understanding of user insights to continuously improve the system's responses. Question: {question} Passages: {passages} Question: {question} Global Memory: {global_memory} Initial Response: {cot_answer} Method Setting Top-3 Top-5 WebQ TriviaQA NQ WebQ TriviaQA NQ gpt-3.5-turbo-0125 59.61 97.36 43.90 62.43 97.36 41.46 w/o RAG Guideline 36.53 42.10 17.07 47.21 36.84 21.95 vanillaRAG 38.46 78.94 36.58 50.14 81.57 41.46 Chain-of-Thought (CoT) 57.69 89.47 39.02 67.51 89.47 41.46 Self-Refined Chain-of-Note (CoN) 57.17 81.57 48.78 65.15 92.10 48.78 Self-Rerank (SR) 32.63 81.57 43.90 40.26 84.21 51.21 PersonaRAG 63.46 94.73 49.02 67.50 89.47 48.78 Table 2 Overall Accuracy Performance Comparison Using Top-3 and Top-5 Passages. PersonaRAG results are reported in bold. User Insights from Interaction Analysis: crucial role in efficiently extracting the necessary informa- User Profile Agent: {user_profile_answer}, tion regarding the user’s information need to achieve these Contextual Retrieval Agent: {contextual_answer}, improvements. Live Session Agent: {live_session_answer}, Furthermore, on the WebQ dataset, PersonaRAG achieved Document Ranking Agent: accuracy scores of 63.46% and 67.50% using Top-3 and Top-5 {document_ranking_answer}, passages, respectively, surpassing the vanillaRAG model by Feedback Agent: {feedback_answer} 25% and 17.36%, and nearly all other baseline models (ex- cept for Chain-of-Thought using Top-5, which performed Task Description: equally). On the NQ dataset, PersonaRAG maintained simi- Verify the reasoning process in the initial response for errors or misalignments. Use larly robust performance with scores of 49.02% and 48.78%, insights from user interaction analysis outperforming all baselines (except for Chain-of-Thought to refine this response, correcting any and Self-Rerank (SR) using Top-5). This pattern was fur- inaccuracies and enhancing the query answers ther validated by experiments on other datasets, with re- based on user profile. Ensure that your refined sults showing that PersonaRAG consistently outperforms response aligns more closely with the user's conventional RAG models with the capability of providing immediate needs and incorporates foundational or an answer tailored to the user’s interaction and informa- advanced knowledge from other sources. tion need. The comprehensive understanding it provides contributes to the generation of accurate and user-centric Answer: answers across various question complexities. 5. Experimental Results and 5.2. Comparative Analysis of RAG Configurations Analyses Further experiments explored PersonaRAG’s adaptive ca- In this section, we show the overall experimental results pabilities (Figure 3). BLEU-2 scores compared outputs and offer in-depth analyses of our method. from Chain-of-Note (consistently best outside PersonaRAG) with other methods. PersonaRAG showed higher similar- 5.1. Main Results ity scores, indicating its ability to generate responses that address user needs rather than just summarizing input. Ad- Table 2 summarizes the primary findings for PersonaRAG ditionally, PersonaRAG provides personalized answers tai- across various single-hop question answering datasets. The lored to user profiles, extending beyond mere information approach was evaluated against multiple baseline models, provision. including large language models (LLMs) without retrieval- The Chain-of-Note approach demonstrated comparable augmented generation (RAG), the conventional RAG model, performance to the Chain-of-Thought approach, implying and self-refined variants, such as utilizing raw retrieved that both techniques effectively extract pertinent informa- passages (CoT with Passage) or refining passages into notes tion from the retrieved passages and adapt it to align with (CoT with Note). the user’s information need. PersonaRAG demonstrated superior performance com- In contrast, vanillaGPT and vanillaRAG outputs differed pared to most of the baseline models, achieving significant significantly from the Chain-of-Note approach, indicating improvements over the conventional RAG (i.e., vanillaRAG) that counterfactual cognition often leads to diverse out- of over 10%, particularly on the WebQ dataset. It also con- comes rather than focusing solely on query-relevant con- sistently outperformed the ChatGPT-3.5 model, except on tent. This suggests LLMs can construct knowledge from TriviaQA, which we suspect is part of the model’s train- multiple perspectives and customize responses based on ing dataset. These results suggest PersonaRAG’s capability user understanding. to guide LLMs in extracting relevant information through Post-hoc analyses of average sentence length and syllable active learning techniques. count across RAG configurations provided insights into the Specifically, the performance of RAG models was assessed system’s ability to adapt responses to user comprehension using the top 3 and 5 ranked passages. While other RAG levels. These observations highlight PersonaRAG’s capac- models generally benefited from more passages, Person- ity to synthesize knowledge from various perspectives and aRAG maintained consistent performance with either 3 or tailor responses to different levels of user expertise. 5 passages, suggesting that 3 passages were adequate for generating accurate answers. PersonaRAG agents played a relevant knowledge due to limitations in their processing capacities. In contrast, the PersonaRAG method provides notable performance improvements: over 8% for LLaMA3-70B and more than 10% for MoE-8x7b across all datasets, underscor- ing its effectiveness. The PersonaRAG methodology distin- guishes itself from the Chain-of-Note approach by offering a cognitive framework that connects retrieved passages with prior knowledge. This framework models the instructor’s (GPT-3.5) reasoning process, guiding the student models (LLaMA3-70B and MoE-8x7b) to better understand knowl- edge retrieved from passages. The results demonstrate that (a) Text Similarity for Top-3 Passages the LLMs are capable of selecting appropriate passages to build more accurate responses, highlighting the benefits of the PersonaRAG approach for improving generalization. Question: Who stole the Mona Lisa from the Louvre in 1911? Answer: ”Vincenzo Peruggia” Chain-of-Thought: 1. Read the given question and passages to gather relevant information. 2. Write reading notes summarizing the key points. 3. Discuss the relevance of the given question and passages. 4. If some passages are relevant to the given question, provide a brief answer based on the passages. 5. If no passage is relevant, directly provide the answer (b) Text Similarity for Top-5 Passages without considering the passages. Figure 3: Text Similarity between Chain-of-Note (CoN) and Step 1: Passage Retrieval Other Methods Using BLEU-2 Score for Evaluation, with Normal- 1. The Mona Lisa was stolen from the Louvre Museum ized Average Sentence Length and Average Syllable Count. while the museum was closed for cleaning. Witnesses reported that a tall, stout individual was carrying what appeared to be a large panel covered with a horse blanket. Method WebQ TriviaQA NQ Two years later, Vincenzo Peruggia was arrested in Florence. LLaMA3-70B 2. On August 22, 1911, the Mona Lisa was stolen from the w/o RAG 45.25 82.17 38.95 Louvre. Vincenzo Peruggia claimed he stole the painting to vanillaRAG 55.14 85.02 40.37 return it to Italy. Chain-of-Thought 60.52 88.72 45.10 3. In 1911, Vincenzo Peruggia, a Louvre employee, stole the Chain-of-Note 62.67 89.37 48.25 Mona Lisa. Self-Rerank 54.25 84.50 47.77 Step 2: User Interaction Analysis PersonaRAG 66.09 92.12 50.85 User Insights from Interaction Analysis: MoE-8x7b - User Profile Agent: The user is interested in art heists and w/o RAG 38.24 75.82 34.26 historical mysteries. vanillaRAG 48.44 80.25 38.50 - Contextual Retrieval Agent: The user frequently searches Chain-of-Thought 54.12 85.46 42.37 for art history topics. Chain-of-Note 55.98 87.55 45.14 - Live Session Agent: The user reads about unsolved Self-Rerank 52.50 83.04 44.96 mysteries and historic crimes. PersonaRAG 61.35 90.30 49.27 - Document Ranking Agent: The user favors detailed, chronological accounts. Table 3 - Feedback Agent: The user interacts with articles about Overall Accuracy Performance Comparison for Top-5 Passages famous art heists. using LLaMA3-70B and MoE-8x7b. Step 3: Cognitive Dynamic Adaptation The user has a consistent interest in famous art heists and historical mysteries. They are drawn to unsolved mysteries 5.3. Analysis on Generalization Ability and historic crimes. They favor detailed, chronological accounts, so the final response should highlight theft This experiment evaluates the quality of knowledge con- motives, the method, and the investigation leading to the struction using different large language models (LLMs). As arrest, providing comprehensive and authoritative sources illustrated in Table 3, the PersonaRAG outcomes are used to satisfy their engagement with articles on art heists. to prompt open-source LLMs, specifically LLaMA3-70B and Answer: Vincenzo Peruggia, a Louvre employee, stole the MoE-8x7b, to generate accurate answers. Mona Lisa from the Louvre Museum on August 21, 1911. He Compared to LLMs without retrieval-augmented gener- claimed he stole the painting to return it to Italy, and the theft remained unsolved for two years before he was ation (w/o RAG), vanilla RAG and Chain-of-Note often ex- arrested in Florence in December 1913. The painting was hibit lower performance. This result suggests that retrieved later recovered and returned to the Louvre. passages can act as noise, adversely affecting model perfor- mance even after refinement through note generation. One Table 4 primary reason for this behavior is that both LLaMA3-70B PersonaRAG Case Study. and MoE-8x7b struggle to efficiently analyze and identify 5.4. Case Study References Finally, we randomly sample one case in Table to demon- [1] W. Yu, H. Zhang, X. Pan, K. Ma, H. Wang, D. Yu, strate the effectiveness of PersonaRAG. Chain-of-note: Enhancing robustness in retrieval- The user interaction analysis mechanism effectively gen- augmented language models, CoRR abs/2311.09210 erates comprehensive results by integrating foundational (2023). URL: https://doi.org/10.48550/arXiv.2311.09210. and advanced insights from user data. Retrieved pas- doi:10.48550/ARXIV.2311.09210 . sages provide critical clues for answering questions, while [2] OpenAI, GPT-4 technical report, CoRR abs/2303.08774 agent analyses summarize and illustrate the applicability (2023). URL: https://doi.org/10.48550/arXiv. of external information to user queries. The cognitive dy- 2303.08774. doi:10.48550/ARXIV.2303.08774 . namic adaptation module refines initial chain-of-thought arXiv:2303.08774 . responses using these insights, generating accurate answers. [3] H. Touvron, T. Lavril, G. Izacard, X. Martinet, For example, including knowledge about the ”theft of the M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Ham- Mona Lisa in 1911,” ”Vincenzo Peruggia,” and ”Florence” bro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lam- enhances the reasoning process’s precision and detail. This ple, Llama: Open and efficient foundation language demonstrates PersonaRAG’s effectiveness in helping IR models, CoRR abs/2302.13971 (2023). URL: https://doi. agents combine external knowledge with intrinsic user data org/10.48550/arXiv.2302.13971. doi:10.48550/ARXIV. to produce well-informed responses. 2302.13971 . arXiv:2302.13971 . [4] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Ka- 6. Conclusion plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- try, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, This paper proposes PersonaRAG, which constructs the T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, retrieval-augmentation architecture incorporating user in- C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, teraction analysis and cognitive dynamic adaptation. Per- S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, sonaRAG builds the user interaction agents and dynamic A. Radford, I. Sutskever, D. Amodei, Language mod- cognitive mechanisms to facilitate the understanding of user els are few-shot learners, in: H. Larochelle, M. Ran- needs and interests and enhance the system capabilities to zato, R. Hadsell, M. Balcan, H. Lin (Eds.), Advances in deliver personalized, context-aware responses with the in- Neural Information Processing Systems 33: Annual trinsic cognition of LLMs. Conference on Neural Information Processing Systems Furthermore, PersonaRAG demonstrates effectiveness 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. in leveraging external knowledge and adapting responses URL: https://proceedings.neurips.cc/paper/2020/hash/ based on user profiles, knowledge levels, and information 1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. needs to support LLMs in generation tasks without fine- [5] Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, tuning. However, this approach requires multiple calls to the H. Lovenia, Z. Ji, T. Yu, W. Chung, Q. V. Do, Y. Xu, LLM’s API, which can introduce additional time latency and P. Fung, A multitask, multilingual, multimodal eval- increase API calling costs when addressing questions. The uation of chatgpt on reasoning, hallucination, and process involves constructing the initial Chain-of-Thought, interactivity, in: J. C. Park, Y. Arase, B. Hu, W. Lu, processing the User Interaction Agents results, and execut- D. Wijaya, A. Purwarianti, A. A. Krisnadhi (Eds.), Pro- ing the Cognitive Dynamic Adaptation to generate the final ceedings of the 13th International Joint Conference on answer. Furthermore, the inputs to LLMs in this approach Natural Language Processing and the 3rd Conference tend to be lengthy due to the inclusion of extensive retrieved of the Asia-Pacific Chapter of the Association for Com- passages and the incorporation of user needs, interests, and putational Linguistics, IJCNLP 2023 -Volume 1: Long profile construction results. These factors can impact the ef- Papers, Nusa Dua, Bali, November 1 - 4, 2023, Associa- ficiency and cost-effectiveness of the PersonaRAG approach tion for Computational Linguistics, 2023, pp. 675–718. in practical applications of Information Retrieval (IR) sys- URL: https://doi.org/10.18653/v1/2023.ijcnlp-main.45. tems. doi:10.18653/V1/2023.IJCNLP- MAIN.45 . Future research will aim to optimize the process by reduc- [6] P. S. H. Lewis, E. Perez, A. Piktus, F. Petroni, ing API calls and developing concise representations of user V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, profiles and retrieved information without compromising T. Rocktäschel, S. Riedel, D. Kiela, Retrieval- response quality. We also plan to explore more user-centric augmented generation for knowledge-intensive NLP agents to better capture writing styles and characteristics tasks, in: H. Larochelle, M. Ranzato, R. Hadsell, of RAG users/searchers. This will enhance the system’s M. Balcan, H. Lin (Eds.), Advances in Neural In- ability to understand and adapt to individual preferences, formation Processing Systems 33: Annual Confer- improving personalization and relevance in IR tasks. ence on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL: https://proceedings.neurips.cc/paper/2020/hash/ Acknowledgments 6b493230205f780e1bc26945df7481e5-Abstract.html. [7] J. Chen, H. Lin, X. Han, L. Sun, Benchmarking This work has received funding from the European Union’s large language models in retrieval-augmented gen- Horizon Europe research and innovation program under eration, in: M. J. Wooldridge, J. G. Dy, S. Natara- grant agreement No 101070014 (OpenWebSearch.EU, https: jan (Eds.), Thirty-Eighth AAAI Conference on Artifi- //doi.org/10.3030/101070014). cial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Ad- vances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, AAAI Press, 2024, pp. [18] Y. Li, Y. Zhang, L. Sun, Metaagents: Simulating in- 17754–17762. URL: https://doi.org/10.1609/aaai.v38i16. teractions of human behaviors for llm-based task- 29728. doi:10.1609/AAAI.V38I16.29728 . oriented coordination via collaborative generative [8] J. Teevan, S. T. Dumais, E. Horvitz, Personalizing agents, CoRR abs/2310.06500 (2023). URL: https://doi. search via automated analysis of interests and activ- org/10.48550/arXiv.2310.06500. doi:10.48550/ARXIV. ities, SIGIR Forum 51 (2017) 10–17. URL: https://doi. 2310.06500 . org/10.1145/3190580.3190582. doi:10.1145/3190580. [19] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, 3190582 . J. Sun, Q. Guo, M. Wang, H. Wang, Retrieval- [9] K. Sugiyama, K. Hatano, M. Yoshikawa, Adaptive augmented generation for large language models: A web search based on user profile constructed without survey, CoRR abs/2312.10997 (2023). URL: https://doi. any effort from users, in: S. I. Feldman, M. Uretsky, org/10.48550/arXiv.2312.10997. doi:10.48550/ARXIV. M. Najork, C. E. Wills (Eds.), Proceedings of the 13th 2312.10997 . international conference on World Wide Web, WWW [20] Y. Huang, J. Huang, A survey on retrieval-augmented 2004, New York, NY, USA, May 17-20, 2004, ACM, text generation for large language models, arXiv 2004, pp. 675–684. URL: https://doi.org/10.1145/988672. preprint arXiv:2404.10981 (2024). 988764. doi:10.1145/988672.988764 . [21] S. Siriwardhana, R. Weerasekera, T. Kaluarachchi, [10] G. Adomavicius, B. Mobasher, F. Ricci, A. Tuzhilin, E. Wen, R. Rana, S. Nanayakkara, Improving the Context-aware recommender systems, AI Mag. 32 domain adaptation of retrieval augmented genera- (2011) 67–80. URL: https://doi.org/10.1609/aimag.v32i3. tion (RAG) models for open domain question an- 2364. doi:10.1609/AIMAG.V32I3.2364 . swering, Trans. Assoc. Comput. Linguistics 11 [11] M. J. Wooldridge, An Introduction to MultiAgent Sys- (2023) 1–17. URL: https://transacl.org/ojs/index.php/ tems, Second Edition, Wiley, 2009. tacl/article/view/4029. [12] A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, [22] J. Chen, H. Lin, X. Han, L. Sun, Benchmarking B. Savary, C. Bamford, D. S. Chaplot, D. de Las Casas, large language models in retrieval-augmented gen- E. B. Hanna, F. Bressand, G. Lengyel, G. Bour, G. Lam- eration, in: M. J. Wooldridge, J. G. Dy, S. Natara- ple, L. R. Lavaud, L. Saulnier, M. Lachaux, P. Stock, jan (Eds.), Thirty-Eighth AAAI Conference on Artifi- S. Subramanian, S. Yang, S. Antoniak, T. L. Scao, cial Intelligence, AAAI 2024, Thirty-Sixth Conference T. Gervet, T. Lavril, T. Wang, T. Lacroix, W. E. on Innovative Applications of Artificial Intelligence, Sayed, Mixtral of experts, CoRR abs/2401.04088 IAAI 2024, Fourteenth Symposium on Educational Ad- (2024). URL: https://doi.org/10.48550/arXiv.2401.04088. vances in Artificial Intelligence, EAAI 2014, February doi:10.48550/ARXIV.2401.04088 . 20-27, 2024, Vancouver, Canada, AAAI Press, 2024, pp. [13] F. Xu, W. Shi, E. Choi, RECOMP: improving retrieval- 17754–17762. URL: https://doi.org/10.1609/aaai.v38i16. augmented lms with compression and selective aug- 29728. doi:10.1609/AAAI.V38I16.29728 . mentation, CoRR abs/2310.04408 (2023). URL: https: [23] K. Wu, E. Wu, J. Zou, How faithful are rag models? //doi.org/10.48550/arXiv.2310.04408. doi:10.48550/ quantifying the tug-of-war between rag and llms’ in- ARXIV.2310.04408 . ternal prior, arXiv preprint arXiv:2404.10198 (2024). [14] Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi- [24] R. C. Atkinson, R. M. Shiffrin, Human memory: A Yu, Y. Yang, J. Callan, G. Neubig, Active re- proposed system and its control processes, in: K. W. trieval augmented generation, in: H. Bouamor, Spence, J. T. Spence (Eds.), Psychology of Learning and J. Pino, K. Bali (Eds.), Proceedings of the 2023 Motivation, volume 2 of Psychology of Learning and Conference on Empirical Methods in Natural Lan- Motivation, Elsevier, 1968, pp. 89–195. URL: https:// guage Processing, EMNLP 2023, Singapore, Decem- doi.org/10.1016/s0079-7421(08)60422-3. doi:10.1016/ ber 6-10, 2023, Association for Computational Lin- S0079- 7421(08)60422- 3 . guistics, 2023, pp. 7969–7992. URL: https://doi.org/ [25] A. Sharma, S. Kumar, Semantic web-based informa- 10.18653/v1/2023.emnlp-main.495. doi:10.18653/V1/ tion retrieval models: a systematic survey, in: Data 2023.EMNLP- MAIN.495 . Science and Analytics: 5th International Conference [15] H. Zamani, W. B. Croft, Embedding-based query lan- on Recent Developments in Science, Engineering and guage models, in: B. Carterette, H. Fang, M. Lalmas, Technology, REDSET 2019, Gurugram, India, Novem- J. Nie (Eds.), Proceedings of the 2016 ACM on Inter- ber 15–16, 2019, Revised Selected Papers, Part II 5, national Conference on the Theory of Information Springer, 2020, pp. 204–222. Retrieval, ICTIR 2016, Newark, DE, USA, September [26] A. Kacem, Personalized Information Retrieval based 12- 6, 2016, ACM, 2016, pp. 147–156. URL: https://doi. on Time-Sensitive User Profile. (Recherche d’Infor- org/10.1145/2970398.2970405. doi:10.1145/2970398. mation Personalisée basée sur un Profil Utilisateur 2970405 . Sensible au Temps), Ph.D. thesis, Paul Sabatier Uni- [16] M. R. Ghorab, D. Zhou, A. O’Connor, V. Wade, Person- versity, Toulouse, France, 2017. URL: https://tel. alised information retrieval: survey and classification, archives-ouvertes.fr/tel-01707423. User Model. User Adapt. Interact. 23 (2013) 381–443. [27] A. Singh, A. Sharma, A multi-agent framework for URL: https://doi.org/10.1007/s11257-012-9124-1. context-aware dynamic user profiling for web person- doi:10.1007/S11257- 012- 9124- 1 . alization, in: Software Engineering: Proceedings of [17] S. Jeong, J. Baek, S. Cho, S. J. Hwang, J. C. Park, CSI 2015, Springer, 2019, pp. 1–16. Adaptive-rag: Learning to adapt retrieval-augmented [28] S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, large language models through question complex- C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, ity, CoRR abs/2403.14403 (2024). URL: https://doi. L. Xiao, C. Wu, Metagpt: Meta programming for multi- org/10.48550/arXiv.2403.14403. doi:10.48550/ARXIV. agent collaborative framework, CoRR abs/2308.00352 2403.14403 . (2023). URL: https://doi.org/10.48550/arXiv.2308.00352. doi:10.48550/ARXIV.2308.00352 . of the 2023 Conference on Empirical Methods in Nat- [29] D. K. Limbu, A. M. Connor, R. Pears, S. G. MacDonell, ural Language Processing, EMNLP 2023, Singapore, Contextual relevance feedback in web information December 6-10, 2023, Association for Computational retrieval, in: I. Ruthven (Ed.), Proceedings of the 1st Linguistics, 2023, pp. 1720–1736. URL: https://doi.org/ International Conference on Information Interaction 10.18653/v1/2023.emnlp-main.107. doi:10.18653/V1/ in Context, IIiX 2006, Copenhagen, Denmark, October 2023.EMNLP- MAIN.107 . 18-20, 2006, ACM, 2006, pp. 138–143. URL: https://doi. [37] H. Trivedi, N. Balasubramanian, T. Khot, A. Sabharwal, org/10.1145/1164820.1164848. doi:10.1145/1164820. Interleaving retrieval with chain-of-thought reason- 1164848 . ing for knowledge-intensive multi-step questions, in: [30] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Rogers, J. L. Boyd-Graber, N. Okazaki (Eds.), Pro- A. P. Parikh, C. Alberti, D. Epstein, I. Polosukhin, ceedings of the 61st Annual Meeting of the Associa- J. Devlin, K. Lee, K. Toutanova, L. Jones, M. Kelcey, tion for Computational Linguistics (Volume 1: Long M. Chang, A. M. Dai, J. Uszkoreit, Q. Le, S. Petrov, Nat- Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, ural questions: a benchmark for question answering Association for Computational Linguistics, 2023, pp. research, Trans. Assoc. Comput. Linguistics 7 (2019) 10014–10037. URL: https://doi.org/10.18653/v1/2023. 452–466. URL: https://doi.org/10.1162/tacl_a_00276. acl-long.557. doi:10.18653/V1/2023.ACL- LONG.557 . doi:10.1162/TACL\_A\_00276 . [38] S. E. Robertson, S. Walker, S. Jones, M. Hancock- [31] M. Joshi, E. Choi, D. S. Weld, L. Zettlemoyer, Triviaqa: Beaulieu, M. Gatford, Okapi at TREC-3, in: D. K. A large scale distantly supervised challenge dataset Harman (Ed.), Proceedings of The Third Text RE- for reading comprehension, in: R. Barzilay, M. Kan trieval Conference, TREC 1994, Gaithersburg, Mary- (Eds.), Proceedings of the 55th Annual Meeting of land, USA, November 2-4, 1994, volume 500-225 of the Association for Computational Linguistics, ACL NIST Special Publication, National Institute of Stan- 2017, Vancouver, Canada, July 30 - August 4, Volume dards and Technology (NIST), 1994, pp. 109–126. URL: 1: Long Papers, Association for Computational Lin- http://trec.nist.gov/pubs/trec3/papers/city.ps.gz. guistics, 2017, pp. 1601–1611. URL: https://doi.org/10. [39] S. Yu, Z. Liu, C. Xiong, Z. Liu, Openmatch-v2: An 18653/v1/P17-1147. doi:10.18653/V1/P17- 1147 . all-in-one multi-modality plm-based information re- [32] J. Berant, A. Chou, R. Frostig, P. Liang, Semantic pars- trieval toolkit, in: H. Chen, W. E. Duh, H. Huang, ing on freebase from question-answer pairs, in: Pro- M. P. Kato, J. Mothe, B. Poblete (Eds.), Proceedings of ceedings of the 2013 Conference on Empirical Methods the 46th International ACM SIGIR Conference on Re- in Natural Language Processing, EMNLP 2013, 18-21 search and Development in Information Retrieval, SI- October 2013, Grand Hyatt Seattle, Seattle, Washing- GIR 2023, Taipei, Taiwan, July 23-27, 2023, ACM, 2023, ton, USA, A meeting of SIGDAT, a Special Interest pp. 3160–3164. URL: https://doi.org/10.1145/3539618. Group of the ACL, ACL, 2013, pp. 1533–1544. URL: 3591813. doi:10.1145/3539618.3591813 . https://aclanthology.org/D13-1160/. [40] F. Petroni, A. Piktus, A. Fan, P. S. H. Lewis, M. Yaz- [33] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, dani, N. D. Cao, J. Thorne, Y. Jernite, V. Karpukhin, F. Xia, E. H. Chi, Q. V. Le, D. Zhou, Chain-of-thought J. Maillard, V. Plachouras, T. Rocktäschel, S. Riedel, prompting elicits reasoning in large language KILT: a benchmark for knowledge intensive language models, in: S. Koyejo, S. Mohamed, A. Agarwal, tasks, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Belgrave, K. Cho, A. Oh (Eds.), Advances in D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, Neural Information Processing Systems 35: Annual T. Chakraborty, Y. Zhou (Eds.), Proceedings of the Conference on Neural Information Processing 2021 Conference of the North American Chapter of Systems 2022, NeurIPS 2022, New Orleans, LA, the Association for Computational Linguistics: Hu- USA, November 28 - December 9, 2022, 2022. URL: man Language Technologies, NAACL-HLT 2021, On- http://papers.nips.cc/paper_files/paper/2022/hash/ line, June 6-11, 2021, Association for Computational 9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference. Linguistics, 2021, pp. 2523–2544. URL: https://doi.org/ html. 10.18653/v1/2021.naacl-main.200. doi:10.18653/V1/ [34] A. Asai, Z. Wu, Y. Wang, A. Sil, H. Hajishirzi, 2021.NAACL- MAIN.200 . Self-rag: Learning to retrieve, generate, and cri- s tique through self-reflection, CoRR abs/2310.11511 (2023). URL: https://doi.org/10.48550/arXiv.2310.11511. doi:10.48550/ARXIV.2310.11511 . [35] A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, H. Hajishirzi, When not to trust language models: Investigating effectiveness of parametric and non- parametric memories, in: A. Rogers, J. L. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Association for Computa- tional Linguistics, 2023, pp. 9802–9822. URL: https:// doi.org/10.18653/v1/2023.acl-long.546. doi:10.18653/ V1/2023.ACL- LONG.546 . [36] J. Baek, S. Jeong, M. Kang, J. C. Park, S. J. Hwang, Knowledge-augmented language model verification, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings