=Paper= {{Paper |id=Vol-3784/paper2 |storemode=property |title=RAGSys: Item-Cold-Start Recommender as RAG System |pdfUrl=https://ceur-ws.org/Vol-3784/paper2.pdf |volume=Vol-3784 |authors=Emile Contal,Garrin McGoldrick |dblpUrl=https://dblp.org/rec/conf/ir-rag/ContalM24 }} ==RAGSys: Item-Cold-Start Recommender as RAG System== https://ceur-ws.org/Vol-3784/paper2.pdf
                         RAGSys: Item-Cold-Start Recommender as RAG System
                         Emile Contal1,∗,† , Garrin McGoldrick1,†
                         1
                             Crossing Minds Inc, San Francisco, USA


                                           Abstract
                                           Large Language Models (LLM) hold immense promise for real-world applications, but their generic knowledge often falls short of
                                           domain-specific needs. Fine-tuning, a common approach, can suffer from catastrophic forgetting and hinder generalizability. In-Context
                                           Learning (ICL) offers an alternative, which can leverage Retrieval-Augmented Generation (RAG) to provide LLMs with relevant
                                           demonstrations for few-shot learning tasks. This paper explores the desired qualities of a demonstration retrieval system for ICL.
                                           We argue that ICL retrieval in this context resembles item-cold-start recommender systems, prioritizing discovery and maximizing
                                           information gain over strict relevance. We propose a novel evaluation method that measures the LLM’s subsequent performance on
                                           NLP tasks, eliminating the need for subjective diversity scores. Our findings demonstrate the critical role of diversity and quality bias in
                                           retrieved demonstrations for effective ICL, and highlight the potential of recommender system techniques in this domain.

                                           Keywords
                                           Recommender systems, Information Retrieval, Large Language Models, Few-Shot Learning, In-Context-Learning



                         1. Introduction                                                                                              scenarios with limited data. This approach exploits the
                                                                                                                                      demonstrated ability of LLMs for ”meta-learning” – essen-
                         Large Language Models (LLMs) have emerged as a powerful                                                      tially, learning how to learn. In [6], the authors prove the
                         tool for natural language processing, demonstrating remark-                                                  capacity of LLMs to effectively ingest in-context training
                         able abilities in areas like text completion, summarization,                                                 data points and solve statistical optimization problems such
                         and question answering [1]. One of their most intriguing                                                     as gradient descent. ICL enables practitioners to leverage Re-
                         capabilities is their potential to learn ”common sense” –                                                    trieval Augmented Generation (RAG), that is enriching the
                         general knowledge about the world that allows them to                                                        input prompt by information that is retrieved in real-time
                         reason and make inferences beyond the literal meaning of                                                     [7]. We refer to [8] for a recent survey on ICL.
                         text. This has fueled excitement about the possibility of                                                       This paper focuses on few-shot learning and the retrieval
                         achieving zero-shot learning, where LLMs can solve unseen                                                    of relevant demonstrations for this process, where a demon-
                         problems without any prior training on specific tasks [2].                                                   stration is some text which is included in the LLM’s context
                            However, a crucial distinction exists between generic pub-                                                to demonstrate how the LLM should formulate correct an-
                         lic knowledge and the specific private knowledge required                                                    swers. Few-shot learning presents a well-structured prob-
                         for most real-world use cases. While LLMs excel at generic                                                   lem, allowing us to evaluate the quality of the retrieval
                         text completion or chat-like interactions, practical applica-                                                algorithm using established classification metrics. Crucially,
                         tions often demand solving specific and repeatable down-                                                     we show that enriching a language model with a few-shot
                         stream tasks within a particular domain [3]. This typically                                                  example retriever offers a powerful method to achieve fine-
                         necessitates knowledge specific to a business or organiza-                                                   tuning-like behavior, steering the output of the LLM towards
                         tion, such as understanding internal processes, up-to-date                                                   the desired outcome even with limited data. Interestingly,
                         product details, or customer behavior.                                                                       increasing the context size in prompts beyond a certain
                            Fine-tuning, a technique where LLMs are trained on large                                                  point yields diminishing returns. The most impactful infor-
                         datasets tailored to the target task, offers a path towards                                                  mation resides within a relatively small set of well-chosen
                         adapting LLMs to these domain-specific needs. Yet, fine-                                                     demonstrations, rather than overloading the prompt with
                         tuning presents significant challenges. When trained on                                                      vast amounts of data [9]. This highlights the importance of
                         tasks-specific data, LLMs tend to forget knowledge and skills                                                effective retrieval strategies, transforming 𝑘-shot learning
                         gained in the initial training, a phenomenon referred to as                                                  into a top-𝑘 information retrieval problem at its core.
                         Catastrophic Forgetting [4]. Consequently, a fine-tuned                                                         Building upon this concept, this paper identifies desir-
                         LLM loses some of its ability to generalize to novel exam-                                                   able properties for a RAG system under the framework of
                         ples that aren’t well represented in its fine-tuning training                                                few-shot learning. We demonstrate that state-of-the-art
                         data. Moreover, while fine-tuning allows an LLM to memo-                                                     retrieval systems in this context resemble item-cold-start
                         rize task-specific information, it doesn’t necessarily allow                                                 recommender systems. Unlike exact search algorithms that
                         the LLM to reason about that information [5]. As a final                                                     prioritize precision and recall, our focus is on discovery, by
                         consideration, keeping LLMs constantly up-to-date using                                                      maximizing the set of collective information gain from the
                         fine-tuning can be infeasible, especially for domains with                                                   retrieved demonstrations. This necessitates solving various
                         frequently changing information like e-commerce product                                                      trade-offs between query relevance, quality scoring, as well
                         inventory, whereas it is easy to update a database in real-                                                  as diversity algorithms to ensure a variety of informative
                         time from which information is retrieved.                                                                    examples are surfaced. Furthermore, we propose a method
                            As an alternative to fine-tuning, In-Context Learning                                                     for evaluating RAG system performance through the sub-
                         (ICL) offers a promising approach for leveraging LLMs in                                                     sequent performance of the enriched LLM on established
                                                                                                                                      NLP tasks like question answering or text generation. This
                         Information Retrieval’s Role in RAG Systems (IR-RAG) - 2024                                                  methodology offers a valuable approach to directly assess-
                         ∗
                              Corresponding author.                                                                                   ing diversity and quality-based retrieval systems, which
                         †
                             Both authors contributed equally to this research.                                                       removes the need to define a subjective diversity score, a
                         Envelope-Open emile@crossingminds.com (E. Contal);                                                           historically challenging aspect of evaluating such systems
                         garrin.mcgoldrick@crossingminds.com (G. McGoldrick)                                                          in academic settings [10].
                                       © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                       Attribution 4.0 International (CC BY 4.0).



CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
   To summarize, in this paper we study the impact of di-         2.1.3. Pure Relevance
versity and quality bias in retrieving demonstrations for
                                                                  Notably, a connection can be drawn between traditional full-
ICL. We start by reviewing the use of diversity and other
                                                                  text search algorithms and pure relevance approaches. In
biases in both ICL and Information Retrieval works. We
                                                                  [13] the authors use BM25 [14], a well-established retrieval
then propose a method for evaluating the performance of
                                                                  ranking function commonly used in information retrieval
different retrieval algorithms. Then we present experiments
                                                                  tasks. This approach essentially leverages the strengths of
and results demonstrating the impact of diversity and qual-
                                                                  BM25 in identifying examples with terms highly similar
ity bias on an LLM’s ability to generate correct answers.
                                                                  to the query, to select the most relevant examples for the
Finally we discuss the applicability of state-of-the-art ICL
                                                                  specific task at hand. This strategy ensures the retrieved
retrieval algorithms in real-world setting, and show that rec-
                                                                  examples are topically relevant to the task while potentially
ommendation engines offer a better solution than semantic
                                                                  introducing some variation in the specific phrasing or word-
search engines.
                                                                  ing used.
                                                                     Finally, neural ranking, one of the most commonly used
2. Related Work                                                   ICL approach, typically yielding superior results [15, 16, 17,
                                                                  18, 19], is maximizing similarity in a dense embeddings
This paper sits at the intersection of two distinct, but in-      space. These methods, like KATE [20], utilize 𝑘-Nearest
creasingly intertwined, research areas: In-Context Learn-         Neighbors (kNN ) search using the cosine distance of sentence
ing for Large Language Models and Information Retrieval.          embeddings to retrieve the examples most semantically sim-
While ICL focuses on enabling LLMs to learn from carefully        ilar to the prompt. Scaling this method leverages vector
selected contextual information, IR deals with retrieving         search algorithms, commonly used in large-scale informa-
relevant information from document collections. Our work          tion retrieval tasks, where efficient retrieval of semantically
leverages concepts from both fields to address the challenge      similar documents is crucial.
of few-shot learning with LLMs.                                      While general purpose pre-trained embeddings like BERT
                                                                  [21] form a strong baseline, learning specific embeddings
2.1. Few-Shot In-Context Learning with                            for retrieval, and in particular ICL, is a very active area of
                                                                  research. In [22] the authors build upon BERT and introduce
     Retrieval Augmented Generation                               ColBERT that improves the retrieval performances by an
Within the context of RAG, few-shot learning can be defined       order of magnitude. Other embeddings model have been
as a specific scenario where the ”documents” retrieved are        proposed for ICL retrieval in [16] and [17]. Some authors
actually ”examples” used to guide the LLM. These examples         also explored training full language models [15], as well
can also be referred to interchangeably as ”In-Context Ex-        as LLM [19], showing further improvements compared to
amples” (ICE) or ”demonstrations”. The importance of ICL          traditional embedding-based approaches. While leading to
in achieving state-of-the-art LLM performance is undeni-          superior results, these supervised neural ranking models
able, with its ubiquitous presence in top benchmarks across       for learning-to-rank necessitate orders of magnitude more
various domains. Consequently, ICL research is a rapidly          training examples data, that is typically not available to
evolving field with numerous proposed algorithms.                 practitioners. In addition, without any explicit metric space
                                                                  such as dense embeddings, efficient retrieval indexing such
2.1.1. Pure Diversity                                             as [23] cannot be used.

Several noteworthy ICL approaches have emerged that ad-           2.1.4. Diversity/Relevance Trade-off
dress the challenge of retrieving informative examples for
few-shot learning. Some methods only promote the diver-           While both relevance and diversity are crucial for effec-
sity of the demonstrations, like in [11] where the authors        tive few-shot learning, methods yielding the best ICL re-
utilize 𝑘-means clustering in a dense embeddings space to         sults combine these two paradigms rather than prioritizing
achieve diversity. By applying 𝑘-means to the sentence em-        one over the other. This is achieved by maximizing a care-
beddings of the demonstrations, this approach ensures that        fully balanced trade-off between semantic similarity to the
the retrieved examples cover a variety of semantic spaces, in-    prompt and diversity of the retrieved examples. Unsuper-
herently increasing the mutual information of the retrieved       vised techniques can be adapted to prioritize the selection
set, but without taking into account the relevancy to the         of examples that are both relevant to the prompt and dis-
query.                                                            similar to each other. In [24] the authors introduce a greedy
                                                                  method to select relevant demonstrations while ensuring
2.1.2. Pure Quality                                               enough coverage. They define specific coverage strategies
                                                                  adapted to the problem of program generation. In [25] the
Other approaches focuses on identifying examples where            authors employs an active learning setting, where a voting
the LLM exhibits low token-level uncertainty. In [12] the         algorithm selects a set of examples penalizing the top-𝑘 clos-
authors analyze token probabilities within candidate 0-shot       est from already selected ones, using cosine distance in an
prompts. By prioritizing prompts where the LLM has the            embedding space.
highest generation likelihood (low perplexity), this approach        The most popular unsupervised approach for achieving
aims to select examples that hold the potential for significant   this balance between relevance and diversity is Maximal
learning gains for the LLM. The intuition that the authors        Marginal Relevance (MMR). MMR retrieves a set of exam-
give is that a prompt that is more expected by the LLM is         ples by iteratively selecting the example that maximizes a
more likely to help it extracting the relevant information.       linear combination of the scores of relevance to the prompt
Accessing the per-token probabilities for all examples incurs     and dissimilarity to the previously retrieved examples. It
a significant compute cost, but can be pre-computed as they       was analyzed in [26] for ICL and was shown to outperform
do not depend on the query.
simpler methods. Alternatively to MMR, Determinantal                 Within batch-mode Bayesian optimization, in [42] and
Point Processes (DPP) has been used in [18] to optimize           [43] the authors analyze two greedy exploration/exploita-
the joint information of the selected 𝑘 examples. However,        tion algorithms to select the next batch of items maximizing
exactly solving the DPP optimization being NP -hard, hence        the cumulative reward. Like with recommender systems,
the authors also employs greedy maximization.                     these online settings exemplify the theoretical and empirical
                                                                  importance of diversifying the selected set of items despite
2.2. Diversity in Information Retrieval and                       the true objective function only including pure relevance,
                                                                  the cumulative reward.
     Online Learning
The concept of diversity in information retrieval has been a      2.3. Quality Biases in Information Retrieval
long-running topic of research. In this section we propose a
short review of the use of diversity in the IR literature and     Complementing the discussion on diversity in information
related domains.                                                  retrieval, quality bias also plays a crucial role in effective re-
                                                                  trieval. Quality bias refers to the prioritization of documents
2.2.1. In Information Retrieval                                   or examples considered to be more reliable or informative
                                                                  within the retrieved set. Incorporating quality consider-
The MMR algorithm was analyzed in [27], and compared              ations into retrieval algorithms can significantly improve
against other approached like KL -divergence optimization in      standard unbiased IR metrics.
[28]. Pure retrieval algorithms typically optimize for recall,       Several approaches have been explored to address quality
not information maximization. Agreeing on a diversity ob-         bias in pure IR tasks. These can be broadly categorized into
jective function remains a challenge. Diversity is sometimes      content-based and graph-based methods.
introduced as a heuristic to cover possible interpretations of
the same query, instead of minimizing information overlap         2.3.1. Content-Based Quality Biases
from near-duplicated results. In [10] the authors leverage a
concept of information nuggets with documents to estimate         Content-based methods leverage existing signals inside the
the redundancy of the set of retrieved documents. Topic           documents themselves to identify potentially lower-quality
modeling is also employed, such as [29] that uses a taxon-        content. Examples include spam detection scores developed
omy of categories labelling the documents and the queries.        in works like [44] and [45]. By incorporating such scores
The desired properties of diverse retrieval are furthermore       during retrieval, the system can prioritize higher quality doc-
characterized in [30]. A various set of similarity methods        uments. More sophisticated content-based approaches don’t
and diversification algorithms are analyzed in [31] on sparse     limit at spam classification, but extract more generic quality
features vectors. Among diversity evaluation methods based        features the content of documents. Works like [46] explore
on topic modelling, three notable criteria used in the TREC       features such as stop-word statistics or entropy of the docu-
Diversity track [32], ERR-IA [33], 𝛼-nDCG @𝑘 [10], and NRBP       ments to generate quality scores. The authors demonstrate
[34], are compared in [35].                                       that biasing standard retrieval using these features leads
                                                                  to improved retrieval effectiveness even using unbiased IR
2.2.2. In Recommender Systems                                     metrics like nDCG .

Within IR, the recommender system literature brings an            2.3.2. Graph-Based Quality Biases
additional point-of-view on studying diversity in retrieval,
by focusing on the benefit of diverse results for a user, in-     Instead of relying on the content itself, graph-based algo-
stead of evaluating methods against a potentially arbitrary       rithms inherently capture implicit quality signals within
relevance/diversity trade-off. The difficulty of evaluating       their ranking model. PageRank , a seminal algorithm for
the impact of diversity, and the necessity for large scale        web search ranking introduced in [47], exemplifies this ap-
real-world recommendation studies has been explored in            proach. PageRank leverages the links structure between
[36]. In [37] and [38] the authors model the user behavior        web articles to assign higher importance to web pages that
conditioned on the set of retrieved items. In [39] the authors    are linked to by other high-quality pages. This process
improve the diversity versus relevance trade-off in recom-        implicitly prioritizes documents with a higher perceived
mender systems by directly learning a ranking model that          quality based on the quality of their in-links.
favor diversity, instead of only applying diversity re-ranking
methods.                                                          2.3.3. Connections to Recommender Systems
                                                                  Interestingly, the concept of inherent quality bias in graph-
2.2.3. In Online Learning                                         based IR approaches resembles collaborative filtering tech-
Learning a trade-off between relevancy and diversity also         niques employed in recommender systems. In an analogous
naturally occurs in related online frameworks such as active      manner to learning-to-rank on a (item, item) graph, collab-
learning, multi-armed bandits and Bayesian optimization.          orative filtering addresses learning-to-rank on a bipartite
In [40] the authors modify a learning-to-rank algorithm           (user, item) graph. In this way, collaborative filtering also
from users feedback, to inherently learn diverse rankings         implicitly learns a trade-off between item similarity and
and demonstrate a positive impact on the original relevance       popularity, favoring items that are both similar to the user’s
metric. Other approaches such as [41] also introduce di-          past preferences and also generally well-received by other
versity in learning-to-rank algorithms while preserving the       users.
offline settings, but then are limited to evaluate using direct
diversity measures.
3. Methodology                                                    of the correct answer 𝑎. That is, we want a metric which is
                                                                  related to difference of 𝑝𝑀 (𝑎 ∣ 𝑞, 𝐶) and 𝑝𝑀 (𝑎 ∣ 𝑞).
We propose to frame the ICL problem as an item-cold-start            In a pure retrieval setting, we would be interested in find-
recommendation problem, where the query is an unseen              ing the context 𝐶 which contains the 𝑘 demonstrations that
item, and the objective is to retrieve from the pool of can-      are most similar to the 𝑞. And we could argue that if there
didate few-shot demonstrations a set of items maximizing          exists a smooth function 𝑓 ∶ 𝑞 → 𝑎 which maps a query to
the cumulative reward to the user (the LLM). In this case,        its correct answer, then by retrieving the demonstrations
the reward is a measure of how much the retrieved items           whose queries are nearest to 𝑞, we should also be retrieving
increase the probability that the LLM generates a correct         the answers which are closest to 𝑎, and this should help the
answer. A solution to this optimization problem requires          language model 𝑀 generate the correct answer 𝑎.
not only relevance, but also diversity and quality in the re-        However, it is doubtful that the space in which 𝑞 is com-
trieved items, such that the amount of useful information         pared to the demonstrations is one in which the function
presented to the LLM in the context is maximized.                 𝑓 ∶ 𝑞 → 𝑎 is smooth, so it is not necessarily true that
   Further, we propose to measure the impact of diversity         the retrieved answers are closest to 𝑎. Nor is it necessarily
on the retrieved items by directly calculating the probability    true that 𝑝𝑀 (𝑎 ∣ 𝑞, 𝐶) is maximized when 𝐶 contains those
of the LLM generating a correct answer given the context          answers closest to 𝑎. Consider that the answer 𝑎 might de-
items. This is in contrast to a typical retrieval context where   pend on some information which isn’t contained in 𝑎 or any
the retriever is evaluated by calculating some metric relat-      nearby answer.
ing to the accuracy and recall of documents most similar             Therefore, we prefer to measure 𝑝𝑀 (𝑎 ∣ 𝑞, 𝐶) directly.
to the query. In such a setting, it is typical to add a term      In practice, given that 𝑀 is an auto-regressive language
to the metric which measures the diversity of the retrieved       model, this is done by taking the product of the probabil-
documents to promote more diverse retrievers, knowing             ity of each token generated by 𝑀. The model generates
that diversity improves the reward to the user but with-          text sequentially by predicting one token at a time based
out having an explicit model connecting diversity to the          on the previously generated tokens. Let 𝑎 = (𝑎1 , 𝑎2 , … , 𝑎𝑛 )
user’s reward. In the case of retrieving demonstrations for       represent a sequence of tokens produced by the model. The
inclusion in an LLM’s context, we can directly measure the        probability of the model generating the sequence 𝑎 can be
impact of diversity on the LLM’s reward by calculating the        expressed as the joint probability of generating each token
probability of the LLM generating a correct answer.               in the sequence, conditioned on the tokens that precede it.
                                                                  This can be mathematically represented as:
3.1. Problem Statement
                                                                   𝑝(𝑎) = 𝑝(𝑎1 ) ⋅ 𝑝(𝑎2 |𝑎1 ) ⋅ 𝑝(𝑎3 |𝑎1 , 𝑎2 ) ⋯ 𝑝(𝑎𝑛 |𝑎1 , 𝑎2 , … , 𝑎𝑛−1 )
Consider an answer 𝑎 that should be generated by an LLM
in response to a query 𝑞. The query can be a simple question         Thus, 𝑝𝑀 (𝑎 ∣ 𝑞, 𝐶) is the product of the conditional proba-
such as ”Who was the first man to walk on the moon?”, or          bilities of each token, and these probabilities are output by
more general message such as ”I’d like to find red shoes”.        the LLM at inference time and are readily available in APIs
The answer could take on many forms, such a factual re-           serving LLMs such as the OpenAI API.
sponse ”Neil Armstrong”, a clarifying question ”What size
and style of shoe are you looking for?”, a JSON payload to        3.2.1. Classification Metrics
send to an API ”{"search_terms":["red","shoe"]} ” etc.
                                                                  In binary classification, accuracy is typically used as an
   Consider a set of demonstrations 𝒟, where each demon-
                                                                  evaluation metric, and can be defined as:
stration is a pair (𝑞, 𝑎) containing a query 𝑞 and correct
answer 𝑎, or a triple (𝑞, 𝑎, 𝑎)̄ which additionally contains an                   1
                                                                                         ∑ 1(𝑝(𝑦 ∣ 𝑥) > 𝑝(𝑦 ̄ ∣ 𝑥))
incorrect answer 𝑎.̄ Datasets under this later triplet form                      |𝒟 | (𝑥,𝑦,𝑦)∈𝒟
                                                                                            ̄
are commonly used in Contrastive Learning approaches.
We call 𝐶, a subset of demonstrations retrieved from 𝒟, the          Where: |𝒟 | is the number of examples in the dataset 𝒟;
context.                                                          1(⋅) is the indicator function that returns 1 if the condition
                                                                  is true and 0 otherwise; and 𝑦 (𝑦)̄ is the correct (incorrect)
             𝐶 ⊂ 𝒟 = {(𝑞𝑖 , 𝑎𝑖 , 𝑎𝑖̄ ), … , (𝑞𝑛 , 𝑎𝑛 , 𝑎𝑛̄ )}     label for example 𝑥.
                                                                     Given a retriever 𝑅𝒟 and a demonstration (𝑞, 𝑎, 𝑎)̄ ∈ 𝒟,
   Given an auto-regressive LLM 𝑀, the query 𝑞, and a re-
                                                                  we introduce the simplified leave-one-out notation 𝑅(𝑞) =
trieved context 𝐶, we define 𝑝𝑀 (𝑎 ∣ 𝑞, 𝐶) the probability that
                                                                  𝑅𝒟 ∖{(𝑞,𝑎,𝑎)}
                                                                             ̄ (𝑞). We define the metric MC1 which is related
𝑀 generates the answer 𝑎. In practice, the tokens of the
                                                                  to accuracy:
examples from the context 𝐶 are appended to the tokens of
the query 𝑞, using prompt formatting techniques that may                      1
be optimized to a specific LLM.                                    MC1 =             ∑ 1(𝑝𝑀 (𝑎 ∣ 𝑞, 𝑅(𝑞)) > 𝑝𝑀 (𝑎 ̄ ∣ 𝑞, 𝑅(𝑞)))
                                                                             |𝒟 | (𝑞,𝑎,𝑎)∈𝒟
                                                                                        ̄
   Putting it all together, for an unseen query 𝑞 and unseen
correct answer 𝑎, a few-shot retriever 𝑅𝒟 must efficiently          In the case that many incorrect answers are provided for
retrieve a subset of 𝑘 demonstrations 𝑅𝒟 (𝑞) ∈ 𝒟 𝑘 such that      each query a,̄ we can extend this in the same manner as
𝑝𝑀 (𝑎 ∣ 𝑞, 𝑅𝒟 (𝑞)) is maximized.                                  multi-class classification by requiring that the correct an-
                                                                  swer have greater probability than all the incorrect answers:
3.2. Evaluation                                                        1
                                                                              ∑ ∏ 1(𝑝𝑀 (𝑎 ∣ 𝑞, 𝑅(𝑞)) > 𝑝𝑀 (𝑎 ̄ ∣ 𝑞, 𝑅(𝑞)))
Consider the probability of generating the correct answer             |𝒟 | (𝑞,𝑎,a)∈𝒟
                                                                                 ̄   𝑎∈̄ ā
𝑎 given an empty context 𝑝𝑀 (𝑎 ∣ 𝑞). We are interested in
evaluating how much the context 𝐶 increases the probability          We also define a metric MC2 which extends this further
                                                                  to the case that multiple correct answers a and multiple
incorrect answers ā are provided for each query. This metric             3.3. Retrieval Algorithm
is the average number of correct answers which have greater
                                                                          This section details the core algorithm employed for few-
probability than all incorrect answers.
                                                                          shot demonstration retrieval, which leverages a greedy strat-
                              MC2 =                                       egy to maximize a combination of three key scores: query
                                                                          relevance, demonstration diversity, and demonstration qual-
 1              1                                                         ity bias. The full retrieval algorithm is presented in Algo-
        ∑          ∑ ∏ 1(𝑝𝑀 (𝑎 ∣ 𝑞, 𝑅(𝑞)) > 𝑝𝑀 (𝑎 ̄ ∣ 𝑞, 𝑅(𝑞)))
|𝒟 | (𝑞,a,a)∈𝒟
           ̄
               |a| 𝑎∈a 𝑎∈̄ ā                                             rithm 1.
   Finally, we define the related metric MC3. This metric is
the ratio of probability of correct answers to the probability            3.3.1. Query Relevance
of incorrect answers.                                                     The relevance score between the query and each candidate
                                  ∑𝑎∈a 𝑝𝑀 (𝑎 ∣ 𝑞, 𝑅(𝑞))                   demonstration is calculated using the cosine similarity of
                    1
         MC3 =             ∑                                              their respective BERT embeddings [21]. By computing the
                   |𝒟 | (𝑞,a,a)∈𝒟
                              ̄   ∑𝑎∈̄ ā 𝑝𝑀 (𝑎 ̄ ∣ 𝑞, 𝑅(𝑞))              cosine similarity between the query embedding and the
                                                                          embedding of each demonstration’s query, we obtain a score
   These metrics and their names follow those defined in                  that reflects the topical similarity and semantic alignment
[48]. While they are easy to interpret, these metrics are not             between the query and the candidate demonstration.
well normalized: they don’t take into account all possible
correct and incorrect answers. As a result, if the sample of
                                                                          3.3.2. Retrieved Demonstrations Diversity
correct and incorrect answers have varying lengths and use
of rare vocabulary tokens, these will impact the metrics.                 To promote diversity in the retrieved demonstrations and
                                                                          avoid redundancy, we incorporate the Maximal Marginal
3.2.2. Direct Preference Optimization Metric                              Relevance (MMR) algorithm. MMR iteratively selects the
                                                                          items that maximizes the combined score of relevance to the
We postulate that an ideal metric should obey the following               query and dissimilarity to the previously chosen items. This
properties: it should be positive when the retrieved context              ensures a balance between retrieving relevant items and
increases the probability of a correct answer; it should be               ensuring they cover a variety of information. A parameter,
equal in magnitude when the probability of a correct an-                  𝜆𝑑 , is used to control the trade-off between relevance and
swer halves or doubles; it should relate to the probability               diversity. Higher values of 𝜆𝑑 prioritize relevance, whereas
of getting all correct answers such that if any one correct               lower values prioritize diversity.
answer is impossible, the metric is minimized. Moreover,
in the case that incorrect answers are provided, it should
                                                                          3.3.3. Demonstration Quality Bias
be positive when the context 𝐶 increases the probability of
correct answers more than that of incorrect answers.                      While the pre-trained BERT embeddings capture semantic
   We define the DPO metric as the negative of the Direct                 relationships, they do not inherently account for the quality
Preference Optimization loss [49], which satisfies these prop-            of the few-shot demonstrations. To address this, we explic-
erties:                                                                   itly introduce a demonstration quality bias term related to
                                                                          the popularity of an item in a training dataset. This score
                                                                          is computed using the log perplexity of the demonstration
                      𝑝𝑀 (𝑎 ∣ 𝑞, 𝑅(𝑞))           𝑝𝑀 (𝑎 ̄ ∣ 𝑞, 𝑅(𝑞))       answer 𝑎, given the demonstration question 𝑞.
  DPO = log 𝜎( log                       − log                        )
                         𝑝𝑀 (𝑎 ∣ 𝑞)                 𝑝𝑀 (𝑎 ̄ ∣ 𝑞)
                                                                                               1
                                                                                                  ∑ log 𝑝𝑀 (𝑎𝑖 ∣ 𝑞)
    In the case that incorrect answers are not available, the                                 |𝑎| 𝑎 ∈𝑎
                                                                                                  𝑖
term containing 𝑎 ̄ can be omitted while preserving the afore-
mentioned properties.                                                        This can be interpreted as measuring the probability of
    Because the metric is proportional to probability ratio               the correct answer 𝑎 given the query 𝑞, normalized to the
𝑝𝑀 (𝑎 ∣ 𝑞, 𝐶)/𝑝𝑀 (𝑎 ∣ 𝑞) rather than the absolute probability             length of the answer. This can also been interpreted as a
𝑝𝑀 (𝑎 ∣ 𝑞, 𝐶), it is invariant to the number of tokens and                proxy for a popularity bias, akin to the number of connec-
frequency of rare vocabulary tokens in the answer. If this                tions of an item in graph-based retrieval algorithms like
were not the case, then the score for an example would get                recommender systems. Like in the article [12], the intuition
a positive (negative) bias if the correct (incorrect) answer is           is that the more frequently a related sequence of tokens oc-
shorter. Similarly, the score across a set of examples would              curs in the pre-training dataset of the LLM, the more likely
weigh examples with shorter answers more strongly.                        the model will be able to extract its relevant information.
    Another aspect of the DPO metric that is worth consider-              Rather than directly analyzing the massive amount of text
ing is that by using this metric to optimize the retriever, we            data (often trillions of tokens) used to pre-train the LLM,
are effectively fine-tuning a model. Consider the LLM model               we focus on the perplexity of the sequence. Perplexity acts
𝑝𝑀 (𝑎 ∣ 𝑞) which assigns a probability of a generation 𝑎 given            as a proxy, indicating how surprised the LLM is by the se-
a prompt 𝑞. Now consider another model 𝑝𝑀      ′ (𝑎 ∣ 𝑞 ′ ), where
                                                                          quence, essentially, how well it aligns with what the LLM
𝑞 ′ = 𝑅(𝑞). From this perspective, we can consider that 𝑝𝑀      ′
                                                                          expects to see. A parameter 𝜆𝑏 controls the trade-off be-
is functionally the same as 𝑝𝑀 but with added parameters                  tween relevance/diversity and quality bias. Lower values of
arising from 𝑅. And so, by finding a retriever 𝑅 which maxi-              𝜆𝑏 emphasize high-quality demonstrations.
mizes the DPO metric, we are in effect fine-tuning the model
𝑝𝑀′ .
Algorithm 1 MMR with quality bias                                      When calculating a metric score for an example (𝑞𝑖 , a𝑖 , a𝑖̄ ),
Require: 1 ≤ 𝑘 ≤ 𝑛; 0 ≤ 𝜆𝑑 ≤ 1; 0 ≤ 𝜆𝑏 ≤ 1                          all demonstrations with the query 𝑞𝑖 are left out from the
Require: 𝑄 ∈ ℝ𝑑                       ▷ query embedding             demonstrations available for inclusion in the context. In
Require: ∀1 ≤ 𝑖 ≤ 𝑛, 𝐸𝑖 ∈ ℝ𝑑          ▷ example question            this manner, the correct answers a𝑖 are not included in the
  embedding                                                         context when the LLM is presented with query 𝑞𝑖 .
Require: ∀1 ≤ 𝑖 ≤ 𝑛, 𝑏𝑖 ∈ ℝ        ▷ example quality bias
        𝑄                     𝐸
  𝑄 ← ‖𝑄‖ ; ∀1 ≤ 𝑖 ≤ 𝑛, 𝐸𝑖 ← ‖𝐸𝑖 ‖                                  4.1.4. The Language Models
                                  𝑖
  ∀1 ≤ 𝑖 ≤ 𝑛, 𝑣𝑖 ← 𝜆𝑏 𝑄𝐸𝑖⊤ + (1 − 𝜆𝑏 )𝑏𝑖                            We conducted our experiments using four noteworthy LLMs:
  𝐶1 ← arg max𝑖 𝑣𝑖                                                  the smaller base text-completion model Mistral-7B-v0.1 (7B
  for 2 ≤ 𝑠 ≤ 𝑘 do                                                  parameters), and the larger instruct-fine-tuned mixture of
      ∀1 ≤ 𝑖 ≤ 𝑛, 𝑚𝑖 ← max1≤𝑗<𝑠 𝐸𝑖 𝐸𝐶⊤𝑗                             models Mixtral-8x22B-Instruct-v0.1 (141B parameters) from
      ∀1 ≤ 𝑖 ≤ 𝑛, 𝑤𝑖 ← 𝜆𝑑 𝑣𝑖 − (1 − 𝜆𝑑 )𝑚𝑖                          Mistral [50]; as well as a smaller chat-tuned model Llama-
      𝐶𝑠 ← arg max𝑖∉{𝐶1 ,…,𝐶𝑠−1 } 𝑤𝑖                                3-8B-chat (8B parameters) and a larger chat-tuned model
  end for                                                           Llama-3-70B-chat (70B parameters) from Llama 1 .
                                                                       All four models are open-weights LLMs, meaning their
                                                                    internal parameters are publicly available for scrutiny and
4. Experiments                                                      potential fine-tuning. These modern models stand out for
                                                                    achieving impressive performance on various tasks despite
4.1. Experimental Setup                                             their relatively compact size. This efficiency makes it an
                                                                    attractive option for resource-constrained environments
4.1.1. The Dataset Choice
                                                                    where deploying colossal models might not be feasible.
We are interested in a publicly available dataset which meets
the following criteria: it should have enough statistical           4.2. Implementation Details and Ablation
power so that we can resolve small differences in accuracy,
                                                                         Study
ideally it will have hundreds of examples or more; it doesn’t
need to have vast amounts of data as this isn’t a typical           We implemented the Algorithm 1 and metrics from Section
setting for few-shot learning, and the cost of conducting           3.2.1 in python. We computed BERT embeddings using the
experiments can become burdensome; it should provide                package sentence_transformers 2 , and implemented the re-
correct and incorrect answers so that we can report classifi-       trieval algorithms in numpy. We queried all LLM models
cation metrics from Section 3.2.1; it should be big enough          using the Together API 3 .
to contain similar examples with partially redundant in-               We did not perform hyper-parameter tuning, and fixed
formation so the use of diversity can improve collective            the two parameters to 𝜆𝑑 = 0.75 and 𝜆𝑏 = 0.95 in all experi-
information presented to the LLM in a context.                      ments. We fixed the amount of retrieved demonstrations to
                                                                    𝑘 = 6, matching the number of few-shot examples from the
4.1.2. The TruthfulQA Dataset                                       fixed primer example from the TruthfulQA paper [48].
                                                                       To measure the impact of the separated components of
We chose to conduct our experiments using the TruthfulQA            Algorithm 1, relevance, diversity, and bias, we implemented
dataset [48] which meets these requirements. The dataset            variants of the retrieval algorithm using only one or two of
contains 817 distinct examples, which yields a standard error       the three components:
in the range of 1% to 2% for accuracy measures in the range
of 90% to 50%. Each example contains a single query, and                  • Fix: fixed primer examples [48]
a variable number of correct and incorrect answers. And                   • Bias: Pure quality bias [12]
by considering each distinct (𝑞, 𝑎) pair as a demonstration               • Rel: Pure semantic similarity [20] (KATE )
for the purpose of building a context, the retriever is faced
                                                                          • Rel+Bias: Semantic similarity plus quality bias
with similar demonstrations as multiple (𝑞, 𝑎) pairs share
the same query (on average, the dataset contains 3.5 correct              • Rel+Div: Semantic similarity plus diversity [26]
answers for each query).                                                  • Rel+Bias+Div: Algorithm 1

4.1.3. Generating Demonstrations Pairs and Triplets                 4.3. Main Results
The dataset is used in three different ways in this paper:          We present the experimental metrics for the 6 retrievers for
                                                                    the 4 different LLMs: in Table 1 and 2 for the Mistral models,
     • 𝒟𝑀𝐶 : this is the dataset as described in [48]. It con-      and Table 3 and 4 for the Llama-3 models.
       tains 817 examples, each of which contains a vari-             Our evaluation relies on a combination of metrics to as-
       able number of correct and incorrect answers. The            sess the effectiveness of different retrieval strategies for ICL.
       metrics MC1, MC2 and MC3, which can accept an                The normalized DPO metric provides the most valuable
       arbitrary number of correct and incorrect answers            insights for each LLM individually but cannot be directly
       as inputs, are calculated over this dataset.                 compared across models. The three additional classification
     • 𝒟𝐷𝑃𝑂 : this is the set of every distinct (𝑞, 𝑎, 𝑎)̄ triple   metrics allow for objective performance comparisons across
       contained in 𝒟𝑀𝐶 . It contains 12,485 such triplets.         models. However, these metrics are susceptible to bias based
       The DPO metric is calculated over this dataset.              on token sequence length.
     • 𝒟𝐼 𝐶𝐿 : this is the set of every distinct (𝑞, 𝑎) pairs
       contained in 𝒟𝑀𝐶 . It contains 2,846 such pairs. This        1
                                                                      https://llama.meta.com/llama3/
       is the set of demonstrations from which a context is         2
                                                                      https://sbert.net/
       drawn. That is, 𝐶 ⊂ 𝒟𝐼 𝐶𝐿 .                                  3
                                                                      https://docs.together.ai/docs/inference-python
   Table 1                                                         Table 2
   Evaluation Metrics with Mistral-7B-v0.1                         Evaluation Metrics with Mixtral-8x22B-Instruct-v0.1
     Method            DPO        MC1        MC2       MC3          Method             DPO       MC1        MC2          MC3
     Fix               -20.40    0.2815    0.2086     0.4285        Fix               -19.06    0.5202     0.3896    0.6799
     Bias              -33.56    0.2411    0.1652     0.3596        Bias              -27.30    0.4382     0.3096    0.5948
     Rel               -12.71    0.4455    0.3664     0.5925        Rel               -15.17    0.6193     0.5004    0.7616
     Rel+Bias          -13.63    0.4602    0.3663     0.5969        Rel+Bias          -14.77    0.6389     0.5080    0.7657
     Rel+Div          -12.37    0.5177    0.3930     0.6616         Rel+Div          -12.67    0.6879     0.5181    0.8092
     Rel+Div+Bias      -14.54    0.4676    0.3592     0.6255        Rel+Div+Bias      -13.29    0.6573     0.5071    0.7924


   Table 3                                                         Table 4
   Evaluation Metrics with Llama-3-8B-chat                         Evaluation Metrics with Llama-3-70B-chat
     Method            DPO        MC1        MC2       MC3          Method             DPO       MC1        MC2          MC3
     Fix               -22.12    0.3623    0.2709     0.5195        Fix               -23.05    0.4382     0.3375    0.6184
     Bias              -17.55    0.3831    0.2876     0.5729        Bias              -20.41    0.4455     0.3303    0.6424
     Rel               -17.20    0.4920    0.4046     0.6518        Rel               -19.02    0.5483     0.4482    0.6958
     Rel+Bias          -17.14    0.5043    0.4083     0.6570        Rel+Bias          -19.09    0.5532     0.4495    0.7054
     Rel+Div           -16.14   0.5520    0.4173     0.7009         Rel+Div           -13.93   0.6389     0.4834    0.7758
     Rel+Div+Bias     -15.80     0.5177    0.4007     0.6841        Rel+Div+Bias     -13.72     0.6022     0.4621    0.7583



   The impact of few-shot learning is best seen by comparing          larity between the BERT embeddings of each demonstration
the three MC metrics for Rel+Div for a smaller model against          pair within the retrieved set. The LLM’s benefit is measured
Fix for a larger model: the smaller models (7B and 8B param-          using the DPO metric. We then systematically vary 𝜆𝑑 while
eters) enriched with ICL RAG are essentially matching or              keeping the LLM fixed (Llama-3-8B-chat), the quality bias
outperforming the bigger models (141B and 70B parameters)             fixed (𝜆𝑏 = 0.95), and the number of retrieved demonstra-
without ICL RAG.                                                      tions constant (𝑘 = 6) to observe the empirical correlation
   The results consistently demonstrate that incorporating            between diversity and DPO. The results are visualized in
both relevance and diversity into the retrieval strategy leads        Figure 1. This experiment underscores the importance of
to superior performance across all metrics and for both               a metric measuring the impact of the retrieved context on
LLMs. For all models, and for all metrics, Rel+Div largely            the LLM, like DPO. Without such a metric, it would be chal-
outperforms Rel. This finding reinforces the importance of            lenging to effectively calibrate 𝜆𝑑 and achieve the optimal
not just retrieving relevant demonstrations but also ensuring         balance between relevance and diversity in the retrieved
a diverse set that maximizes the informative value for the            demonstrations.
LLM.
   Interestingly, the impact of the low perplexity bias yields
contrasting results. For both Mistral models, adding this
bias results in a decline in performance on almost all metrics.
Conversely, both Llama-3 models exhibit overall improve-
ment with the low perplexity bias, in particular with the
DPO metric. This intriguing observation suggests that LLM-
dependent hyper-parameter tuning of 𝜆𝑏 might be necessary
to optimize retrieval strategies for specific models. Alter-
natively, the low perplexity bias itself may benefit from
further refinement. Using an opposite intuition, we may
argue that instead of prioritizing demonstrations the LLM
already finds likely, introducing demonstrations that sur-
prise the model the most could be beneficial for certain
LLMs, potentially maximizing the learning impact of each
demonstration. These findings open exciting new avenues
for future research in ICL retrieval strategies, creating a par-
allel with novelty and serendipity concepts in recommender            Figure 1: Diversity Metric and DPO. Non-monotonous relation-
systems.                                                              ship between a diversity metric, the average cosine similarity
                                                                      between embedding pairs, and the quality metric DPO. Obtained
                                                                      by varying 𝜆𝑑 with Llama-3-8B-chat and 𝑘 = 6.
4.4. Calibrating Diversity using DPO
Calibrating the amount of diversity in the retrieved set is
crucial when optimizing ICL retrieval. We highlight the
difficulty of achieving this without our proposed method-             5. Discussion: Real-World RAG
ology by demonstrating the non-monotonous relationship
between the amount of diversity in the retrieved demon-                  Systems
strations and the resulting benefit to the LLM performance.
                                                                      While the importance of diversity in ICL retrieval is
To quantify diversity, we calculate the average cosine simi-
                                                                      paramount, we note that readily available RAG systems
rarely implement it directly within the core retrieval algo-                  5.3. Achieving State-of-the-Art Retrieval
rithm. There are several practical considerations to keep in                       with Available Tools
mind for successful deployment.
                                                                              Traditional full-text search algorithms like BM25 lead to
                                                                              empirically lower ICL quality. Vector stores offer a more
5.1. Balancing Performance and Efficiency                                     suitable solution for efficient retrieval based on semantic
Retrieval latency is crucial at scale. Exhaustive, brute-force                similarity. Numerous vendors provide vector store solutions,
nearest neighbor search is computationally expensive and                      and they can be broadly categorized as follows:
impractical. Instead, real-world systems leverage efficient                      In-Memory vector indexes, such as FAISS and nmslib ,
indexing techniques and approximated kNN algorithms, as                       offer exceptional speed with minimal setup complexity, but
described in [23], to ensure fast retrieval times. This ap-                   limited scalability for larger datasets. They may not imple-
proach is essential for handling large datasets while main-                   ment in-place addition or deletion of the indexed vectors.
taining responsiveness. To seamlessly integrate with exist-                   Self-Hosted vector databases, such as Elasticsearch and
ing retrieval engines and leverage their optimized search                     Postgres , provide a balance between scalability and perfor-
capabilities, a retrieval algorithm for RAG must ensure its                   mance, at a much larger setup complexity. They typically
data is stored in a format compatible with these engines.                     implement efficient addition and deletion of the indexed vec-
Commonly indexed data structures include text itself or                       tors. SaaS vector stores, such as Pinecone and VertexAI ,
low-dimensional dense vector embeddings. By adhering                          offer a convenient option with pre-configured infrastructure
to these indexing practices, RAG systems can effectively                      and almost no setup complexity. We invite the reader to
leverage the power of existing retrieval engines and achieve                  consult the lists of integrated vector stores of LangChain 5
fast, scalable retrieval of informative examples.                             and LlamaIndex 6 for near-exhaustive lists of available tools.
                                                                                 Due to the complexities of incorporating such rules di-
                                                                              rectly within retrieval indexing algorithm [52], none of the
5.2. ICL RAG versus Fine-Tuning                                               solutions known to the authors from any of the above cate-
The computational cost of ICL may be evaluated against                        gory implements diversity or quality biasing of the result.
the cost of fine-tuning. For instance, consider a large LLM                   A common heuristic to mitigate this problem is to retrieve
like gpt-3.5 with a current price 6x larger per input tokens                  a larger set of candidate examples (e.g., double the desired
between fine-tuned or default model 4 . While ICL requires                    number) and then apply diversity techniques like MMR on
additional input tokens, it is guaranteed to offer cost savings               the retrieved candidates as a post-processing step. Quality
compared to fine-tuning when 𝑘 < 6 with this model.                           biasing can be indirectly achieved by modifying the indexed
   An interesting contrast between ICL and fine-tuning is                     embeddings themselves. For instance, reducing the norm of
highlighted in the paper [51]. The paper argues that fine-                    embeddings associated with low-quality content can nudge
tuning can be more efficient than few-shot ICL in terms of                    the retrieval algorithm towards higher-quality examples.
cost and latency due to the super-linear increase in LLM                      An exact implementation in the context of cosine-similarity
latency with growing prompt sizes. However, this latency                      or dot-product relevance is to add an additional column
concern is less relevant with inference-throughput opti-                      storing the quality bias, and set the corresponding value to
mized LLM systems built with large GPU clusters, such as                      1 in the embedding of the query.
commonly used APIs. In these systems, the observed latency                       While vector search offers a powerful foundation for prac-
remains independent of the prompt size. From a latency per-                   tical ICL retrieval, it often lacks native support for essential
spective, adding ICL demonstrations can be considered free.                   considerations like diversity or quality bias. These aspects
Additionally, the paper suggests that ICL requires scanning                   are crucial for ensuring informative and effective retrieval
through 100% of the demonstrations at query time. However                     of few-shot learning examples. Existing tools for recom-
this does not hold when employing real retrieval engines                      mendation engines, on the other hand, often excel in these
with indexing and approximate kNN , which significantly                       areas. Recommendation engines natively incorporate rules
reduce the number of examples scanned during retrieval.                       that promote diversity by recommending a variety of items,
   Furthermore, building a curated database of few-shot                       or quality bias by prioritizing most popular products. Fu-
demonstrations offers significant advantages to practition-                   ture research directions as well as practical systems for
ers. These demonstrations are not specific to a single                        ICL retrieval could explore adapting or integrating these
LLM but can be readily utilized with any LLM architec-                        well-established techniques from recommender systems to
ture. This eliminates vendor lock-in and lets practitioners                   further enhance the effectiveness and sophistication of few-
leverage the best LLM for the task at hand without con-                       shot learning through information retrieval. State-of-the-art
cerns about compatibility. Perhaps even more importantly,                     ICL for real-world applications can be achieved by com-
a well-maintained database of few-shot examples automat-                      bining the strengths of vector search with the established
ically benefits from the continuous advancements in LLM                       ”diversity-aware” retrieval approaches from recommender
technology. As newer, more powerful pre-trained LLMs                          systems.
become rapidly available, existing demonstrations can be
used to enrich them quickly. This ensures applications lever-
age the latest capabilities without the need to completely                    6. Conclusion
re-engineer workflows. This reusability and adaptability
                                                                              This paper explored the critical role of information retrieval
position our few-shot learning engine as a powerful tool
                                                                              in ICL for few-shot learning with Large Language Mod-
for harnessing the ever-evolving potential of LLMs to solve
                                                                              els. Our work identified key desirable properties for ICL
real business challenges.
4                                                                             5
    In May 2024, the price of gpt-turbo-0125 is $0.5/M input tokens, $1.5/M       https://python.langchain.com/docs/integrations/vectorstores/
                                                                              6
    output tokens; and fine-tuned price of $3/M input tokens, $6/M output         https://docs.llamaindex.ai/en/stable/module_guides/storing/vector_
    tokens, $8M fine-tuning tokens                                                stores/
retrieval systems. We demonstrated that state-of-the-art               b2e63e36c57e153b9015fece2352a9f9-Paper-Conference.
retrieval in this domain resembles recommender systems                 pdf.
under the item cold-start problems. Unlike traditional infor-      [7] O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay,
mation retrieval prioritizing for exact recall, our approach           A. Shashua, K. Leyton-Brown, Y. Shoham, In-Context
emphasizes discovery by maximizing the collective informa-             Retrieval-Augmented Language Models, Transactions
tion gain from retrieved demonstrations. This necessitates             of the Association for Computational Linguistics 11
balancing query relevance, quality scoring, and diversity              (2023) 1316–1331. doi:10.1162/tacl_a_00605 .
to ensure a variety of informative examples are surfaced.          [8] Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang,
Furthermore, we propose a novel evaluation method for ICL              X. Sun, J. Xu, L. Li, Z. Sui, A survey for in-context
retrieval based on the subsequent performance of the en-               learning, CoRR abs/2301.00234 (2023).
riched LLM on NLP tasks. This approach eliminates the need         [9] D. Machlab, R. Battle, LLM In-Context Recall is Prompt
for subjective diversity scores, a challenge in information re-        Dependent, CoRR (2024) arXiv:2404.08865. doi:10.
trieval evaluation. Our findings demonstrate the significant           48550/arXiv.2404.08865 . arXiv:2404.08865 .
impact of diversity and quality bias in retrieving demon-         [10] C. L. Clarke, M. Kolla, G. V. Cormack, O. Vechtomova,
strations for ICL. By incorporating these well-established             A. Ashkan, S. Büttcher, I. MacKinnon, Novelty and
techniques from recommender systems, we can unlock the                 diversity in information retrieval evaluation, in: Pro-
full potential of ICL for few-shot learning and empower                ceedings of the 31st Annual International ACM SIGIR
LLMs to tackle real-world tasks with limited data.                     Conference on Research and Development in Infor-
                                                                       mation Retrieval, SIGIR ’08, Association for Comput-
                                                                       ing Machinery, New York, NY, USA, 2008, p. 659–666.
7. Acknowledgments                                                     doi:10.1145/1390334.1390446 .
                                                                  [11] W. Yu, D. Iter, S. Wang, Y. Xu, M. Ju, S. Sanyal, C. Zhu,
To Ching-Wei Chen, for finding the name RAGSys, and the
                                                                       M. Zeng, M. Jiang, Generate rather than retrieve:
entire Crossing Minds team for their support during this
                                                                       Large language models are strong context generators,
research.
                                                                       in: International Conference for Learning Representa-
                                                                       tion (ICLR), 2023.
References                                                        [12] H. Gonen, S. Iyer, T. Blevins, N. Smith, L. Zettlemoyer,
                                                                       Demystifying prompts in language models via per-
 [1] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-                plexity estimation, in: H. Bouamor, J. Pino, K. Bali
     plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-              (Eds.), Findings of the Association for Computational
     try, A. Askell, et al., Language models are few-shot              Linguistics: EMNLP 2023, Association for Computa-
     learners, Advances in neural information processing               tional Linguistics, Singapore, 2023, pp. 10136–10148.
     systems 33 (2020) 1877–1901.                                      doi:10.18653/v1/2023.findings- emnlp.679 .
 [2] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, Y. Iwasawa,         [13] S. Agrawal, C. Zhou, M. Lewis, L. Zettlemoyer,
     Large language models are zero-shot reasoners, in:                M. Ghazvininejad, In-context examples selection
     S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,                   for machine translation, in: A. Rogers, J. Boyd-
     K. Cho, A. Oh (Eds.), Advances in Neural Infor-                   Graber, N. Okazaki (Eds.), Findings of the Asso-
     mation Processing Systems, volume 35, Curran                      ciation for Computational Linguistics: ACL 2023,
     Associates, Inc., 2022, pp. 22199–22213. URL: https:              Association for Computational Linguistics, Toronto,
     //proceedings.neurips.cc/paper_files/paper/2022/file/             Canada, 2023, pp. 8857–8873. doi:10.18653/v1/2023.
     8bb0d291acd4acf06ef112099c16f326-Paper-Conference.                findings- acl.564 .
     pdf.                                                         [14] S. Robertson, S. Walker, S. Jones, M. M. Hancock-
 [3] E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y. Li,            Beaulieu, M. Gatford,          Okapi at trec-3,       in:
     S. Wang, L. Wang, W. Chen, LoRA: Low-rank adap-                   Overview of the Third Text REtrieval Confer-
     tation of large language models, in: International                ence (TREC-3), Gaithersburg, MD: NIST, 1995,
     Conference on Learning Representations, 2022. URL:                pp. 109–126. URL: https://www.microsoft.com/en-us/
     https://openreview.net/forum?id=nZeVKeeFYf9.                      research/publication/okapi-at-trec-3/.
 [4] Y. Luo, Z. Yang, F. Meng, Y. Li, J. Zhou, Y. Zhang,          [15] O. Rubin, J. Herzig, J. Berant, Learning to retrieve
     An empirical study of catastrophic forgetting                     prompts for in-context learning, in: M. Carpuat, M.-C.
     in large language models during continual                         de Marneffe, I. V. Meza Ruiz (Eds.), Proceedings of the
     fine-tuning,     ArXiv abs/2308.08747 (2023). URL:                2022 Conference of the North American Chapter of
     https://api.semanticscholar.org/CorpusID:261031244.               the Association for Computational Linguistics: Hu-
 [5] L. Berglund, M. Tong, M. Kaufmann, M. Balesni,                    man Language Technologies, Association for Com-
     A. C. Stickland, T. Korbak, O. Evans, The rever-                  putational Linguistics, Seattle, United States, 2022,
     sal curse: Llms trained on ”a is b” fail to learn ”b              pp. 2655–2671. doi:10.18653/v1/2022.naacl- main.
     is a”, ArXiv abs/2309.12288 (2023). URL: https://api.             191 .
     semanticscholar.org/CorpusID:262083829.                      [16] L. Wang, N. Yang, X. Huang, B. Jiao, L. Yang, D. Jiang,
 [6] Y. Bai, F. Chen, H. Wang, C. Xiong, S. Mei, Trans-                R. Majumder, F. Wei, Text Embeddings by Weakly-
     formers as statisticians: Provable in-context learning            Supervised Contrastive Pre-training, CoRR (2022)
     with in-context algorithm selection, in: A. Oh,                   arXiv:2212.03533. doi:10.48550/arXiv.2212.03533 .
     T. Neumann, A. Globerson, K. Saenko, M. Hardt,                    arXiv:2212.03533 .
     S. Levine (Eds.), Advances in Neural Information             [17] X. Li, K. Lv, H. Yan, T. Lin, W. Zhu, Y. Ni, G. Xie,
     Processing Systems, volume 36, Curran Asso-                       X. Wang, X. Qiu, Unified demonstration retriever for
     ciates, Inc., 2023, pp. 57125–57211. URL: https:                  in-context learning, in: A. Rogers, J. Boyd-Graber,
     //proceedings.neurips.cc/paper_files/paper/2023/file/             N. Okazaki (Eds.), Proceedings of the 61st Annual
     Meeting of the Association for Computational Lin-              [26] X. Ye, S. Iyer, A. Celikyilmaz, V. Stoyanov, G. Dur-
     guistics (Volume 1: Long Papers), Association for                   rett, R. Pasunuru, Complementary explanations
     Computational Linguistics, Toronto, Canada, 2023, pp.               for effective in-context learning, in: A. Rogers,
     4644–4668. doi:10.18653/v1/2023.acl- long.256 .                     J. Boyd-Graber, N. Okazaki (Eds.), Findings of the As-
[18] J. Ye, Z. Wu, J. Feng, T. Yu, L. Kong, Compositional ex-            sociation for Computational Linguistics: ACL 2023,
     emplars for in-context learning, in: A. Krause, E. Brun-            Association for Computational Linguistics, Toronto,
     skill, K. Cho, B. Engelhardt, S. Sabato, J. Scarlett (Eds.),        Canada, 2023, pp. 4469–4484. doi:10.18653/v1/2023.
     Proceedings of the 40th International Conference on                 findings- acl.273 .
     Machine Learning, volume 202 of Proceedings of Ma-             [27] J. Carbonell, J. Goldstein, The use of mmr, diversity-
     chine Learning Research, PMLR, 2023, pp. 39818–39833.               based reranking for reordering documents and pro-
     URL: https://proceedings.mlr.press/v202/ye23c.html.                 ducing summaries, in: Proceedings of the 21st Annual
[19] L. Wang, N. Yang, F. Wei, Learning to retrieve in-                  International ACM SIGIR Conference on Research and
     context examples for large language models, in: Y. Gra-             Development in Information Retrieval, SIGIR ’98, As-
     ham, M. Purver (Eds.), Proceedings of the 18th Confer-              sociation for Computing Machinery, New York, NY,
     ence of the European Chapter of the Association for                 USA, 1998, p. 335–336. doi:10.1145/290941.291025 .
     Computational Linguistics (Volume 1: Long Papers),             [28] C. X. Zhai, W. W. Cohen, J. Lafferty, Beyond indepen-
     Association for Computational Linguistics, St. Julian’s,            dent relevance: methods and evaluation metrics for
     Malta, 2024, pp. 1752–1767.                                         subtopic retrieval, in: Proceedings of the 26th Annual
[20] J. Liu, D. Shen, Y. Zhang, B. Dolan, L. Carin, W. Chen,             International ACM SIGIR Conference on Research and
     What makes good in-context examples for GPT-3?,                     Development in Informaion Retrieval, SIGIR ’03, As-
     in: E. Agirre, M. Apidianaki, I. Vulić (Eds.), Proceed-             sociation for Computing Machinery, New York, NY,
     ings of Deep Learning Inside Out (DeeLIO 2022): The                 USA, 2003, p. 10–17. doi:10.1145/860435.860440 .
     3rd Workshop on Knowledge Extraction and Integra-              [29] R. Agrawal, S. Gollapudi, A. Halverson, S. Ieong, Di-
     tion for Deep Learning Architectures, Association for               versifying search results, in: Proceedings of the Sec-
     Computational Linguistics, Dublin, Ireland and Online,              ond ACM International Conference on Web Search
     2022, pp. 100–114. doi:10.18653/v1/2022.deelio- 1.                  and Data Mining, WSDM ’09, Association for Com-
     10 .                                                                puting Machinery, New York, NY, USA, 2009, p. 5–14.
[21] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT:                 URL: https://doi.org/10.1145/1498759.1498766. doi:10.
     Pre-training of deep bidirectional transformers for                 1145/1498759.1498766 .
     language understanding, in: J. Burstein, C. Doran,             [30] S. Gollapudi, A. Sharma, An axiomatic approach for
     T. Solorio (Eds.), Proceedings of the 2019 Conference               result diversification, in: Proceedings of the 18th In-
     of the North American Chapter of the Association for                ternational Conference on World Wide Web, WWW
     Computational Linguistics: Human Language Tech-                     ’09, Association for Computing Machinery, New York,
     nologies, Volume 1 (Long and Short Papers), Asso-                   NY, USA, 2009, p. 381–390. doi:10.1145/1526709.
     ciation for Computational Linguistics, Minneapolis,                 1526761 .
     Minnesota, 2019, pp. 4171–4186. doi:10.18653/v1/               [31] M. R. Vieira, H. L. Razente, M. C. N. Barioni, M. Had-
     N19- 1423 .                                                         jieleftheriou, D. Srivastava, C. Traina, V. J. Tsotras,
[22] O. Khattab, M. Zaharia, Colbert: Efficient and effective            On query result diversification, in: Proceedings of the
     passage search via contextualized late interaction over             2011 IEEE 27th International Conference on Data Engi-
     bert, in: Proceedings of the 43rd International ACM                 neering, ICDE ’11, IEEE Computer Society, USA, 2011,
     SIGIR Conference on Research and Development in                     p. 1163–1174. doi:10.1109/ICDE.2011.5767846 .
     Information Retrieval, SIGIR ’20, Association for Com-         [32] C. L. Clarke, N. Craswell, I. Soboroff, Overview of
     puting Machinery, New York, NY, USA, 2020, p. 39–48.                the trec 2009 web track., in: Trec, volume 9, 2009, pp.
     doi:10.1145/3397271.3401075 .                                       20–29.
[23] A. Beygelzimer, S. Kakade, J. Langford, Cover                  [33] O. Chapelle, D. Metlzer, Y. Zhang, P. Grinspan, Ex-
     trees for nearest neighbor, in: Proceedings of the                  pected reciprocal rank for graded relevance, in: Pro-
     23rd International Conference on Machine Learning,                  ceedings of the 18th ACM Conference on Information
     ICML ’06, Association for Computing Machinery, New                  and Knowledge Management, CIKM ’09, Association
     York, NY, USA, 2006, p. 97–104. doi:10.1145/1143844.                for Computing Machinery, New York, NY, USA, 2009,
     1143857 .                                                           p. 621–630. doi:10.1145/1645953.1646033 .
[24] I. Levy, B. Bogin, J. Berant, Diverse demonstrations           [34] C. L. Clarke, M. Kolla, O. Vechtomova, An effec-
     improve in-context compositional generalization, in:                tiveness measure for ambiguous and underspecified
     A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Proceed-              queries, in: Proceedings of the 2nd International
     ings of the 61st Annual Meeting of the Association for              Conference on Theory of Information Retrieval: Ad-
     Computational Linguistics (Volume 1: Long Papers),                  vances in Information Retrieval Theory, ICTIR ’09,
     Association for Computational Linguistics, Toronto,                 Springer-Verlag, Berlin, Heidelberg, 2009, p. 188–199.
     Canada, 2023, pp. 1401–1422. doi:10.18653/v1/2023.                  doi:10.1007/978- 3- 642- 04417- 5_17 .
     acl- long.78 .                                                 [35] C. L. Clarke, N. Craswell, I. Soboroff, A. Ashkan, A
[25] H. Su, J. Kasai, C. H. Wu, W. Shi, T. Wang, J. Xin, R. Z.           comparative analysis of cascade measures for nov-
     0037, M. Ostendorf, L. Zettlemoyer, N. A. Smith, T. Y.              elty and diversity, in: Proceedings of the Fourth
     0009, Selective annotation makes language models bet-               ACM International Conference on Web Search and
     ter few-shot learners, in: The Eleventh International               Data Mining, WSDM ’11, Association for Comput-
     Conference on Learning Representations, ICLR 2023,                  ing Machinery, New York, NY, USA, 2011, p. 75–84.
     Kigali, Rwanda, May 1-5, 2023, OpenReview.net, 2023.                doi:10.1145/1935826.1935847 .
     URL: https://openreview.net/pdf?id=qY1hlv7gwg.                 [36] C.-N. Ziegler, S. M. McNee, J. A. Konstan, G. Lausen,
     Improving recommendation lists through topic diver-              scale hypertextual web search engine, Computer
     sification, in: Proceedings of the 14th International            Networks and ISDN Systems 30 (1998) 107–117.
     Conference on World Wide Web, WWW ’05, Associa-                  URL: https://www.sciencedirect.com/science/article/
     tion for Computing Machinery, New York, NY, USA,                 pii/S016975529800110X, proceedings of the Seventh
     2005, p. 22–32. doi:10.1145/1060745.1060754 .                    International World Wide Web Conference.
[37] S. Vargas, P. Castells, Rank and relevance in novelty       [48] S. C. Lin, J. Hilton, O. Evans, Truthfulqa: Measuring
     and diversity metrics for recommender systems, in:               how models mimic human falsehoods, in: Annual
     Proceedings of the Fifth ACM Conference on Recom-                Meeting of the Association for Computational Lin-
     mender Systems, RecSys ’11, Association for Comput-              guistics, 2021. URL: https://api.semanticscholar.org/
     ing Machinery, New York, NY, USA, 2011, p. 109–116.              CorpusID:237532606.
     doi:10.1145/2043932.2043955 .                               [49] R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning,
[38] S. Vargas, Novelty and diversity enhancement and                 S. Ermon, C. Finn, Direct preference optimization:
     evaluation in recommender systems and information                Your language model is secretly a reward model,
     retrieval, in: Proceedings of the 37th International             in: A. Oh, T. Neumann, A. Globerson, K. Saenko,
     ACM SIGIR Conference on Research & Development                   M. Hardt, S. Levine (Eds.), Advances in Neural
     in Information Retrieval, SIGIR ’14, Association for             Information Processing Systems, volume 36, Curran
     Computing Machinery, New York, NY, USA, 2014, p.                 Associates, Inc., 2023, pp. 53728–53741. URL: https:
     1281. doi:10.1145/2600428.2610382 .                              //proceedings.neurips.cc/paper_files/paper/2023/file/
[39] Y. Zheng, C. Gao, L. Chen, D. Jin, Y. Li, Dgcn: Diver-           a85b405ed65c6477a4fe8302b5e06ce7-Paper-Conference.
     sified recommendation with graph convolutional net-              pdf.
     works, in: Proceedings of the Web Conference 2021,          [50] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam-
     WWW ’21, Association for Computing Machinery,                    ford, D. Singh Chaplot, D. de las Casas, F. Bressand,
     New York, NY, USA, 2021, p. 401–412. doi:10.1145/                G. Lengyel, G. Lample, L. Saulnier, L. Renard Lavaud,
     3442381.3449835 .                                                M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang,
[40] F. Radlinski, R. Kleinberg, T. Joachims, Learning di-            T. Lacroix, W. El Sayed, Mistral 7B, arXiv e-prints
     verse rankings with multi-armed bandits, in: Pro-                (2023) arXiv:2310.06825. doi:10.48550/arXiv.2310.
     ceedings of the 25th International Conference on Ma-             06825 . arXiv:2310.06825 .
     chine Learning, ICML ’08, Association for Comput-           [51] H. Liu, D. Tam, M. Muqeeth, J. Mohta, T. Huang,
     ing Machinery, New York, NY, USA, 2008, p. 784–791.              M. Bansal, C. A. Raffel, Few-shot parameter-efficient
     doi:10.1145/1390156.1390255 .                                    fine-tuning is better and cheaper than in-context
[41] Y. Zhu, Y. Lan, J. Guo, X. Cheng, S. Niu, Learning               learning, in: S. Koyejo, S. Mohamed, A. Agarwal,
     for search result diversification, in: Proceedings of            D. Belgrave, K. Cho, A. Oh (Eds.), Advances in Neural
     the 37th International ACM SIGIR Conference on Re-               Information Processing Systems, volume 35, Curran
     search & Development in Information Retrieval, SIGIR             Associates, Inc., 2022, pp. 1950–1965. URL: https:
     ’14, Association for Computing Machinery, New York,              //proceedings.neurips.cc/paper_files/paper/2022/file/
     NY, USA, 2014, p. 293–302. doi:10.1145/2600428.                  0cde695b83bd186c1fd456302888454c-Paper-Conference.
     2609634 .                                                        pdf.
[42] T. Desautels, A. Krause, J. W. Burdick, Parallelizing       [52] L. Li, C.-Y. Chan, Efficient indexing for diverse query
     exploration-exploitation tradeoffs in gaussian process           results, Proc. VLDB Endow. 6 (2013) 745–756. doi:10.
     bandit optimization, Journal of Machine Learning                 14778/2536360.2536373 .
     Research 15 (2014) 4053–4103. URL: http://jmlr.org/
     papers/v15/desautels14a.html.
[43] E. Contal, D. Buffoni, A. Robicquet, N. Vayatis, Parallel
     gaussian process optimization with upper confidence
     bound and pure exploration, in: H. Blockeel, K. Kerst-
     ing, S. Nijssen, F. Železný (Eds.), Machine Learning and
     Knowledge Discovery in Databases, Springer Berlin
     Heidelberg, Berlin, Heidelberg, 2013, pp. 225–240.
[44] A. Ntoulas, M. Najork, M. Manasse, D. Fetterly, De-
     tecting spam web pages through content analysis, in:
     Proceedings of the 15th International Conference on
     World Wide Web, WWW ’06, Association for Com-
     puting Machinery, New York, NY, USA, 2006, p. 83–92.
     doi:10.1145/1135777.1135794 .
[45] G. V. Cormack, M. D. Smucker, C. L. A. Clarke, Ef-
     ficient and effective spam filtering and re-ranking
     for large web datasets, Inf. Retr. 14 (2011) 441–465.
     doi:10.1007/s10791- 011- 9162- z .
[46] M. Bendersky, W. B. Croft, Y. Diao, Quality-biased
     ranking of web documents, in: Proceedings of the
     Fourth ACM International Conference on Web Search
     and Data Mining, WSDM ’11, Association for Comput-
     ing Machinery, New York, NY, USA, 2011, p. 95–104.
     doi:10.1145/1935826.1935849 .
[47] S. Brin, L. Page,          The anatomy of a large-