<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TeamCMU at Touché: Adversarial Co-Evolution for Advertisement Integration and Detection in Conversational Search</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>To Eun Kim</string-name>
          <email>toeunk@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>João Coelho</string-name>
          <email>jmcoelho@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gbemileke Onilude</string-name>
          <email>gonilude@cs.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jai Singh</string-name>
          <email>jsingh2@andrew.cmu.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Carnegie Mellon University</institution>
          ,
          <addr-line>Pittsburgh, PA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>As conversational search engines increasingly adopt generation-based paradigms powered by Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG), the integration of advertisements into generated responses presents both commercial opportunities and challenges for user experience. Unlike traditional search, where advertisements are clearly delineated, generative systems blur the boundary between informational content and promotional material, raising concerns around transparency and trust. In this work, we propose a modular pipeline for advertisement management in RAG-based conversational systems, consisting of an ad-rewriter for seamless ad integration and a robust ad-classiefir for detection. We leverage synthetic data to train high-performing classifiers, which are then used to guide two complementary ad-integration strategies: supervised fine-tuning of the ad-rewriter and a best-of-N sampling approach that selects the least detectable ad-integrated response among multiple candidates. Our evaluation focuses on two core questions: the efectiveness of ad classifiers in detecting diverse ad integration strategies, and the training methods that best support coherent, minimally intrusive ad insertion. Experimental results show that our ad-classifier, trained on synthetic advertisement data inspired by marketing strategies and enhanced through curriculum learning, achieves robust detection performance. Additionally, we demonstrate that classifier-guided optimization, through both fine-tuning and bestof-N sampling, significantly improves ad stealth, enabling more seamless integration. These findings contribute an adversarial co-evolution framework for developing more sophisticated ad-aware generative search systems and robust ad classifiers.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Conversational Search</kwd>
        <kwd>Retrieval-Augmented Generation</kwd>
        <kwd>LLM</kwd>
        <kwd>Advertisement</kwd>
        <kwd>Classification</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Conversational search engines powered by Large Language Models (LLMs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Retrieval-Augmented
Generation (RAG) [
        <xref ref-type="bibr" rid="ref2 ref29 ref3">2, 3</xref>
        ] are increasingly integrating advertisements into responses to enhance
monetization. As these systems shift toward generation-driven paradigms, the inclusion of advertising content
in LLM outputs has become both a timely and underexplored area, especially as state-of-the-art industry
systems move toward ad-supported deployments [
        <xref ref-type="bibr" rid="ref4 ref5">4, 5</xref>
        ]. Given that advertising has historically served
as the primary revenue stream for search engines [6], this transition raises critical questions about
how to embed ads in generated content without compromising response utility or user trust. Unlike
traditional search interfaces, where sponsored content is explicitly demarcated, generative systems risk
blurring the line between organic information and promotional material, potentially obfuscating ad
presence in the absence of clear markers [7].
      </p>
      <p>Despite its significance for the future of commercial LLM systems, advertisement integration and
transparency in LLM-generated responses remain insuficiently studied. While prior work has
introduced auction frameworks for generative ads and investigated methods for detecting LLM-generated
Retrieval
LLM for QA</p>
      <p>Item</p>
      <p>Ad-Rewritter
LLM for Rewriting
trained with
feedback from
Ad Classifier
Query
docs</p>
      <p>Response</p>
      <p>Rewritten Response
with Advertisement
advertisements [8, 9], comprehensive generation-side strategies remain limited. In addition,
foundational insights from marketing research, such as the distinctions between explicit vs. implicit advertising
and soft vs. hard selling [10, 11, 12], have yet to be meaningfully incorporated into generative model
design. It also remains unclear whether existing ad-detection systems [13, 14], originally developed
for traditional media, can generalize to the diverse and increasingly subtle forms of ad integrated in
LLM-generated contents. Furthermore, recent eforts that rely on naive ad insertion strategies [ 9] may
risk compromising response quality and user experience.</p>
      <p>To address these challenges, we participate in both sub-tasks (generation and classification) of the
Advertisement in Retrieval-Augmented Generation shared task at the Touché lab [15], CLEF 2025, where
our systems were submitted via the TIRA platform [16]. We propose a modular pipeline (Figure 1)
for advertisement management in RAG-based conversational systems. Our architecture consists of a
standalone RAG-based QA System, followed by an Ad-Rewriter that integrates advertisements into
the generated responses, and an Ad-Classifier trained to detect them.</p>
      <p>The Ad-Rewriter is experimented in three variants: a zero-shot version prompted directly for ad
integration, a supervised fine-tuning (SFT) variant trained with feedback from a robust ad-classifier,
and a zero-shot version enhanced with best-of-N sampling, where the final response is selected from
multiple candidates based on the classifier’s ad probability scores.</p>
      <p>In this adversarial co-evolution setup, the robustness of the Ad-Classifier is critical to the
efectiveness of the Ad-Rewriter. To enhance the robustness of a classifier, we augment the provided
dataset with carefully curated synthetic data, including hard positive and hard negative instances. The
enhanced classifier is then used as feedback mechanisms to guide the optimization of the Ad-Rewriter
across diferent implementation strategies.</p>
      <p>Our study explores the feasibility of this framework through two central research questions:
• RQ1: How can we train an Ad-Classifier that achieves robust classification performance
across diverse types of ad-integrated responses?
• RQ2: How can we develop an Ad-Rewriter that enables seamless ad integration while
minimizing the likelihood of ad detection?</p>
      <p>Through our experiments, we demonstrate the efectiveness of training the Ad-Classifier using
hard positive and hard negative synthetic data to improve robustness. We also show that training the
Ad-Rewriter in an adversarial setup (i.e., using feedback from the robust classifier) leads to more
efective and less detectable ad integration. We publicly release our code for further research. 1</p>
      <sec id="sec-1-1">
        <title>1https://github.com/kimdanny/TeamCMU-AdRAG</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Open Domain QA</title>
        <p>In this section, we survey related work on open domain QA and emerging strategies for advertising in
LLM-based search applications.</p>
        <p>
          Early QA systems relied on extracting answer spans from retrieved documents and machine reading
comprehension models [17]. Models such as DrQA [18] set the foundation for modern QA by improving
retrieval and answer span prediction. Large language models (LLMs) revolutionized QA by enabling
zero-shot and few-shot learning [19]. While these models provide high-quality answers, challenges
remain in evaluation, due to large but semantically sound answers that difer from the gold label [20],
and other LLM-related problems such as hallucination [21]. Retrieval-Augmented Generation (RAG) [
          <xref ref-type="bibr" rid="ref2 ref29">2</xref>
          ],
a specialized method for generation as part of a retrieval-enhanced machine learning strategy [
          <xref ref-type="bibr" rid="ref3 ref6">3, 22, 23</xref>
          ]
combines retrieval mechanisms with LLMs to improve factual correctness and response relevance,
especially in knowledge-intensive task such as QA and fact-checking [
          <xref ref-type="bibr" rid="ref7">24</xref>
          ].
        </p>
        <p>
          The MS-MARCO dataset [
          <xref ref-type="bibr" rid="ref8">25</xref>
          ] has driven significant web QA advancements, with state-of-the-art
methods employing dense retrieval [
          <xref ref-type="bibr" rid="ref10 ref9">26, 27</xref>
          ] and contrastive learning to optimize both response quality
and retrieval accuracy. Recent hybrid architectures that strategically combine LLMs with dense retrievers
have demonstrated measurable improvements over standalone GPT-3.5 or LLaMA-7B prompting [
          <xref ref-type="bibr" rid="ref11">28</xref>
          ].
Similar approaches [
          <xref ref-type="bibr" rid="ref12 ref13">29, 30</xref>
          ] have also obtained state-of-the-art results on other benchmarks [
          <xref ref-type="bibr" rid="ref14 ref15">31, 32</xref>
          ],
showing the versatility of retrieval-enhanced language models across domains.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Advertisement in the Era of LLMs</title>
        <p>
          Search systems are actively employing LLMs to display search results [
          <xref ref-type="bibr" rid="ref16">33</xref>
          ]. In the era of LLMs, revenue
generation through online advertising within LLM-generated response is gaining attention. In response,
researchers are starting to investigate auctions and advertising strategies in the context of LLM-based
search systems. Dubey et al. [8] studied an auction framework ensuring higher bidders receive greater
ad placement in LLM outputs. Inspired by them, Hajiaghayi et al. [
          <xref ref-type="bibr" rid="ref17">34</xref>
          ] examined advertisement auctions
with a focus on RAG, by considering both relevance (from the retriever) and bids when allocating ads
within generated responses. Soumalias et al. [
          <xref ref-type="bibr" rid="ref18">35</xref>
          ] proposed an auction framework where advertisers
influence LLM responses through reinforcement learning from human feedback. The detection of
generated ad content is also a growing research area. Schmidt et al. [9] introduced the Webis Generated
Native Ads 2024 dataset, focusing on identifying LLM-generated ads.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Problem Definition</title>
      <p>In this section, we restate the two sub-tasks for the Advertisement in Retrieval-Augmented Generation
shared task at the Touché lab [15] with more details.</p>
      <sec id="sec-3-1">
        <title>3.1. Sub-task 1: Ad-Augmented QA</title>
        <p>
          In Sub-task 1, the QA system is provided with an open-domain web query, a set of relevant passages,
and a set of external items to be advertised. The objective is to build a system that leverages the
passages to answer the query while incorporating an advertisement for one of the provided items. If
 items are given, the system should generate  independent answers, each integrating a distinct
item. These advertisements should be seamlessly woven into the response and dificult to detect as ads.
Additionally, the system must be capable of generating standard, non-advertising answers when no
items are provided, ensuring those responses do not exhibit ad-like characteristics.
Touché-25 Advertisement-in-Retrieval-Augmented-Generation (Ad-RAG) Dataset The
AdRAG dataset comprises approximately 3,000 queries, for which systems are required to generate both
ad-augmented and standard responses.2 The queries are typically short phrases that describe a topic or
product (e.g., “good triceps workout equipment", “corvette z06"). For half of the queries, no advertisements
are needed, but just an informative response. For the remaining queries, each requires the inclusion of,
on average, two advertisements. Information about the items to be advertised is provided in the form of
short descriptions averaging six words. Each query is supported by up to 100 passages retrieved from
the MS MARCO v2.1 dataset [
          <xref ref-type="bibr" rid="ref8">25</xref>
          ] using BM25 retrieval [
          <xref ref-type="bibr" rid="ref19">36</xref>
          ]. These queries in the Ad-RAG dataset were
obtained from the Webis-Ads dataset [9], which will be used in Sub-task 2.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Sub-task 2: Ad Detection</title>
        <p>The objective of Sub-task 2 is to determine whether a given response contains an embedded
advertisement. Specifically, the system receives a response as input and performs binary classification to
predict whether the response includes an advertisement or is purely informative. An efective classifier
should be robust to subtle ad insertions, ensuring that even seamlessly integrated advertisements can
be accurately detected.</p>
        <p>Webis Generated Native Ads 2024 (Webis-Ads) Dataset The Webis-Ads dataset [9] was created
to train an ad-blocker system for conversational search engines. This dataset comprises approximately
7,500 queries, along with responses generated by Microsoft Copilot and YouChat. For half of these
queries, a second version of the response was produced by prompting GPT-4 to insert advertisements
without altering the original informative content. As a result, the dataset includes 7,500 responses
without ads and 3,800 responses with ads.</p>
        <p>
          Notably, the data in this dataset is relatively easy to fit. In our preliminary experiments, a simple
DeBERTa-based text classifier [
          <xref ref-type="bibr" rid="ref20">37</xref>
          ] achieved around 98% accuracy on held-out data, suggesting that
the naive ad-insertion strategy used to construct the dataset results in easily detectable patterns. This
observation motivates the need for more challenging training data. To address this, we construct
synthetic hard positive and hard negative examples, which we discuss in detail in the following sections.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Pipeline Overview</title>
        <p>In this section, we describe our methodology for building a more robust Ad-Classifier and leveraging
it as a feedback mechanism to improve the efectiveness of the Ad-Rewriter.</p>
        <p>Figure 1 presents an overview of our system. Given a user query , the retrieval-augmented QA System
retrieves relevant passages and generates an initial response  without any advertisements.3 When a
specific item  is provided for advertisement, the Ad-Rewriter module  modifies the base response
 to seamlessly incorporate the promotional content, yielding a rewritten response .</p>
        <p>
          The Ad-Rewriter can operate in several modes: it can be 1) prompted to produce a rewritten
response directly, 2) guided by best-of-N sampling [
          <xref ref-type="bibr" rid="ref21 ref22">38, 39</xref>
          ] using feedback from a trained Ad-Classifier,
or 3) fine-tuned through supervised learning using training data generated with classifier feedback.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. QA System</title>
        <p>The QA System is responsible for generating contextually relevant responses to open-domain queries
prior to any advertisement integration. While a typical QA pipeline involves both retrieval and
genera</p>
        <sec id="sec-4-2-1">
          <title>2https://zenodo.org/records/14699130</title>
          <p>3In the Touché competition, retrieved documents are provided. As a result, we do not evaluate retrieval efectiveness and
simply use the top- passages.
tion, in the Touché competition setting, retrieved passages are provided. Thus, we directly proceed to
the generation step using the top- passages.</p>
          <p>Given the top- retrieved passages , we prompt a language model ℱ to synthesize a coherent and
self-contained response.4 Prompts are constructed using a prompt generation function ℱ (, ), which
is designed to elicit cohesive and informative responses from the model. The base QA output, , serves
as the input to the Ad-Rewriter module when advertisement integration is required. Prompt used for
the response generation can be found in Appendix B.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ad-Classifier</title>
        <p>The Ad-Classifier ℋ is formulated as a standard binary text classification task: given a query 
and its corresponding response , the model predicts whether the response contains an advertisement.
To build increasingly robust classifiers, we incrementally expand the training data with progressively
harder examples derived from multiple sources.</p>
        <p>
          The initial version (V0.0) was trained solely on the Webis-Ads dataset [9]; a simple DeBERTa-based
classifier [
          <xref ref-type="bibr" rid="ref20">37</xref>
          ] achieved strong performance on held-out data.5 However, we found that it failed to
generalize to more naturally embedded or implicit forms of advertising.
        </p>
        <p>To address this limitation, we introduced two complementary types of synthetic training data. The
ifrst, NaiveSynthetic dataset, involves prompting an LLM to insert fictional advertisements into baseline
QA responses without constraints, resulting in a wide variety of superficially embedded ads. With this
data, we trained two classifiers: V0.1 and V0.2.</p>
        <p>
          The second, StructuredSynthetic dataset, incorporates real-world product entities sourced from
Wikipedia. Drawing on advertising and marketing literature [10, 11, 12], we extract descriptive features
and generate two categories of training examples: (i) hard positives, where the product is promoted
through indirect or implicit language, and (ii) hard negatives, which are neutral informative passages
about the product with no advertising intent. With the StructuredSynthetic dataset, we train successive
versions of the classifier using combinations of the Webis-Ads, NaiveSynthetic, and StructuredSynthetic
datasets: V0.3, V0.4, and V0.5. In the last two versions (V0.4, V0.5), we incorporate curriculum learning
[
          <xref ref-type="bibr" rid="ref23">40</xref>
          ] based on classification dificulty, as estimated by the output logits from an earlier classifier (V0.1).
This training strategy produces classifiers with improved generalization and robustness to diverse ad
integration strategies, including those grounded in efective marketing practices.
4.3.1. Creation of NaiveSynthetic Data
NaiveSynthetic data generation follows the original Webis-Ads dataset approach, i.e., given an answer
without an advertisement, prompt an LLM to inject an ad. The query generation prompts include no
specific item; rather, the LLM is instructed to generate an advertisement of an item that fits the context,
which may result in the creation of fictional products. To promote diversity, we use a combination
of 5 diferent LLMs: GPT-4o, Gemma-2-9B-it 6, LLaMA-3.1-8B-Instruct7, Qwen2.5-7B-Instruct8, and
Mistral-7B-Instruct.9 Moreover, we devise 12 diferent prompts for ad insertion, targeting various
advertising strategies (e.g., direct, indirect, explicit, implicit, hard-sell and soft-sell). An example prompt
for NaiveSynthetic query generation can be found in Appendix D.1.
        </p>
        <p>Using this setup, we trained two versions of the classifier. Both V0.1 and V0.2 leverage the same set
of LLMs. However, V0.1 uses a single prompt for data generation, while V0.2 randomly samples from
the full pool of 12 prompts. The HuggingFace model pages contain the prompts used for insertion.10
4Qwen2.5-7B-Instruct is used as a language model in our experiment.
5V0.0: https://huggingface.co/jmvcoelho/ad-classifier-v0.0
6https://huggingface.co/google/gemma-2-9b-it
7https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
8https://huggingface.co/Qwen/Qwen2.5-7B-Instruct
9https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1
10V0.1: https://huggingface.co/jmvcoelho/ad-classifier-v0.1; V0.2: https://huggingface.co/jmvcoelho/ad-classifier-v0.2
4.3.2. Creation of StructuredSynthetic Data</p>
        <sec id="sec-4-3-1">
          <title>We generate StructuredSynthetic dataset through the following steps:</title>
          <p>1. Systematic collection of product entities from Wikipedia.</p>
          <p>We manually select Wikipedia "infobox" namespaces likely to contain product-related pages that
can be advertised (e.g., ‘product’, ‘brand’, ‘camera’, ‘automobile’), collecting a total of 25 infoboxes.
To ensure that each page within these namespaces refers to a real product (e.g., iPhone) rather
than a general concept (e.g., Mobile phone), we filter pages using Wikidata properties that strongly
indicate "product-ness" (e.g., P162 – producer, P593 – model number). This allows us to curate a
set of non-fictional product entities along with their associated Wikipedia content.</p>
          <p>For each verified entity, we retrieve its release year, rank the entities by recency, and retain only
those released in or after the year 2000.
2. Wikipedia article summarization and extraction of key promotional features.</p>
          <p>For each selected entity, we prompt a GPT-4o model to summarize the corresponding Wikipedia
page and extract key features and qualities suitable for promotional purposes.
3. Creation of hard positives (indirect and implicit advertisements) and hard negatives (factual,
nonpromotional texts).</p>
          <p>Drawing on insights from advertising literature, we generate two types of data using GPT-4o:
• Hard positives: Indirect and implicit advertisements.</p>
          <p>• Hard negatives: Factual and informative descriptions without promotional intent.
List of infoboxes, Wikidata properties, and the prompts used for hard positive and negative query
generations can be found in Appendix D.2.</p>
          <p>
            Using this setup, we trained three versions of the classifier. In V0.3, the classifier was trained
on a combined dataset consisting of Webis-Ads, NaiveSynthetic, and StructuredSynthetic instances.
In V0.4, we applied curriculum learning [
            <xref ref-type="bibr" rid="ref23">40</xref>
            ], where instance dificulty was determined by the V0.1
model. Finally, V0.5 used the same training regime as V0.4, but balanced the NaiveSynthetic and
StructuredSynthetic instances by upsampling the StructuredSynthetic dataset. Further details are available
on the corresponding HuggingFace model pages.11
          </p>
        </sec>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Ad-Rewriter</title>
        <p>The Ad-Rewriter module  takes as input a query , an ad-free QA response , and a product or
service to be advertised . These elements are combined into a prompt, denoted as  (, , ), which
conditions the rewriting process. The goal of the Ad-Rewriter is to produce a fluent, contextually
relevant, and minimally intrusive ad-integrated version of the original response.</p>
        <p>Method 1: Zero-shot rewriting Our initial implementation relies on a prompt-based zero-shot
rewriting, exploring advertisement strategies from the marketing literature, such as direct vs. indirect
and explicit vs. implicit advertising. Prompt used for the rewriting can be found in Appendix C.
Method 2: Supervised fine-tuning-based rewriting To move beyond prompt engineering, we
construct a training dataset using our synthetic query generation pipeline. For each (, , ) triplet,
we generate five candidate ad-integrated responses:  ∼  ( (, , )) for  ∈ 1..5, where  can be
various LLMs with diferent temperature. Each rewritten response  is then scored by the ad-classifier
ℋ, which estimates the likelihood that the response contains an advertisement: ℋ().
11V0.3: https://huggingface.co/teknology/ad-classifier-v0.3; V0.4: https://huggingface.co/teknology/ad-classifier-v0.4; V0.5:
https://huggingface.co/teknology/ad-classifier-v0.5</p>
        <p>We adopt a supervised fine-tuning (SFT) regime in which the objective is to train the model to prefer
completions with lower predicted ad probability. Formally, we define the optimal response * and the
negative log-likelihood loss ℒ as:</p>
        <p>* = argmin∈{1,...,5}ℋ()
ℒ = − log P(* | (, , )).
(1)
(2)
Method 3: Zero-shot rewriting with Best-of-N sampling Due to the computational cost of
ifne-tuning a language model, we apply a best-of-N sampling strategy in the zero-shot method. We set
our generation temperature above zero, and the model produces a diverse set of  candidate rewrites
for each input. Each candidate response is then evaluated using a trained ad-classifier, which assigns
an ad probability score. We select the response with the lowest predicted ad probability as the final
rewritten output. In our experiments, we use  = 10.</p>
        <p>This feedback loop, where classifiers guide the training of rewriters, forms the backbone of our
approach, which aims to result in more natural ad integration within the generated responses.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <sec id="sec-5-1">
        <title>5.1. Experimental Setup</title>
        <p>
          For Sub-task 1, systems are evaluated adversarially using a hidden classifier maintained by the task
organizers. The primary evaluation metric is advertisement detection accuracy, where a lower score
indicates more successful ad integration, i.e., the advertisement is more dificult to detect. As the
oficial classifier is unavailable during development, we employ a series of in-house Ad-Classifier
(V0.0–V0.5) as proxy evaluators for iterative tuning and model comparison. Specifically, we compare
detection accuracy across the three diferent Ad-Rewriter strategies (zero-shot, zero-shot with BoN,
and supervised fine-tuning-based rewriting) to assess the detectability of inserted ads. This setup
also serves to evaluate the performance of the Ad-Classifier for Sub-task 2 using standard binary
classification accuracy. For Ad-Classifier, we use pre-trained DeBERTa model [
          <xref ref-type="bibr" rid="ref20">37</xref>
          ], and for QA
System and Ad-Rewriter, we use Qwen2.5-7B-Instruct [
          <xref ref-type="bibr" rid="ref24">41</xref>
          ].
        </p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Results</title>
        <p>RQ1: How can we train an Ad-Classifier that achieves robust classification performance
across diverse types of ad-integrated responses? Before proceeding to ad rewriting, we first
identify which classifiers perform well across a range of ad-integration strategies. To evaluate this,
we test six versions of classifiers on responses generated using three ad rewriting approaches: a pure
zero-shot method, a nfie-tuned rewriter, and a zero-shot method with best-of-N (BoN) sampling. Each
approach is evaluated under two diferent generation temperature settings. Recall that these rewriters
modify the base QA response using various advertising techniques (e.g., indirect promotion through
storytelling), producing a diverse set of ad-integrated outputs.</p>
        <p>Table 1 presents the ad classification accuracy for each classifier version across the diferent rewriting
approaches. When comparing V0.0, which is trained without synthetic data or specialized training
methods such as curriculum learning, to the other versions, we observe a substantial improvement in
ad detection accuracy. This indicates that synthetic data inspired by marketing strategies can efectively
enhance classifier performance. In particular, versions V0.1 and V0.4 achieve consistently high accuracy
across both temperature settings, underscoring the value of combining diverse synthetic training data
with curriculum learning.</p>
        <p>Rewriting Method</p>
        <p>Temperature
Zero-Shot</p>
        <p>SFT</p>
        <p>Zero-Shot-BoN
RQ2: How can we develop an Ad-Rewriter that enables seamless ad integration while
minimizing the likelihood of ad detection? In addition to the basic zero-shot ad rewriter, we
develop two enhanced versions: the SFT and Zero-Shot-BoN-based rewriter. Both approaches rely
on a classifier to provide feedback, either during inference in the BoN setting or during training data
construction for the SFT approach. Based on the classification performance observed on outputs from
the pure Zero-Shot rewriter, we select V0.4 as the most robust classifier and use it both for scoring
candidate responses in the best-of-N sampling strategy and for fine-tuning the rewriter.</p>
        <p>
          As shown in Table 1, both the SFT and Zero-Shot-BoN-based rewriters lead to a substantial reduction
in ad detection accuracy across all classifier versions when compared to the basic Zero-Shot rewriter.
These results demonstrate that incorporating classifier feedback, either at inference time or during
training, is efective in achieving more seamless ad integration. In fact, these methods are able to
deceive robust classifiers in approximately 50 percent of cases. A paired Wilcoxon signed-rank test [
          <xref ref-type="bibr" rid="ref25">42</xref>
          ]
confirms that this reduction is statistically significant for all classifiers, indicating that classifier-guided
generation can reliably produce responses that are more dificult to detect as advertisements.
Shared Task Results For Sub-Task 1 (ad generation), the primary evaluation metric designated by
the lab organizers is recall; lower recall indicates more subtle ad insertion, and thus a better-performing
system. In addition to automatic metrics, the organizers manually reviewed 100 responses per submission
to validate classifier labels and ensure the legitimacy of each run. We submitted our system based on
the Zero-Shot-BoN approach, using the Qwen2.5-7B-Instruct model [
          <xref ref-type="bibr" rid="ref24">41</xref>
          ] for both the QA System and
the Ad-Rewriter, and employing the V0.4 classifier for best-of-N sampling. Out of six submissions,
including a baseline, our system ranked 3rd by recall (see Table 2). If F1-Score had been the ranking
criterion, our system would have placed 2nd.
        </p>
        <p>For Sub-Task 2 (ad classification), the primary evaluation metric set by the lab organizers is F1-Score.
We submitted our V0.4 classifier for evaluation. Among 16 runs, including baselines, our classifier
ranked 3rd in terms of F1-Score (see Table 3).</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>Methodological Implications Our findings across the two research questions and the strong
evaluation results highlight the efectiveness of using carefully curated synthetic data and classifier-guided
training in managing ad integration within generative systems. For ad detection, we show that classifiers
trained on synthetic data inspired by marketing strategies—particularly when combined with curriculum
learning—achieve significantly more robust performance across a diverse set of ad-integration styles.
On the generation side, incorporating classifier feedback, either through best-of-N sampling or during
supervised fine-tuning, leads to ad-integrated responses that are substantially harder to detect. These
results suggest that adversarial training dynamics between rewriters and classifiers can be efective in
shaping both components for more seamless and harder-to-detect ad insertion.</p>
      <p>Among these generation strategies, we also observe that responses generated at lower temperatures
tend to yield lower ad detection rates. One possible explanation for this pattern is that the model produces
more coherent and well-structured responses at lower temperatures [19], allowing ad insertions to
blend more naturally with the surrounding content. In contrast, higher temperatures introduce greater
variability, which can result in phrasing or transitions that are less contextually aligned, making the
presence of advertisements more noticeable to the classifier.</p>
      <p>Limitations A key constraint of this study is the reliance on synthetic data generated by LLMs
necessitates more rigorous validation and incorporation of more challenging scenarios to ensure
robustness. The binary nature of the current advertisement classifier may also fall short in fully
capturing nuanced or context-dependent advertisements. Additionally, the metric of ad detectability
is grounded in classifier performance. However, human users may perceive ads diferently, and what
evades a model may still be obvious to a human reader.</p>
      <p>
        Ethical Considerations This work reveals that advertisements can be seamlessly integrated into
LLM-generated responses in ways that are dificult even for strong classifiers to detect. While this
demonstrates the technical feasibility of subtle ad insertion, it also underscores the importance of
accompanying such capabilities with appropriate transparency controls. Without explicit labeling
or disclosure mechanisms, users may be unknowingly exposed to persuasive content, potentially
diminishing trust in conversational systems [7]. Moreover, false positives from ad classifiers risk
misclassifying informative content, which could disadvantage legitimate content providers. Ethical
challenges are amplified when ads appear in sensitive contexts, such as mental health or
emergencyrelated queries, or when cultural stereotypes and provider-side exposure imbalances propagate through
system components. These findings highlight the need for careful design choices and deployment
safeguards to ensure that stealthy ad integration does not come at the cost of user agency or marketplace
fairness [
        <xref ref-type="bibr" rid="ref26">43</xref>
        ].
      </p>
      <p>
        Future Direction Future work can address current limitations through comprehensive validation
of synthetic data using approaches like system rank correlation and linguistic analysis [
        <xref ref-type="bibr" rid="ref27">44</xref>
        ]. Beyond
any technical improvements, future implementations can explore more realistic scenarios involving
retrieval based on dynamic ad bidding information [
        <xref ref-type="bibr" rid="ref17">34</xref>
        ]. Moreover, evaluating and ensuring
providerside fairness will be essential for maintaining a balanced and sustainable advertisement ecosystem,
demanding rigorous assessment of both provider-consumer dynamics and systemic biases [
        <xref ref-type="bibr" rid="ref28">45</xref>
        ].
      </p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>We show that fine-tuning an advertisement classifier using synthetic query data inspired by marketing
strategies, along with progressively harder detection examples, significantly enhances its robustness
and efectiveness in identifying seamlessly integrated ads. Notably, we find that feedback from such a
well-trained classifier, whether used during test-time sampling or as part of the training objective, can be
leveraged to guide ad generators that strategically evade detection, successfully deceiving even strong
classifiers. This adversarial dynamic underscores both the potential and the challenge of developing
reliable and transparent advertisement in LLM-based search systems.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We thank Professor Eric Nyberg, Professor Teruko Mitamura, and Kimihiro Hasegawa for their valuable
feedback during the development of our system.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used generative AI in order to identify and correct
grammatical errors and typos. The authors reviewed and edited the content as needed and take full
responsibility for the publication’s content.
[6] J. Gleason, A. Koeninger, D. Hu, J. Teurn, Y. Bart, S. Knight, R. E. Robertson, C. Wilson, Search
engine revenue from navigational and brand advertising, in: Proceedings of the International
AAAI Conference on Web and Social Media, volume 18, 2024, pp. 488–501.
[7] I. Zelch, M. Hagen, M. Potthast, A user study on the acceptance of native advertising in generative
ir, in: Proceedings of the 2024 Conference on Human Information Interaction and Retrieval,
CHIIR ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 142–152. URL:
https://doi.org/10.1145/3627508.3638316. doi:10.1145/3627508.3638316.
[8] A. Dubey, Z. Feng, R. Kidambi, A. Mehta, D. Wang, Auctions with llm summaries, in: Proceedings
of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24,
Association for Computing Machinery, New York, NY, USA, 2024, p. 713–722. URL: https://doi.org/
10.1145/3637528.3672022. doi:10.1145/3637528.3672022.
[9] S. Schmidt, I. Zelch, J. Bevendorf, B. Stein, M. Hagen, M. Potthast, Detecting generated native
ads in conversational search, in: Companion Proceedings of the ACM Web Conference 2024,
WWW ’24, Association for Computing Machinery, New York, NY, USA, 2024, p. 722–725. URL:
https://doi.org/10.1145/3589335.3651489. doi:10.1145/3589335.3651489.
[10] Y. Yi, Direct and indirect approaches to advertising persuasion: Which is more efective?, Journal
of Business Research 20 (1990) 279–291.
[11] S. Shapiro, H. S. Krishnan, Memory-based measures for assessing advertising efects: A comparison
of explicit and implicit memory efects, Journal of advertising 30 (2001) 1–13.
[12] S. Okazaki, B. Mueller, C. R. Taylor, Measuring soft-sell versus hard-sell advertising appeals,</p>
      <p>Journal of Advertising 39 (2010) 5–20.
[13] E. L. Post, C. N. Sekharan, Comparative study and evaluation of online ad-blockers, in: 2015 2nd</p>
      <p>International Conference on Information Science and Security (ICISS), IEEE, 2015, pp. 1–4.
[14] B. Shiller, J. Waldfogel, J. Ryan, The efect of ad blocking on website trafic and quality, The RAND</p>
      <p>Journal of Economics 49 (2018) 43–63.
[15] J. Kiesel, Ç. Çöltekin, M. Gohsen, S. Heineking, M. Heinrich, M. Fröbe, T. Hagen, M. Aliannejadi,
T. Erjavec, M. Hagen, M. Kopp, N. Ljubešić, K. Meden, N. Mirzakhmedova, V. Morkevičius, H. Scells,
I. Zelch, M. Potthast, B. Stein, Overview of Touché 2025: Argumentation Systems, in: Experimental
IR Meets Multilinguality, Multimodality, and Interaction. 16th International Conference of the
CLEF Association (CLEF 2025), Lecture Notes in Computer Science, Springer, Berlin Heidelberg
New York, 2025.
[16] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast,
Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot,
F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances
in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes
in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. doi:10.1007/
978-3-031-28241-6_20.
[17] M. Seo, A. Kembhavi, A. Farhadi, H. Hajishirzi, Bidirectional attention flow for machine
comprehension, in: International Conference on Learning Representations, 2017. URL: https:
//openreview.net/forum?id=HJ0UKP9ge.
[18] D. Chen, Reading wikipedia to answer open-domain questions, arXiv preprint arXiv:1704.00051
(2017).
[19] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, et al., Language models are few-shot learners, Advances in neural information
processing systems 33 (2020) 1877–1901.
[20] E. Kamalloo, N. Dziri, C. L. A. Clarke, D. Rafiei, Evaluating Open-Domain Question Answering in
the Era of Large Language Models, in: Proceedings of the 61st Annual Meeting of the Association
for Computational Linguistics, 2023.
[21] J. Li, X. Cheng, X. Zhao, J. Nie, J. Wen, HaluEval: A Large-Scale Hallucination Evaluation
Benchmark for Large Language Models, in: Proceedings of the 2023 Conference on Empirical
Methods in Natural Language Processing, 2023.
[22] H. Zamani, F. Diaz, M. Dehghani, D. Metzler, M. Bendersky, Retrieval-enhanced machine learning,</p>
    </sec>
    <sec id="sec-10">
      <title>A. Computational Cost</title>
      <p>All inference experiments, except for ablations on models larger than 7B parameters, were conducted
on a single L40 GPU. For eficiency, we distributed experiments across 2 to 32 GPUs depending on
availability. Training was performed using 8 L40 GPUs for supervised fine-tuning on 7B parameter
models, while ad-classifier training required only a single GPU.</p>
    </sec>
    <sec id="sec-11">
      <title>B. Prompt for QA System response generation</title>
      <sec id="sec-11-1">
        <title>Answer the following web query, given the context.</title>
        <p>Context: {context}.</p>
        <p>Query: {query}.</p>
        <p>Reply only with a "well formed answer", i.e., human-like and coheseve. The answer should
be written in a style where advertisements are easy to be integrated. Do not include any
advertisements or explicit advertisement space, just write with the stylistic choice in mind.</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>C. Prompt for Ad-Rewriter rewritten response generation</title>
      <p>Consider this existing response to a conversational search query, and modify it to include an
implicit advertisement.</p>
      <p>The advertisement should:
- Mention this {type}: {item};
- Highlight these qualities: {qualities};
- Blend naturally with the original response;
- Not appear overly promotional;
- Be hard to detect by current textual ad-detection systems.</p>
      <sec id="sec-12-1">
        <title>Original query: {query} Original response: {response} Return only the modified response with the advertisement integrated.</title>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>D. Details on Synthetic Data Generation</title>
      <sec id="sec-13-1">
        <title>D.1. Naive Synthetic Data</title>
        <p>Example prompt for NaiveSynthetic query generation:</p>
        <p>Task: You are tasked with subtly integrating an advertisement into a search query response. The goal is
to make the advertisement feel natural and helpful within the context of the response, not disruptive or
overtly promotional. First, you should define the item to advertise. You should keep in mind the context
of the query and original response. Consider the following advertisement qualities when choosing the
product:
* Product/Service Category.
* Specific Advertisement Theme.
* Key Selling Points/Qualities to Emphasize.</p>
        <p>Instructions for Integration:
1. Contextual Relevance is Key: The advertisement must feel relevant to the search query and the existing
response text. Think about how the advertised product/service genuinely relates to the user’s needs
implied by the query.
2. Subtle and Natural Language: Use language that blends seamlessly with the original response. Avoid
overly promotional or sales-oriented phrasing. Focus on informative and helpful language.
3. Focus on Benefits, Not Just Features: Instead of directly listing qualities, rephrase them as benefits the
user might gain in relation to the search query and response context.
4. Strategic Placement: Choose the most natural and impactful location(s) within the response to subtly
introduce the advertisement. This might involve:
* Briefly modifying an existing sentence to subtly incorporate the advertisement.
* Adding a short, relevant phrase or clause to an existing sentence.
* In rare cases, adding a very short, contextual sentence (only if it feels truly natural).
5. Maintain Original Meaning: Ensure the core meaning and factual accuracy of the original response
remain unchanged. The advertisement should enhance, not distort, the original information.
6. Review for Subtlety: Before returning the response, critically evaluate if the advertisement feels
genuinely subtle and integrated. If it feels forced or obvious, refine your approach.</p>
        <p>Output: Return **only** the modified response with the subtly integrated advertisement.</p>
        <sec id="sec-13-1-1">
          <title>Search Query: {query} Original Response: {response} Modified Response:</title>
          <p>• V0.1:
• V0.2:
– Multiple models, single data generation prompt.
– https://huggingface.co/jmvcoelho/ad-classifier-v0.1
– Multiple models, multiple data generation prompt.</p>
          <p>– https://huggingface.co/jmvcoelho/ad-classifier-v0.2
The HuggingFace model pages contain the prompts used for insertion.</p>
          <p>The following versions of Ad-Classifier were trained using the NaiveSynthetic data:</p>
        </sec>
      </sec>
      <sec id="sec-13-2">
        <title>D.2. Structured Synthetic Data</title>
        <p>D.2.1. List of infoboxes selected
product, brand, automobile, motorcycle, tractor, calculator, computing device, keyboard, software, camera, mobile
phone, night vision device, synthesizer, tool, watch, pinball, toy, film, book, Asian comic series, comic, musical,
furniture, video game, drug.
D.2.3. Hard positive creation prompt</p>
        <p>Your task is to generate an indirect and implicit advertisement for a {infobox_name} named {product_name}.
The advertisement
* must not indicate that it is an advertisement or promotional content.
* must include the {infobox_name} name, {product_name}.
* must avoid any direct call to action.
* must be brief and contained within one paragraph.
* may present the {infobox_name} as part of natural, conversational, or informational content, or as a
synthetic personal experience that could occur in real life.
* may use testimonial or storytelling styles that describe the experiences of people with {page_title}.
* may include detailed, scientific/research-backed statements.</p>
        <sec id="sec-13-2-1">
          <title>The following information about {page_title} may be useful for your writing: {summary} The advertisement can implicitly promote some of the following aspects of {page_title}: {key_features} Write only the advertisement without any explanations.</title>
          <p>D.2.4. Hard negative creation prompt</p>
          <p>Your task is to write a concise, informative text about a {infobox_name} named {product_name}.
The text:
* must focus on delivering factual information.
* must not include expressions of preference or favoritism toward {page_title} and should focus solely on
the facts.
* must include the name {product_name} at least once.
* can mention other {infobox_name}s related to {page_title} to provide comprehensive information about
the subject.</p>
        </sec>
        <sec id="sec-13-2-2">
          <title>The following information about {page_title} may be useful for your writing: {summary} Write only the informative text without any explanations.</title>
          <p>The following versions of Ad-Classifier were trained with the StructuredSynthetic data:
• V0.4:</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Radlinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <article-title>A theoretical framework for conversational search</article-title>
          ,
          <source>in: Proceedings of the 2017 conference on conference human information interaction and retrieval</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>117</fpage>
          -
          <lpage>126</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>9459</fpage>
          -
          <lpage>9474</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Drozdov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <article-title>Retrieval-enhanced machine learning: Synthesis and opportunities</article-title>
          ,
          <source>arXiv preprint arXiv:2407.12982</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Perplexity</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <article-title>Why we're experimenting with advertising</article-title>
          ,
          <year>2024</year>
          . URL: https://www.perplexity. ai/hub/blog/why
          <article-title>-we-re-experimenting-with-advertising</article-title>
          , accessed:
          <fpage>2025</fpage>
          -04-30.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <article-title>[5] OpenAI, Improved shopping results from chatgpt search</article-title>
          ,
          <year>2025</year>
          . URL: https://help.openai.com/en/ articles/11146633-improved
          <article-title>-shopping-results-from-chatgpt-search</article-title>
          ,
          <source>accessed: 2025-04-30. in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>2875</fpage>
          -
          <lpage>2886</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>F.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Drozdov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zamani</surname>
          </string-name>
          ,
          <article-title>Retrieval-enhanced machine learning: Synthesis and opportunities</article-title>
          ,
          <source>in: Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>299</fpage>
          -
          <lpage>302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yazdani</surname>
          </string-name>
          , N. De Cao,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Maillard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Plachouras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          , S. Riedel,
          <article-title>KILT: a benchmark for knowledge intensive language tasks</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>2523</fpage>
          -
          <lpage>2544</lpage>
          . URL: https: //aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>200</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl-main.
          <volume>200</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>P.</given-names>
            <surname>Bajaj</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Campos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Craswell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>McNamara</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , et al.,
          <article-title>Ms marco: A human generated machine reading comprehension dataset</article-title>
          ,
          <source>arXiv preprint arXiv:1611.09268</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Oguz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Edunov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          , W.-t. Yih,
          <article-title>Dense passage retrieval for open-domain question answering</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>6769</fpage>
          -
          <lpage>6781</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .emnlp-main.
          <volume>550</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>550</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Tang</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Bennett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Overwijk</surname>
          </string-name>
          ,
          <source>Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dou</surname>
          </string-name>
          ,
          <article-title>Unigen: A unified generative framework for retrieval and question answering with large language models</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jackson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Retrieve what you need: A mutual learning framework for open-domain question answering</article-title>
          ,
          <source>Trans. Assoc. Comput. Linguistics</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Carenini, ARR: Question Answering with Large Language Models via Analyzing, Retrieving, and Reasoning</article-title>
          , volume
          <volume>2502</volume>
          .04689,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Weld</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <article-title>Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension</article-title>
          ,
          <source>in: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kwiatkowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Palomaki</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Redfield</surname>
          </string-name>
          , M. Collins,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Alberti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Epstein</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kelcey</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petrov</surname>
          </string-name>
          ,
          <article-title>Natural questions: A benchmark for question answering research, Transactions of the Association for Computational Linguistics (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>W.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , S. Wang,
          <string-name>
            <given-names>P.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <article-title>Is chatGPT good at search? investigating large language models as re-ranking agents</article-title>
          ,
          <source>in: The 2023 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2023</year>
          . URL: https://openreview.net/forum?id=3Q6LON8y2I.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hajiaghayi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lahaie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Rezaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <article-title>Ad auctions for LLMs via retrieval augmented generation</article-title>
          ,
          <source>in: The Thirty-eighth Annual Conference on Neural Information Processing Systems</source>
          ,
          <year>2024</year>
          . URL: https://openreview.net/forum?id=
          <fpage>Ujo8V7iXmR</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>E.</given-names>
            <surname>Soumalias</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Curry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Seuken</surname>
          </string-name>
          ,
          <article-title>Truthful aggregation of llms with an application to online advertising</article-title>
          ,
          <source>arXiv preprint arXiv:2405.05905</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>S. E.</given-names>
            <surname>Robertson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zaragoza</surname>
          </string-name>
          ,
          <article-title>The probabilistic relevance framework: BM25 and beyond</article-title>
          ,
          <source>Found. Trends Inf. Retr</source>
          . (
          <year>2009</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen,
          <article-title>Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing</article-title>
          ,
          <source>arXiv preprint arXiv:2111.09543</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>N.</given-names>
            <surname>Stiennon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Ziegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Voss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <article-title>Learning to summarize from human feedback</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Neural Information Processing Systems</source>
          , NIPS '20, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>R.</given-names>
            <surname>Nakano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Balaji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hesse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kosaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Saunders</surname>
          </string-name>
          , et al.,
          <article-title>Webgpt: Browser-assisted question-answering with human feedback</article-title>
          ,
          <source>arXiv preprint arXiv:2112.09332</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Louradour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Collobert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <article-title>Curriculum learning</article-title>
          ,
          <source>in: Proceedings of the 26th annual international conference on machine learning</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>41</fpage>
          -
          <lpage>48</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          , G. Dong,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Ma, J. Xu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Men</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Ge</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <source>Qwen2 technical report, arXiv preprint arXiv:2407.10671</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>R. F.</given-names>
            <surname>Woolson</surname>
          </string-name>
          ,
          <article-title>Wilcoxon signed-rank test</article-title>
          ,
          <source>Encyclopedia of biostatistics 8</source>
          (
          <year>2005</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mehrotra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McInerney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bouchard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lalmas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <article-title>Towards a fair marketplace: Counterfactual evaluation of the trade-of between relevance, fairness &amp; satisfaction in recommendation systems</article-title>
          ,
          <source>in: Proceedings of the 27th acm international conference on information and knowledge management</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>2243</fpage>
          -
          <lpage>2251</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Arguello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <article-title>Tip of the tongue query elicitation for simulated evaluation</article-title>
          ,
          <source>arXiv preprint arXiv:2502.17776</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Diaz</surname>
          </string-name>
          ,
          <article-title>Towards fair rag: On the impact of fair ranking in retrieval-augmented generation</article-title>
          ,
          <source>arXiv preprint arXiv:2409.11598</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>D.</surname>
          </string-name>
          <year>2</year>
          .2.
          <article-title>List of Wikidata properties selected P50 (author), P86 (composer), P110 (illustrator), P123 (publisher), P162 (producer), P170 (creator), P176 (manufacturer), P178 (developer), P179 (product series), P287 (designed by), P593 (model number), P676 (lyricist), P943 (programmer), P3640 (National Drug Code), P4087 (MyAnimeList manga ID), P8731 (AniList manga ID), P9618 (AlternativeTo software ID), P9897 (App Store age rating), and P12969 (game designer)</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>