<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>JU-NLP at Touché: Covert Advertisement in Conversational AI-Generation and Detection Strategies</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Arka Dutta</string-name>
          <email>arka08652@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Agrik Majumdar</string-name>
          <email>agrik.maz33@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sombrata Biswas</string-name>
          <email>sombrata.biswas@gmail.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dipankar Das</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sivaji Bandyopadhyay</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Jadavpur University</institution>
          ,
          <addr-line>Kolkata, 700032</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper proposes a comprehensive framework for the generation of covert advertisements within Conversational AI systems, along with robust techniques for their detection. It explores how subtle promotional content can be crafted within AI-generated responses and introduces methods to identify and mitigate such covert advertising strategies. For generation (Sub-Task 1), we propose a novel framework that leverages user context and query intent to produce contextually relevant advertisements. We employ advanced prompting strategies and curate paired training data to fine-tune a large language model (LLM) for enhanced stealthiness. For detection (Sub-Task 2), we explore two efective strategies: a fine-tuned CrossEncoder ( all-mpnet-base-v2) for direct classification, and a prompt-based reformulation using a fine-tuned DeBERTa-v3-base model. Both approaches rely solely on the response text, ensuring practicality for real-world deployment. Experimental results show high efectiveness in both tasks, achieving a precision of 1.0 and recall of 0.71 for ad generation, and F1-scores ranging from 0.99 to 1.00 for ad detection. These results underscore the potential of our methods to balance persuasive communication with transparency in conversational AI.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;llm finetuning</kwd>
        <kwd>stealth advertisement</kwd>
        <kwd>binary classification</kwd>
        <kwd>sentence transformers</kwd>
        <kwd>cross encoder</kwd>
        <kwd>retrievalaugmented generation</kwd>
        <kwd>context-aware generation</kwd>
        <kwd>prompt-based learning</kwd>
        <kwd>DeBERTa</kwd>
        <kwd>transformer models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The detection and generation of covert advertisements in conversational AI is an emerging challenge
that intersects language understanding, marketing ethics, and human-computer interaction. As
conversational agents and retrieval-augmented generation (RAG) systems become increasingly integrated into
user-facing platforms, there is a growing concern around the insertion of native advertisements that
may subtly influence user behavior without clear disclosure. The ability to generate such responses in
a contextually relevant yet stealthy manner, as well as to accurately detect them post-generation, is
essential for preserving trust and transparency in AI-mediated communication [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        The 2025 edition of the Touché shared task addresses this concern by introducing two sub-tasks [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]:
• Sub-Task 1: Given a user query and supporting document context, generate a relevant response
that optionally includes a covert advertisement for a given item or service.
      </p>
      <p>
        In our participation as Team JU-NLP, we propose tailored solutions for both sub-tasks. For
SubTask 1, we construct a high-quality training dataset by leveraging a large language model (LLM) as
a judge to evaluate a multiset of responses generated by a pretrained LLM across iterative prompts,
scoring them based on advertisement detectability. These preference-labeled pairs are then used to
ifne-tune a LLM (like, Mistral-7B model [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) using the ORPO (Odds Ratio Preference Optimization)
training framework [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], which encourages the generation of contextually coherent yet covert
promotional content. For Sub-Task 2, we explore two complementary detection strategies: (1) a fine-tuned
CrossEncoder (all-mpnet-base-v2) that performs binary classification using only the response text,
and (2) a prompt-based reformulation of the task utilizing a fine-tuned DeBERTa-v3-base model,
aimed at improving detection performance through instruction-style inputs and enhanced contextual
understanding [
        <xref ref-type="bibr" rid="ref5">5, 6</xref>
        ].
      </p>
      <p>This paper is structured as follows: Section 2 presents our method for advertisement generation
using a preference-tuned LLM model. Section 3 describes the techniques employed for advertisement
detection. Finally, Section 4 concludes the paper with a discussion of the results and potential directions
for future work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Covert Ad Insertion in Conversational AI</title>
      <p>In this work, we introduce a novel framework for generating covert advertisements that are seamlessly
integrated into contextually relevant responses. The proposed system embeds promotional content
related to a product or service in a manner that preserves the coherence and informativeness of the
response while minimizing the likelihood of detection as an advertisement.</p>
      <sec id="sec-2-1">
        <title>2.1. Objective</title>
        <p>The goal of this task is to generate fluent, contextually grounded responses that incorporate promotional
content in a subtle and undetectable manner. The system should address the user’s query while
seamlessly embedding product or service mentions without disrupting coherence or raising suspicion
of advertising intent.</p>
        <sec id="sec-2-1-1">
          <title>Inputs: The system is provided with:</title>
          <p>• A natural-language user query , representing an information need.
• An optional item  (e.g., product, service, or brand) to be promoted.
• A set of associated attributes  for the item (e.g., features, benefits, or keywords).
• A document index  = {1, 2, . . . ,  } containing external knowledge passages for retrieval.
Outputs: The system is expected to return:
• A generated response ^ that:
– is relevant to the query ,
– is grounded in retrieved content  ⊆ ,
– subtly incorporates the item  and its attributes  without overt advertisement cues,
– minimizes detectability as an advertisement.
• A supporting document set  of top- retrieved segments used during generation, provided for
transparency and verification.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Contribution</title>
        <p>
          We adopt a hybrid framework that combines Retrieval-Augmented Generation (RAG) [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] with
CacheAugmented Generation (CAG) [7] to provide rich contextual grounding for the language model based
on the user query. In the first stage, relevant document segments are retrieved using a BM25-based
retrieval module to supply external knowledge. In the second stage, we fine-tune a Mistral-7B model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
using preference pairs generated by a large language model acting as a judge, which scores responses
based on the detectability of embedded advertisements. This preference-based supervision enables the
model to learn subtle promotional strategies. The resulting fine-tuned model generates responses that
are both contextually coherent and covertly promotional, making the advertisements dificult to detect.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Background</title>
        <p>
          In recent years, Retrieval-Augmented Generation (RAG) has emerged as a robust framework for
enhancing the factual grounding and contextual precision of large language models (LLMs). By retrieving
relevant external document segments during inference, RAG-based systems can generate responses
that are not only fluent but also anchored in real-world information, making them particularly efective
for tasks demanding domain-specific or query-sensitive outputs [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ].
        </p>
        <p>However, integrating covert advertisements into such generated responses introduces distinct
challenges. Unlike traditional advertising approaches—which often employ overt markers, stylistic shifts,
or explicit endorsements—covert advertisements require the seamless embedding of promotional
content within natural language. These insertions must remain undetectable to both human readers and
automated detection systems while preserving topicality and coherence.</p>
        <p>To address this, we adopt a preference-based fine-tuning strategy that trains the model to
distinguish and favor subtly promotional responses over explicitly advertorial ones. We employ a large
language model as an automated judge to evaluate candidate response pairs, scoring them based on the
detectability of the embedded advertisement. Each pair consists of one overt and one covertly phrased
advertisement, which are then labeled with preferences indicating the more discreet option.</p>
        <p>
          These labeled preferences are used to fine-tune a Mistral-7B model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] under the Odds Ratio Preference
Optimization (ORPO) framework [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. ORPO enables the model to learn fine-grained distinctions in
promotional phrasing, encouraging generation that aligns with strategic communication goals such as
persuasive stealth marketing.
        </p>
        <p>The resulting system is capable of producing high-quality, context-aware responses that incorporate
product or service mentions in a subtle and natural manner. This enables the generation of content
that fulfills both informational and marketing intents without disrupting user experience or triggering
advertisement detection heuristics. Overall, our approach represents a significant step forward in
training LLMs for applications requiring nuanced, goal-aligned generation such as covert advertising.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. System Overview</title>
        <p>2.4.1. Data Preprocessing
We utilize the Webis Generated Native Ads 2024 dataset [8], which comprises user queries, associated
items (e.g., products or services), and corresponding item-specific attributes (e.g., features or qualities).
To facilitate training for covert ad generation, we extract and normalize the relevant fields: queries,
items, item qualities, and response texts.</p>
        <p>For the preference-based fine-tuning setup, we construct a dedicated training set of preference-labeled
response pairs. This involves generating multiple candidate responses per query-item pair using the
base LLM model within the RAG-CAG framework. These candidates are then scored for advertisement
detectability using a large language model acting as an automated judge. Each pair consists of one
subtly promotional and one more explicitly advertorial response, with the less detectable response
labeled as preferred. These preference pairs serve as supervision signals for fine-tuning under the ORPO
paradigm.</p>
        <p>
          All textual inputs are tokenized using the Mistral tokenizer with a maximum sequence length of 8000.
Standard preprocessing steps such as lowercasing, punctuation normalization, and dynamic truncation
are applied to ensure consistency across retrieval and generation modules.
2.4.2. Preparing the Preference-Labeled Pairs for Training
To enable efective preference-based fine-tuning, we construct a dataset of response pairs labeled
according to their advertisement detectability. The preparation process is illustrated in Figure [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] and
involves several key steps:
• Context Assembly: For each user query and associated item (with its qualities), we assemble a
context using both Retrieval-Augmented Generation (RAG) and Cache-Augmented Generation
(CAG) mechanisms. This ensures that the model has access to relevant background information
and item-specific details.
• Candidate Generation: The Mistral-7B model, conditioned on the assembled context, generates
multiple candidate responses. These responses vary in how overtly or subtly they incorporate
the promotional content.
• Detectability Scoring: An LLM-based judge (also a Mistral-7B model) evaluates each candidate
response, assigning a detectability score that reflects how easily the advertisement can be identified
within the text. The LLM judge is used for its ability to capture contextual cues and subtle
language patterns that traditional classifiers often miss, enabling it to efectively distinguish
between naturally integrated and overt advertisements.
• Preference Pair Construction: For each query-context, we select pairs of responses where one is
less detectable (more covert) and the other is more easily identified as an advertisement. The less
detectable response is labeled as preferred. These preference-labeled pairs form the training data
for the Odds Ratio Preference Optimization (ORPO) fine-tuning process.
• Iterative Loop: The process is iterative—feedback from the LLM judge can be used to refine
generation strategies, encouraging the model to produce increasingly subtle promotional content
over successive rounds.
        </p>
        <p>This workflow ensures that the training data explicitly encodes the distinction between overt and
covert advertisement strategies, allowing the fine-tuned model to internalize nuanced preferences for
stealthy ad insertion.</p>
        <p>User Query</p>
        <p>Item &amp; Qualities
Context Assembly</p>
        <p>(RAG + CAG)
Mistral-7B Inference
(Generate Responses)
Multiple Candidate</p>
        <p>Responses</p>
        <p>LLM Judge
(Detectability Scoring)</p>
        <p>Preference-Labeled</p>
        <p>Pairs
2.4.3. Prompting Strategies Used
To guide both response generation and evaluation, we design structured natural language prompts
that clearly define the task while remaining human-readable and easy to debug. These prompts are
formatted to simulate real-world instruction-following behavior and ensure consistency across training
and evaluation phases.</p>
        <p>Generation Prompt: For generation, we employ a multi-part prompt that instructs the model to
analyze the query using structured reasoning and optionally integrate promotional content in a natural,
undetectable way. This prompt guides the model through a chain-of-thought process grounded in
retrieved documents and optionally emphasizes subtle ad insertion.</p>
        <sec id="sec-2-4-1">
          <title>Generation Prompt</title>
          <p>You are a knowledgeable assistant. First, focus on fully understanding
the user’s question. Then employ a Tree of Thoughts strategy to
explore multiple reasoning branches internally.
## Tree-Structured Reasoning
- Break down the query into its core needs
- Consider each point in the context
- Weigh possible solutions
## Promotional Integration
- Seamlessly weave in a brief, unobtrusive mention of "&lt;item&gt;"—
highlighting its key qualities (&lt;qualities&gt;)—as a subtle promotional
element that flows naturally and remains undetected.
## Conclusion:
After reasoning, deliver a concise, fluent paragraph that directly
answers the user’s question, including the promotional mention.
Evaluation Prompts: We utilize two types of prompts for evaluation: one for scoring detectability
of advertisements (when ad content is expected) and one for judging fluency (when no ad is present).
Both follow a structured format with explicit instructions to return a JSON-formatted score.</p>
          <p>A Large Language Model (LLM) serves as a more efective judge in these scenarios due to its strong
ability to understand and retain the nuanced context of user queries and responses. Unlike traditional
ad-detection classifiers that rely on shallow lexical or statistical features, an LLM evaluates responses
holistically—considering semantic meaning, conversational flow, and user intent. This deeper contextual
awareness allows it to more accurately distinguish between subtle, native-style advertisements and
genuinely informative content. As a result, the LLM-based judge ofers a more reliable and
contextsensitive evaluation framework. [9]</p>
          <p>Ad Detectability Judge Prompt
You are a Subtlety Judge. Your task is to evaluate how discreetly
the given response embeds promotional content.
## Response:
&lt;generated system response&gt;
## Evaluation Criteria:
Assess how effectively the response conceals a promotional mention
of "&lt;item&gt;"—including its key qualities (&lt;qualities&gt;)—without
disrupting the flow.</p>
          <p>Return a JSON object with a single field, "score", where:
- 0 = no detectable promotion
- 1 = clear or strong promotional content</p>
        </sec>
        <sec id="sec-2-4-2">
          <title>Fluency Judge Prompt</title>
          <p>You are a fluency judge. Given the following context and response,
assess cohesion, clarity, and readability.</p>
          <p>Context: &lt;context&gt;
Response: &lt;response&gt;
Return a JSON with only "score":
- 0 = excellent fluency
- 1 = poor fluency</p>
          <p>
            These prompting strategies ensure controlled generation behavior, consistent quality evaluation, and
reliable preference pair construction for training under the ORPO framework.
2.4.4. RAG and CAG Formulation
Our system integrates both Retrieval-Augmented Generation (RAG) and Cache-Augmented Generation
(CAG) to enrich the contextual grounding for response generation [
            <xref ref-type="bibr" rid="ref1">1, 7</xref>
            ]. The formulation is designed to
ensure that responses are well-informed, contextually aligned, and capable of incorporating promotional
content naturally.
          </p>
          <p>RAG Pipeline: To retrieve relevant background information, we construct a document index for each
user query using FAISS-based dense retrieval[10]. The indexing process is as follows:
• If an index for a query ID already exists in the local cache (CACHE_INDEX), it is loaded directly to
avoid recomputation.
• Otherwise, each candidate document segment is converted into a LangChain Document object
containing:
– the document text segment,
– metadata such as document ID, estimated educational value, and BM25 score.
• These documents are then embedded using a predefined embedding model and indexed with</p>
          <p>FAISS.</p>
          <p>• The resulting FAISS index is saved locally for future reuse.</p>
          <p>Cache-Augmented Generation (CAG): While RAG fetches relevant documents dynamically based
on the query, CAG ensures reusability and low-latency by storing query-specific document embeddings
locally. This caching mechanism allows the system to:
• Quickly retrieve semantically similar segments for repeated or semantically similar queries,
• Avoid redundant embedding computation, thereby improving eficiency,
• Maintain consistency in retrieved context across generations, which helps when evaluating
subtlety and detectability of promotional insertions.</p>
          <p>Context Retrieval Strategy: Given a query and its cached FAISS index, we retrieve the top- context
segments to ground generation:
• Initially, 2 passages are retrieved via similarity_search_with_score.
• Each document is re-ranked using a custom score that balances semantic similarity with document
quality, defined as:</p>
          <p>combined_score = similarity_score + (2 − max(2, edu_value))
• This formulation penalizes low-quality documents (based on edu_value) to ensure high utility
content is selected.</p>
          <p>• The top- re-ranked passages are returned and concatenated to form the context input.</p>
          <p>By combining RAG’s relevance with CAG’s eficiency and stability, our formulation ensures that
the LLM receives a coherent and context-rich prompt that balances factual grounding with consistent
advertisement integration. This dual mechanism is particularly efective in stealth advertisement
generation where response quality and subtlety must be jointly optimized.</p>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>2.5. Training and Evaluation Strategy</title>
        <p>
          Our approach to model development was guided by the need for high-context retention, stealthy ad
integration, and preference-aligned generation. To meet these requirements, we selected a Mistral-7B[
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]
model as the base generator, given its strong instruction-following performance, eficient decoding, and
support for large context windows (up to 4,000 tokens in our setup via Unsloth[11]). This allowed us to
incorporate extended retrieval-augmented context while still accommodating long-form generations.
        </p>
        <p>
          To enhance the model’s ability to learn subtle advertising preferences, we adopted Odds Ratio
Preference Optimization (ORPO) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] during fine-tuning. ORPO is particularly suited for tasks where
generation quality is judged via pairwise preferences (e.g., more covert vs. more overt ad insertions).
It enables the model to internalize ranking signals between high-quality and low-quality outputs by
combining a standard language modeling loss with a margin-based ranking objective. This dual objective
encourages the model to not only generate fluent responses but also to prioritize those that align with
stealthy advertisement strategies.
2.5.1. Model Building and Training
Training proceeds in two stages: (1) construction of preference-labeled examples, and (2) fine-tuning a
LoRA-adapted Mistral-7B model using those preferences.
        </p>
        <p>Stage 1: Preference Data Construction: For each training instance, we retrieve or build a FAISS
index corresponding to the user query and apply our RAG+CAG mechanism to extract the most relevant
segments. Multiple candidate responses are generated using the Mistral-7B model with controlled
sampling parameters (top-p = 0.75, temperature = 0.6, repetition penalty = 1.06, and up to 3000 new
tokens).</p>
        <p>
          Each generated response is evaluated using a detectability scoring pipeline, where a separate LLM
(configured as a judge) assigns a score in the range [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] based on how overt the promotional insertion
is. Responses are sorted by this score, and the most covert and most overt samples are selected as a
preference pair. These pairs are serialized into the training format required by TRL’s ORPOTrainer[12].
Stage 2: LoRA-Augmented Fine-Tuning: The Mistral-7B model is loaded via the Unsloth [11]
FastLanguageModel interface with LoRA adapters applied to selected attention projection layers.
Finetuning is then conducted using the ORPO framework to optimize a hybrid objective:
ℒ = ℒLM +  ℒ rank,
(1)
where:
• ℒLM is token-level cross-entropy loss,
• ℒrank is a margin ranking loss with a margin of 1.0,
•  = 0.5 balances the two objectives.
        </p>
        <p>Training Configuration
• Maximum sequence length: 4000 tokens (combined context + generation)
• Batch size: 2 per device, with gradient accumulation over 4 steps (efective batch size of 8)
• LoRA settings: rank  = 16,  = 16, dropout = 0
• Optimizer: 8-bit AdamW with linear learning rate scheduler
• Precision: Mixed precision (FP16 or BF16, hardware-dependent)
• Training steps: 30 (approx. 1 epoch)
• Logging: Managed through Weights &amp; Biases</p>
        <p>This pipeline enables eficient and lightweight training while embedding nuanced preferences for
subtle advertisement integration into a strong base generator.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.6. Results and Evaluation</title>
        <p>We evaluate the performance of our proposed approach and various baselines on Sub-Task 1 using
the oficial metrics from the TIRA leaderboard [ 13]. Our focus is on how well models can embed
promotional content in a stealthy manner while maintaining fluency.</p>
        <sec id="sec-2-6-1">
          <title>Evaluation Metrics: Each system is assessed using:</title>
          <p>• Evasion Score (FNR) – The fraction of true ad responses that evade detection. Higher is better.
• Precision – The fraction of system outputs identified as ads that were actually ad-inserted.</p>
          <p>Higher is better.</p>
          <p>• Recall – The fraction of true ad responses that were identified as such. Lower is better for stealth.</p>
        </sec>
        <sec id="sec-2-6-2">
          <title>To rank models overall, we use the following aggregate score:</title>
          <p>Stealth Score =</p>
          <p>FNR + Precision + (1 − Recall)
3
(2)
This formulation rewards stealthy insertions (high FNR), precision in ad detection (high precision), and
low detectability (low recall).</p>
          <p>1
Evaluation Protocol: All models were submitted to the oficial TIRA evaluation platform [ 13], which
samples 100 outputs per model and runs a standardized ad classifier to compute metrics. This ensures
fairness and reproducibility across submissions.</p>
          <p>Model Comparison: Figure 2 compares our fine-tuned models (JU_NLP ORPO v1 and v2) against a
variety of powerful pretrained LLMs (including Mistral, Phi, Gemma, LLaMA, and Qwen).</p>
          <p>1
0.8
e0.6
r
o
c
S
0.4
0.2
0</p>
          <p>Insights: Our best-performing model (JU_NLP ORPO v2) clearly outperformed all other approaches,
including powerful pretrained baselines, demonstrating the strength of preference-based fine-tuning
for subtle ad generation. The fine-tuning process—leveraging ORPO and large-context reasoning via
retrieval—efectively teaches the model to balance informativeness with stealth.</p>
          <p>Notably, pretrained LLMs like Gemma-12B and Mistral-7B showed decent performance even without
ifne-tuning. However, since these responses were not manually filtered or curated, their stealthiness
scores may be inflated due to coincidental omission of promotional language. Therefore, the scores of
pretrained LLMs should be interpreted with caution.</p>
          <p>Our submission demonstrates that strategic fine-tuning (especially via ORPO) combined with
retrieval augmentation can produce high-quality, fluently integrated responses that resist ad
classification—meeting the core challenge of Sub-Task 1.</p>
          <p>Reproducibility: All experimental results reported are fully reproducible via the TIRA evaluation
platform [13], which ensures standardized, isolated, and tamper-proof evaluation. This provides a fair
benchmarking setup and prevents overfitting to hidden test data. For detailed replication instructions
and access to the evaluation setup, please refer to our supplementary material or the project README
included in the submission.</p>
          <p>Our fine-tuned model checkpoint used for the submission is publicly accessible on Hugging Face at:
arka08652/orpo_trained_advertise-v0.2.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Detection of Covert Advertisement in Conversational-AI</title>
      <p>As conversational search engines become increasingly prevalent, distinguishing between informative
content and covert advertising within generated responses is a pressing concern. Native advertisements,
often embedded seamlessly in natural language, can compromise content integrity and user trust. This
paper addresses the binary classification task of detecting whether an AI-generated response contains
a native advertisement. Two distinct approaches are presented: (1) a CrossEncoder-based method
leveraging the all-mpnet-base-v2 model for deep contextual analysis of response texts, and (2) a
prompt-based fine-tuning approach using DeBERTa-v3 to reformulate the task as an instruction-guided
classification problem. Both approaches aim to tackle the challenge of identifying subtle promotional
cues without relying on external metadata or structural features, reflecting real-world scenarios where
only the response text is available.
3.0.1. Contribution
This work introduces two efective approaches for detecting native advertisements in AI-generated
responses, each ofering distinct advantages. The first approach adapts the all-mpnet-base-v2
CrossEncoder for single-text binary classification, enabling deep contextual analysis without relying on
query-response pairs or metadata. It emphasizes simplicity, reproducibility, and F1-focused training to
balance precision and recall. The second approach reformulates the task as prompt-based classification
using DeBERTa-v3, leveraging natural language instructions to enhance semantic understanding. It
employs eficient mixed-precision training and cosine learning rate scheduling for resource optimization.
Both methods advance ad detection by eliminating dependency on structural cues and prioritizing
real-world applicability through response-only analysis.
3.0.2. Objective
The primary objective is to develop robust binary classifiers capable of detecting native advertisements
in conversational AI responses. Specifically:
• Approach 1 aims to maximize classification accuracy using a CrossEncoder model fine-tuned
on the Webis Native Ads 2024 dataset [8], with F1-score as the primary metric to handle class
imbalance.
• Approach 2 investigates the eficacy of prompt-based supervision, reformulating inputs as natural
language instructions (e.g., "Does this response contain an advertisement? (Yes/No)") to enhance
contextual reasoning.</p>
      <sec id="sec-3-1">
        <title>3.1. Background</title>
        <p>Detecting advertising and promotional content in text has progressed from early rule-based systems and
shallow classifiers to modern transformer-based models. Initial approaches often relied on handcrafted
features or metadata and were suited for structured domains such as web pages and social media.
As advertising strategies have grown increasingly covert—particularly within conversational AI—the
challenge has shifted toward detecting subtle promotional language using only the linguistic content of
generated responses.</p>
        <p>The introduction of pretrained language models significantly advanced the field of text classification.
Bidirectional transformers have demonstrated strong performance in contextual understanding [14],
with further improvements in training stability and robustness achieved through architectural
modifications and extended pretraining [15]. Lightweight alternatives [16] have also been proposed to reduce
inference latency while retaining much of the original performance.</p>
        <p>
          Recent work has applied these models to domain-specific ad detection tasks, validating the
efectiveness of contextual embeddings in recognizing promotional language. However, most of these studies
operate on isolated text snippets without considering dialog structure or real-world conversational
context. To overcome these limitations, one of our approaches reframes the task as prompt-based
binary classification over full query-response pairs, leveraging the DeBERTa-v3 architecture [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] for
its disentangled attention and relative positional encoding mechanisms. In parallel, we implement a
CrossEncoder based on all-mpnet-base-v2 [6], operating solely on the response text, mirroring
deployment scenarios where the user query is unavailable. This model is optimized for F1-score and
achieves strong performance without requiring architectural complexity or external signals.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Methodology</title>
        <p>3.2.1. Approach 1
This section presents a complete overview of the classification pipeline for advertisement detection,
consolidating data preprocessing, prompt formulation, model training, and evaluation. The approach
is designed to align with transformer pretraining objectives and maximize classification performance
under limited supervision.</p>
        <p>Task Formulation: The task is framed as a binary classification problem. Given a system-generated
response—and optionally its query—the model must decide whether it contains an advertisement. The
model outputs a single label: 1 for promotional content and 0 for neutral responses. We employ a
CrossEncoder to leverage token-level interactions that highlight subtle persuasive wording.
Input Construction and Preprocessing: We utilize the Webis Native Ads 2024 dataset[8], consisting
of user queries, system responses, and binary labels. The preprocessing pipeline:
• Loads JSONL splits and extracts responseText and label.
• Performs minimal normalization: original casing and whitespace are preserved to retain subtle
cues.
• Tokenizes with the MPNet tokenizer from all-mpnet-base-v2, applying dynamic padding
and truncation to 512 tokens to preserve semantic context.
• Constructs training examples using sentence-transformers.InputExample with
singlesentence input (responseText only) and its binary label.
• No aggressive cleaning (e.g., stopword removal) is applied, to maintain advertisement cues’
integrity.</p>
        <p>Model Architecture and Justification: We fine-tune a CrossEncoder built on
sentencetransformers/all-mpnet-base-v2[17], which includes 12 transformer layers ( 110M parameters)
and enables full input sequence encoding for token-to-token interaction—crucial for detecting subtle
promotional language.</p>
        <p>MPNet, the model’s backbone, integrates masked language modeling (MLM) from BERT and permuted
language modeling (PLM) from XLNet, while retaining full positional encoding. Trained on over
160 GB of text and fine-tuned on benchmarks like GLUE and SQuAD, MPNet outperforms BERT,
XLNet, and RoBERTa by 4.8, 3.4, and 1.5 points respectively on GLUE dev sets under equivalent
settings [6, 15, 18, 14]. It also shows consistent improvements in SQuAD and other downstream
tasks [19, 6].</p>
        <p>This superior semantic fidelity makes MPNet ideal for high-precision native advertisement detection,
outperforming lighter models (e.g., all-MiniLM-L6-v2) at a manageable computational cost [16, 17].
The model is adapted for binary classification by setting num_labels=1 and applying a sigmoid
activation to the logit output.</p>
        <p>Model Building and Training: We train the CrossEncoder using binary cross-entropy loss:
ℒBCE = − [︀  · log ^ + (1 − ) · log(1 − ^)
︀]
(3)
uses AdamW with a linear warmup schedule.</p>
        <p>where  ∈ {0, 1} is the ground-truth label, and ^ ∈ (0, 1) the predicted probability. Optimization
Hyperparameter</p>
        <sec id="sec-3-2-1">
          <title>Batch Size</title>
        </sec>
        <sec id="sec-3-2-2">
          <title>Epochs</title>
        </sec>
        <sec id="sec-3-2-3">
          <title>Learning Rate</title>
        </sec>
        <sec id="sec-3-2-4">
          <title>Warmup Steps</title>
        </sec>
        <sec id="sec-3-2-5">
          <title>Weight Decay</title>
          <p>Value
16
3</p>
          <p>The task is cast as a binary classification problem. Given a system-generated
response and its corresponding query, the goal is to determine whether the response contains an
advertisement. The desired output is a single label:
• 1 if the response is promotional in nature,
• 0 otherwise.</p>
          <p>To fully utilize the model’s instruction-following capabilities, we reformulate each data point as a
natural language prompt.</p>
          <p>Input Construction and Preprocessing: The Touché-2024 dataset is used as the source corpus.
Original .jsonl files are converted to</p>
          <p>.json using a custom utility for seamless integration with
pandas. For each instance , the "query", "response" and "label is extracted and converted into the
following prompt format:</p>
        </sec>
        <sec id="sec-3-2-6">
          <title>Prompt Example</title>
          <p>Query: &lt;Query&gt;
Response: &lt;Response&gt;
Answer: &lt;Label(Yes/No)&gt;
Task: Does this response contain an advertisement? (Yes or No)
This format enables the transformer model to better contextualize the classification task by explicitly
posing it as an instruction. Tokenization is carried out using the DeBERTa tokenizer with truncation at
512 tokens (model max input length), padding to handle batch inputs, automatic generation of input-ids
and attention-mask for training.</p>
          <p>Model Architecture and Justification: We fine-tune the microsoft/deberta-v3-base
transformer with a binary classification head.</p>
          <p>Our core model is the microsoft/deberta-v3-base variant, augmented with a classification head that
projects the [CLS] token representation to two logits. We opted for DeBERTa-v3 over alternatives like
BERT or RoBERTa due to its disentangled attention mechanism—which separately attends to token
content and positional information—and relative position embeddings, both of which have been shown
to significantly enhance representation quality and downstream task performance. These architectural
advances are particularly efective for subtle, instruction-based binary classification, outperforming
standard BERT/RoBERTa in low-resource settings.</p>
          <p>Training Configuration: The model is trained using the HuggingFace Trainer API under the
following hyperparameters:</p>
          <p>Hyperparameter</p>
          <p>Value</p>
          <p>Rationale
Batch Size
Epochs
Learning Rate
Warmup Steps
Optimizer
Scheduler
Precision
32
1
5 × 10 −5
10
AdamW
Cosine
FP16/BF16</p>
        </sec>
        <sec id="sec-3-2-7">
          <title>Eficient GPU usage without overfitting Minimal gains beyond 1 epoch; avoids overfitting Standard for transformer fine-tuning Stabilizes early updates</title>
          <p>Suitable for transformer training with weight decay
Enables smooth convergence
Reduces memory footprint, speeds up training</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Model Evaluation</title>
        <p>Model performance was evaluated on a held-out test set from the Webis Native Ads 2024 dataset [8].
Each response in this set is annotated with a binary label indicating the presence (1) or absence (0) of a
native advertisement. To ensure consistency across approaches, the test set was preprocessed using
the same configuration employed during training. For Approach 1, responses were tokenized using
the MPNet tokenizer, while Approach 2 followed a prompt-based format using the DeBERTa tokenizer,
with dynamic padding handled via HuggingFace’s DataCollatorWithPadding.</p>
        <p>For inference, the CrossEncoder (Approach 1) outputs a scalar probability between 0 and 1, which is
thresholded at 0.5 to generate binary predictions. In contrast, the DeBERTa-based classifier (Approach 2)
outputs class-wise logits, and the final prediction is determined by applying argmax over these logits.
Despite architectural diferences, both models are evaluated using the same criteria.</p>
        <p>Evaluation metrics include Precision, Recall, and F1-Score. Model predictions and ground-truth labels
were compared after each epoch using a custom BinaryEvaluator (in the CrossEncoder setup) or
via PyTorch and scikit-learn evaluation scripts (in the DeBERTa setup). Evaluation results are reported
both per class and in terms of macro and micro averages, to ensure a fair and balanced assessment of
model performance across imbalanced classes.</p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Results and Analysis</title>
        <p>The evaluation results on the test set for both approaches are presented in Table 1.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>
        In this study, we explored two complementary directions for addressing native advertisement detection
and generation in AI-generated conversational systems, using the Webis Native Ads 2024 dataset [8].
Generation Side: We proposed a stealth-aware generation framework that embeds promotional
content subtly into responses grounded in retrieved document segments. By combining
RetrievalAugmented Generation (RAG) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and Cache-Augmented Generation (CAG) [7] for context assembly,
and training using Odds Ratio Preference Optimization (ORPO) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] on preference-labeled response pairs,
our fine-tuned JU_NLP (ORPO v2) model achieved state-of-the-art performance. The model scored
highest in stealth metrics on the TIRA [13] evaluation platform, balancing high false-negative rates
(FNR), strong precision, and controlled recall. This demonstrates the efectiveness of large-context
LLMs fine-tuned with preference-driven objectives for subtle ad insertion. The final model is openly
available at: arka08652/orpo_trained_advertise-v0.2.
      </p>
      <p>
        Detection Side: We further tackled the inverse problem—detecting native advertisements—in two
ways. First, a transformer-based CrossEncoder model (all-mpnet-base-v2) was fine-tuned on labeled
query–response pairs, achieving an F1-score of 0.9901 on the test set, highlighting the power of dense
textual representations in spotting covert ads. Second, we reformulated the task as a prompt-based
classification problem and fine-tuned a DeBERTa-v3-base model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] using instruction-style prompts.
This approach proved highly efective in low-resource settings and required minimal architectural
changes.
      </p>
      <p>Together, these approaches ofer a full-stack solution to native ad integration and detection in
opendomain dialogue. They show that modern LLMs, when properly guided via retrieval mechanisms or
instruction prompts and fine-tuned using structured objectives like ORPO, can either convincingly
conceal or efectively uncover promotional intent in text. This provides a strong foundation for future
work on explainable and controllable advertisement systems in conversational AI.</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Chat-GPT-4o in order to: Grammar and spelling
check and abstract drafting. After using these tool(s)/service(s), the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content.
[6] K. Song, X. Tan, T. Qin, J. Lu, T.-Y. Liu, Mpnet: Masked and permuted pre-training for language
understanding, 2020. URL: https://arxiv.org/abs/2004.09297. arXiv:2004.09297.
[7] V. V. Surulimuthu, A. K. G. Rao, Cag: Chunked augmented generation for google chrome’s built-in
gemini nano, 2024. URL: https://arxiv.org/abs/2412.18708. arXiv:2412.18708.
[8] S. Schmidt, I. Zelch, J. Bevendorf, B. Stein, M. Hagen, M. Potthast, Detecting generated native
ads in conversational search, in: Companion Proceedings of the ACM Web Conference 2024,
WWW ’24, ACM, 2024, p. 722–725. URL: http://dx.doi.org/10.1145/3589335.3651489. doi:10.1145/
3589335.3651489.
[9] J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y. Wang,
W. Gao, L. Ni, J. Guo, A survey on llm-as-a-judge, 2025. URL: https://arxiv.org/abs/2411.15594.
arXiv:2411.15594.
[10] M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazaré, M. Lomeli, L. Hosseini,</p>
      <p>H. Jégou, The faiss library, 2025. URL: https://arxiv.org/abs/2401.08281. arXiv:2401.08281.
[11] M. H. Daniel Han, U. team, Unsloth, 2023. URL: http://github.com/unslothai/unsloth.
[12] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, K. Rasul,</p>
      <p>Q. Gallouédec, Trl: Transformer reinforcement learning, https://github.com/huggingface/trl, 2020.
[13] M. Froebe, T. Gollub, M. Potthast, B. Stein, TIRA: A Platform for Reproducible Evaluation of NLP
and IR Tasks, in: Proceedings of the 46th International ACM SIGIR Conference on Research and
Development in Information Retrieval, 2023, pp. 3387–3397.
[14] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep bidirectional transformers
for language understanding, 2019. URL: https://arxiv.org/abs/1810.04805. arXiv:1810.04805.
[15] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized bert pretraining approach, 2019. URL: https://arxiv.org/abs/1907.
11692. arXiv:1907.11692.
[16] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, Minilm: Deep self-attention distillation for
task-agnostic compression of pre-trained transformers, 2020. URL: https://arxiv.org/abs/2002.10957.
arXiv:2002.10957.
[17] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks, 2019.</p>
      <p>URL: https://arxiv.org/abs/1908.10084. arXiv:1908.10084.
[18] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, Q. V. Le, Xlnet: Generalized
autoregressive pretraining for language understanding, 2020. URL: https://arxiv.org/abs/1906.08237.
arXiv:1906.08237.
[19] P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, Squad: 100,000+ questions for machine comprehension
of text, 2016. URL: https://arxiv.org/abs/1606.05250. arXiv:1606.05250.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W. tau Yih, T. Rocktäschel,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2005</year>
          .11401. arXiv:
          <year>2005</year>
          .11401.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          , Ç. Çöltekin,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gohsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Heineking</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Aliannejadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Erjavec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hagen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kopp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ljubešić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Meden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mirzakhmedova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Morkevičius</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Scells</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wolter</surname>
          </string-name>
          , I. Zelch,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , Overview of Touché 2025:
          <article-title>Argumentation Systems</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. 16th International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. de las Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Lavaud</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>W. E.</given-names>
          </string-name>
          <string-name>
            <surname>Sayed</surname>
          </string-name>
          , Mistral 7b,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2310.06825. arXiv:
          <volume>2310</volume>
          .
          <fpage>06825</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          , Orpo:
          <article-title>Monolithic preference optimization without reference model</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2403.07691. arXiv:
          <volume>2403</volume>
          .
          <fpage>07691</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen,
          <article-title>Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2111.09543. arXiv:
          <volume>2111</volume>
          .
          <fpage>09543</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>