<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Akram Elbouanani</string-name>
          <email>elbouanani.akram@gmail.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evan Dufraisse</string-name>
          <email>evan.dufraisse@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aboubacar Tuo</string-name>
          <email>aboubacar.tuo@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adrian Popescu</string-name>
          <email>adrian.popescu@cea.fr</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Multilingual NLP</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Large Language Models, Few-Shot Learning, Prompt Engineering</institution>
          ,
          <addr-line>Subjectivity Detection, Debate Prompting</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Université Paris-Saclay, CEA-List</institution>
          ,
          <addr-line>F-91120, Palaiseau</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents a competitive approach to multilingual subjectivity detection using large language models (LLMs) with few-shot prompting. We participated in Task 1: Subjectivity of the CheckThat! 2025 evaluation campaign. We show that LLMs, when paired with carefully designed prompts, can match or outperform fine-tuned smaller language models (SLMs), particularly in noisy or low-quality data settings. Despite experimenting with advanced prompt engineering techniques-such as debating LLMs and various example selection strategies-we found limited benefit beyond well-crafted standard few-shot prompts. Our system achieved top rankings across multiple languages in the CheckThat! 2025 subjectivity detection task, including first place in Arabic and Polish, and top-four finishes in Italian, English, German, and multilingual tracks. Notably, our method proved especially robust on the Arabic dataset, likely due to its resilience to annotation inconsistencies. These findings highlight the efectiveness and adaptability of LLM-based few-shot learning for multilingual sentiment tasks, ofering a strong alternative to traditional fine-tuning, particularly when labeled data is scarce or inconsistent.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>CEUR</p>
      <p>ceur-ws.org
work aims to demonstrate that LLMs can indeed compete with, and potentially outperform, fine-tuned
SLMs in subjectivity detection tasks through the application of techniques such as careful prompting and
few-shot prompting. We investigate the extent to which precise prompt engineering and the provision
of relevant examples within the prompt can enhance the performance of LLMs, thereby showcasing
their potential for robust and adaptable subjectivity detection in academic and real-world settings.
While these techniques show promise in theory, our experiments reveal that they do not consistently
improve performance on the tested datasets, suggesting important limitations and directions for further
research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>The computational detection of subjective language has evolved significantly from its early rule-based
foundations to contemporary data-driven approaches. This progression reflects broader trends in
natural language processing while addressing challenges in identifying opinionated content. Current
research emphasizes two critical aspects: (1) the development of increasingly sophisticated models
capable of capturing linguistic nuance [9, 10], and (2) the creation of specialized techniques to optimize
these models for subjectivity analysis [11, 12, 13].</p>
      <p>
        Architectural Evolution. Traditional lexicon-based and supervised learning approaches have given
way to transformer-based models, with BERT-style architectures demonstrating strong performance on
binary subjectivity classification [ 14, 15, 16]. However, the emergence of LLMs has introduced new
capabilities in detecting implicit subjectivity through contextual reasoning. Indeed, a key advantage of
LLMs is their advanced ability to recognize subtle linguistic cues like irony, sarcasm, or implicit bias [17].
Their architectural design enables them to capture complex dependencies within sequential data, leading
to a deeper understanding of intricate relationships between words and emotions. Recent empirical
studies have demonstrated that LLMs consistently achieve higher overall accuracy in sentiment analysis,
often outperforming specialized pre-trained transformer models due to their comprehensive grasp of
human thought and emotion [10]. This ability indicates strong potential for subjectivity detection, as
recognizing nuanced evaluative language is key to distinguishing subjective from objective statements
[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>Prompting Strategies. Despite their inherent capabilities, optimizing LLM performance in
specialized tasks like subjectivity detection often requires appropriate prompting. While a wide range of
prompting strategies exists, we focus in this work on a selected subset of strategies which we evaluate
in our experiments.</p>
      <p>
        • Prompt engineering involves meticulously designing input queries to guide LLMs toward
desired outputs. This technique is crucial for clearly defining the nuances of subjectivity for the
model, specifying output formats, and aligning the LLM’s reasoning with task-specific objectives
[18, 19]. However, prompt efectiveness can be highly sensitive to wording, and small changes
may lead to inconsistent results [20, 21].
• In-context Learning (ICL) refers to the ability of LLMs to perform tasks by conditioning on
input-output examples provided directly in the prompt, without updating model parameters. A
common subcategory of ICL is few-shot learning, where the prompt includes a small number
of labeled examples to help the model infer the task and classification criteria [
        <xref ref-type="bibr" rid="ref5">22, 5</xref>
        ]. A key
challenge in few-shot ICL is selecting representative and diverse examples, as LLMs can overfit
to or ignore suboptimal demonstrations.
• Multi-agent LLM Systems is an emerging paradigm to enhance LLM performance. This
approach distributes responsibilities across multiple specialized agents, each focusing on specific
functions like information retrieval, complex reasoning, or decision-making [23]. Multi-agent
systems ofer several advantages, including enhanced reliability through cross-verification, refined
decision-making through collaborative information sharing, and improved handling of complex
tasks by dividing workloads [23, 24, 25]. Yet, coordination overhead and potential inconsistencies
between agents remain significant challenges [ 23].
      </p>
      <p>
        The CheckThat! lab [
        <xref ref-type="bibr" rid="ref9">26, 27</xref>
        ], organized within CLEF, serves as a prominent platform for advancing
subjectivity detection research, particularly in distinguishing subjective from objective statements at
the sentence level within news articles across multiples languages [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. CheckThat! evaluations have
systematically demonstrated the strengths and limitations of diferent approaches to fact-checking and
subjectivity detection. In early iterations (2018-2020), traditional machine learning models, such as
SVM with carefully engineered linguistic features, achieved competitive results, particularly for English
texts [
        <xref ref-type="bibr" rid="ref10 ref11">28, 29</xref>
        ]. The 2020-2021 evaluations marked a transition period in which fine-tuned BERT-style
models began to dominate the leaderboards [
        <xref ref-type="bibr" rid="ref12 ref13">30, 31</xref>
        ]. These results established transformer architectures
as the new baseline for subjectivity detection tasks. The most recent CheckThat! cycles (2023-2024)
included early approaches relying on LLMs. While early submissions underperformed due to inadequate
prompting strategies, subsequent systems demonstrated that properly optimized LLMs could match or
exceed specialized models [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>We utilize the provided multilingual dataset, which comprises sentence-level annotations labeled as
either OBJ (objective) or SUBJ (subjective). Table 1 summarizes the number of annotated sentences
per language and split. The dataset exhibits class imbalance across languages and splits, with some
languages (e.g., Italian and Arabic) showing a predominance of OBJ labels, while others (e.g., Bulgarian)
present a more balanced distribution.</p>
      <p>During the exploratory phase, we focus primarily on English and Arabic. These two languages were
selected due to their difering class distributions and dataset sizes. We operate under the assumption
that insights obtained from these languages are transferable to the other languages in the dataset.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>We explore several strategies to improve subjective sentence classification. We experiment with the
following three main approaches: prompt engineering, few-shot learning, and multi-agent LLM setups.
We also use fine-tuned SLMs as baselines.</p>
      <sec id="sec-4-1">
        <title>4.1. Prompt Engineering</title>
        <p>We systematically evaluate the impact of prompt phrasing and label framing on classification
performance. We compare:
• A minimal, generic prompt vs. a detailed one generated from the annotation guidelines.
• Label framing using explicit terms (“Subjective”/“Objective”) vs. neutral terms (“Category
0”/“Category 1”).</p>
        <p>• Binary yes/no questions (e.g., “Is the sentence subjective?”) as an alternative to direct classification.
These variations are designed to probe how linguistic framing afects the model’s interpretability and
consistency.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Few-Shot Learning Strategies</title>
        <p>We experiment with a range of few-shot prompting configurations to assess how the number and
selection of support examples influence performance. Specifically, we compare:
• Prompting setups with 0 (no-shot), 6-shot, and 12-shot examples.
• Example selection strategies based on: (a) Semantic similarity, (b) Semantic dissimilarity, and (c)</p>
        <p>Random sampling.</p>
        <p>For the semantic approaches, similarity and dissimilarity are measured using cosine similarity between
sentence embeddings generated by OpenAI’s text-embedding-3-small model. To ensure class balance
in the few-shot prompts, we selected an equal number of examples from each class (subjective and
objective). For instance, in the 6-shot setting, the prompt included the three most similar (or dissimilar)
subjective examples and the three most similar (or dissimilar) objective ones. This balance helps prevent
prompt-induced bias toward a particular class label.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Multi-Agent LLM Reasoning</title>
        <p>We design a set of multi-agent prompting experiments to investigate the interpretability and robustness
of LLM outputs:
• Debate setup: Two agents argue why a sentence is subjective vs. objective; a third model acts as
a judge.
• Adversarial reasoning: One agent argues why the sentence is not subjective and another why
it is not objective, and a judge makes the final call.
• Extended framing: We include all four perspectives—Subjective, Not Subjective, Objective, and</p>
        <p>Not Objective— and a judge making the final decision.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>This section presents a detailed evaluation of multiple modeling strategies for subjectivity classification,
including fine-tuned transformers, prompted LLMs with and without few-shot examples, advanced
prompt reframing, and agent-based debating approaches. Given the dataset’s imbalance, the oficial
primary evaluation metric is the macro-averaged F1-score, which equally weights both classes. We also
pay particular attention to SUBJ recall, as the subjective class is often underrepresented and may be
more informative in downstream analyses. This imbalance was mitigated using a weighted BCE loss,
warmup training, and early stopping to ensure stability and generalizability. We first report results
with diferent system variants on the development subset and then discuss the oficial results obtained
with the 2025 test set.</p>
      <sec id="sec-5-1">
        <title>5.1. Preliminary Results</title>
        <sec id="sec-5-1-1">
          <title>5.1.1. Fine-Tuned Transformers</title>
          <p>Table 2 summarizes the performance of supervised transformer models trained on English and Arabic
data. RoBERTa-Base, fine-tuned on English data, achieves the best performance overall, with a macro
F1 of 0.70 and a notably high macro precision of 0.79. However, the model struggles with subjective
instances, achieving only 0.39 recall for the subjective class, which suggests a strong bias toward the
majority (objective) class.</p>
          <p>In Arabic, the results are considerably weaker across the board. While BERT-Base-Arabertv02
achieves the best Arabic macro F1 score (0.55), subjective recall remains modest (0.47). Despite the
use of language-specific models and XLM-RoBERTa for cross-lingual encoding, the performance gap
between English and Arabic remains substantial.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Prompt Engineering and Few-Shot Learning</title>
        </sec>
        <sec id="sec-5-1-3">
          <title>5.1.3. Few-Shot Selection Strategies</title>
          <p>We further compare several strategies for selecting few-shot examples: random sampling,
similaritybased sampling, and dissimilarity-based sampling. In similarity-based sampling, we choose examples
that are most similar to the test sentence, while in dissimilarity-based sampling, we select those that
are the most diferent. Similarity is measured using the cosine similarity between sentence embeddings
generated by GPT-3. Results are shown in Table 4.</p>
          <p>
            Interestingly, random selection outperforms similarity-based strategies across all models. For
GPT4o-mini, random sampling yields the best macro F1 (0.76), whereas dissimilarity-based selection ofers
a better recall (0.73) but slightly lower overall F1. A similar trend is observed for Qwen-72B, where
dissimilar sampling boosts recall (+0.07 over similarity) but ofers minimal F1 gain. This suggests that
dissimilar examples may help capture broader linguistic variance, aiding generalization. These results
contrast with earlier findings highlighting the benefits of semantically similar examplars for in-context
learning [
            <xref ref-type="bibr" rid="ref14">32</xref>
            ].
          </p>
        </sec>
        <sec id="sec-5-1-4">
          <title>5.1.4. Prompt Reframing and Debate-Based Inference</title>
          <p>We investigate whether the way labels are framed afects model behavior. Reframing “subjective
vs. objective” as a binary question (e.g., “Is the sentence subjective? Yes/No”) or as category labels
(“Category 1 vs. Category 2”) leads to slight F1 gains over the base prompt (Table 5). Framing clearly
influences the model’s inductive bias, with category labels yielding better subjective precision (0.69)
and macro F1 (0.72).</p>
          <p>Debating-based prompting (Table 6) also provides strong results. The setup where one LLM argues
for subjectivity, another for objectivity, and a judge decides, achieves the best macro F1 overall (0.77).
Notably, this format significantly enhances subjective recall (up to 0.74), suggesting that
reasoningfocused prompting facilitates more balanced decisions. Debate variants using negated prompts (e.g.,
“Not Subjective” vs. “Not Objective”) also perform competitively.</p>
        </sec>
        <sec id="sec-5-1-5">
          <title>5.1.5. LLM Ensemble Results</title>
          <p>Finally, we evaluate an ensemble voting strategy that aggregates predictions from five diverse models:
RoBERTa-Base, GPT-4o-mini, LLaMA 70B, Qwen 72B, and Aya-Expanse 32B. As shown in Table 7,
this ensemble achieves the highest overall macro F1 score (0.79), with a strong subjective precision
of 0.77. These results indicate that ensembling models with heterogeneous architectures and training
paradigms can efectively capture complementary perspectives on subjectivity, enhancing robustness
and performance.</p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Final Results</title>
        <sec id="sec-5-2-1">
          <title>5.2.1. Evaluation Setup</title>
          <p>The oficial evaluation of the CheckThat! 2025 campaign comprised three settings:
• Monolingual: train and test on data in a given language  (Arabic, Italian, German, English).
• Multilingual: train and test on data comprising several languages.
• Zero-shot: train on several languages and test on unseen languages (Romanian, Polish, Ukrainian,</p>
          <p>Greek).</p>
          <p>For our final submitted system in the oficial campaign evaluation, we adopted the extended-prompt
strategy using randomly selected 6-shot examples, paired with an ensemble of multiple models, including
GPT-4 variants (GPT-4o-mini, GPT-4.1-mini), RoBERTa, LLaMA 70B, and Qwen 72B. In the zero-shot
setting, the in-context examples were provided in English, following the task guidelines.</p>
          <p>Final results are reported in Table 8.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Discussion</title>
          <p>Our team demonstrated strong results across multiple languages, achieving first place in Arabic and
Polish, and top-three positions in the majority of the evaluated languages, including Italian, English,
and multilingual settings. This consistent performance underscores the robustness and generalizability
of our approach using LLM-based few-shot learning.</p>
          <p>Our experiments demonstrate that leveraging large language models (LLMs) instead of fine-tuned
smaller language models (SLMs) can yield highly competitive results across multiple languages and
settings. However, the efectiveness of LLMs critically depends on the quality of prompt design. For
instance, advanced prompting strategies such as debating LLMs, where multiple model outputs are
cross-examined, did not lead to substantial improvements over standard few-shot prompting. Similarly,
varying the example selection method, whether by similarity, dissimilarity, or random choice, showed
no significant impact on final performance. These findings suggest that while prompt engineering
remains essential, more complex example selection or ensemble strategies may not always provide
additional gains.</p>
          <p>
            The most notable result was in Arabic, where we outperformed the second-ranked team by a
substantial margin of +0.10 Macro F1-score. We attribute this advantage partly to the nature of the
Arabic dataset, which exhibits annotation inconsistencies. Unlike fine-tuned models that heavily depend
on high-quality labeled training data, our few-shot LLM approach is less afected by such noise. Prior
research has indicated that in-context learning with LLMs can be relatively independent of the exact
label quality provided in training examples [
            <xref ref-type="bibr" rid="ref15">33</xref>
            ]. Consequently, our method was more resilient to
inconsistencies, resulting in superior evaluation performance. This highlights a significant practical
benefit of using LLMs: they can better handle noisy or imperfect datasets, ofering an edge in real-world
scenarios where high-quality annotations are dificult to obtain.
          </p>
          <p>Overall, our findings suggest that LLMs, combined with carefully crafted few-shot prompts, ofer a
powerful and flexible alternative to traditional fine-tuning approaches, especially when training data
quality varies. This has important implications for future multilingual sentiment analysis tasks and
other NLP challenges where data quality and multilingual coverage are key concerns.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Dataset quality</title>
      <p>During our evaluation of the Arabic dataset, we consistently observed limited performance across all
tested configurations. Regardless of the prompting strategy employed, ranging from simple to extended
prompts, in both zero-shot and few-shot settings, the macro F1-score remained below 0.55. This plateau
in performance was observed across multiple models, including GPT-4 and fine-tuned BERT models as
shown in Table 2, and suggests potential issues beyond model capacity or prompt design.</p>
      <p>Interestingly, this contrasts with the results reported in the original dataset paper [13], where a
ifve-shot setup using a Maximal Marginal Relevance (MMR)-based example selection achieved an
F1-score of 0.80. In our comparable few-shot setting with GPT-4 and similarity-based example selection,
the F1-score reached only 0.547, indicating a significant gap in reproducibility.</p>
      <p>Upon closer inspection, we identified potential sources of annotation inconsistency. Manual review
revealed several instances where the assigned labels did not seem to align with the guidelines outlined
in the original paper. For example, the sentence:
، م هع م ىرخأ لﺍ ﺕا ح اسلﺍ ل م اكتت ن ﺃ ب ج ي نيطسلف ل هﺃو نين ج ل هﺃو نين ج ل خمياطبﺃ ن ﺍو فنع و ةيمز ع ّ ل ك ع م "
".هدح و س يل ينيطسل فلﺍ ب ع ش لﺍ ّ ن ﺃ هديّؤي ن م و ّودعلﺍ ف رعي ن ﺃو
“With all the determination and fervor of the heroes of Jenin camp, the people of Jenin, and
the people of Palestine, other fronts must unite with them, and the enemy and those who
support it must know that the Palestinian people are not alone.”
is labeled as objective, despite the presence of emotive language and the term ”the enemy” (ّ وُُدَعْلٱ),
which could reasonably be interpreted as subjective under the dataset’s own criteria, where they
state that such politically charged language must be labeled as subjective. Conversely, clearly factual
sentences such as:</p>
      <p>24 ن انبلربع2024 ملاعلﺍ ﺱأك ﺡاتتف ا ل رشابملﺍ ثبلﺍ ن آلﺍ ﺍوده ا ش
are labeled as subjective, despite appearing to report straightforward event announcements. Similarly,
sentences comprising purely reported speech, such as the following about COVID-19 statistics, are
labeled as subjective even though the annotation guidelines specify otherwise:</p>
      <p>
        ﺽﺍرم أ لﺍ زكرم س يئﺭ ن ا لعﺇ ع م ن م ﺍزتلاب ةيضاملﺍ ةعاس 24ـلﺍ ل ا لخ ن يرﺍﺇ يف انوﺭو ك ﺕاباصإو ﺕياف و ﺓﺩياﺯ"
ﺓﺩياﺯ ن ع ﺓﺭﺍﺯولاب ة م اعلﺍ ﺕاق ا لعلﺍ ﺓﺭﺍﺩﺇ ت نلعﺃ ،انوﺭو ك ﺕﺍﺀاص ح ﺇ يف ي ج ي ﺭدت ع ج رﺍت ن ع ةحصلﺍ ﺓﺭﺍﺯوب ةيدعملﺍ
".ةيضاملﺍ ةعاس 24ـلﺍ ل ا لخ ﺕاباص إ لﺍو ﺕيافولﺍ ﺩدع يف ىرخﺃ
”An increase in COVID-19 deaths and infections in Iran over the past 24 hours, coinciding with
the announcement by the head of the Infectious Diseases Center at the Ministry of Health of a
gradual decline in COVID-19 statistics, while the Public Relations Department of the Ministry
announced another increase in the number of deaths and infections during the past 24 hours.”
To ensure this was not a limitation inherent to the task or language, we ran comparable experiments
on the Arabic dataset from the 2023 edition of the task. In that case, our models achieved significantly
better performance (F1 = 0.84) using a six-shot extended prompt, and the top team of that year had
reported an F1-score of 0.79 [
        <xref ref-type="bibr" rid="ref16">34</xref>
        ], demonstrating the feasibility of high performance on well-annotated
Arabic datasets.
      </p>
      <p>To further test the hypothesis that label quality rather than linguistic features was the bottleneck,
we translated the dataset into English using DeepL and reran the experiments. However, this also did
not lead to improved performance (F1 &lt; 0.6), reinforcing our initial hypothesis. A small-scale manual
reannotation conducted by one of the authors, who is a native Arabic speaker, led to a moderate increase
in performance (F1 = 0.65), providing further evidence that inconsistencies in labeling may play a role
in the observed results.</p>
      <p>These observations highlight the challenges of subjectivity annotation, especially in politically
sensitive contexts, and underline the importance of annotation consistency for benchmarking tasks
involving subtle linguistic distinctions.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this study, we have shown that large language models (LLMs) used with well-designed few-shot
prompting can rival or surpass fine-tuned smaller models (SLMs) across diverse languages and settings.
Our approach proved particularly robust in the face of noisy or inconsistent training data, as
demonstrated by our strong performance on the Arabic dataset. By consistently ranking among the top teams,
securing first place in Arabic and Polish and top-three finishes in most other languages, we illustrate
the versatility and efectiveness of LLMs for multilingual subjectivity detection.</p>
      <p>While advanced prompt engineering strategies such as debating and varied example selection did
not yield major improvements, our results emphasize the critical role of prompt quality itself. The
lfexibility of LLMs combined with minimal reliance on extensive labeled data ofers a promising path
forward for multilingual NLP tasks, especially when dealing with data of varying quality.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>This work was supported by the BOOM ANR Project (ANR-20-CE23-0024) and benefited from the use
of the FactoryIA supercomputer, funded by the Île-de-France Regional Council.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>This work was assisted by generative AI tools used to improve clarity and style, specifically GPT-4. The
authors reviewed and verified all content to ensure accuracy and maintain the integrity of the scientific
work.
[9] S. Javdan, B. Minaei-Bidgoli, et al., Applying transformers and aspect-based sentiment analysis
approaches on sarcasm detection, in: Proceedings of the second workshop on figurative language
processing, 2020, pp. 67–71.
[10] W. Zhang, Y. Deng, B. Liu, S. Pan, L. Bing, Sentiment analysis in the era of large language
models: A reality check, in: K. Duh, H. Gomez, S. Bethard (Eds.), Findings of the Association
for Computational Linguistics: NAACL 2024, Association for Computational Linguistics, Mexico
City, Mexico, 2024, pp. 3881–3906. URL: https://aclanthology.org/2024.findings-naacl.246/. doi:10.
18653/v1/2024.findings- naacl.246.
[11] M. Shokri, V. Sharma, E. Filatova, S. Jain, S. Levitan, Subjectivity detection in english news using
large language models, in: Proceedings of the 14th Workshop on Computational Approaches to
Subjectivity, Sentiment, &amp; Social Media Analysis, 2024, pp. 215–226.
[12] T. Huang, E. Fan, Structured reasoning for fairness: A multi-agent approach to bias detection in
textual data, 2025. URL: https://arxiv.org/abs/2503.00355. arXiv:2503.00355.
[13] R. Suwaileh, M. Hasanain, F. Hubail, W. Zaghouani, F. Alam, Thatiar: subjectivity detection in
arabic news sentences, arXiv preprint arXiv:2406.05559 (2024).
[14] Kusrini, M. Mashuri, Sentiment analysis in twitter using lexicon based and polarity multiplication,
in: 2019 International Conference of Artificial Intelligence and Information Technology (ICAIIT),
2019, pp. 365–368. doi:10.1109/ICAIIT.2019.8834477.
[15] S. Zahoor, R. Rohilla, Twitter sentiment analysis using lexical or rule based approach: A case study,
in: 2020 8th International Conference on Reliability, Infocom Technologies and Optimization
(Trends and Future Directions) (ICRITO), 2020, pp. 537–542. doi:10.1109/ICRITO48877.2020.
9197910.
[16] A. Kotelnikova, D. Paschenko, K. Bochenina, E. Kotelnikov, Lexicon-based methods vs. bert for
text sentiment analysis, in: International Conference on Analysis of Images, Social Networks and
Texts, Springer, 2021, pp. 71–83.
[17] R. A. Potamias, G. Siolas, A.-G. Stafylopatis, A transformer-based approach to irony and sarcasm
detection, Neural Computing and Applications 32 (2020) 17309–17320.
[18] G. Marvin, N. Hellen, D. Jjingo, J. Nakatumba-Nabende, Prompt engineering in large language
models, in: International conference on data intelligence and cognitive informatics, Springer, 2023,
pp. 387–402.
[19] P. Sahoo, A. K. Singh, S. Saha, V. Jain, S. Mondal, A. Chadha, A systematic survey of prompt
engineering in large language models: Techniques and applications, arXiv preprint arXiv:2402.07927
(2024).
[20] J. Zhuo, S. Zhang, X. Fang, H. Duan, D. Lin, K. Chen, Prosa: Assessing and understanding the
prompt sensitivity of llms, 2024. URL: https://arxiv.org/abs/2410.12405. arXiv:2410.12405.
[21] F. Errica, G. Siracusano, D. Sanvito, R. Bifulco, What did i do wrong? quantifying llms’
sensitivity and consistency to prompt engineering, 2025. URL: https://arxiv.org/abs/2406.12334.
arXiv:2406.12334.
[22] Y. Wang, Q. Yao, J. T. Kwok, L. M. Ni, Generalizing from a few examples: A survey on few-shot
learning, ACM computing surveys (csur) 53 (2020) 1–34.
[23] Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, I. Mordatch, Improving factuality and reasoning in
language models through multiagent debate, in: Forty-first International Conference on Machine
Learning, 2024. URL: https://openreview.net/forum?id=zj7YuTE4t8.
[24] T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang, Y. Yang, S. Shi, Z. Tu, Encouraging divergent
thinking in large language models through multi-agent debate, in: Y. Al-Onaizan, M. Bansal,
Y.-N. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, Miami, Florida, USA, 2024, pp. 17889–17904.</p>
      <p>URL: https://aclanthology.org/2024.emnlp-main.992/. doi:10.18653/v1/2024.emnlp- main.992.
[25] N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, S. Yao, Reflexion: Language agents with verbal
reinforcement learning, Advances in Neural Information Processing Systems 36 (2023) 8634–8652.
[26] F. Alam, J. M. Struß, T. Chakraborty, S. Dietze, S. Hafid, K. Korre, A. Muti, P. Nakov, F. Ruggeri,
S. Schellhammer, V. Setty, M. Sundriyal, K. Todorov, V. V., The clef-2025 checkthat! lab: Subjectivity,</p>
    </sec>
    <sec id="sec-10">
      <title>A. Prompts Used</title>
      <p>In this section, we report the prompts used for the classification of the sentences. We report both
the simple prompt we experimented with at first and the extended prompt that provided the best
performance. We also report the prompts used for the debating LLMs. In particular, we report the
prompt used for : (1) the LLM tasked with explaining why a sentence is objective ,(2) the LLM tasked
with explaining why a sentence is subjective ,(3) the LLM tasked with explaining why a sentence is not
objective, (4) the LLM tasked with explaining why a sentence is not subjective and (5) the judge LLM.</p>
      <sec id="sec-10-1">
        <title>A.1. Simple Prompt (English):</title>
        <p>You are a linguistic expert, able to detect whether a sentence is objective (OBJ) or subjective (SUBJ).
Answer only with OBJ or SUBJ.</p>
      </sec>
      <sec id="sec-10-2">
        <title>A.2. Extended Prompt (English):</title>
        <p>You are a linguistic expert specializing in detecting whether a sentence is objective or subjective. Your
task is to classify sentences according to the following criteria:
– Intensifiers: Words or phrases that amplify a statement (e.g., ‘so damaged’) can indicate
subjectivity, as they may reflect the author’s personal perspective.
– Speculations: Statements that imply uncertainty, predictions, or unverifiable claims should
be labeled as subjective. For example, phrases like ‘will hope to sow uncertainty’ suggest an
interpretation rather than a fact.</p>
        <p>Answer only with the words objective or subjective based on these criteria.</p>
        <p>Note: For other languages, this extended prompt was translated using DeepL to ensure semantic
accuracy and consistency.</p>
      </sec>
      <sec id="sec-10-3">
        <title>A.3. Subjectivity Explanation Prompt:</title>
        <p>You are a linguistic expert specializing in detecting whether a sentence is objective (OBJ) or subjective
(SUBJ). Your task is to classify sentences according to the following criteria:
• Objective: A sentence is objective if it presents factual information, even if the information is
debatable or controversial. Additionally:
– Emotions: Statements conveying emotions should be labeled as objective if they reflect the
author’s beliefs or sensations that cannot be fact-checked or rephrased in a more neutral
form.
– Quotes: If a sentence contains a direct quote, label it as objective, since the task concerns
only the subjectivity of the article’s author, not the quoted speaker. I repeat: SENTENCES
WHICH ONLY CONTAIN REPORTED SPEECH SHOULD NEVER BE LABELED
SUBJECTIVE.
• Subjective: A sentence is subjective if it reflects personal opinions, interpretations, or evaluations.</p>
        <p>Indicators of subjectivity include:
– Intensifiers: Words or phrases that amplify a statement (e.g., “so damaged”) can indicate
subjectivity, as they may reflect the author’s personal perspective.
– Speculations: Statements that imply uncertainty, predictions, or unverifiable claims should
be labeled as subjective. For example, phrases like “will hope to sow uncertainty” suggest
an interpretation rather than a fact.</p>
        <p>Given the following sentence, explain why it is classified as subjective based on these criteria. Try
to be concise and explain why it is classified as such. Do not repeat the sentence in your answer.
Keep to the annotation guidelines given above. Maintain a critical mindset, you can disagree with the
classification, but do so only if you are certain. Do not speculate about the sentence’s intention.</p>
      </sec>
      <sec id="sec-10-4">
        <title>A.4. Objectivity Explanation Prompt:</title>
        <p>You are a linguistic expert specializing in detecting whether a sentence is objective (OBJ) or subjective
(SUBJ). Your task is to classify sentences according to the following criteria:</p>
        <p>Given the following sentence, explain why it is classified as objective based on these criteria. Try
to be concise and explain why it is classified as such. Do not repeat the sentence in your answer.
Keep to the annotation guidelines given above. Maintain a critical mindset, you can disagree with the
classification, but do so only if you are certain. Do not speculate about the sentence’s intention.</p>
      </sec>
      <sec id="sec-10-5">
        <title>A.5. Non-Subjectivity Explanation Prompt:</title>
        <p>You are a linguistic expert specializing in detecting whether a sentence is objective (OBJ) or subjective
(SUBJ). Your task is to classify sentences according to the following criteria:
• Objective: A sentence is objective if it presents factual information, even if the information is
debatable or controversial. Additionally:
– Emotions: Statements conveying emotions should be labeled as objective if they reflect the
author’s beliefs or sensations that cannot be fact-checked or rephrased in a more neutral
form.
– Quotes: If a sentence contains a direct quote, label it as objective, since the task concerns
only the subjectivity of the article’s author, not the quoted speaker. SENTENCES WHICH
ONLY CONTAIN REPORTED SPEECH SHOULD NEVER BE LABELED SUBJECTIVE.
• Subjective: A sentence is subjective if it reflects personal opinions, interpretations, or evaluations.</p>
        <p>Indicators of subjectivity include:
– Intensifiers: Words or phrases that amplify a statement (e.g., “so damaged”) can indicate
subjectivity, as they may reflect the author’s personal perspective.
– Speculations: Statements that imply uncertainty, predictions, or unverifiable claims should
be labeled as subjective. For example, phrases like “will hope to sow uncertainty” suggest
an interpretation rather than a fact.</p>
        <p>Given the following sentence, explain why it should not be classified as subjective based on these
criteria. Try to be concise and explain why it does not fit the criteria for subjectivity. Do not repeat the
sentence in your answer. Focus only on why it fails to meet the conditions for subjectivity.</p>
      </sec>
      <sec id="sec-10-6">
        <title>A.6. Non-Objectivity Explanation Prompt:</title>
        <p>You are a linguistic expert specializing in detecting whether a sentence is objective (OBJ) or subjective
(SUBJ). Your task is to classify sentences according to the following criteria:</p>
        <p>Given the following sentence, explain why it should not be classified as objective based on these
criteria. Try to be concise and explain why it does not fit the criteria for subjectivity. Do not repeat the
sentence in your answer. Focus only on why it fails to meet the conditions for objectivity.</p>
      </sec>
      <sec id="sec-10-7">
        <title>A.7. Judge Prompt:</title>
        <p>You are a judge LLM tasked with determining whether a sentence is objective (OBJ) or subjective (SUBJ)
based on opinions defending diferent points of view. Your job is to evaluate these opinions according
to the following criteria:
• Objective (OBJ): A sentence is objective if it presents factual information, even if the information
is debatable or controversial. Additionally:
– Emotions: Statements conveying emotions should be labeled as objective if they reflect the
author’s beliefs or sensations that cannot be fact-checked or rephrased in a more neutral
form.
– Quotes: If a sentence contains a direct quote, label it as objective, since the task concerns
only the subjectivity of the article’s author, not the quoted speaker.
• Subjective (SUBJ): A sentence is subjective if it reflects personal opinions, interpretations, or
evaluations. Indicators of subjectivity include:
– Intensifiers: Words or phrases that amplify a statement (e.g., so damaged) can indicate
subjectivity, as they may reflect the author’s personal perspective.
– Speculations: Statements that imply uncertainty, predictions, or unverifiable claims should
be labeled as subjective. For example, phrases like will hope to sow uncertainty suggest an
interpretation rather than a fact.
• Edge Cases:
– Emotions: Although statements carrying emotions convey a subjective point of view,
they cannot be verified or confuted by a fact-checking system and are therefore labeled as
objective.
– Quotes: When authors use quotes to support their thesis, the quoted content may be
subjective, but for classification, it should be labeled as objective, focusing only on the
article’s author.
– Intensifiers: The presence of intensifiers can indicate subjectivity, but it’s important to
assess whether they genuinely reflect the author’s perspective or serve a descriptive purpose.
– Speculations: Speculative statements should be regarded as subjective, as they often reflect
the author’s interpretation and not just factual content.</p>
        <p>Given the sentence and the opinions, your task is to make a final decision and answer only with
objective or subjective.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Antici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barron</surname>
          </string-name>
          , et al.,
          <article-title>On the definition of prescriptive annotation guidelines for language-agnostic subjectivity detection</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          , volume
          <volume>3370</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>111</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Hatzivassiloglou</surname>
          </string-name>
          ,
          <article-title>Towards answering opinion questions: Separating facts from opinions and identifying the polarity of opinion sentences</article-title>
          ,
          <source>in: Proceedings of the 2003 conference on Empirical methods in natural language processing</source>
          ,
          <year>2003</year>
          , pp.
          <fpage>129</fpage>
          -
          <lpage>136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Abimbola</surname>
          </string-name>
          , E. de La Cal Marin,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <article-title>Enhancing legal sentiment analysis: A convolutional neural network-long short-term memory document-level model</article-title>
          ,
          <source>Machine Learning and Knowledge Extraction</source>
          <volume>6</volume>
          (
          <year>2024</year>
          )
          <fpage>877</fpage>
          -
          <lpage>897</lpage>
          . URL: https://www.mdpi.com/2504-4990/6/2/41. doi:
          <volume>10</volume>
          .3390/make6020041.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Sentiment analysis</article-title>
          and
          <source>opinion mining</source>
          , Springer Nature,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Galassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Caselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kutlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Antici</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Köhler</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the clef-2023 checkthat! lab: Task 2 on subjectivity in news articles</article-title>
          ,
          <source>in: 24th Working Notes of the Conference and Labs of the Evaluation Forum, CLEF-WN 2023, CEUR Workshop Proceedings (CEUR-WS. org)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>236</fpage>
          -
          <lpage>249</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dimitrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Galassi</surname>
          </string-name>
          , G. Pachov,
          <string-name>
            <surname>I. Koychev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Siegel</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the clef-2024 checkthat! lab task 2 on subjectivity in news articles</article-title>
          ,
          <source>in: CEUR Workshop Proceedings</source>
          , volume
          <volume>3740</volume>
          ,
          <string-name>
            <surname>CEUR-WS</surname>
          </string-name>
          ,
          <year>2024</year>
          , pp.
          <fpage>287</fpage>
          -
          <lpage>298</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A. I.</given-names>
            <surname>Paran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. H.</given-names>
            <surname>Shohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hossain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ahsan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Hoque</surname>
          </string-name>
          , Semanticcuetsync at checkthat!
          <year>2024</year>
          <article-title>: finding subjectivity in news article using llama, Faggioli et al</article-title>
          .[
          <volume>22</volume>
          ] (
          <year>2024</year>
          ).
          <article-title>fact-checking, claim normalization, and retrieval</article-title>
          , in: C.
          <string-name>
            <surname>Hauf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jannach</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Nardini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pinelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Silvestri</surname>
          </string-name>
          , N. Tonellotto (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          , pp.
          <fpage>467</fpage>
          -
          <lpage>478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Korre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! Lab: Subjectivity, fact-checking, claim normalization, and retrieval</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>P.</given-names>
            <surname>Atanasova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barron-Cedeno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kyuchukov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D. S.</given-names>
            <surname>Martino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims. task 1: Check-worthiness</article-title>
          , arXiv preprint arXiv:
          <year>1808</year>
          .
          <volume>05542</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Atanasova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , G. Karadzhov,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mohtarami</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Da San Martino, Overview of the clef-2019 checkthat! lab: Automatic identification and verification of claims. task 1: Check-worthiness</article-title>
          .,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          (Working Notes)
          <volume>2380</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babulkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          , G. Da San Martino, et al.,
          <article-title>Overview of checkthat! 2020 english: Automatic identification and verification of claims in social media</article-title>
          .,
          <source>CLEF (working notes)</source>
          <volume>2696</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hamdan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. S.</given-names>
            <surname>Ali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kutlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Kartal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          , G. Da San Martino, et al.,
          <article-title>Overview of the clef-2021 checkthat! lab task 1 on check-worthiness estimation in tweets and political debates</article-title>
          .,
          <source>in: CLEF (working notes)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>369</fpage>
          -
          <lpage>392</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>B.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>She</surname>
          </string-name>
          , Y. Zhang,
          <article-title>nn prompting: Beyond-context learning with calibration-free nearest neighbor inference</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2303.13824. arXiv:
          <volume>2303</volume>
          .
          <fpage>13824</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>S.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lyu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Artetxe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>Rethinking the role of demonstrations: What makes in-context learning work?</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>11048</fpage>
          -
          <lpage>11064</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .emnlp-main.
          <volume>759</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          . emnlp- main.759.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>K.</given-names>
            <surname>Dey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Tarannum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. R. H.</given-names>
            <surname>Noori</surname>
          </string-name>
          , Nn at checkthat!
          <article-title>-2023: Subjectivity in news articles classification with transformer based models</article-title>
          .,
          <source>in: CLEF (Working Notes)</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>318</fpage>
          -
          <lpage>328</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>