<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>ClaimIQ at CheckThat! 2025: Comparing Prompted and Fine-Tuned Language Models for Verifying Numerical Claims</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Anirban Saha Anik</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Md Fahimul Kabir Chowdhury</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrew Wyckof</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sagnik Ray Choudhury</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, University of North Texas</institution>
          ,
          <addr-line>Denton, TX</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Data Science, University of North Texas</institution>
          ,
          <addr-line>Denton, TX</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents our system for Task 3 of the CLEF 2025 CheckThat! Lab, which focuses on verifying numerical and temporal claims using retrieved evidence. We explore two complementary approaches: zero-shot prompting with instruction-tuned large language models (LLMs) and supervised fine-tuning using parameter-eficient LoRA. To enhance evidence quality, we investigate several selection strategies, including full-document input and top- sentence filtering using BM25 and MiniLM. Our best-performing model LLaMA fine-tuned with LoRA achieves strong performance on the English validation set. However, a notable drop in the test set highlights a generalization challenge. These findings underscore the importance of evidence granularity and model adaptation for robust numerical fact verification.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Fact-checking</kwd>
        <kwd>LLM</kwd>
        <kwd>Numerical Claim Verification</kwd>
        <kwd>Fine-Tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        As misinformation continues to spread across digital platforms, the ability to automatically verify factual
claims has become increasingly important [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Among the most challenging forms of misinformation
are those involving numerical or temporal elements, claims that reference statistics, quantities, dates,
or trends [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These claims are often persuasive and deceptively simple; yet, verifying them requires
not just factual knowledge but also precise reasoning over quantitative details.
      </p>
      <p>
        To support the verification of numerical misinformation, Viswanathan et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] proposed the
QuanTemp dataset. This benchmark targets real-world quantitative and temporal claims, including
multilingual evidence retrieved from fact-checking sources. It serves as the foundation for CLEF 2025 Task 3
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Task 3 of the CLEF 2025 CheckThat! Lab [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] focuses on verifying such claims by classifying them
as True, False, or Conflicting based on a small set of retrieved evidence. This task is especially
challenging because evidence is frequently noisy, partially relevant, or even contradictory, and claims
may rely on implicit or contextualized numerical reasoning.
      </p>
      <p>
        Recent advancements in large language models (LLMs) have shown promising capabilities in
understanding and generating human-like text [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7, 8</xref>
        ]. However, their efectiveness in structured fact
verification, especially when reasoning over multiple retrieved evidence passages, remains an open
research problem. Additionally, aligning LLM outputs with factual correctness while managing
computational eficiency is a key consideration.
      </p>
      <p>In this work, we explore two complementary strategies for numerical claim verification: zero-shot
prompting with instruction-tuned LLMs and supervised fine-tuning using parameter-eficient methods
(LoRA). We also experiment with various evidence selection techniques, including full-document input
and top- sentence retrieval via BM25 and MiniLM, to assess the impact of evidence granularity on
model performance.</p>
      <p>Our approach aims to evaluate the balance between generalization and supervision, and to investigate
how LLMs can be adapted for precise, scalable, and reliable numerical fact-checking.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Recent years have seen growing interest in fact verification systems that integrate natural language
processing, information retrieval, and reasoning [9, 10]. A prominent line of work in this space is
retrieval-augmented generation (RAG), which combines document retrieval with large language models
(LLMs) to produce contextually grounded and factually accurate outputs [11, 12]. Yue et al. [13]
introduced RARG, a retrieval-augmented RAG framework that incorporates scientific literature to
generate polite, evidence-based counter-responses. Their use of reinforcement learning with
documentlevel supervision demonstrated the benefits of aligning generation with factual evidence. Expanding on
this, RAFTS [14] introduced a contrastive fact verification pipeline that generates both supporting and
refuting responses from retrieved passages. RAFTS emphasized interpretability and achieved strong
results using parameter-eficient models.</p>
      <p>Systems such as FactGenius [15] improve zero-shot prompt-based fact-checking abilities of LLMs
by integrating them with external knowledge bases (DBPedia) and similarity measures (fuzzy text
matching). ClaimMatch [16] leverages LLMs in both zero-shot and few-shot settings (e.g.,
GPT-3.5turbo, Gemini, LLaMA) for claim matching (CM), utilizing natural language inference and paraphrase
detection. Tang et al. [17] developed MiniCheck, a sentence-level verifier that approaches GPT-4
performance using synthetic training data and smaller models. Their work shows that compact models
can perform competitively when fine-tuned appropriately.</p>
      <p>Several researchers have employed Full-Context Retrieval and Verification frameworks to perform
LLM-based claim extraction in conjunction with Retrieval-Augmented Generation (RAG). RAG enhances
the detection process by constructing a comprehensive context for fact-checking [18, 19, 20].</p>
      <p>Our approach builds on these insights by combining sentence-level retrieval (BM25 and MiniLM),
ifne-tuned generation with LLaMA, and multilingual claim-evidence alignment. Unlike
decompositionheavy pipelines, we show that strong performance can be achieved with simpler architectures and
focused supervision.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description</title>
      <sec id="sec-3-1">
        <title>3.1. Task Overview</title>
        <p>We participate in Task 3: Fact-Checking Numerical Claims as part of the CLEF 2025 CheckThat!
Lab [21]. This task aims to verify the factual correctness of claims that include numerical quantities
or temporal expressions. Such claims require not only linguistic understanding but also the ability to
interpret quantities, dates, and time-based facts in context.</p>
        <p>Participants are provided with a set of claims and corresponding evidence passages retrieved using
top-100 BM25 ranking. The goal is to classify each claim into one of three labels:
• True - the claim is fully supported by the evidence;
• False - the claim is clearly refuted by the evidence;
• Conflicting - the evidence is ambiguous, partially supportive, or contradictory.</p>
        <p>The task challenges systems to handle ambiguous evidence, resolve conflicting numbers or dates, and
reason over concise or incomplete textual data. Participants are allowed to apply re-ranking, retrieval
ifltering, and generation techniques to improve verification performance.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Dataset Summary</title>
        <p>For Task 3, we use a dataset sourced from fact-checking reports gathered via the Google Fact Check
Explorer API. We filter claims to include only those with numerical or temporal expressions. Each claim
comes with a ranked set of evidence documents, retrieved using BM25 and claim decomposition.*</p>
        <p>Though the dataset supports multiple languages, we limit our experiments to the English portion.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Problem Formulation</title>
        <p>The goal of this task is to automatically verify the factual correctness of claims that contain numerical or
temporal expressions. Each instance in the dataset consists of a claim  and a corresponding evidence
set  = {1, 2, ..., }, where each  is a sentence or a document retrieved from a fact-checking
corpus. The task is to classify the claim into one of three categories: True, False, or Conflicting.</p>
        <p>We treat this as a three-way classification problem, where the model learns a function:
 (, ) →  ∈ {True, False, Conflicting}
Here,  can be instantiated as either a generative language model prompted in zero-shot fashion, or a
ifne-tuned discriminative classifier.</p>
        <p>The evidence  is varied across diferent experimental configurations. In some cases,  includes
the full document retrieved via BM25, while in others, it consists of a ranked subset of top- relevant
sentences, or a summary generated by a large language model. This flexible formulation allows us to
investigate the efect of evidence selection on both prompted and fine-tuned approaches.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Prompting with LLaMA</title>
        <p>We employ LLaMA [22] to perform zero-shot claim verification using a prompting-based approach. In
this setup, we construct an instruction-style prompt that includes the task definition, the numerical
claim, and the selected evidence (either full document, top- sentences, or a generated summary). The
model is then asked to classify the claim into one of the three predefined categories: True, False, or
Conflicting.</p>
        <p>The prompt is designed to guide the model toward generating a concise classification rather than an
open-ended explanation. A typical example of the input prompt is as follows:</p>
        <sec id="sec-4-2-1">
          <title>Fact-Checking Prompt</title>
          <p>You are a helpful and concise fact-checking assistant. Given a claim
and supporting evidence, your task is to determine the truthfulness of
the claim.</p>
          <p>Respond strictly with one of the following labels: True, False, or
Conflicting.</p>
          <p>Claim: [CLAIM]
Evidence: [EVIDENCE]
Based on the evidence, what is the correct classification?</p>
          <p>LLaMA’s output is processed with simple regex patterns to extract the first valid label found. We
also clean ambiguous responses such as ‘partially true’ or ‘half false’ by mapping them to the nearest
predefined label (typically Conflicting).</p>
          <p>With prompted inference (no gradient updates), we eficiently test diferent evidence setups. This
lets us evaluate how well the model generalizes for fact-checking without task-specific fine-tuning.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evidence Selection Strategies</title>
        <p>Each claim in the dataset is accompanied by up to 100 retrieved evidence documents, obtained using the
BM25 ranking algorithm. However, these documents often contain irrelevant or redundant information,
which can negatively impact model performance, particularly for length-sensitive models or those
afected by context dilution. To address this, we evaluate several evidence selection strategies to enhance
the signal-to-noise ratio of the input.</p>
        <p>Full Document. In the baseline approach, we use the complete top-ranked BM25-retrieved document
without filtering. While this preserves full context, it frequently includes of-topic or low-relevance
content.</p>
        <p>Top-3 BM25 Sentences. We apply BM25 [23] at the sentence level, treating the claim as a query to
select the three highest-scoring sentences from top documents. This eficient method favors lexical
matches but may miss semantically relevant content.</p>
        <p>Top-3 MiniLM Sentences. For improved semantic matching, we embed both claims and sentences
using all-MiniLM-L6-v21, then select the three sentences with highest cosine similarity to the claim.
This approach captures meaning beyond surface-level lexical overlap.</p>
        <p>Each of these evidence types is paired with both prompting and fine-tuned models to study the efect
of evidence quality on downstream fact-checking performance.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Model Architectures</title>
        <p>We evaluate three model variants for numerical claim verification: (1) a zero-shot prompted LLM, (2)
a fine-tuned RoBERTa classifier, and (3) a parameter-eficient fine-tuned LLaMA (using LoRA). Each
model takes a claim and selected evidence as input, outputting one of {True, False, Conflicting}.</p>
        <p>Prompted LLaMA (Zero-Shot). Using LLaMA in zero-shot mode, we provide a natural language
prompt containing the claim and evidence, instructing the model to return a single label. The prompt
defines the task and response format. No model updates occur during training; we extract predictions
through simple post-processing of the generated output.</p>
        <p>Fine-Tuned RoBERTa. We fine-tune roberta-base [24] via supervised learning. The
concatenated claim-evidence pair serves as input, with the model outputting label probabilities. Trained for
three epochs on stratified data using cross-entropy loss, this provides a strong discriminative baseline.</p>
        <sec id="sec-4-4-1">
          <title>1https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2</title>
          <p>Fine-Tuned LLaMA with LoRA. Using Low-Rank Adaptation (LoRA) [25], we fine-tune
LLaMA3.1-8B with prompt-response pairs (claim+evidence as prompt, label as response). LoRA applies to
query, key, value, and output projections ( = 8,  = 16, dropout=0.05). The Hugging Face Trainer
implements 3-epoch fine-tuning with mixed precision and gradient checkpointing, balancing task
alignment with computational eficiency.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Evaluation Metrics</title>
        <p>We follow the oficial evaluation protocol defined by the CLEF 2025 CheckThat! Lab for Task 3. The
primary evaluation metric is the macro-averaged F1 score across the three classification labels: True,
False, and Conflicting .</p>
        <p>In addition to macro-F1, we report class-wise F1 scores to better understand model behavior across
diferent types of claims. This is particularly important given the inherent class imbalance in the dataset
and the dificulty of predicting Conflicting cases.</p>
        <p>All results are computed on the oficial English validation and test splits using a consistent
preprocessing and evaluation pipeline.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <sec id="sec-5-1">
        <title>5.1. Experimental Setup</title>
        <p>We conduct experiments on the English subset of the CLEF 2025 Task 3 dataset, which contains 15,514
claims annotated with one of three labels: True, False, or Conflicting. Each claim is associated
with a list of up to 100 evidence documents retrieved using BM25 over a pooled web corpus.</p>
        <p>For supervised learning, we split the dataset into 90% training and 10% validation sets using stratified
sampling to preserve label distribution. All evidence selection methods: full document, top-3 BM25 and
top-3 MiniLM are applied on both training and validation sets to evaluate their downstream impact.</p>
        <p>We evaluate model performance using the macro-averaged F1 score, which is the oficial metric for
the shared task. Additionally, we report class-wise F1 scores to better understand how models handle
imbalanced or ambiguous labels, especially the Conflicting class. For qualitative analysis, we also
examine confusion matrices and sample errors.</p>
        <p>To ensure comparability, all models are evaluated using the same preprocessing pipeline and evidence
configuration across prompting, fine-tuning, and hybrid setups.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Training and Inference Setup</title>
        <p>We implement all models using the Hugging Face Transformers, PEFT, and SentenceTransformers
libraries. Experiments are conducted on a high-performance server equipped with dual Intel(R) Xeon(R)
Gold 6226R CPUs (64 threads), 125GB of RAM, and three NVIDIA Quadro RTX 8000 GPUs, each with
48GB of memory. Training jobs are executed using PyTorch with CUDA 12.6, and GPU utilization is
managed dynamically based on availability.</p>
        <p>Prompted LLaMA (Zero-Shot). We use the meta-llama/Llama-3.1-8B-Instruct2 model
without fine-tuning for zero-shot generation. The model is prompted using an instruction-style format
that defines the task and presents the claim and evidence. We use nucleus sampling with temperature
0.3, top- of 0.9, and a maximum of 30 new tokens. Model outputs are post-processed using regular
expressions to extract the first valid verdict label. Ambiguous generations (e.g., “somewhat true”) are
mapped to the closest predefined class, typically Conflicting.</p>
        <p>RoBERTa Fine-Tuning. We fine-tune the roberta-base3 model using cross-entropy loss over the
three output labels. Claims and evidence are tokenized as a sequence pair and truncated to a maximum
length of 512 tokens. We use the AdamW optimizer with a learning rate of 2 × 10− 5, a batch size of 8,</p>
        <sec id="sec-5-2-1">
          <title>2https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct 3https://huggingface.co/FacebookAI/roberta-base</title>
          <p>and train for 3 epochs with early stopping based on macro-F1 score on the validation set. The model is
evaluated using softmax-based prediction.</p>
          <p>LLaMA Fine-Tuning (LoRA). We fine-tune the same LLaMA-3.1-8B-Instruct model using
Low-Rank Adaptation (LoRA). LoRA is applied to the query, key, value, and output projection layers
with a rank of  = 8, a scaling factor of  = 16, and a dropout rate of 0.05. Each training instance is
formatted as a prompt-response pair, where the response corresponds to a single label. We use a batch
size of 2 with gradient accumulation over 4 steps. Training is performed in mixed-precision (FP16), and
gradient checkpointing is enabled to reduce memory usage. The model is trained for 3 epochs using the
Hugging Face Trainer API4.</p>
          <p>All experimental runs are tracked using Weights &amp; Biases5 for reproducibility, and each configuration
is evaluated using identical preprocessing and scoring scripts.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <sec id="sec-6-1">
        <title>6.1. Validation Results</title>
        <p>Table 2 presents the F1 scores of various model configurations on the English validation set. We evaluate
performance across three model types: prompted LLaMA, fine-tuned RoBERTa, and fine-tuned LLaMA
with LoRA under diferent evidence selection strategies.</p>
        <p>Among the prompted models, LLaMA achieves its best performance using full-document input,
reaching a macro-F1 of 0.609. However, it struggles significantly with the Conflicting class, indicating
limitations in handling ambiguous evidence without task-specific fine-tuning.</p>
        <p>Fine-tuned models consistently outperform prompted ones. RoBERTa performs well across both
BM25 and MiniLM sentence-level evidence, with the best Conflicting class F1 (0.510) achieved using
MiniLM. This suggests that sentence-level semantic filtering benefits models lacking strong pretraining
on numerical reasoning.</p>
        <p>The best overall performance is achieved by the fine-tuned LLaMA with LoRA using full-document
evidence. It reaches a macro-F1 of 0.945 and shows balanced performance across all three classes.
Sentence-level evidence (e.g., Top-3 MiniLM) also provides strong results, particularly improving
precision on harder examples while reducing irrelevant context.</p>
        <p>These results confirm that combining large language models with parameter-eficient tuning and
retrieval-aware evidence selection leads to substantial improvements in numerical claim verification.</p>
        <sec id="sec-6-1-1">
          <title>4https://huggingface.co/docs/transformers/en/main_classes/trainer 5https://wandb.ai/</title>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Test Set Performance</title>
        <p>Table 3 summarizes the F1 scores for all model configurations on the English test set. We evaluated
prompted LLaMA, fine-tuned RoBERTa, and fine-tuned LLaMA with LoRA, each paired with diferent
evidence selection strategies.</p>
        <p>Among the prompted models, LLaMA with full-document input achieved a macro-F1 of 0.40, while
Top-3 BM25 and Top-3 MiniLM sentence selection resulted in similar scores (0.41 and 0.40, respectively).
These results indicate that zero-shot prompting generalized better than fine-tuned RoBERTa, whose
macro-F1 dropped to 0.35 (Top-3 BM25) and 0.34 (Top-3 MiniLM).</p>
        <p>Fine-tuned LLaMA with LoRA achieved the highest macro-F1 on the test set (0.43) with both Top-3
BM25 and Top-3 MiniLM evidence. Notably, fine-tuning with full-document evidence, despite yielding
the best validation macro-F1, led to a macro-F1 of 0.42 on the test set, with a modest improvement on
the Conflicting class (F1: 0.32).</p>
        <p>Across all configurations, models consistently achieved higher F1 scores for False claims, while True
and Conflicting claims remained challenging. The Conflicting class in particular showed low F1 except
for the full-document fine-tuned LLaMA, suggesting that richer context helps resolve ambiguous or
contradictory evidence.</p>
      </sec>
      <sec id="sec-6-3">
        <title>6.3. Discussion</title>
        <p>Our results demonstrate that large language models, when fine-tuned with parameter-eficient
techniques and supported by retrieval-aware evidence selection, can achieve strong performance on
numerical claim verification. In particular, sentence-level evidence filtering using MiniLM embeddings helped
improve model precision for ambiguous cases, especially in the Conflicting class.</p>
        <p>However, as shown in Table 4, there remains a substantial performance gap between the validation
and test sets. While the model performed well on validation data, it struggled to maintain comparable
performance on the test set, particularly for the True and Conflicting categories. This suggests that
the model may have overfit to patterns in the validation data or faced dificulties adapting to shifts in
evidence structure and language style in the test set.</p>
        <p>Preliminary review of errors indicates that failures were often related to numerical reasoning
challenges, ambiguous or contradictory evidence, or missing key supporting facts. These patterns highlight
the complexity of verifying numerical claims in the presence of noisy or incomplete context.</p>
        <p>Overall, our findings underscore the importance of both model architecture and evidence quality
in developing robust fact verification systems. Future work should explore domain-adaptive training,
reasoning-aware approaches, and improved evidence selection techniques to enhance model
generalization in real-world scenarios.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this paper, we presented our approach for Task 3 of the CLEF 2025 CheckThat! Lab, which focuses
on verifying numerical claims using retrieved evidence. We explored both zero-shot prompting and
parameter-eficient fine-tuning of large language models, alongside multiple evidence selection strategies
including sentence-level filtering via BM25 and MiniLM.</p>
      <p>Our experiments showed that fine-tuning LLaMA with LoRA on full-document evidence achieved the
best performance on the validation set. Sentence-level filtering improved performance for ambiguous
claims, especially in the Conflicting class. However, the performance drop on the test set highlighted
challenges in generalization, likely due to domain shift and the nuanced nature of real-world evidence.</p>
      <p>Future work will focus on enhancing model robustness through domain-adaptive training, improved
retrieval filtering, and reasoning-aware modeling strategies. Our findings suggest that large language
models, when combined with structured evidence processing, are a promising foundation for building
scalable and accurate fact verification systems.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT-4o and Grammarly for grammar
and clarity revision. These tools were employed to refine sentence structure, correct typographical
errors, and improve overall language quality. No generative content was used for analysis, figures,
or experimental sections. After using these tool(s)/service(s), the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content.
[8] Z. Wang, J. Valdez, D. Basu Mallick, R. G. Baraniuk, Towards human-like educational question
generation with large language models, in: International conference on artificial intelligence in
education, Springer, 2022, pp. 153–166.
[9] E. Lazarski, M. Al-Khassaweneh, C. Howard, Using nlp for fact checking: A survey, Designs 5
(2021) 42.
[10] L. Hong, X. Song, A. S. Anik, V. Frias-Martinez, Dynamic fusion of large language models for
crisis communication, in: Proceedings of the International ISCRAM Conference, 2025.
[11] Y. Huang, J. Huang, A survey on retrieval-augmented text generation for large language models,
arXiv preprint arXiv:2404.10981 (2024).
[12] A. S. Anik, X. Song, E. Wang, B. Wang, B. Yarimbas, L. Hong, Multi-agent retrieval-augmented
framework for evidence-based counterspeech against health misinformation, arXiv preprint
arXiv:2507.07307 (2025).
[13] Z. Yue, H. Zeng, Y. Lu, L. Shang, Y. Zhang, D. Wang, Evidence-driven retrieval augmented response
generation for online misinformation, arXiv preprint arXiv:2403.14952 (2024).
[14] Z. Yue, H. Zeng, L. Shang, Y. Liu, Y. Zhang, D. Wang, Retrieval augmented fact verification by
synthesizing contrastive arguments, in: Proceedings of the 62nd Annual Meeting of the Association
for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 10331–10343.
[15] S. Gautam, R. Pop, Factgenius: Combining zero-shot prompting and fuzzy relation mining to
improve fact verification with knowledge graphs, in: The Seventh Fact Extraction and VERification
Workshop, 2024, p. 297.
[16] D. Pisarevskaya, A. Zubiaga, Zero-shot and few-shot learning with instruction-following llms for
claim matching in automated fact-checking, arXiv preprint arXiv:2501.10860 (2025).
[17] L. Tang, P. Laban, G. Durrett, Minicheck: Eficient fact-checking of llms on grounding documents,
in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
2024, pp. 8818–8847.
[18] Y. Bai, K. Fu, A large language model-based fake news detection framework with rag fact-checking,
in: 2024 IEEE International Conference on Big Data (BigData), IEEE, 2024, pp. 8617–8619.
[19] P. Laban, A. R. Fabbri, C. Xiong, C.-S. Wu, Summary of a haystack: A challenge to long-context
llms and rag systems, in: Proceedings of the 2024 Conference on Empirical Methods in Natural
Language Processing, 2024, pp. 9885–9903.
[20] D. Russo, S. Menini, J. Staiano, M. Guerini, Face the facts! evaluating rag-based fact-checking
pipelines in realistic settings, arXiv preprint arXiv:2412.15189 (2024).
[21] F. Alam, J. M. Struß, T. Chakraborty, S. Dietze, S. Hafid, K. Korre, A. Muti, P. Nakov, F. Ruggeri,
S. Schellhammer, V. Setty, M. Sundriyal, K. Todorov, V. Venktesh, Overview of the CLEF-2025
CheckThat! Lab: Subjectivity, fact-checking, claim normalization, and retrieval, in: J. Carrillo-de
Albornoz, J. Gonzalo, L. Plaza, A. García Seco de Herrera, J. Mothe, F. Piroi, P. Rosso, D. Spina,
G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction.</p>
      <p>Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025), 2025.
[22] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
E. Hambro, F. Azhar, et al., Llama: Open and eficient foundation language models, arXiv preprint
arXiv:2302.13971 (2023).
[23] S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond,</p>
      <p>Foundations and Trends® in Information Retrieval 3 (2009) 333–389.
[24] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[25] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., Lora: Low-rank
adaptation of large language models., ICLR 1 (2022) 3.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Á.</given-names>
            <surname>Figueira</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Oliveira,</surname>
          </string-name>
          <article-title>The current state of fake news: challenges and opportunities</article-title>
          ,
          <source>Procedia computer science 121</source>
          (
          <year>2017</year>
          )
          <fpage>817</fpage>
          -
          <lpage>825</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P.</given-names>
            <surname>Meel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. K.</given-names>
            <surname>Vishwakarma</surname>
          </string-name>
          ,
          <article-title>Fake news, rumor, information pollution in social media and web: A contemporary survey of state-of-the-arts, challenges and opportunities</article-title>
          ,
          <source>Expert Systems with Applications</source>
          <volume>153</volume>
          (
          <year>2020</year>
          )
          <fpage>112986</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <article-title>Quantemp: A real-world open-domain benchmark for fact-checking numerical claims</article-title>
          ,
          <source>in: 47th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2024</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          (ACM),
          <year>2024</year>
          , pp.
          <fpage>650</fpage>
          -
          <lpage>660</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bendou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bouamor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Iturra-Bocaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscakova</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! lab task 3 on fact-checking numerical claims</article-title>
          ,
          <source>in: [5]</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2025</year>
          , Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , T. Liu,
          <article-title>Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements</article-title>
          ,
          <source>in: The 2023 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Iter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sanyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <article-title>Generate rather than retrieve: Large language models are strong context generators</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>