<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>TIFIN at CheckThat! 2025: X-VERIFY - Multi-lingual NLI-based Fact Checking with Condensed Evidence⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bharatdeep Hazarika</string-name>
          <email>bharatdeep@askmyfi.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Prasanna Devadiga</string-name>
          <email>prasanna@askmyfi.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pawan Kumar Rajpoot</string-name>
          <email>pawan@tifin.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aditya U Baliga</string-name>
          <email>aditya.baliga@iiitkottayam.ac.in</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kishan Gurumurthy</string-name>
          <email>Kishan.gurumurthy@workifi.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manish Jain</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Manan Sharma</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashish Shrivastava</string-name>
          <email>Ashish.shrivastava@workifi.com</email>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arya Suneesh</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anshuman B Suresh</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>TIFIN - Technology</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Innovation in Finance</string-name>
        </contrib>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>In the era of pervasive digital misinformation, automated fact-checking systems for numerical claims present unique challenges due to the complexity of quantitative reasoning and evidence synthesis. This paper presents our approach for Task 3 of the CLEF 2025 CheckThat! Lab, which requires verifying numerical claims as True, False, or Conflicting based on retrieved evidence documents. Our methodology combines optimized evidence selection, LLM-based summarization using IBM Granite 3.3 8B, and advanced classification with DeBERTa-large [1] fine-tuned using LoRA [ 2]. We address severe class imbalance through multilingual data augmentation, translating Arabic and Spanish datasets to English to strengthen underrepresented classes. Our comprehensive pipeline achieves a macro-averaged F1 score of 0.6858 on the English validation set, representing a relative improvement over the provided RoBERTa [3] baseline. Extensive ablation studies demonstrate that multilingual augmentation provides the most substantial performance gains (relative Macro F1 score improved from 0.5815 (baseline) to 0.6859, an absolute increase of 0.1044, which is approximately 17.95%), while evidence optimization and LLM-based summarization contribute consistent improvements across all veracity classes. These results highlight the efectiveness of cross-lingual data expansion for addressing class imbalance in specialized factverification tasks and establish new benchmarks for numerical claim verification. Our approach ranked 3rd place in both English and Arabic tracks in the oficial evaluation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;numerical fact verification</kwd>
        <kwd>evidence summarization</kwd>
        <kwd>multilingual data augmentation</kwd>
        <kwd>class imbalance mitigation</kwd>
        <kwd>DeBERTa classification</kwd>
        <kwd>cross-lingual transfer learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The proliferation of numerical misinformation in digital media presents a critical challenge for automated
fact-checking systems. Unlike subjective opinion verification, numerical claims require precise
quantitative reasoning and evidence synthesis to determine veracity. Claims involving statistics, financial
ifgures, temporal expressions, and comparative quantities are particularly susceptible to manipulation
and misinterpretation, making their automated verification both essential and technically demanding.</p>
      <p>
        The CLEF 2025 CheckThat! Lab Task 3 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] addresses this challenge by focusing specifically on
fact-checking numerical claims across multiple languages. Participants must classify claims containing
explicit or implicit quantitative details as True, False, or Conflicting based on curated evidence documents
retrieved through BM25 ranking. This task represents a significant advancement over traditional
factchecking approaches by emphasizing the unique complexities of numerical reasoning in multilingual
contexts.
      </p>
      <p>CLEF 2025 Working Notes, 9 – 12 September 2025, Madrid, Spain
⋆This paper presents our approach to CLEF 2025 Task 3: Fact-Checking Numerical Claims, achieving 17.9% improvement over
baseline through evidence optimization and multilingual augmentation.
* Corresponding author.</p>
      <p>Numerical fact verification presents several distinct challenges that diferentiate it from general claim
verification. First, numerical claims often require understanding of mathematical relationships, temporal
sequences, and quantitative comparisons that demand sophisticated reasoning capabilities. Second,
the evidence supporting or refuting numerical claims may be scattered across multiple documents,
requiring efective synthesis and summarization techniques. Third, numerical expressions can vary
significantly across languages and cultural contexts, complicating cross-lingual verification approaches.</p>
      <p>The motivation for this work stems from the critical role that numerical accuracy plays in public
discourse. Misleading statistics in political debates, inflated financial claims in business communications,
and distorted temporal relationships in historical reporting can significantly impact public understanding
and decision-making. Automated systems capable of rapidly and accurately verifying numerical claims
are essential for maintaining information integrity in our increasingly data-driven society.</p>
      <p>Our approach addresses these challenges through a multi-component pipeline that optimizes evidence
selection, employs LLM-based summarization for noise reduction, and leverages advanced transformer
architectures for robust classification. Recognizing the severe class imbalance inherent in fact-checking
datasets, where False claims typically dominate, we introduce a novel multilingual data augmentation
strategy that substantially improves minority class performance.</p>
      <p>The contributions of this work include: (1) systematic evidence optimization demonstrating that top-5
BM25 documents provide optimal performance, (2) a LLM-based summarization framework using IBM
Granite 3.3 8B that efectively reduces evidence noise while preserving critical information, (3) advanced
classification using DeBERTa-large fine-tuned through parameter-eficient LoRA, and (4) multilingual
data augmentation that addresses class imbalance through cross-lingual knowledge transfer.</p>
      <p>Our experimental results demonstrate the efectiveness of this comprehensive approach, achieving a
macro-averaged F1 score of 0.6858 on the English validation set, a 17.9% relative improvement over
the provided baseline. Extensive ablation studies reveal that multilingual augmentation contributes
the majority of performance gains, while each pipeline component provides consistent improvements
across all veracity classes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Fact Verification and Claim Verification</title>
        <p>
          The field of automated fact verification has evolved significantly from early rule-based approaches to
sophisticated neural architectures. Thorne et al. (2018) [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] introduced the FEVER dataset, establishing a
benchmark for fact verification that emphasized the importance of evidence retrieval and reasoning.
Subsequent work by Nie et al. (2019) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and Hanselowski et al. (2019) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] demonstrated the efectiveness
of transformer-based models for claim verification, setting the foundation for modern approaches.
        </p>
        <p>
          The specific challenges of numerical fact verification have been addressed by Venktesh et al. (2024)
[
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] in their QuanTemp benchmark, which focuses on temporal and quantitative claims. Their work
highlights the unique dificulties in verifying numerical content, including the need for mathematical
reasoning and temporal understanding. Chen et al. (2022) [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] further explored numerical reasoning in
fact verification, demonstrating that specialized architectures can improve performance on quantitative
claims.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Evidence Retrieval and Selection</title>
        <p>
          Evidence retrieval forms a critical component of fact verification systems. Nie et al. (2019) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]
demonstrated the importance of high-quality evidence selection, showing that performance is heavily
dependent on the relevance and coverage of retrieved documents. Recent work by Lewis et al. (2020) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]
introduced retrieval-augmented generation, which combines neural retrieval with language modeling
for improved evidence synthesis.
        </p>
        <p>
          The specific challenge of determining optimal evidence quantity has been explored by various
researchers. Wadden et al. (2020) [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] investigated the trade-ofs between evidence quantity and quality,
while our work provides systematic analysis showing that 5 evidence documents represent the optimal
balance for numerical claims.
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Neural Summarization for Fact Verification</title>
        <p>
          The application of neural summarization to fact verification has gained attention as a method for
reducing evidence noise and improving model performance. Kryściński et al. (2020) [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] demonstrated
that summarization can enhance evidence quality in verification tasks. Our work extends this line
of research by specifically focusing on numerical claims and employing carefully designed prompt
templates to maintain factual accuracy while achieving conciseness.
        </p>
      </sec>
      <sec id="sec-2-4">
        <title>2.4. Multilingual and Cross-lingual Fact Verification</title>
        <p>
          The challenge of fact verification across multiple languages has been addressed by several recent studies.
Popat et al. (2018) [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] introduced cross-lingual claim verification, while Augenstein et al. (2019) [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
demonstrated the efectiveness of multilingual pre-trained models for fact-checking tasks. Recent
work by Nakov et al. (2021) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] in the CheckThat! lab has pushed the boundaries of multilingual fact
verification.
        </p>
        <p>The specific challenge of class imbalance in multilingual settings has received limited attention in
the literature. Our work addresses this gap by demonstrating that cross-lingual data augmentation can
efectively mitigate class imbalance while improving overall model robustness.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Task Description</title>
      <p>The CLEF 2025 Task 3 centers on the automatic verification of factual claims containing numerical
quantities and temporal expressions. These claims, which may include explicit or implicit quantitative
details, are sourced from real-world fact-checking scenarios through the Google Fact-check Explorer
API. The task challenges participants to classify each claim into one of three categories - True, False, or
Conflicting based on a curated evidence set.</p>
      <p>Participants are provided with top-k evidence documents retrieved using the BM25 ranking algorithm
from a comprehensive, pooled evidence corpus. This corpus is constructed using multiple advanced
claim decomposition strategies to ensure a diverse and context-rich evidence base. The task supports
multilingual evaluation across English, Spanish, and Arabic.</p>
      <p>The dataset for the task comprises over 17,000+ annotated claims:
• English: 9,935 training and 3,084 validation instances
• Spanish: 1,506 training and 377 validation instances
• Arabic: 2,191 training and 587 validation instances</p>
      <p>Each data instance includes metadata such as the original claim, its taxonomy (e.g., temporal, statistical,
comparison, interval), a label (veracity verdict), an oracle document summarizing the rationale behind
the label, and associated evidence texts either at the original or decomposed claim level.</p>
      <p>Performance is evaluated using macro-averaged F1 and class-wise F1 scores across the three labels.
The task provides oficial baseline models, which utilize a RoBERTa-large model fine-tuned on natural
language inference (NLI), along with inference scripts and scoring tools to benchmark participants’
approaches.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Our approach to CLEF 2025 Task 3 Fact-Checking Numerical Claims explores two distinct methodological
paradigms before converging on an optimal solution. We systematically evaluated both large language
model (LLM) based approaches and specialized transformer architectures, ultimately developing a
multi-stage pipeline that combines the strengths of both paradigms through LLM-based summarization
and transformer-based classification.</p>
      <sec id="sec-4-1">
        <title>4.1. System Overview</title>
        <p>The architecture demonstrates our key innovation of multilingual data augmentation, where Arabic
and Spanish claims undergo translation before joining the English stream for evidence summarization
and subsequent model training.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Methodological Approach Comparison</title>
        <p>We conducted comprehensive experiments across two primary methodological paradigms for numerical
fact verification:</p>
        <p>Large Language Model Approach: We evaluated direct fact verification using LLMs ranging from
1.5B to 70B parameters, including few-shot learning, chain-of-thought reasoning, and supervised
finetuning. This approach leveraged the inherent reasoning capabilities and world knowledge of modern
LLMs without requiring domain-specific architectural modifications.</p>
        <p>Specialized Transformer Approach: We developed a pipeline combining optimized evidence
selection, LLM-based summarization, and fine-tuned transformer classification. This approach emphasized
domain adaptation, evidence optimization, and class imbalance mitigation through carefully designed
components.</p>
        <p>Our systematic evaluation revealed that while LLMs achieved reasonable performance (maximum
0.49 macro F1), the specialized transformer pipeline substantially outperformed direct LLM approaches
(0.6858 macro F1), leading us to adopt the hybrid methodology described in subsequent sections that
leverages LLMs for summarization while employing specialized transformers for classification.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evidence Selection and Optimization</title>
        <p>While the provided dataset includes top-3 BM25-retrieved evidence documents for each claim, we
extended our analysis to top-5 evidence documents based on systematic evaluation. Our empirical
analysis guided our decision to optimize at five documents per claim, as this configuration provided the
optimal balance between information coverage and computational eficiency.</p>
        <p>Notably, reranking approaches applied to the top-100 BM25 evidence pool did not yield performance
improvements over the raw BM25 ranking, suggesting that the initial retrieval efectively captured the
most relevant contextual information for numerical claim verification.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. LLM-based Evidence Summarization</title>
        <p>To address evidence redundancy and enhance semantic coherence, we implemented a summarization
pipeline using IBM Granite 3.3 8B, selected for its strong multilingual capabilities and
instructionfollowing performance. The summarization process concatenates each claim with its top-5 evidence
documents and generates a concise summary capped at 35 words.</p>
        <p>Figure 2 demonstrates our summarization approach with diferent word constraints, illustrating how
the 35-word limit preserves critical contextual information while maintaining conciseness.</p>
        <sec id="sec-4-4-1">
          <title>Our prompt design emphasizes neutrality and factual precision:</title>
          <p>The 35-word limit was determined through empirical evaluation: shorter summaries (&lt;=20 words)
sufered from significant context loss, while longer summaries introduced noise without performance
gains. We conducted a manual evaluation of 250 randomly sampled summaries (maintaining the
original class distribution) to validate summary quality and fine-tune the prompt template. While
our summarization approach appears domain-agnostic, we specifically designed the prompt template
to preserve numerical and temporal information critical for Task 3. The instruction to "capture only
Evidence Summarization Prompt Template
system_prompt = """You are an expert evidence summarizer.</p>
          <p>Your task is to create unbiased, concise summaries of evidence
related to claims.</p>
          <p>INSTRUCTIONS:
1. Analyze the provided claim and evidence objectively.
2. Create a neutral summary in under 35 words that:
- Captures only the most relevant evidence points
- Presents key information without judgment
- Uses precise, factual language
- Omits unnecessary context or hedging
3. Focus solely on what the evidence actually contains,
not assumptions.
4. When evidence presents conflicting information, include
the most significant points from both sides.</p>
          <p>Your response must be direct, factual, and under 35 words total."""
the most relevant evidence points" implicitly prioritizes quantitative details, as these represent the
core factual claims being verified. We experimented with explicitly numerical-focused prompts (e.g.,
"focus on numbers, dates, and statistics") but found that such constraints often led to context loss when
numerical claims required qualitative supporting evidence for proper verification.</p>
        </sec>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Advanced Classification with DeBERTa-MNLI</title>
        <p>For veracity classification, we employed DeBERTa-large pre-trained on MNLI. DeBERTa
(Decodingenhanced BERT with disentangled attention) ofers several architectural advantages over RoBERTa for
our task: (1) disentangled attention mechanism that separately encodes content and position information,
improving handling of complex numerical relationships, (2) enhanced mask decoder that better captures
token dependencies crucial for fact verification, and (3) improved training eficiency through virtual
adversarial training. MNLI (Multi-Genre Natural Language Inference) is a large-scale dataset for
training models to determine textual entailment relationships between premise and hypothesis pairs.
Pre-training on MNLI provides a strong foundation for fact verification tasks, as both require reasoning
about logical relationships between claims and evidence. The entailment, contradiction, and neutral
classifications in MNLI directly parallel our True, False, and Conflicting labels, making this pre-training
particularly relevant for numerical claim verification. DeBERTa’s disentangled attention mechanism
and enhanced mask decoder provide significant advantages for numerical reasoning tasks.</p>
        <p>The classifier processes the concatenation of the original claim and its LLM-based summary, predicting
one of three veracity labels: True, False, or Conflicting. We implemented the model using LoRA
(LowRank Adaptation) to enable eficient fine-tuning while maintaining model stability.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Class Imbalance Mitigation</title>
        <p>The dataset exhibits severe class imbalance with approximately 58% False, 18% True, and 24% Conflicting
labels. We addressed this through multiple strategies:
• Weighted Loss Functions: Applied inverse frequency weighting to penalize misclassification of
minority classes
• Focal Loss with Label Smoothing: Implemented focal loss ( = 2.0) combined with label
smoothing ( = 0.1) to focus learning on hard examples
• Weighted Random Sampling: Used oversampling during training to balance class exposure</p>
      </sec>
      <sec id="sec-4-7">
        <title>4.7. Multilingual Data Augmentation</title>
        <p>To further address class imbalance and enhance model robustness, we incorporated Arabic and Spanish
datasets through LLM-based translation. Using IBM Granite 3.3 8B’s multilingual capabilities, we
translated non-English claims and evidence to English, efectively expanding our training corpus and
significantly reducing the True class underrepresentation.</p>
        <p>This multilingual augmentation strategy not only improved class balance but also enhanced the
model’s exposure to diverse linguistic patterns and cultural contexts in numerical claims, contributing
to improved generalization.</p>
      </sec>
      <sec id="sec-4-8">
        <title>4.8. Training Configuration</title>
        <p>We trained the model using the following optimized hyperparameters and infrastructure:
Hyperparameters:
• Learning Rate: 5e-4 (adjusted for LoRA)
• Batch Size: 8 with gradient accumulation (efective batch size: 16)
• LoRA Configuration : r=16,  = 32, dropout=0.1
• Training Epochs: 8 with linear warmup (10% of total steps)
• Regularization: R-Drop ( = 4.0) for consistency regularization
Training Infrastructure:
• Hardware: NVIDIA RTX 4090 (24GB VRAM), NVIDIA A100 (40GB VRAM)
• Inference: NVIDIA H100 GPU with vLLM engine
• Framework: PyTorch 1.13+ with HuggingFace Transformers
• Optimization: Mixed precision (FP16) training
• Early Stopping: Validation F1 score with patience of 3 epochs</p>
      </sec>
      <sec id="sec-4-9">
        <title>4.9. Alternative LLM-based Experimental Pipeline</title>
        <p>Before settling on our hybrid transformer-based approach, we conducted extensive experiments with
large language models (LLMs) for direct fact verification. Our motivation was to leverage the reasoning
capabilities and world knowledge embedded in modern LLMs to directly classify numerical claims
without requiring fine-tuning on domain-specific data.</p>
        <p>
          We evaluated a comprehensive range of LLM architectures spanning from 1.5B to 70B
parameters, including Qwen2.5 7B, Qwen3 14B, Qwen3 8B, Mistral Small 24B, Llama3.3 70B, Llama 3.2 3B,
IBM Granite 3.3 8B, and DeepSeek-R1-Distill-Qwen-1.5B. For in-context learning (ICL), we employed
intfloat/multilingual-e5-large-instruct [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] as both a base retrieval model and later fine-tuned it for
improved context selection.
4.9.1. Experimental Configurations
We systematically evaluated four distinct approaches with increasing complexity:
        </p>
        <p>Basic Few-Shot Inference: Using 3 carefully selected few-shot examples, we prompted models to
classify claims directly. This baseline approach achieved macro F1 scores ranging from 0.28 to 0.42
across all model sizes, with larger models not consistently outperforming smaller ones.</p>
        <p>Short Chain-of-Thought (CoT): We incorporated brief reasoning chains (20-30 words) before the
ifnal prediction to encourage step-by-step analysis. This approach showed modest improvements, with
scores ranging from 0.36 to 0.49, representing a consistent but limited enhancement over basic inference.</p>
        <p>Extended Chain-of-Thought: Expecting that more detailed reasoning would improve performance,
we extended CoT explanations to 100-120 words. Surprisingly, this approach underperformed the
shorter CoT variant, achieving scores between 0.32 and 0.45, suggesting that overly verbose reasoning
may introduce noise or hallucinations.</p>
        <p>Supervised Fine-Tuning (SFT): Using Unsloth [17] frameworks, we fine-tuned Qwen2.5 7B on our
training data with ICL-enhanced examples. Despite domain-specific adaptation, the fine-tuned model
peaked at 0.483 macro F1, still substantially below our transformer-based approach.
4.9.2. Key Findings and Limitations</p>
        <sec id="sec-4-9-1">
          <title>Our extensive LLM experimentation revealed several critical insights:</title>
          <p>Model Bias in Geopolitical Claims: We observed systematic bias diferences between model
families when evaluating claims related to Western countries. Llama models consistently exhibited
more positive assessments of Western-related numerical claims, while Qwen models showed notably
more skeptical evaluations. This bias manifested in Chain-of-Thought reasoning and significantly
impacted classification accuracy, highlighting the challenge of ensuring neutrality in fact verification
systems.</p>
          <p>Scale-Performance Paradox: Contrary to expectations, model size did not correlate with verification
performance. Our highest-performing LLM configuration achieved 0.49 macro F1 using Llama 3.2 3B,
substantially outperforming the 70B Llama 3.3 model (0.43 macro F1) under identical settings. This
suggests that architectural eficiency and training data alignment may be more critical than parameter
count for numerical reasoning tasks.</p>
          <p>Limited Numerical Reasoning: Despite their impressive general capabilities, LLMs struggled
with the precise quantitative analysis required for numerical fact verification. The models frequently
hallucinated numerical relationships or failed to accurately process mathematical comparisons within
the evidence documents.</p>
          <p>Context Length Limitations: We systematically tested context length variations by extending
evidence from 3 to 10 documents across all LLM models. Despite increased context windows,
performance either saturated or slowly degraded with additional evidence, contradicting expectations that
more information would improve accuracy. This aligns with the well-documented "lost in the middle"
phenomenon [18], where language models show degraded performance on information positioned in
the middle of long contexts while maintaining better recall for content at the beginning and end of the
input sequence.</p>
          <p>Our evaluation across the entire English training dataset revealed that all tested models exhibited
this limitation, with the efect being particularly pronounced for Conflicting and True classes, where
nuanced reasoning across multiple evidence sources was crucial. Notably, reranking approaches using
top-ranked embedding models from the MTEB leaderboard did not mitigate this issue, suggesting that
the limitation stems from the models’ inherent context processing capabilities rather than evidence
ordering. This context utilization bottleneck represents a fundamental constraint for LLM-based fact
verification systems that rely on synthesizing information from multiple diverse evidence sources.
4.9.3. Transition to Hybrid Approach
Our initial motivation was leveraging state-of-the-art LLMs to achieve competitive performance in
numerical fact verification, expecting that larger, more capable models would provide superior reasoning
abilities for this complex task. However, our systematic evaluation revealed counterintuitive findings:
larger LLMs provided no performance boost over smaller variants, with some smaller models (3B
parameters) actually outperforming their 70B counterparts. This scale-performance paradox contradicted
expected scaling laws and suggested that raw model size was insuficient for numerical reasoning tasks.</p>
          <p>Given these unexpected limitations and the fact that we had always considered MNLI and
NLIbased approaches as viable alternatives, we pivoted to exploring specialized transformer architectures.
The natural language inference framework ofered by MNLI pre-training directly aligned with fact
verification tasks, where entailment, contradiction, and neutral classifications parallel our True, False,
and Conflicting labels. Recognizing the complementary strengths of both paradigms, we developed
a hybrid approach that incorporates LLM capabilities for evidence summarization and multilingual
processing while employing domain-adapted transformer architectures for the core classification task.</p>
          <p>This methodology leverages the text generation and multilingual strengths of LLMs in preprocessing
components while addressing their numerical reasoning limitations through specialized transformers
with carefully designed training objectives for fact verification.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Ablation Studies</title>
      <p>We conducted comprehensive ablation experiments to validate each component of our methodology.
All experiments were performed on the English validation set unless otherwise specified.</p>
      <sec id="sec-5-1">
        <title>5.1. Evidence Length Optimization</title>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Impact of LLM-based Summarization</title>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Summary Length Optimization</title>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Model Architecture Comparison</title>
      </sec>
      <sec id="sec-5-5">
        <title>5.5. Multilingual Data Augmentation Impact</title>
        <p>Critical Finding: Multilingual data augmentation provides the most substantial performance
improvement (+9.79% macro F1), with particularly dramatic gains for True (+11.78%) and Conflicting
(+15.64%) classes, efectively addressing the severe class imbalance in the original English dataset.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <sec id="sec-6-1">
        <title>6.1. Final System Performance</title>
        <p>Our complete methodology achieved a macro-averaged F1 score of 0.6858 on the English validation
set, representing a 17.9% relative improvement over the provided RoBERTa baseline (0.5815). This
substantial improvement demonstrates the efectiveness of our multi-component approach to numerical
fact verification.</p>
        <p>The results clearly demonstrate that multilingual data augmentation provides the most substantial
performance boost, contributing over 90% of the total improvement beyond the baseline enhancements.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Computational Eficiency</title>
        <p>Our use of LoRa achieved competitive performance with significant computational advantages:
• Training Time: 5 hours vs. 10 hours for full fine-tuning
• Memory Usage: 40% reduction compared to full parameter updates
• Inference Speed: Real-time performance suitable for practical deployment</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <sec id="sec-7-1">
        <title>Summary of Contributions and Key Findings</title>
        <p>This work presents a comprehensive approach to automated numerical fact verification that addresses
key challenges in the CLEF 2025 Task 3. Our primary contributions include systematic evidence
optimization demonstrating that top-5 BM25 documents provide optimal performance, a neural summarization
pipeline using IBM Granite 3.3 8B that reduces evidence noise while preserving critical information
with 35-word constraints, advanced classification using DeBERTa-large with parameter-eficient LoRA,
multilingual data augmentation that addresses class imbalance through cross-lingual knowledge transfer,
and comprehensive imbalance mitigation integrating focal loss, label smoothing, and weighted sampling
techniques.</p>
        <p>Our extensive ablation studies reveal several important insights for numerical fact verification.
Evidence quantity optimization shows performance saturates at 5 evidence documents, suggesting
that additional context introduces more noise than signal. LLM-based summarization consistently
improves performance across all evidence lengths and classes, with 35-word summaries providing
optimal information density. Cross-lingual benefits demonstrate that multilingual data augmentation
not only addresses class imbalance but also enhances model robustness to diverse linguistic expressions
of numerical claims. Architecture choices confirm that DeBERTa’s enhanced attention mechanisms
provide meaningful improvements for numerical reasoning tasks over traditional BERT-based models.
These findings establish clear best practices for evidence selection, summarization design, and imbalance
mitigation in numerical fact verification tasks.</p>
      </sec>
      <sec id="sec-7-2">
        <title>Limitations and Future Work</title>
        <p>While our approach achieves substantial improvements, several limitations warrant discussion.
Translation Dependency means the multilingual augmentation strategy relies on neural translation quality,
which may introduce artifacts or lose nuanced meaning in numerical expressions. Evidence Corpus
Constraints indicate that performance is bounded by the quality and coverage of the BM25-retrieved
evidence corpus, suggesting potential for improvements through advanced retrieval methods.
Computational Overhead from the summarization pipeline adds inference time, though this may be mitigated
through caching strategies in production environments. Domain Generalization concerns arise as
evaluation focuses on the CLEF task dataset; broader evaluation across diverse numerical claim domains
would strengthen generalisability claims.</p>
        <p>Future research directions include investigation of retrieval-augmented generation approaches for
evidence synthesis, development of claim decomposition strategies for complex numerical assertions,
exploration of few-shot learning techniques for rapid adaptation to new domains, and integration of
structured knowledge bases for enhanced numerical reasoning.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>The authors also employed Claude Sonnet 4 and GPT-4o for initial prompt design and methodology
refinement. After using these tools, the authors reviewed and edited all generated content and take full
responsibility for the publication’s content.
technical report, 2024. URL: https://arxiv.org/abs/2402.05672. arXiv:2402.05672.
[17] M. H. Daniel Han, U. team, Unsloth, 2023. URL: http://github.com/unslothai/unsloth.
[18] N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, P. Liang, Lost in the
middle: How language models use long contexts, 2023. URL: https://arxiv.org/abs/2307.03172.
arXiv:2307.03172.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , W. Chen, Deberta:
          <article-title>Decoding-enhanced bert with disentangled attention</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2006</year>
          .03654. arXiv:
          <year>2006</year>
          .03654.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.09685. arXiv:
          <volume>2106</volume>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joshi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Roberta: A robustly optimized bert pretraining approach</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1907</year>
          . 11692. arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bendou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bouamor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Iturra-Bocaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! lab task 3 on fact-checking numerical claims</article-title>
          , ????
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Christodoulopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <article-title>Fever: a large-scale dataset for fact extraction</article-title>
          and verification,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1803</year>
          .05355. arXiv:
          <year>1803</year>
          .05355.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Williams</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Dinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Weston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Adversarial nli: A new benchmark for natural language understanding</article-title>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/
          <year>1910</year>
          .14599. arXiv:
          <year>1910</year>
          .14599.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanselowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sorokin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>UKP-athene: Multi-sentence textual entailment for claim verification</article-title>
          , in: J.
          <string-name>
            <surname>Thorne</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Vlachos</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Cocarascu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Christodoulopoulos</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Mittal (Eds.),
          <source>Proceedings of the First Workshop on Fact Extraction and VERification (FEVER)</source>
          ,
          <source>Association for Computational Linguistics</source>
          , Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>103</fpage>
          -
          <lpage>108</lpage>
          . URL: https://aclanthology.org/W18-5516/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W18</fpage>
          -5516.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>V. V</given-names>
            ,
            <surname>A. Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <article-title>Quantemp: A real-world open-domain benchmark for factchecking numerical claims</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2403.17169. arXiv:
          <volume>2403</volume>
          .
          <fpage>17169</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Smiley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ma</surname>
          </string-name>
          , S. Shah, W. Y. Wang,
          <article-title>ConvFinQA: Exploring the chain of numerical reasoning in conversational finance question answering</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>6279</fpage>
          -
          <lpage>6292</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .emnlp-main.
          <volume>421</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          . emnlp-main.
          <volume>421</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W. tau Yih, T. Rocktäschel,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive nlp tasks</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/
          <year>2005</year>
          .11401. arXiv:
          <year>2005</year>
          .11401.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wadden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Zuylen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <article-title>Fact or fiction: Verifying scientific claims</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>7534</fpage>
          -
          <lpage>7550</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          . emnlp-main.
          <volume>609</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-main.
          <volume>609</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kryściński</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McCann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <article-title>Evaluating the factual consistency of abstractive text summarization</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1910</year>
          .12840. arXiv:
          <year>1910</year>
          .12840.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>K.</given-names>
            <surname>Popat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yates</surname>
          </string-name>
          , G. Weikum,
          <article-title>DeClarE: Debunking fake news and false claims using evidence-aware deep learning</article-title>
          , in: E.
          <string-name>
            <surname>Rilof</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Chiang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Hockenmaier</surname>
          </string-name>
          , J. Tsujii (Eds.),
          <source>Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Brussels, Belgium,
          <year>2018</year>
          , pp.
          <fpage>22</fpage>
          -
          <lpage>32</lpage>
          . URL: https://aclanthology.org/ D18-1003/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D18</fpage>
          -1003.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>I.</given-names>
            <surname>Augenstein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lioma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Chaves</given-names>
            <surname>Lima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hansen</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. G. Simonsen,</surname>
          </string-name>
          <article-title>MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims</article-title>
          , in: K. Inui,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <string-name>
            <surname>X.</surname>
          </string-name>
          Wan (Eds.),
          <source>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Hong Kong, China,
          <year>2019</year>
          , pp.
          <fpage>4685</fpage>
          -
          <lpage>4697</lpage>
          . URL: https://aclanthology.org/D19-1475/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D19</fpage>
          -1475.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , G. Da San Martino, T. Elsayed,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Míguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babulkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          , T. Mandl,
          <article-title>The clef-2021 checkthat! lab on detecting check-worthy claims, previously fact-checked claims, and fake news</article-title>
          ,
          <source>in: Advances in Information Retrieval: 43rd European Conference on IR Research</source>
          , ECIR
          <year>2021</year>
          ,
          <string-name>
            <surname>Virtual</surname>
            <given-names>Event</given-names>
          </string-name>
          ,
          <year>March</year>
          28 - April 1,
          <year>2021</year>
          , Proceedings,
          <string-name>
            <surname>Part</surname>
            <given-names>II</given-names>
          </string-name>
          , Springer-Verlag, Berlin, Heidelberg,
          <year>2021</year>
          , p.
          <fpage>639</fpage>
          -
          <lpage>649</lpage>
          . URL: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -72240-1_
          <fpage>75</fpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -72240-1_
          <fpage>75</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Majumder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wei</surname>
          </string-name>
          , Multilingual e5 text embeddings: A
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>