<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>LIS at CheckThat! 2025: Multi-Stage Open-Source Large Language Models for Fact-Checking Numerical Claims</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Quy Thanh Le</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ismail Badache</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aznam Yacoub</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maamar El Amine Hamri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Aix Marseille Université</institution>
          ,
          <addr-line>CNRS, LIS, Marseille</addr-line>
          ,
          <country country="FR">France</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Windsor</institution>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The fast and large-scale spread of information through social networks and digital platforms has become an important global issue for today's societies, making automated fact-checking necessary. This paper presents the contribution of the R2I1 and MoFED2 teams from the LIS Laboratory3 to the task of automated claim relevance estimation, in the context of the 2025 CheckThat! Lab, specifically Task 3: Fact-Checking Numerical Claims 4. This task focuses on the verification of claims expressed in Arabic, English, and Spanish, particularly those involving numerical data or temporal references. In this study, we explore the efectiveness of recent open-source large language models (LLMs), such as Mistral5 and Qwen6, for automating the fact-checking of numerical claims. We propose a two-stage pipeline that incorporates these LLMs into the fact-checking process: evidence retrieval and veracity prediction. First, we employ the QwQ-32B model to automatically generate questions from each claim, guiding the retrieval of relevant evidence from the corpus provided for Task 3. Second, we fine-tune the Mistral-Small-24B-Instruct-2501 model using the LoRA (Low-Rank Adaptation) technique to predict the veracity of each claim. This hybrid approach is designed to enhance both the performance and eficiency of the fact-checking pipeline. Despite variations in performance across languages, our method achieved outstanding results, ranking first in all 3 languages : Spanish, English and Arabic. The multilingual nature of the datasets played a crucial role in improving the generalizability of claim validation across linguistic contexts. Our approach obtained macro-F1 scores of 0.503 for Spanish, 0.595 for English and an exceptional 0.960 for Arabic, significantly outperforming the second-best Arabic score of 0.635. These results not only underscore the eficacy of leveraging open-source LLMs for fact-checking, but also contribute to ongoing research in claim detection. They further highlight the importance of language-specific adaptations and the potential of multilingual strategies in the development of robust, automated fact-checking systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;NLP</kwd>
        <kwd>Automated Fact-checking</kwd>
        <kwd>Numerical claims</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Multilingual datasets</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The issue of Big Data often refers to a situation where the speed at which data is generated and
disseminated far exceeds current computational and processing capacities. This results in a gap
between the massive production of data and the ability to analyze or exploit it efectively in real
time. A similar phenomenon occurs in the field of information verification: the rapid proliferation of
digital content, particularly on social media, makes the swift and systematic validation of circulating
claims extremely dificult [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. In this context, the implementation of robust mechanisms to assess the
veracity of information has become an increasingly critical challenge. Moreover, the speed at which
information circulates, often with viral dynamics, surpasses the capacity of verification processes,
allowing misinformation to take hold before corrections can be made. This situation highlights the
urgent need to develop efective tools and strategies to address this major informational challenge, and
among the possible solutions is fact checking [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Fact-checking techniques, initially developed within the field of journalism, aim to evaluate the
veracity of information by systematically comparing claims with verified and trustworthy sources. In
recent years, these techniques have been increasingly adapted and enhanced through computational
approaches. Automated fact-checking generally follows a standardized three-step pipeline: claim
detection, evidence retrieval, and verdict prediction [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Fact-checking can be performed either manually
or automatically. Manual fact-checking, typically carried out by trained experts or journalists, is known
for its high reliability and contextual sensitivity. However, it is inherently limited in terms of scalability
and response time. For instance, platforms like Snopes1 manually verify viral claims, often requiring
several hours to a full day to process a single claim [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. These temporal constraints are especially
problematic given that false information tends to spread significantly faster and more broadly than
truthful content, particularly on social media platforms [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This growing asymmetry between the
pace of misinformation and the capacity for human-led verification highlights the pressing need for
automated fact-checking systems. Such systems are designed to identify claims requiring validation
and to assess their credibility by retrieving relevant evidence and determining whether the claims are
supported, contradicted, or unverifiable due to insuficient information [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The development of robust,
scalable, and accurate automated fact-checking methods is therefore crucial to addressing the challenges
posed by the rapid and widespread difusion of false information in contemporary digital ecosystems.
      </p>
      <p>
        However, automating fact-checking presents substantial challenges, especially when dealing with
complex claims that require nuanced reasoning and contextual understanding. This complexity is
further exacerbated when claims include numerical information, which are often perceived as more
credible due to the numeric-truth efect—a cognitive bias where numbers enhance perceived accuracy
[
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Recent research has shown that verifying numerical claims is more dificult than verifying
nonnumerical ones [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. For example, the social media claim that “Pfizer admitted its COVID-19 vaccine
is only 12% efective, not 95% as previously stated” is a misinterpretation of Pfizer’s briefing document
submitted to the FDA in December 2020, prior to the issuance of the Emergency Use Authorization
(EUA)2, and has caused unnecessary public concern. As a result, developing methods to automatically
verify numerical claims has become an essential area of investigation in fact-checking research.
      </p>
      <p>In this paper, we present our participation in CheckThat! 2025 Task 3: Fact-Checking Numerical
Claims3, a challenge that focuses on verifying claims involving numerical quantities and temporal
expressions. The main objective of this task is to assess the veracity of claims containing explicit or
implicit quantitative or temporal information by classifying them as True, False, or Conflicting, based on
evidence retrieved from a predefined corpus. To address this task, we propose a two-stage fact-checking
framework specifically tailored to the challenges posed by numerical and temporal claims. The first stage
consists of an evidence retrieval module, where we automatically generate natural language questions
from each claim using the multilingual large language model (LLM) QWQ-32B. These questions guide
the retrieval of the most relevant supporting or contradicting documents from the corpus. In the second
stage, we implement a veracity prediction module, in which the model Mistral-Small-24B-Instruct-2501
is fine-tuned on the task’s training data to classify the claims based on the retrieved evidence. Our
study builds on a key hypothesis shaped by open-source LLM democratization: Are LLMs capable of
performing accurate fact-checking on the dataset provided in CheckThat! 2025 Task 3?</p>
      <p>The remainder of this paper is organized as follows. Section 2 provides a short review of some related
work, situating our contribution within the current research on fact-checking. Section 3 presents the
methodology, detailing the LLMs experimented with and the prompts employed. Section 4 discusses the
experimental results, highlighting key findings and observed limitations. Finally, Section 5 concludes
the paper by summarizing the main insights and outlining directions for future research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Some Related Works</title>
      <p>
        Although an increasing number of studies have been conducted to enhance fact-checking methods [
        <xref ref-type="bibr" rid="ref1 ref9">1, 9</xref>
        ],
the majority of existing work primarily focuses on verifying textual claims using either structured or
unstructured data sources [
        <xref ref-type="bibr" rid="ref10 ref11 ref12">10, 11, 12</xref>
        ]. In contrast, research that specifically addresses the verification
of numerical claims remains relatively scarce.
      </p>
      <p>
        To the best of our knowledge, only a few prior works have directly addressed the fact-checking
of numerical claims. For example, Wallat et al. [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] present a focused evaluation of LLMs ability to
verify temporal factual claims through a dedicated Temporal Fact Checking task. They use a dataset
of 4196 manually verified claims sourced from fact-checking websites [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] to assess whether models
can classify statements as True, False, or contradictory. This task difers from traditional
questionanswering by requiring critical judgment rather than mere fact retrieval. The results reveal limited
performance across all models, with classification accuracy ranging from 29% (Llama 3.1) to 74.7%
(Qwen 2.5). Surprisingly, even highly capable models like GPT-4 underperform, often refusing to answer
when uncertain—a behavior interpreted as cautious calibration rather than lack of knowledge. This
part of the study demonstrates that while LLMs may store temporal facts, they often lack the reasoning
ability or confidence to verify them reliably, especially when temporal specificity is required.
      </p>
      <p>Building on the state of the art and existing literature on Fact-Checking Claims, while adopting a
somewhat diferent perspective regarding the formulation of our research question, We investigate the
capability of open-source LLMs to accurately perform fact-checking on the specific dataset provided in
CheckThat! 2025 Task 3.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Open source LLMs in Fact-Checking process</title>
      <p>Our goal is to apply large open-source language models to two key stages of the fact-checking process,
evidence retrieval and veracity prediction, for predicting the label of numerical claims. This section
introduces the dataset, the Open source LLMs Used in Experiments, and the pipeline of the proposed
method followed by our models to fact-check claims. Since we use a shared pipeline across all three
languages, we focus on the English setup here; the setups for Arabic and Spanish are analogous, with
prompts adapted to each language while maintaining the same format.</p>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>
          The CheckThat! Lab 2025 task 3 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] provided participants with datasets in English, Spanish and
Arabic. Each dataset includes training (Train), development (Dev), and test (Test) splits, with the test set
reserved for final submissions. Additionally, the organizers provided a collection of evidence used to
verify all claims for each language (see Table 1).The distribution of claims shows that the Spanish and
Arabic datasets are significantly smaller than the English dataset (see Table 2). The training (Train) and
development (Dev) splits show a highly imbalanced label distribution across the 3 labels, with the False
label accounting for the majority in both English and Spanish, whereas the Arabic dataset exhibits a
more balanced class distribution. However, the Arabic dataset includes only 2 labels, True and False,
compared to 3 labels in English and Spanish (see Table 3).
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Open source LLMs Used in Experiments</title>
        <p>Open-source LLMs provide significant advantages in terms of cost-efectiveness, transparency and
community collaboration. Therefore, we utilize two open-source LLMs, along with an embedding model
developed upon LLM foundations, as outlined in Table 4.</p>
        <p>QwQ-32B4 is the reasoning-focused model from the Qwen series. Unlike conventional
instructiontuned models, QwQ-32B demonstrates enhanced capabilities in reasoning and complex problem-solving
tasks. It performs competitively with state-of-the-art models such as DeepSeek-R1 and o1-mini, making</p>
        <sec id="sec-3-2-1">
          <title>4https://huggingface.co/Qwen/QwQ-32B</title>
          <p>it well-suited for fact-checking tasks that require logical inference. This model is specifically employed
to perform the question generation task as part of our inference pipeline.</p>
          <p>
            Linq-Embed-Mistral5 builds upon the foundations of E5-mistral-7b-instruct and Mistral-7B-v0.1. It
demonstrates strong retrieval performance, ranking 2nd on the MTEB benchmark leaderboard [
            <xref ref-type="bibr" rid="ref15">15</xref>
            ] with
a score of 60.2. This model is specifically used for embedding and retrieval tasks within our pipeline.
          </p>
          <p>Mistral Small 36 (also referred to as 2501) is a compact model with 24B parameters, making it one
of the most capable models in the sub-70B category. Developed by Mistral AI, it optimizes
transformerbased architectures for language tasks, combining high performance with computational eficiency.
Within our inference pipeline, this model is specifically responsible for assessing the veracity of the
claim based on the retrieved evidences, claim and generated questions.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Methodology</title>
        <p>Figure 1 presents our proposed pipeline for fact-checking, which follows a two-stage architecture:
evidence retrieval and veracity prediction. The input to the system consists of a claim to be verified and
a corpus of evidence provided by the task organizers, which serves as a knowledge base containing
relevant information for evaluating claims. In the first stage, the system retrieves the top-k most relevant
pieces of evidence from the corpus by applying a decomposition method to the input claims. In the
second stage, veracity prediction, the input includes the retrieved evidence, the original claim, and the
generated questions. These are fed into a fine-tuned LLM that has been pre-trained for the veracity
prediction task using the training dataset to produce a final verdict, reflecting the claim’s degree of
validity. To ensure consistency, the same pipeline described above is used to retrieve evidences for the
training, development, and test sets. We do not use the generated questions, retrieved evidences, or
golden labels provided by the task organizers at any point in this process.</p>
        <sec id="sec-3-3-1">
          <title>5https://huggingface.co/Linq-AI-Research/Linq-Embed-Mistral 6https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501</title>
          <p>Although the Arabic dataset contains only two labels (True and False), a three-label prompt structure,
as shown in Figure 4, is maintained to ensure consistency across diferent languages. While the prompt
suggests the possibility of three output labels, the fine-tuned model, trained solely on two-label data,
produces only two labels during evaluation on the development set. This behavior does not result in
any errors or require additional normalization, and the model still achieves high performance.</p>
          <p>Table 5 presents our model configurations (LIS system) in comparison to the baseline system provided
by the organizers. Key diferences lie in the evidence retrieval stage, where we employ claim-augmented
retrieval queries (combining the original claim with generated questions), as opposed to the baseline
system which uses question-only queries. For retrieval, we use Linq-Embed-Mistral, which selects the
most relevant evidence based on cosine similarity, rather than relying on BM25 followed by reranking
with paraphrase-MiniLM-L6-v2 as in the baseline. For question generation, our system utilizes the
QwQ32B model, while the baseline employs GPT-3.5. In the veracity prediction stage, our system leverages
the powerful Mistral-Small-24B-Instruct-2501, whereas the baseline is based on FinQA-Roberta-Large.
3.3.1. Evidence Retrieval</p>
          <p>Question Generation</p>
          <p>Reranking</p>
          <p>Veracity Prediction
GPT-3.5
QwQ-32B
paraphrase-MiniLM-L6-v2</p>
          <p>
            FinQA-Roberta-Large
Mistral-Small-24B-Instruct-2501
As presented in Figure 1, the claim verification system is based on a two-stage inference pipeline. The
ifrst stage consists of retrieving relevant evidence from the corpus of evidences to either support or
refute a given claim. Inspired by recent advances in generative retrieval methods [
            <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
            ], we employ
an instruction-following language model to generate questions based on the input claims. The prompt
used during this generation process is illustrated in Figure 2.
          </p>
          <p>Once the questions are generated, each question is concatenated with its corresponding claim to
form a query. This query is then embedded using our embedding model (Linq-Embed-Mistral) and
compared against the corpus to retrieve the most relevant piece of evidence (top-1 retrieval) based on
cosine similarity. This approach ensures that the selected evidence is contextually aligned with both
the claim and the generated question. An example of this retrieval process is illustrated in Figure 3.
3.3.2. Veracity Prediction
In the same vein, and as shown in Figure 1, the second stage handles the veracity prediction task
based on the generated question, the retrieved evidence, and the original claim as inputs to predict
the veracity of the claim. We employ an instruction-following LLM, specifically the fine-tuned
Mistral-Small-24B-Instruct-2501 model, to perform this task. This model predicts the final veracity label
(True,, False or Conflicting) based on the combined context. The prompt structure used for veracity
prediction is illustrated in Figure 4.</p>
          <p>Figure 2 illustrates a concrete example of an instruction prompt alongside its generated output for
the task of question generation. In this figure, the instruction is highlighted in bold text, providing
clear guidance to the model. The italicized text presents the claim or statement on which the question
is based. Finally, the question produced by the model is shown in blue, demonstrating how the model
interprets and transforms the input prompt into a relevant query. This visual representation helps to
clarify the interaction between the instruction, the claim, and the model’s generated response.
# Instructions
You are a fact-checker. Your general motivation is to verify a given claim. You are at the beginning of
the fact-checking process, meaning you have just received the claim, optionally with some additional
metadata (such as the date of the claim or the author), if available. Your task now is to prepare the
fact-check. That means:
1. Begin with an interpretation of the claim. As part of the interpretation, list the key points of the claim
as a list of reformulated sub-claims.
2. Then, analyze what information is missing.
3. Finally, present a complete, numbered list of 5 questions: These are questions that explore the
truthfulness of the claim and that we need to answer in order to factually verify the claim.
IMPORTANT:
Follow these rules:
* Phrase each question so that it can be understood independently and without additional context. Be
explicit and do not use pronouns or generic terms in place of names or objects.
* Enclose each question in backticks like this.
# Examples
Claim: "The new Food Bill in New Zealand bans gardening"
Good Question: “Did the New Zealand government pass a food bill that restricted gardening activities for
its citizens?”
Bad Question: “Did the government pass a bill?”
Bad Question: “Did the bill restrict activities?”
# The Claim
french president emmanuel macron is around 25 years younger than his wife, brigitte, and they started dating when he
was a teenager and she was one of his teachers. her daughter, laurence, was once one of his classmates, and later became
his stepdaughter.</p>
          <p>Output | Generated questions:
“What is the exact age diference between French President Emmanuel Macron and his wife Brigitte Macron?”
“Did Laurence, Brigitte Macron’s daughter, legally become Emmanuel Macron’s stepdaughter through the marriage
of Brigitte Macron and Emmanuel Macron?”
“Did Emmanuel Macron and Brigitte Macron begin their romantic relationship while Macron was still a teenager?”
“Was Brigitte Macron serving as a teacher at the same educational institution where Emmanuel Macron was a
student when their relationship began?”
“Was Laurence, Brigitte Macron’s daughter from a previous relationship, enrolled as a student in the same school
class as Emmanuel Macron before he married Brigitte?”</p>
          <p>Figure 3 illustrates the process of query formation by integrating a claim with a generated question
aimed at retrieving relevant evidence. In this example, the bold text represents the instruction guiding
the task, while the italic text shows the claim under verification. The gray text displays the most
pertinent evidence retrieved from the data source, and the blue text highlights the questions generated
by the model to support the evidence search. This structured approach facilitates targeted evidence
gathering for fact-checking purposes.</p>
          <p>Query:french president emmanuel macron is around 25 years younger than his wife, brigitte, and they started dating
when he was a teenager and she was one of his teachers. her daughter, laurence, was once one of his classmates, and later
became his stepdaughter. What is the exact age diference between French President Emmanuel Macron and his wife
Brigitte Macron?
Output | Relevant evidence 1:
Emmanuel, who became president in 2017, and Brigitte first met when the now-president was in high school. In a
November 2023 interview with Paris Match — translated by The Independent — Brigitte shared how she never
thought of entertaining a relationship with her student, who was 15 at the time. Brigitte and Emmanuel met when
he was a student at Le Providence, a Catholic school in northern France. Brigitte worked as a drama teacher and
was 39 years old at the time, while Emmanuel was 15. The president’s parents originally thought he was seeing
Brigitte’s daughter, but when they found out he had feelings for the teacher, they sent him to boarding school.
French President Emmanuel Macron and his wife, Brigitte, have been married for nearly two decades, but their
relationship dates back even further.</p>
          <p>Query:french president emmanuel macron is around 25 years younger than his wife, brigitte, and they started dating
when he was a teenager and she was one of his teachers. her daughter, laurence, was once one of his classmates, and
later became his stepdaughter. Was Brigitte Macron serving as a teacher at the same educational institution where
Emmanuel Macron was a student when their relationship began?
Output | Relevant evidence 2:
Emmanuel Macron’s wife and former teacher thought he would “fall in love with someone his [own] age” after he
was moved from the school where she taught him. Mrs Macron said her older siblings used to joke about the gossip.
She divorced from her estranged husband, André-Louis Auzière, a banker, in 2006 and married Mr Macron a year
later, when he was a young civil servant. The French president was 15 when he fell for his married drama teacher,
Brigitte Auzière, then 40, at the Catholic Providence school in Amiens in the early 1990s. Her daughter Laurence was
the future French president’s classmate.</p>
          <p>Query:french president emmanuel macron is around 25 years younger than his wife, brigitte, and they started dating
when he was a teenager and she was one of his teachers. her daughter, laurence, was once one of his classmates, and later
became his stepdaughter. Question k and  ∈ {3, 5}</p>
          <p>Figure 4 presents an example of an instruction prompt and output for veracity prediction. The bold
text corresponds to the instruction given to the model, the italic text is the claim to be verified, while
the gray text represents the extracted evidence deemed relevant. Finally, the blue text includes the
questions generated by the model as well as its final prediction regarding the claim’s veracity. This
ifgure clearly illustrates the complete process of automatic claim verification based on evidence and
generated questions.</p>
          <p># Instructions
You are a fact-checker. You have done a fact-check to verify a Claim based on the provided
questionanswer pair evidence.</p>
          <p>Your task is to predict the verdict of a claim based on the provided question-answer pair evidence whether
the Claim is one of the labels: ’True’, ’False’, ’Conflicting’. Do this by following:
- Respond "True" only if the relevant evidence fully or almost fully supports and verifies the claim as
correct.
- Respond "False" if:
- The relevant evidence contradicts or disproves the claim.
- The claim is misleading based on the relevant evidence.
- The evidence is too weak or insuficient to support the claim.
- Respond "Conflicting" if the evidence is ambiguous, incomplete, or inconclusive, making it impossible
to determine if the claim is fully true or false.</p>
          <p>Always adhere to the following rules:
- Use information only from the recorded evidence: Avoid inserting information that is not implied by the
evidence. You may use commonsense knowledge, though.
- Avoid repeating yourself.</p>
          <p>Claim: french president emmanuel macron is around 25 years younger than his wife, brigitte, and they started dating
when he was a teenager and she was one of his teachers. her daughter, laurence, was once one of his classmates, and later
became his stepdaughter.</p>
          <p>Q1: “What is the exact age diference between French President Emmanuel Macron and his wife Brigitte Macron?”
A1: Emmanuel, who became president in 2017, and Brigitte first met when the now-president was in high school. In
a November 2023 interview with Paris Match — translated by The Independent — Brigitte shared how she never
thought of entertaining a relationship with her student, who was 15 at the time. Brigitte and Emmanuel met when
he was a student at Le Providence, a Catholic school in northern France. Brigitte worked as a drama teacher and
was 39 years old at the time, while Emmanuel was 15. The president’s parents originally thought he was seeing
Brigitte’s daughter, but when they found out he had feelings for the teacher, they sent him to boarding school.
French President Emmanuel Macron and his wife, Brigitte, have been married for nearly two decades, but their
relationship dates back even further.</p>
          <p>Q2: “Was Brigitte Macron serving as a teacher at the same educational institution where Emmanuel Macron was a
student when their relationship began?”
A2: Emmanuel Macron’s wife and former teacher thought he would “fall in love with someone his [own] age” after
he was moved from the school where she taught him. Mrs Macron said her older siblings used to joke about the
gossip. She divorced from her estranged husband, André-Louis Auzière, a banker, in 2006 and married Mr Macron a
year later, when he was a young civil servant. The French president was 15 when he fell for his married drama
teacher, Brigitte Auzière, then 40, at the Catholic Providence school in Amiens in the early 1990s. Her daughter
Laurence was the future French president’s classmate.</p>
          <p>Output | Verdict: True</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Results</title>
      <p>This section presents the results of our implementation, including the hyperparameter configurations
used to fine-tune Mistral-Small-24B-Instruct-2501 for veracity prediction and to generate questions
with QwQ-32B, as well as the environment setup and our experiment results and discussion.</p>
      <sec id="sec-4-1">
        <title>4.1. Implementation</title>
        <p>
          The hyperparameters used for fine-tuning the LLMs in our experiments are summarized in Table 6.
To enable eficient fine-tuning, we employed the Low-Rank Adaptation (LoRA) technique [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]. The
hyperparameters specific to the question generation process are provided in Table 7, while those for
veracity prediction using the fine-tuned model are detailed in Table 8. We also updated the evidence
corpus with the language-specific documents provided by the task organizers during the test phase to
ensure the retrieval of the most relevant evidence for each claim. All experiments were conducted on the
LIS cluster, equipped with NVIDIA A100 GPUs (80GB). The evidence retrieval phase took approximately
8 hours, while the question generation phase required around 2 days. Veracity prediction across all three
languages was completed in roughly 3 hours. To address potential cases where the language model
produced abnormal outputs lacking a final label in the required format, we implemented a fallback
mechanism whereby the generation process was repeated up to five times. At each iteration, the
temperature was incrementally increased by 0.1—starting from an initial value of 0.1—to promote more
diverse and correctly formatted outputs, as illustrated in Table 8. Nevertheless, during evaluation on
the test set, no such abnormal outputs were observed. For the comparison experiments, we trained our
models on the training set and evaluated them on the development set. These models were subsequently
used for the final submission to the shared task.
        </p>
        <p>Table 6 summarizes the hyperparameters used for fine-tuning a model for veracity prediction. The
training was conducted over 2 epochs with a batch size of 2 and gradient accumulation steps of 4, using
the AdamW 8-bit optimizer. Key settings include a learning rate of 2e-4, weight decay of 0.01, and
a warmup of 5 steps. A temperature of 0.1 was applied for output control. LoRA-based fine-tuning
was used with alpha set to 16, dropout to 0.1, and rank to 64, enabling eficient and scalable parameter
adaptation.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Results and Discussions</title>
        <p>4.2.1. English datasets</p>
        <p>Results on the development set show that our proposed model achieved nearly equivalent performance
in both the 3-question and 5-question scenarios, with the 3-question setup performing slightly better
by about 0.2%. Notably, both scenarios outperformed the baseline by approximately 3%.</p>
        <p>Due to the task requirement of selecting only one best-performing model for the final evaluation
on the test set, we submitted the proposed 3-question model. This model achieved a macro-F1 score
of 59.54%, ranking first, ahead of all other participating teams. The full leaderboard is published
on the oficial CLEF website 7. Compared to the second-ranked team, our system outperformed by
approximately 3%, and exceeded the performance of the lowest-ranked team by about 24%. Notably,
this performance was approximately 1.5% lower than its result on the development set. However, we
were unable to compare this result against the baseline on the test set, as the oficial baseline results for
Task 3 have not yet been released.
4.2.2. Spanish and Arabic datasets
Table 10 presents the results obtained from various configurations evaluated on the Spanish and Arabic
datasets of the CheckThat! 2025 Task 3, covering both the development and test sets. The evaluation
was conducted based on two scenarios regarding the number of generated questions per claim—3
questions versus 5 questions—using the Mistral-Small-24B-Instruct-2501 model, which was fine-tuned
on the training set. The evaluation metrics include macro-F1 and class-wise F1 scores for each label.
However, unlike the English dataset, the authors did not provide baseline model results for comparison.</p>
        <p>On the development set, the macro-F1 scores indicate that, similar to the English results, Arabic
achieved nearly equivalent performance in both the 3-question and 5-question scenarios, with the
3-question configuration slightly outperforming by about 1%. Notably, the 3-question setup yielded
almost perfect performance, reaching approximately 96%. For Spanish, using 3 questions resulted in a
performance improvement of over 10% compared to the 5-question scenario, suggesting that a smaller
number of questions may help retrieve more accurate and relevant evidence.</p>
        <p>Due to the task requirement that only one best-performing model per language could be submitted for
test set evaluation—as was the case with the English dataset—we submitted the 3-question configuration.</p>
        <p>In the Spanish, the model attained a macro-F1 score of 59.54%, reflecting a decrease of roughly 7%
compared to its performance on the development set. Despite this, it secured first place, outperforming
all other competing teams. The complete leaderboard is available on the oficial CLEF website 8. Our
system surpassed the second-place team by approximately 13% and outperformed the lowest-ranked
team by around 25%.</p>
        <p>For Arabic, although a three-label prompt was used as mentioned earlier in Section 3.3, the results
on the test set consistently produced only two labels without any errors or the need for normalization,
the model achieved a macro-F1 score of 96.15%, showing a slight improvement of about 0.15% relative</p>
        <sec id="sec-4-2-1">
          <title>7https://codalab.lisn.upsaclay.fr/competitions/22699#results</title>
          <p>8https://codalab.lisn.upsaclay.fr/competitions/22823#results
to its development set results. It also ranked first among all participants. The full rankings can be found
on the oficial CLEF website 9. Compared to the runner-up, our model exceeded their performance by
roughly 33%, and outperformed the lowest-ranked team by approximately 60%.</p>
          <p>
            We chose to generate three questions per claim based on consistent empirical results across Arabic,
English, and Spanish, showing this configuration to be the most efective. This finding aligns with [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ],
who tested multiple settings (1, 3, 5, and 7 questions) and also reported optimal performance with three.
This convergence reinforces the robustness of our approach.
          </p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This paper presents the experiments conducted by the LIS team for CheckThat! 2025 Task 3, which
focuses on verifying numerical claims in English, Spanish, and Arabic. We investigated the application
of large open-source language models in the two key stages of the fact-checking pipeline: evidence
retrieval and veracity prediction. Our proposed pipeline, which integrates instruction-following LLMs
and efective fine-tuning strategies such as LoRA. Our model consistently applies the same methodology
across all three languages and is the only team to fully participate in all three. The model achieved
strong performance; notably, the fine-tuned model on the training data secured first place on the oficial
leaderboard, outperforming submissions from 19 participating teams across all languages. These results
highlight the potential of open-source LLMs in multilingual fact-checking tasks. Furthermore, our
ifndings emphasize that fine-tuning Mistral-Small-24B-Instruct-2501 model, yields significantly better
performance compared to NLI models like FinQA-Roberta-Large, which was used as the baseline on the
English development set. This demonstrates the benefit of leveraging both model scale and task-specific
adaptation in complex fact-checking scenarios involving numerical claims.</p>
      <p>However, many potentially promising open-source LLMs remain unexplored in this experiment.
The number of test questions is still limited to only one configuration: 3 questions. Additionally, the
current fine-tuning approach has been limited to monolingual language models. In future work, we
plan to conduct experiments on a broader range of open-source LLMs as well as with more diverse
question configurations, and to explore fine-tuning multilingual models, which may yield better results
compared to monolingual ones. Furthermore, although the Arabic dataset includes only two labels
(True and False), the current prompt structure uses three possible verdicts to ensure consistency across
languages. This design choice may lead the model to generate non-applicable verdicts for certain
Arabic claims, potentially afecting system behavior. The impact of this discrepancy will be examined
to determine whether normalization techniques—such as mapping the model’s output to the available
binary labels—can enhance reliability and alignment with the ground truth.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work was supported by the Eifel Excellence Scholarship awarded to Quy Thanh Le by the French
government, as well as by the R2I and MoFED teams of the LIS Laboratory.</p>
      <sec id="sec-6-1">
        <title>9https://codalab.lisn.upsaclay.fr/competitions/22699#results</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used GPT-4 for grammar and spelling checks, as
well as for paraphrasing and rewording. After using these tools, the author(s) reviewed and edited the
content as needed, and take full responsibility for the publication’s content.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlichtkrull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          , A survey on
          <source>automated fact-checking, Transactions of the Association for Computational Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>178</fpage>
          -
          <lpage>206</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          . tacl-
          <volume>1</volume>
          .11/. doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00454</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Abumansour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <source>Automated fact-checking: A survey, Language and Linguistics Compass</source>
          <volume>15</volume>
          (
          <year>2021</year>
          )
          <article-title>e12438</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Arslan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Caraballo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jimenez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gawsane</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Joseph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kulkarni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. K.</given-names>
            <surname>Nayak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sable</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tremayne</surname>
          </string-name>
          ,
          <article-title>Claimbuster: the first-ever end-to-end fact-checking system</article-title>
          ,
          <source>Proc. VLDB Endow</source>
          .
          <volume>10</volume>
          (
          <year>2017</year>
          )
          <fpage>1945</fpage>
          -
          <lpage>1948</lpage>
          . URL: https://doi.org/10.14778/3137765.3137815. doi:
          <volume>10</volume>
          .14778/3137765.3137815.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Aral,</surname>
          </string-name>
          <article-title>The spread of true and false news online</article-title>
          ,
          <source>Science</source>
          <volume>359</volume>
          (
          <year>2018</year>
          )
          <fpage>1146</fpage>
          -
          <lpage>1151</lpage>
          . URL: https://www.science.org/doi/abs/10.1126/science.aap9559. doi:
          <volume>10</volume>
          .1126/science. aap9559. arXiv:https://www.science.org/doi/pdf/10.1126/science.aap9559.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Panchendrarajan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <article-title>Claim detection for automated fact-checking: A survey on monolingual, multilingual and cross-lingual research</article-title>
          ,
          <source>Natural Language Processing Journal</source>
          <volume>7</volume>
          (
          <year>2024</year>
          )
          <article-title>100066</article-title>
          . URL: http://dx.doi.org/10.1016/j.nlp.
          <year>2024</year>
          .
          <volume>100066</volume>
          . doi:
          <volume>10</volume>
          .1016/j.nlp.
          <year>2024</year>
          .
          <volume>100066</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sagara</surname>
          </string-name>
          , E. Peters,
          <article-title>Consumer understanding and use of numeric information in product claims</article-title>
          , in: D. R. Deeter-Schmelz (Ed.),
          <source>Proceedings of the 2010 Academy of Marketing Science (AMS) Annual Conference</source>
          , Springer International Publishing, Cham,
          <year>2015</year>
          , pp.
          <fpage>245</fpage>
          -
          <lpage>245</lpage>
          . doi:
          <volume>10</volume>
          .1007/ 978-3-
          <fpage>319</fpage>
          -11797-3_
          <fpage>140</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V. V</given-names>
            ,
            <surname>A. Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <article-title>Quantemp: A real-world open-domain benchmark for factchecking numerical claims</article-title>
          ,
          <source>in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , SIGIR '24,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2024</year>
          , p.
          <fpage>650</fpage>
          -
          <lpage>660</lpage>
          . URL: https://doi.org/10.1145/3626772.3657874. doi:
          <volume>10</volume>
          .1145/3626772.3657874.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Schlichtkrull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Christodoulopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Cocarascu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <article-title>The fact extraction and VERification over unstructured and structured information (FEVEROUS) shared task</article-title>
          , in: R.
          <string-name>
            <surname>Aly</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Christodoulopoulos</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Cocarascu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mittal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Schlichtkrull</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Thorne</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Vlachos (Eds.),
          <source>Proceedings of the Fourth Workshop on Fact Extraction and VERification (FEVER)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Dominican Republic,
          <year>2021</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>13</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .fever-
          <volume>1</volume>
          .1/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          . fever-
          <volume>1</volume>
          .1.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          , Factcheck editor:
          <article-title>Multilingual text editor with end-to-end fact-checking</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2404.19482. arXiv:
          <volume>2404</volume>
          .
          <fpage>19482</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          , Automated fact checking:
          <article-title>Task formulations, methods and future directions</article-title>
          , in: E. M.
          <string-name>
            <surname>Bender</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Derczynski</surname>
          </string-name>
          , P. Isabelle (Eds.),
          <source>Proceedings of the 27th International Conference on Computational Linguistics</source>
          , Association for Computational Linguistics, Santa Fe, New Mexico, USA,
          <year>2018</year>
          , pp.
          <fpage>3346</fpage>
          -
          <lpage>3359</lpage>
          . URL: https://aclanthology.org/C18-1283/.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>R.</given-names>
            <surname>Aly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlichtkrull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Christodoulopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Cocarascu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mittal</surname>
          </string-name>
          , Feverous:
          <article-title>Fact extraction and verification over unstructured</article-title>
          and structured information,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.05707. arXiv:
          <volume>2106</volume>
          .
          <fpage>05707</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>J.-F. Bussotti</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Ragazzi</surname>
            , G. Frisoni, G. Moro,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Papotti</surname>
          </string-name>
          ,
          <article-title>Unknown claims: Generation of fact-checking training examples from unstructured and structured data</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>12105</fpage>
          -
          <lpage>12122</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>675</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          . emnlp-main.
          <volume>675</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wallat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdallah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jatowt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <article-title>A study into investigating temporal robustness of llms</article-title>
          ,
          <year>2025</year>
          . arXiv:
          <volume>2503</volume>
          .
          <fpage>17073</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bendou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bouamor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Iturra-Bocaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! lab task 3 on fact-checking numerical claims</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2025</year>
          , Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tazi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Magne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          , MTEB:
          <article-title>Massive text embedding benchmark</article-title>
          , in: A.
          <string-name>
            <surname>Vlachos</surname>
          </string-name>
          , I. Augenstein (Eds.),
          <source>Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics</source>
          , Association for Computational Linguistics, Dubrovnik, Croatia,
          <year>2023</year>
          , pp.
          <fpage>2014</fpage>
          -
          <lpage>2037</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .eacl-main.
          <volume>148</volume>
          /. doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2023</year>
          .eacl-main.
          <volume>148</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sriram</surname>
          </string-name>
          , E. Choi, G. Durrett,
          <article-title>Generating literal and implied subquestions to factcheck complex claims</article-title>
          , in: Y.
          <string-name>
            <surname>Goldberg</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Kozareva</surname>
          </string-name>
          , Y. Zhang (Eds.),
          <source>Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Abu Dhabi, United Arab Emirates,
          <year>2022</year>
          , pp.
          <fpage>3495</fpage>
          -
          <lpage>3516</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .emnlp-main.
          <volume>229</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .emnlp-main.
          <volume>229</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Fact-checking complex claims with program-guided reasoning</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>6981</fpage>
          -
          <lpage>7004</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>386</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>386</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>CoRR abs/2106</source>
          .09685 (
          <year>2021</year>
          ). arXiv:
          <volume>2106</volume>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>