<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>E. Morales);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>UCOM_UNAM_PLN at CheckThat! 2025: Evaluating LLMs in a two-Step Architecture for Numerical Fact Checking⋆</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Guido Acosta</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Eduardo Morales</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helena Gómez-Adorno</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Instituto de Investigaciones en Matemáticas Aplicadas y en Sistemas, Universidad Nacional Autónoma de México</institution>
          ,
          <addr-line>Av. Universidad 3000, C.U., Coyoacán, Ciudad de México</addr-line>
          ,
          <country country="MX">México</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Maestría en Ciencia de Datos, Universidad Comunera</institution>
          ,
          <addr-line>Monseñor Bogarín 284, Asunción</addr-line>
          ,
          <country country="PY">Paraguay</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>This paper presents the approach developed by the UCOM-UNAM-PLN team for Task 3 of the CLEF 2025 CheckThat! Lab, focused on verifying numerical and temporal claims in Spanish. We propose a two-step factchecking architecture leveraging large language models (LLMs) in both stages: (1) evidence retrieval through BM25-preselected documents, and (2) veracity classification using zero-shot and few-shot prompting strategies. We conduct a comprehensive evaluation of retrieval techniques-including statistical, embedding-based, and LLM-based models-on the CLEF 2024 development set, identifying GPT-4o as the top-performing retriever. Based on these findings, we deploy GPT-4o for both evidence extraction and classification. Our best-performing pipeline combines few-shot prompting with curated evidence, achieving a macro-averaged F1 score of 0.3595 on the oficial CLEF 2025 test set. Results highlight the efectiveness of hybrid LLM-based architectures in identifying false claims, while also revealing challenges in handling ambiguous or true statements. This study underscores the potential of combining semantic retrieval and task-specific prompting for robust fact verification in real-world contexts.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Fact checking</kwd>
        <kwd>LLMS</kwd>
        <kwd>Classification</kwd>
        <kwd>Information Retrieval</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In their state-of-the-art review, [3] organize automated fact-checking methods according to four-stage
architecture proposed by [4]: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) claim detection, where statements that can be verified are extracted; (
        <xref ref-type="bibr" rid="ref2">2</xref>
        )
evidence retrieval, which involves locating relevant information from external sources such as databases
or textual corpora; (3) claim verification, which entails determining whether the evidence supports,
refutes, or does not allow a conclusion about the claim; and (4) justification generation, where a textual
explanation is produced to accompany the system’s decision. In [5], the fact-checking process mainly
consists of two stages. First, an evidence retrieval model based on transformers extracts a relevant
text snippet from a trusted source, which is crucial for assessing the validity of a claim. Then, a claim
verification model, also based on transformers, uses the retrieved evidence to classify the claim as true,
false, or "not enough information" (NEI) if the evidence is inconclusive. In this paper, we will focus on
the two-step architecture.
      </p>
      <p>In their review on automated fact-checking, [3] identify three main approaches for evidence retrieval:
traditional lexical-based methods like BM25, dense embedding-based methods, and hybrid approaches
that combine initial retrieval and re-ranking with more complex models. For information retrieval, [5]
employs a transformer-based model inspired by extractive question-answering systems. This model,
typically BERT-based, takes a claim and a reliable source (such as a Wikipedia article) and extracts a
relevant text snippet for claim verification. The training is done by fine-tuning pre-trained BERT models
(such as BERT and DistilBERT) using the FEVER dataset. In the experiments conducted, DistilBERT
achieved an exact match score of 90.19% and an F1-score of 93.98%, slightly outperforming BERT, which
obtained an exact match score of 89.89% and an F1-score of 93.93%.</p>
      <p>In [5] the claim verification model is based on BERT. This model takes as input the original textual
claim and the retrieved textual evidence. Sentence embeddings for the claim and the evidence are
obtained through a pre-trained BERT embedding layer. These embeddings are then processed and passed
to a fully connected layer. Finally, the output of this layer is used to classify the claim into one of three
categories: true, false, or "not enough information" (NEI), determining whether the evidence is suficient
to verify the claim. In the verification stage in [ 6], focuses primarily on the use of Large Language Models
(LLMs). These LLMs, such as GPT-4 and Llama3-70B-Instruct, are guided by system prompts to
analyze a claim and the provided textual evidence. Their goal is to predict a label (SUPPORTS, REFUTES,
or NOT ENOUGH INFO) along with a confidence score, leveraging their deep understanding of natural
language for complex reasoning tasks. In [7], the authors conducted a comprehensive analysis on
the identification of critical statements using various strategies, including lexical representations,
embeddings, and LLMs. Specifically, they evaluated the use of LLMs as direct classifiers by applying
zero-shot and few-shot prompting strategies, without requiring additional training.</p>
      <p>Large Language Models (LLMs) are emerging as a promising component in fact-checking systems. In
this work, we explore the use of Large Language Models (LLMs) in both stages of the fact-checking
architecture.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>1 {
2
3
4
5</p>
      <p>To address this challenge we only used the development dataset of the Checkthat 2025 Lab [2] which
contains elements in JSON format with the following structure:
" c l a i m ": " San L u i s no t i e n e e m p l e a d o s pú b l i c o s en e x c e s o ; hoy son 17 m i l ",
" c r a w l e d _ d a t e ": " 2011 −09 −26 ",
" c o u n t r y _ o f _ o r i g i n ": " a r g e n t i n a ",
" doc ": " L a s c i f r a s d e l P r e s u p u e s t o p r o v i n c i a l son c o n s i s t e n t e s con e s a a f i r m a c i
ón . No a s í l o s r e s u l t a d o s de l a E n c u e s t a Permanente de Hogares . O t r o s
i n d i c a d o r e s m u e s t r a n que l a p r o v i n c i a t i e n e una a l t a t a s a de empleo en n e g r o
y de p l a n e s s o c i a l e s . San L u i s e s [ l a p r o v i n c i a ] mejor a d m i n i s t r a d a de l a
Repú b l i c a porque no t i e n e e m p l e a d o s pú b l i c o s en e x c e s o ; hoy son 17 m i l e</p>
      <p>
        The test dataset with 1.806 claims has a diferent structure than the development dataset with 377
claims. This dataset contained only the claims without associated documents. For this reason, it was
necessary to perform an additional task to collect the corresponding evidence documents for each claim.
To obtain the evidence, we used the corpus provided by the challenge organizers. In this context, we
considered two alternatives: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) using the pre-selected evidence provided by the organizers, which
was retrieved using the BM25 algorithm, and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) developing a custom function to perform automatic
evidence retrieval from the corpus. The procedures implemented in both strategies are detailed in the
following sections.
      </p>
      <p>Table 1 presents the statistic of the development. It shows class distribution of the development
dataset, documents length, and the average evidence length measured in characters.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Method</title>
      <p>From each entry of the development dataset, we extracted: the claim, the doc (the document containing
the associated evidence) and the label.</p>
      <p>
        Our proposed approach is based on the use of large language models (LLM), which were employed in
two diferent ways: (
        <xref ref-type="bibr" rid="ref1">1</xref>
        ) as a classifier, assigning a label to each claim based on the available evidence,
and (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) as an evidence retriever, for identifying relevant fragments within the document that are later
used to support, refute, or indicate conflict regarding the claim.
      </p>
      <p>Both approaches were configured under two diferent inference schemes: zero-shot, in which the
model performs the task without having received prior examples during the interaction, and few-shot,
where representative examples are provided within the prompt to guide its behavior and attempt to
improve accuracy in classification or evidence extraction.</p>
      <p>The following section provides a detailed description of both stages of the approach.</p>
      <sec id="sec-4-1">
        <title>4.1. Claim classifier</title>
        <p>In this first approach, the LLM was instructed to act as an automatic claim classifier. Its goal was to
analyze each claim along with its associated evidence and assign a veracity label based on the provided
content. The instructions used are shown below:
You are an a s s i s t a n t focused on f a c t v e r i f i c a t i o n involving numerical claims and temporal expressions .
Your r o l e i s to c l a s s i f y the t r u t h f u l n e s s of each claim based on the evidence provided .</p>
        <p>Numerical claims are defined as those req uiring v a l i d a t i o n of e x p l i c i t or i m p l i c i t q u a n t i t a t i v e or temporal d e t a i l s .
You w i l l r e c e i v e d a t a i n JSON format . Each i n s t a n c e w i l l i n c l u d e :
− a c l a i m ( t h e s t a t e m e n t t o v e r i f y ) ,
− and an e v i d e n c e ( t h e s u p p o r t i n g or c o n t r a d i c t i n g i n f o r m a t i o n ) ,
You must a n a l y z e t h e e v i d e n c e and a s s i g n one o f th e f o l l o w i n g l a b e l s t o th e c l a i m :
− True
− F a l s e
− C o n f l i c t i n g
Only respond with t h e l a b e l . No a d d i t i o n a l e x p l a n a t i o n i s r e q u i r e d .</p>
        <p>In the case of the few-shot approach, the same base instructions as in the zero-shot scenario were
used, but labeled examples were added to the prompt in order to guide the model in the task. To do this,
the following instruction was included:
You w i l l f i r s t r e c e i v e a few l a b e l e d examples . Use t h e s e examples as gu idanc e t o d e t e r m i n e t he c o r r e c t l a b e l f o r t he
f o l l o w i n g u n l a b e l e d i n s t a n c e s . Examples :</p>
        <p>Following this instruction, an example of a claim with its corresponding evidence and label was
included for each of the possible classes (True, False, and Conflicting ). These examples were intended
to illustrate both the expected format and the necessary criteria for the model to perform a coherent
classification aligned with the task. It is worth noting that, in order to identify the examples to be used,
a manual curation was carried out to obtain representative examples.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Relevant evidence retriever</title>
        <p>In a second approach, the LLM was instructed as an evidence retriever, with the goal of identifying
relevant fragments within the document that could later be used in the classification of the claim.
Specifically, the model was asked to extract three sentences from the
doc field, assigning each one a
label indicating its relation to the claim (True, False, or Conflicting ), depending on whether the sentence
supports, contradicts, or shows ambiguity with respect to the claim. The instructions used for the
zero-shot case are provided below:</p>
        <p>In the case of the few-shot approach, the same base instructions as in the zero-shot scenario were
used, but labeled examples were added to the prompt in order to guide the model in the task. To do this,
the following instruction was included:
Here a r e some examples . Use t h e s e examples as a guide t o g e t t h e c o r r e c t s e n t e n c e s .</p>
        <p>Once the relevant sentences were obtained, the following logic was applied to determine the final
label of the claim:
• If at least one of the sentences was labeled as Conflicting , the claim was classified as</p>
        <sec id="sec-4-2-1">
          <title>Conflicting .</title>
          <p>• In the absence of Conflicting</p>
          <p>sentences, if at least one sentence was labeled as False, the claim
was classified as</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>False.</title>
          <p>• If all the sentences were labeled as True, the claim was classified as</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>True.</title>
          <p>• In cases where the retriever failed to identify relevant fragments and therefore returned no
sentences, the claim was classified as</p>
          <p>False by default.</p>
          <p>In this work, we adopted this heuristic under the assumption that the absence of relevant evidence
suggests that the claim is false. If a claim were true or conflicting, it is likely that verifiable information
associated with it would exist in the document corpus. Therefore, a complete lack of retrieved evidence
could lead us to conclude that the claim is false in this context. It is possible that there are cases in
which no evidence exists—either because the evidence retriever failed to find the relevant evidence
or because such evidence simply does not exist. In either case, a default class must be chosen at the
time of classification. We acknowledge that an alternative could be to classify these cases as “not
enough information for a verdict,” however, this class is not used in the current system. It is clear that
this heuristic may introduce bias toward the False class. Nevertheless, introducing a new class like
NOT ENOUGH INFO could also introduce bias toward that class, since the classification stage strongly
depends on the evidence retrieval stage.
4.2.1. Use of evidence provided by the organizers
The organizers provided a file containing the claims along with the 100 most relevant pieces of evidence
retrieved from the corpus using the BM25 algorithm. Each entry in the file had the following structure:
1 {
2
3
4
5
6
7 }
"doc_id": [list of document identifiers],
"scores": [list of scores associated with each selected evidence],
"query_id": "claim identifier number",
"claim": "claim",
"docs": [list of corresponding evidence]</p>
          <p>For the experiments, the five pieces of evidence with the highest scores were selected for each claim.
These pieces of evidence were concatenated into a single text, which served as the input document (doc).
This document, along with the corresponding claim, was sent to the assistant to assign the veracity
label or to extract the most relevant evidence.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Previous Experiments on Fact Checking</title>
      <p>Before addressing Task 3 of the CheckThat! 2025 challenge — Fact-Checking of Numerical Claims — ,
we ran preliminary experiments using the dataset from the CheckThat! 2024 Rumor Verification task
[8]. The goal of this task was to explore information retrieval techniques and select the most suitable
one as the relevant evidence retriever.</p>
      <p>We conducted a comprehensive evaluation of a wide range of retrieval techniques, grouped into
three main categories:
• Traditional statistical techniques [9, 10, 11, 12, 13, 14, 15, 16]:</p>
      <p>BM25, BM25PLUS, BM25-OKAPI, BM25+PL2, BM25LARGE, PL2, TF-IDF, DPH, DPH+PL2, DLH,
LEMUR, HIEMSTRA, DFIZ, DFIC, DFR-BM25.
• Embedding-based models [17, 18, 19] :</p>
      <p>OpenAI : text-embedding-3-large, text-embedding-3-small.</p>
      <p>Gemini: text-embedding-005-FACT-VERIFICATION,
text-embedding-005-SEMANTICSIMILARITY, text-embedding-005-RETRIEVAL-DOCUMENT.</p>
      <p>SBERT : SBERT-all-MiniLM-L6-v3.
• Large Language Models (LLMs) used as semantic retrievers [20, 21, 22, 23]:
GPT-4o: gpt-4o-2024-08-06, gpt-4o-mini-2024-07-18.</p>
      <p>Qwen: qwen2.5-72b.</p>
      <p>LLAMA: LLAMA3.2LARGE.</p>
      <p>Deepseek: deepseek-llm-67b.</p>
      <p>These techniques were evaluated on the CheckThat! 2024 development set, which contains 32 rumors,
using Recall@5 (R@5) [24] which measures the model’s ability to retrieve at least one relevant document
among the top 5 results and Mean Average Precision (MAP) which measures the model’s ability to</p>
      <p>MAP
rank relevant documents higher across the entire result list as evaluation metrics. The key results are
summarized in Table 2. Bellow we draw some conclusions from these results.</p>
      <p>LLMs as Retrievers. Large Language Models (LLMs) demonstrated a remarkable improvement
in semantic retrieval capabilities. Notably, GPT-4o-2024-08-06 achieved the best overall performance,
with R@5=0.940 and MAP=0.918, significantly outperforming all other techniques. This performance
indicates a deep semantic understanding of claims and an outstanding ability to locate relevant evidence,
even when it is expressed implicitly or through paraphrasing. Other LLMs, such as Qwen 2.5-72B and
GPT-4o-mini, also delivered strong results (MAP &gt; 0.74), albeit with slightly lower recall. In contrast,
models like DeepSeek-67B and LLAMA3.2 Large showed significantly lower performance, possibly due
to limitations in their training or in how they represent truthfulness or evidential relations in retrieval
tasks.</p>
      <p>Embedding-based Models. Pretrained embedding models ofered an interesting balance between
eficiency and quality. The OpenAI text-embedding-3-large model achieved notable results
(R@5=0.800, MAP=0.731), approaching the performance of some LLMs while maintaining a lower
computational cost. Within this category, Gemini-FACT_VERIFICATION remained competitive (MAP
= 0.695), confirming its suitability for contextual verification tasks. SBERT also yielded acceptable
results considering its lightweight nature.</p>
      <p>Traditional Statistical Techniques. While often considered baselines, traditional statistical
retrieval methods such as BM25, PL2, TF-IDF, and DFR variants demonstrated surprisingly
competitive performance, particularly when evaluating MAP. Notably, DFIZ achieved a MAP of 0.680 and
DPH reached 0.676, both comparable to modern embedding models such as GEMINI-fact-verification
(MAP=0.695), GEMINI-semantic-similarity (MAP=0.685), and OpenAI-embedding-3-small (MAP=0.674).
In fact, DFIZ and DPH outperformed other embedding-based retrievers like GEMINI-retrieval-document
(MAP=0.670), showing that statistical methods can still ofer strong baseline performance under the right
conditions. These results suggest that, despite lacking semantic understanding, term-frequency-based
approaches remain viable for certain fact-checking tasks—particularly when the evidence is lexically
similar to the claim. Their lower computational cost and strong MAP scores make them attractive for
large-scale or resource-constrained settings.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Checkthat 2025 Results</title>
      <p>To address Task 3 of the CheckThat! 2025 Lab, which consists in verifying numerical and temporal
claims in Spanish, we implemented and evaluated two distinct pipelines using the gpt-4o-2024-08-06
model. This model was selected based on its outstanding retrieval performance in the preliminary
experiments (MAP = 0.918, R@5 = 0.940), as shown in Section 5.</p>
      <p>We explored two main strategies:
• Direct classifier: a single-stage prompt that directly predicts the veracity label (True, False, or</p>
      <p>Conflicting) given the claim and evidence.
• Evidence collection + classification: a two-stage pipeline where evidence is first retrieved
using a semantic retriever (top-5 from BM25 preselection), and then the claim is classified based
on this curated evidence.</p>
      <p>Each strategy was evaluated using two prompting approaches:
• Zero-shot: the LLM receives only the task instruction.</p>
      <p>• Few-shot: the LLM is provided with a small number of labeled examples before prediction.</p>
      <p>The experiments were carried out using the OpenAI assistant configuration, with a top p value
set to 1 and a temperature of 0.01. This setup minimizes the randomness of the generated responses,
making the model highly deterministic by prioritizing the most probable tokens at each step, while still
considering the full probability distribution due to the top-p value of 1.</p>
      <p>We report macro-averaged F1 [25] score on the oficial Spanish development set. The results are
summarized in Table 3. This metric was selected as it aligns with the oficial evaluation criteria used by
the competition organizers, ensuring consistency and comparability with the leaderboard results.</p>
      <p>The few-shot evidence+classifier pipeline yielded the best result (F1 = 0.672), indicating that combining
relevant evidence selection with prompting examples significantly improves performance. In contrast,
the zero-shot direct classification approach underperformed, achieving only F1 = 0.349, highlighting
the importance of both evidence quality and task-specific guidance.</p>
      <sec id="sec-6-1">
        <title>6.1. Checkthat Submission Results</title>
        <p>In the Checkthat 2025 Lab, we participated in Task 3 (Fact-Checking Numerical Claims) for the Spanish
dataset. According to the competition rules, teams were allowed to submit only one final run for
evaluation on the oficial test set.</p>
        <p>Based on our validation results (see Table 3), we selected the Evidence + classifier strategy with
few-shot prompting using the gpt-4o-2024-08-06 model, as it achieved the best macro F1 (0.672)
on the development set.</p>
        <p>The final submission was evaluated by the organizers using macro-averaged F1, as well as
classspecific F1 scores.</p>
        <p>Our approach achieved the third-best macro F1 score (0.3595) in the oficial CheckThat! 2025 Task 3
evaluation for Spanish (see Table 4). While not leading in overall ranking, our model demonstrated
particularly strong performance in identifying False claims, with an F1 score of 0.7443, indicating a
reliable ability to detect and reject incorrect numerical or temporal assertions. However, the F1 scores
for the True (0.1853) and Conflicting (0.1490) labels were significantly lower on the test dataset.
This disparity in class-wise performance suggests a particular challenge in distinguishing and handling
claims that are not clearly "False", an aspect we address in more detail in the limitations and future
perspectives of this work.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Limitations of the Approach</title>
      <p>Despite the promising results obtained by the proposed pipeline, several limitations must be
acknowledged. The computational cost associated with using large language models like gpt-4o could be
a practical limitation for large-scale deployments or resource-constrained environments, despite the
observed performance gains. This limitation highlights the need to explore smaller models in future
work.</p>
      <p>Furthermore, the efectiveness of the LLM-based components, particularly in the few-shot
configuration, relies heavily on the quality and representativeness of the examples provided; poorly chosen
examples can lead to suboptimal classifications or evidence extraction. The classification is highly
dependent on the evidence retrieved; in cases where relevant evidence is not retrieved by BM25, the
system may fail to properly assess the claim’s veracity.</p>
      <p>A crucial limitation that explains the lower F1 scores for the "True" and "Conflicting" labels lies
in the classification heuristic applied. When the evidence retriever identifies relevant sentences, the
current logic assigns specific priority to the verdicts of the extracted sentences. If at least one sentence
is "Conflicting", the claim is classified as "Conflicting". In the absence of "Conflicting" sentences, if at
least one sentence is "False", the claim is classified as "False". Only if all sentences are "True" is the claim
classified as "True". This precedence implies that if sentences with both "True" and "False" verdicts are
retrieved simultaneously (i.e., contradictions exist among the relevant sentences) and no sentence is
explicitly labeled as "Conflicting" by the LLM, the system tends to classify the claim as "False" due to
the priority given to the "False" verdict. This may result in the failure of identification of "Conflicting"
or even "True" claims when the retrieved evidence is not uniformly positive.</p>
      <p>Likewise, in cases where the retriever failed to identify relevant fragments and therefore returned no
sentences, the claim was classified as False by default. This heuristic, adopted under the assumption that
the absence of verifiable evidence suggests the claim is false in this context, further biases the system
toward the "False" classification, contributing to the poor performance in the "True" and "Conflicting"
categories and potentially reducing the overall robustness of the classification.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion and Perspectives for Future Work</title>
      <p>This paper presented an efective two-step architecture for numerical fact-checking, leveraging the
power of large language models within a hybrid pipeline. Our approach, combining semantic evidence
retrieval with a few-shot classification strategy using the gpt-4o-2024-08-06 model, achieved a
competitive macro F1 score (0.3595) and ranked third in the CheckThat! 2025 Task 3 oficial test set
for the Spanish dataset [1]. The results underscore the significant impact of evidence quality and
task-specific guidance through few-shot prompting on fact verification performance.</p>
      <p>For future work, several promising avenues remain open for exploration. One potential direction is
to investigate alternative evidence retrieval strategies, including the full implementation and evaluation
of embedding-based retrieval methods using dedicated performance metrics. Such approaches could
enrich the initial evidence pool and potentially lead to improved overall system performance.</p>
      <p>In addition, to more efectively address the identified limitations in the classification of "True" and
"Conflicting" claims, there is a need to improve the classification heuristic. Specifically, we will explore
a more sophisticated logic to handle situations where the retrieved sentences present contradictions
(e.g., a mix of "True" and "False" verdicts without an explicitly "Conflicting" sentence). The goal is to
ensure that such cases are more accurately classified as "Conflicting", better reflecting the inherent
ambiguity of contradictory evidence.</p>
      <p>Finally, given the computational cost and scalability limitations of large models like GPT-4o, exploring
the potential of smaller language models is identified as a priority for future work. This line of research
will include both a comparative evaluation and the application of fine-tuning techniques specifically
tailored to fact-checking tasks.</p>
      <p>Fine-tuning lighter models will significantly reduce deployment costs and also open up the possibility
of training models to capture complex semantic nuances. For example, in situations where all retrieved
evidence appears to be "True" or "False", but the claim contains a subtlety that requires a "Conflicting"
classification, a properly fine-tuned model could learn to recognize these patterns beyond traditional
heuristic rules. By incorporating such subtleties into supervised training, we expect to enhance the
system’s ability to produce more accurate and robust judgments.</p>
      <p>These future work directions are considered fundamental steps to increase both the technical viability
and the practical applicability of the proposed architecture, especially in contexts where eficiency is
required without sacrificing accuracy.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This work was partially supported by UNAM PAPIIT project IN104424 and by the Mexican Government
through SECIHTI Proyect FC-2023-G-64.</p>
    </sec>
    <sec id="sec-10">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used chatGPT in order to: Grammar and spelling check.
After using this tool, the authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.
on Research and Development in Information Retrieval, SIGIR ’24, Association for Computing
Machinery, New York, NY, USA, 2024, p. 650–660. doi:10.1145/3626772.3657874.
[3] Z. Guo, M. Schlichtkrull, A. Vlachos, A survey on automated fact-checking, Transactions of the</p>
      <p>Association for Computational Linguistics 10 (2022) 178–206. doi:10.1162/tacl_a_00454.
[4] A. Vlachos, S. Riedel, Fact checking: Task definition and dataset construction, in: C.
DanescuNiculescu-Mizil, J. Eisenstein, K. McKeown, N. A. Smith (Eds.), Proceedings of the ACL 2014
Workshop on Language Technologies and Computational Social Science, Association for
Computational Linguistics, Baltimore, MD, USA, 2014, pp. 18–22. URL: https://aclanthology.org/W14-2508/.
doi:10.3115/v1/W14-2508.
[5] A. Adesokan, S. Elbassuoni, Factify: An automated fact-checker for web information, in:
2024 IEEE International Conference on Big Data (BigData), 2024, pp. 1546–1551. doi:10.1109/
BigData62323.2024.10825147.
[6] L. Kolb, A. Hanbury, Authev-lkolb at checkthat! 2024: a two-stage approach to evidence-based
social media claim verification, Faggioli et al.[22] (2024).
[7] G. E. Pianciola Bartol, A. Tommasel, Towards automated fact-checking: An exploratory study on
identifying check-worthy phrases for verification, in: 2024 L Latin American Computer Conference
(CLEI), 2024, pp. 1–10. doi:10.1109/CLEI64178.2024.10700241.
[8] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari,
M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The clef-2024 checkthat! lab: Check-worthiness,
subjectivity, persuasion, roles, authorities, and adversarial robustness, in: N. Goharian, N.
Tonellotto, Y. He, A. Lipani, G. McDonald, C. Macdonald, I. Ounis (Eds.), Advances in Information
Retrieval, Springer Nature Switzerland, Cham, 2024, pp. 449–458.
[9] A. Trotman, A. Puurula, B. Burgess, Improvements to bm25 and language models examined, in:
Proceedings of the 19th Australasian Document Computing Symposium, ADCS ’14, Association
for Computing Machinery, New York, NY, USA, 2014, p. 58–65. doi:10.1145/2682862.2682863.
[10] Kocabaş, B. T. Dinçer, B. Karaoğlan, A nonparametric term weighting method for information
retrieval based on measuring the divergence from independence, Information Retrieval 17 (2013)
153–176. doi:10.1007/s10791-013-9225-4.
[11] G. Amati, E. Ambrosi, M. Bianchi, C. Gaibisso, G. Gambosi, Fub, iasi-cnr and university of tor
vergata at trec 2007 blog track, 2007.
[12] G. Amati, Frequentist and bayesian approach to information retrieval, 1970, pp. 13–24. doi:10.</p>
      <p>1007/11735106_3.
[13] B. He, I. Ounis, Term frequency normalisation tuning for bm25 and dfr models, in: D. E. Losada,
J. M. Fernández-Luna (Eds.), Advances in Information Retrieval, Springer Berlin Heidelberg, Berlin,
Heidelberg, 2005, pp. 200–214.
[14] S. Robertson, H. Zaragoza, The probabilistic relevance framework: Bm25 and beyond, Foundations
and Trends® in Information Retrieval 3 (2009) 333–389. doi:10.1561/1500000019.
[15] J. Perea-Ortega, M. García-Cumbreras, M. García-Vega, L. López, Comparing several textual
information retrieval systems for the geographical information retrieval task, volume 5039, 2008,
pp. 142–147. doi:10.1007/978-3-540-69858-6_15.
[16] D. Hiemstra, Using language models for information retrieval, 2001.
[17] OpenAI, text-embedding-3: Openai embedding models, 2024. URL: https://platform.openai.com/
docs/guides/embeddings, accessed: 2025-05-28.
[18] G. Research, Generalizable embeddings from gemini, arXiv preprint arXiv:2503.07891 (2025). URL:
https://arxiv.org/abs/2503.07891.
[19] N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using Siamese BERT-networks, in:
K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural Language
Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019,
pp. 3982–3992. doi:10.18653/v1/D19-1410.
[20] OpenAI, GPT-4 technical report, CoRR abs/2303.08774 (2023). doi:10.48550/ARXIV.2303.</p>
      <p>08774. arXiv:2303.08774.
[21] Qwen, et al., Qwen2.5 technical report, 2025. URL: https://arxiv.org/abs/2412.15115.</p>
      <p>arXiv:2412.15115.
[22] A. G. et al., The llama 3 herd of models, 2024. URL: https://arxiv.org/abs/2407.21783.</p>
      <p>arXiv:2407.21783.
[23] D.-A. et al., Deepseek llm: Scaling open-source language models with longtermism, 2024. URL:
https://arxiv.org/abs/2401.02954. arXiv:2401.02954.
[24] A. Yates, R. Nogueira, J. Lin, Pretrained transformers for text ranking: Bert and beyond, SIGIR ’21,
Association for Computing Machinery, New York, NY, USA, 2021, p. 2666–2668. doi:10.1145/
3404835.3462812.
[25] F. Sebastiani, Machine learning in automated text categorization, ACM Computing Surveys 34
(2002) 1–47. doi:10.1145/505282.505283.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bendou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bouamor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Iturra-Bocaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscakova</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! lab task 3 on fact-checking numerical claims</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2025</year>
          , Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V. V</given-names>
            ,
            <surname>A. Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <article-title>Quantemp: A real-world open-domain benchmark for factchecking numerical claims</article-title>
          ,
          <source>in: Proceedings of the 47th International ACM SIGIR Conference</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>