1. Introduction

KSU at CheckThat! 2025: Two-stage approach to fact-checking numerical claims

Keito Fukuoka

Hisashi Miyamori

0 0 Kyoto Sangyo University of Japan (KSU University) , Kamigamo Motoyama, Kita-ku, Kyoto City, Kyoto , Japan

2025

The spread of misinformation containing numerical claims online poses a severe threat, undermining the very foundation of democracy. This paper proposes a fact-checking method for automatically determining the veracity of claims that include numerical and temporal elements. The proposed method consists of a two-stage process: evidence retrieval and classification. Specifically, it combines comprehensive evidence retrieval using a Contriever model enhanced by SimCSE-based contrastive learning with a classification method that extracts crucial evidence using a Large Language Model (LLM). For experiments, we used the English dataset provided by CheckThat! 2025 Task 3. In the evidence retrieval task, the Contriever model with SimCSE-based contrastive learning achieved a Recall@100 of 0.524, significantly outperforming conventional methods like BM25. Conversely, in the classification task, the method utilizing search results from BM25 achieved the highest performance with a macro F1 of 0.5054. A significant insight gained from this study is that improvements in evidence retrieval ranking accuracy do not necessarily directly lead to enhanced classification performance.

eol>Fact-checking Numerical claims Evidence retrieval Contrastive learning

1. Introduction

The spread of misinformation online, particularly prominent during election periods, not only triggers social and political unrest but also poses a severe threat, undermining the very foundation of democracy[ 1 ]. Among various forms of misinformation, verifying claims that include numerical and temporal elements is of paramount importance in fact-checking. Indeed, numerical claims constitute a significant component of political discourse.

This paper addresses the CheckThat! Lab’s Task 3: Fact-checking numerical claims [ 2 ]. The objective of this task is to determine the veracity of claims containing numerical quantities and temporal expressions. For each claim, participants are provided with a short list of evidence and are required to classify the claim as "True," "False," or "Conflicting" based on this evidence.

We propose a two-stage fact-checking method consisting of an evidence retrieval step enhanced by contrastive learning and a classification step that combines LLM-based crucial evidence extraction. First, in the evidence retrieval step, we observed that claims and their corresponding evidence often have diferent phrasings, even when their content is highly relevant. To comprehensively retrieve highly relevant evidence, we adopted an evidence retrieval system composed of a Contriever model further trained with SimCSE-based contrastive learning to capture the semantic relevance between claims and evidence. Furthermore, in the classification using the retrieved evidence, we confirmed that gold evidence in this task tends to be lengthy. To mitigate any negative impact on classification, we therefore adopted a method that uses an LLM to extract important information from the evidence and then performs classification based on these extracted results.

2. Related Work

Automated fact-checking has garnered significant attention as a crucial countermeasure against online misinformation [ 3, 4, 5 ]. Existing fact-checking research has largely been limited to synthetic claims [6] and non-numerical claims [7], with a notable lack of focus on claims containing numerical information.

Addressing this gap, Viswanathan et al. constructed QUANTEMP [8], an open-domain benchmark specifically designed for real-world numerical claims. QUANTEMP is a diverse dataset encompassing comparisons, statistics, durations, and temporal aspects, ofering detailed metadata and evidence. Using this dataset, they evaluated the limitations of existing methods and presented new challenges in numerical claim verification.

The Task 3: Fact-checking numerical claims that we address in this paper aligns with the challenges posed by QUANTEMP. This task defines two sub-tasks for determining the veracity of claims: an "evidence retrieval task" to search for relevant evidence and a "classification task" to categorize claims based on that evidence.

3. Method

This task broadly consists of the following two components: • Evidence retrieval task: retrieving evidence relevant to a given claim. • Classification task: determining whether a claim is True, False, or Conflicting based on the claim and retrieved evidence.

3.1. Task Formulation

This task is formulated as follows. Given a claim ∈ ( is the claim space) as a query, and a sequence of top- retrieved evidences = (1, 2, ..., ) ∈ ℰ (ℰ is the evidence sequence space) obtained by a retrieval system , a classification function outputs a label ∈ ℒ = { , , }: : (, ℰ ) → ℒ

Here, the process of obtaining the evidence sequence for a claim by the retrieval system is expressed as follows: = top-k(sort∈ (score(, ))) = (1, 2, . . . , ) s.t. score(, 1) ≥ score(, 2) ≥ · · · ≥ score(, ) where is the set of evidences relevant to claim , = {1, 2, . . . , }, score(, ) is a function that returns the relevance score of document for query , sort∈ ( ()) is a function that sorts each element in set in descending order based on the value of function (), top-k() is a function that returns the top- elements of sequence , and represents the -th evidence.

Furthermore, each label ∈ ℒ represents one of the following three types of content: • : Based on the retrieved evidence, the claim is determined to be true. • : Based on the retrieved evidence, the claim is determined to be false. • : Based on the retrieved evidence, it is not possible to determine whether the claim is true (insuficient evidence or conflicting content).

The classification model takes the claim and the retrieved evidence sequence as input and outputs the probability (|, ) for label : (|, ) = softmax(ℎ(, )) (1) (2) (3) (4) where ℎ(, ) represents the feature representation by a neural network, and ∈ ℒ = { , , } represents the predicted label.

The final predicted label ˆ is determined as follows: h = MeanPooling(Contriever(), mask) h = MeanPooling(Contriever(, ), mask)

3.2.2. SimCSE-based Contrastive Learning for Contriever

When retrieving evidence sentences relevant to claims using dense retrieval, models may struggle to generalize to novel topics not present in the training data, potentially performing worse than conventional sparse retrieval methods like BM25. The Contriever model has been shown to outperform BM25 [9] in terms of Recall@100 on various datasets, even when pre-trained in an unsupervised manner, by pre-training a dense retriever with contrastive learning [10]. Therefore, to enable the Contriever model to more comprehensively retrieve evidence sentences relevant to claims, we further trained the model using SimCSE-based contrastive learning on the semantic relatedness between claims and gold evidences.

Contrastive learning is performed using claim and a sentence , extracted from its gold evidence as a positive pair, and claim and a sentence , extracted from another claim’s gold evidence as a negative pair.

First, the Contriever encoder is used to convert claim and evidence sentence , (or ,) into vector representations: ˆ = arg max (|, )

Thus, it is important to note that the classification task depends on the results of the evidence retrieval task, and the ranking performance of the retrieval system afects the accuracy of the classification results.

Here, MeanPooling is an average pooling operation that considers the attention mask, and mask and mask are the attention masks for the claim and evidence sentence, respectively.

The resulting representation vectors h ∈ R and h ∈ R are then transformed by an MLP layer: h′ = (Wh + b), h′ = (Wh + b)

3.2. Evidence Retrieval 3.2.1. Dataset Construction for Evidence Retrieval Evaluation

Evidence retrieval is the process of selecting highly relevant evidence for a given claim. In this task, explicit claim-evidence pairs are not provided in the supplied data, which makes evaluating ranking performance challenging. To address this, we explicitly constructed claim-evidence pairs by leveraging the gold evidences present in the validation data.

Let = {1, 2, . . . , } be the set of claims in the validation data, and be the gold evidence corresponding to each claim . We segmented each gold evidence into individual sentences to obtain a set of evidence sentences = {,1, ,2, . . . , , }. The relevance label (, ) is defined as follows: (, ) = {︃1 if ∈

0 if ∈ ⋃︀̸=

Through this process, we constructed a dataset = {(, , (, ))| ∈ {1, . . . , }, ∈ {1, . . . , ||}} consisting of 13,019 claim-evidence pairs (train: 9,935 pairs, dev: 3,084 pairs), enabling quantitative evaluation of ranking performance in evidence retrieval. Here, = ⋃︀ =1 represents the set of all evidence sentences. (5) (6) (7) (8) (9) where is the activation function, W ∈ R′× is the weight matrix, b ∈ R′ is the bias vector, and ′ is the dimension after transformation.

With a batch size of , the transformed representations of all claims in the batch are represented as a matrix H = [h′,1, h′,2, . . . , h′,] ∈ R× ′ , and the transformed representations of all evidence sentences as a matrix H = [h′,1, h′,2, . . . , h′,] ∈ R× ′ .

The similarity matrix S = H(H) ∈ R× is computed within the batch, and the model is trained using the SimCSE loss:

1 ∑︁ log ℒ = − =1

exp(S/ ) ∑︀=1 exp(S / ) Here, S is the similarity between the -th claim and its corresponding positive evidence sentence, S ( ̸= ) is the similarity between the -th claim and the -th evidence sentence (negative example), and is the temperature parameter.

3.3. Classification Task 3.3.1. Evidence Sentence Processing for Classification Model Training

Two primary approaches can be considered for training the classification model: • Retrieving relevant evidence using a ranking algorithm like BM25 with the claim as a query, and then training the classification model using these results.

• Directly using the gold evidence corresponding to the claim to train the classification model.

While the former allows for automatic evidence acquisition, it carries the risk of retrieving irrelevant sentences, which could negatively impact classification performance. The latter approach is advantageous for leveraging highly reliable evidence. However, gold evidence is generally lengthy and not suitable for direct use in training a classification model. Therefore, we propose extracting crucial information from the gold evidence and transforming it into a format suitable for classification model training.

Given a claim ∈ ( is the claim space) and its corresponding gold evidence ∈ ( is the gold evidence space), an LLM-based crucial segment extraction function extract outputs a set of important evidence sentences ext = {e1xt, e2xt, . . . , ext} ∈ ext (ext is the space of important evidence sentence sets): cl(, ext) = ∈ ℒ (10) (11) (12) (13) (14) extract : (, ) →

ext ext = LLM( (, ))

The extraction process is achieved by providing a prompt (, ) as input to a function LLM corresponding to an LLM model:

Here, = {1, , . . . , } is the set of important evidence sentences extracted by the LLM.

The prompt (, ) is constructed by combining the claim and the gold evidence , taking the following form:

(, ) = Template ⊕ ⊕

Here, ⊕ is the string concatenation operator, and Template is the prompt template specifying the extraction task. Using the extracted evidence sentence set ext, the classification model cl infers the predicted label as follows: Prompt Template and LLM Details. For the extraction of crucial evidence sentences, we used the prompt template as shown in Figure 1. For all evidence extraction using an LLM (Large Language Model), we utilized the unsloth/Qwen3-8B-bnb-4bit model without employing Chain-of-Thought prompting.

Please output the following information in a bulleted list: ### Claim: [claim] ### Document: [gold evidence] Output Example: - result1 - result2 ### Judgment: Extract and concisely output only the direct evidence from the document needed to determine if the claim is [label]. Do not include any unnecessary explanations or analysis.

3.3.2. Data Augmentation for Improved Noise Robustness

Training the classification model solely on crucial information extracted by an LLM could lead to training with only correct positive examples. This raises concerns about its ability to efectively learn robustness against erroneous information, which is expected in real-world deployments. Therefore, we decided to intentionally inject irrelevant sentences during the training of the classification model.

For each claim and its LLM-extracted evidence sentence set = {1, , . . . , }, we 2 randomly inject irrelevant sentences as noise. Let = ⋃︀=1 be the union of all extracted evidence sentences across all claims. The set of noise candidates for claim is defined as follows: = ∖

Here, is the set of evidence sentences extracted from the gold evidence of claims other than . The noise injection function AddNoise is defined as:

AddNoise(, , ) = is generated by the following process: Here,

= ∪ RandomSample(, )

RandomSample(, ) is a function that randomly selects sentences from the noise candidate set , where is a hyperparameter representing the number of sentences to be injected as noise. The final training evidence sentence set is expressed as:

= AddNoise(, , ) (18)

Through this, we anticipate that the classification model will operate robustly even when noise is present during inference. The classification model robust is trained to minimize:

(robust(, train), gt) where (·, ·) represents the loss function, and gt represents the ground truth label. Here, crossentropy loss was used as the loss function. (15) (16) (17) (19)

3.3.3. 4-Class Classification for Irrelevant Evidence Detection

During inference, a retrieval system might present evidence irrelevant to a given claim. To address such situations, we introduced a new label, "Irrelevant," to the classification model. We trained the model to categorize evidence sentences unrelated to the claim under this new label.

For the training data of the "Irrelevant" label, we used evidence sentences = ∖ extracted from the gold evidence of other claims for each claim . This means that an evidence sentence is defined as irrelevant to claim if it was extracted from the gold evidence of a diferent claim ( ̸= ).

Extending the conventional 3-class classification, a 4-class classification function irr now outputs a label ∈ ℒirr = {True, False, Conflicting, Irrelevant}: irr : (, ℰ ) → ℒirr (20)

This allows the model to identify irrelevant evidence even if appropriate evidence sentences are not retrieved. Consequently, it enables a strategy where the system can re-perform evidence retrieval and re-classify if irrelevant evidence is detected.

4. Experiments 4.1. Experimental Settings

For the claim classification task, the following settings were used for training and evaluation. • Model: FacebookAI/roberta-base • Maximum sequence length: 512 • Number of labels: Automatically determined from the data (e.g., Conflicting, False, True, etc.) Training settings: • Learning rate: 2 × 10 −5 • Batch size: 128 (training), 128 (evaluation) • Number of epochs: 10 • Weight decay: 0.01 • Adam epsilon: 1 × 10 −8 • Scheduler: linear • Warmup ratio: 0.1

All experiments were conducted using a Tesla V100-PCIE-32GB GPU.

4.2. Evidence Retrieval

We evaluated and compared three algorithms for retrieving evidence sentences relevant to claims: BM25, Contriever, and Contriever (additional training with SimCSE). Table 1 presents the results.

The fact that the three models showed similarly high performance in P@1 to P@3 indicates that when clear, relevant evidence sentences exist for numerical claims, the diferences between retrieval methods are limited. Contriever demonstrated significant performance improvements, particularly from P@10 onwards and in Recall metrics, achieving a substantial increase to 0.524 for Recall@100 and 0.731 for Recall@1000. This is likely due to the SimCSE-based contrastive learning enabling more efective learning of semantic relevance between claims and gold evidence. While BM25 showed excellent performance in top-tier precision, it lagged behind other methods in Recall metrics.

For fact-checking tasks, it is considered crucial to collect a wide range of diverse evidence sentences. Therefore, Contriever , with its high Recall performance, proved to be the optimal choice. On the other hand, BM25 can still be a viable option when computational resources are limited or when only the highest-ranked evidence is required.

4.3. Classification for Fact-Checking

Table 2 presents the results of the fact-checking classification using the development data. Despite Contriever demonstrating high accuracy in the evidence retrieval results (Table 1), it achieved the lowest macro F1 in the classification task. This clearly indicates that improvements in retrieval performance do not necessarily translate directly to enhanced classification performance. SimCSE-finetuned Contriever improved recall by including a greater number of relevant evidence sentences in the retrieval results. However, the precision at top ranks (e.g., top-1 or top-3) did not suficiently improve, and thus this did not lead to better overall classification performance. This suggests that while the ifne-tuned Contriever is efective at broadly collecting semantically related sentences, it is less efective at ranking the most crucial evidence at the top. Therefore, a two-stage retrieval approach—first using Contriever for initial retrieval to gather a wide range of candidates, followed by a reranking model to place the most relevant evidence at higher ranks—would likely be more efective. BM25-based retrieval achieved the most stable classification performance, proving to be the optimal choice from a practical perspective.

Contrary to expectations, the noise augmentation method led to a performance decrease, particularly a significant drop in Conflicting predictions. This phenomenon can be attributed to the model’s tendency, after being exposed to irrelevant sentences during training, to classify ambiguous or weakly supported cases as True or False rather than Conflicting. By learning to make predictions even in the presence of noise, the model becomes less sensitive to ambiguity and is more likely to output a definitive label. As a result, the recall and F1 score for the Conflicting class decreased, while misclassifications into the True or False classes increased. However, a slight improvement was observed for True predictions, partially confirming the efect of improved noise robustness. These findings suggest that noise injection can enhance robustness to irrelevant information while also blurring the criteria for identifying ambiguous cases in the model. In the future, further improvements are needed, such as optimizing data augmentation and loss function design, to enhance robustness against irrelevant information while more accurately identifying ambiguous cases.

4.4. Irrelevant Evidence Detection 5. Conclusion

In this paper, we proposed and validated a method following a two-stage approach for evidence retrieval and classification in fact-checking numerical claims. In the evidence retrieval task, Contriever , further trained with SimCSE-based contrastive learning, achieved substantial performance improvements, particularly in Recall metrics, demonstrating its ability to efectively learn the semantic relevance between claims and gold evidence. Meanwhile, BM25 maintained stable performance in top-tier precision, confirming its practicality from a computational eficiency perspective.

However, a crucial insight gained from the classification task was that high accuracy in evidence retrieval does not necessarily directly lead to improved classification performance. The classification model using BM25-based search results achieved the most stable macro F1, indicating its optimality from a practical standpoint. Class-wise analysis revealed that False predictions consistently had the highest F1 score across all methods, while True predictions proved to be the most challenging. Although the noise augmentation method unexpectedly led to an overall performance decrease, a slight improvement was observed for True predictions, suggesting potential for improved noise robustness.

As future work, we aim to build a two-stage retrieval system that leverages the high Recall performance of Contriever . By applying re-ranking techniques to comprehensively retrieved candidate documents, we expect to achieve performance improvements that balance both retrieval coverage and ranking accuracy, by placing more relevant evidence sentences higher in the results.

Acknowledgments

A part of this work was supported by JSPS KAKENHI Grant Number 23K11342.

Declaration on Generative AI

During the preparation of this work, the author utilized Gemini for revisions related to grammar and clarity. These tools were employed to refine sentence structure, correct typographical errors, and enhance the overall quality of the language. They were also used for translating content into English. No generative content was used in the analysis, figures, or experimental sections. After using these tools/services, the author reviewed and edited the content as needed and assumes full responsibility for the content of this publication. (FEVER), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 78–82. URL: https: //aclanthology.org/2022.fever-1.8/. doi:10.18653/v1/2022.fever-1.8. [4] J. Chen, A. Sriram, E. Choi, G. Durrett, Generating literal and implied subquestions to factcheck complex claims, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 3495–3516. URL: https://aclanthology.org/ 2022.emnlp-main.229/. doi:10.18653/v1/2022.emnlp-main.229. [5] I. Augenstein, C. Lioma, D. Wang, L. Chaves Lima, C. Hansen, C. Hansen, J. G. Simonsen, MultiFC: A real-world multi-domain dataset for evidence-based fact checking of claims, in: K. Inui, J. Jiang, V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 4685– 4697. URL: https://aclanthology.org/D19-1475/. doi:10.18653/v1/D19-1475. [6] A. Sathe, S. Ather, T. M. Le, N. Perry, J. Park, Automated fact-checking of claims from Wikipedia, in: N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the Twelfth Language Resources and Evaluation Conference, European Language Resources Association, Marseille, France, 2020, pp. 6874–6882. URL: https://aclanthology.org/2020.lrec-1.849/. [7] M. Schlichtkrull, Z. Guo, A. Vlachos, Averitec: A dataset for real-world claim verification with evidence from the web, 2023. URL: https://arxiv.org/abs/2305.13117. arXiv:2305.13117. [8] V. V, A. Anand, A. Anand, V. Setty, Quantemp: A real-world open-domain benchmark for factchecking numerical claims, 2024. URL: https://arxiv.org/abs/2403.17169. arXiv:2403.17169. [9] S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond,

Foundations and Trends® in Information Retrieval 3 (2009) 333–389. [10] G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, E. Grave, Unsupervised dense information retrieval with contrastive learning, arXiv preprint arXiv:2112.09118 (2021).

[1]

Guo ,

Schlichtkrull ,

Vlachos , A survey on automated fact-checking, Transactions of the Association for Computational Linguistics 10 ( 2022 ) 178 - 206 .

[2]

Venktesh ,

Setty ,

Anand ,

Hasanain ,

Bendou ,

Bouamor ,

Alam ,

Iturra-Bocaz ,

Galuščáková , Overview of the CLEF-2025 CheckThat! lab task 3 on fact-checking numerical claims , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum , CLEF 2025 , Madrid, Spain, 2025 .

[3]

Mori ,

Papotti ,

Bellomarini , O. Giudice, Neural machine translation for fact-checking temporal claims , in: R. Aly , C.

Christodoulopoulos , O.

Cocarascu , Z.

Guo , A.

Mittal , M.

Schlichtkrull , J.

Thorne , A . Vlachos (Eds.), Proceedings of the Fifth Fact Extraction and VERification Workshop