<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>KSU at CheckThat! 2025: Two-stage approach to fact-checking numerical claims</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Keito Fukuoka</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hisashi Miyamori</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Kyoto Sangyo University of Japan (KSU University)</institution>
          ,
          <addr-line>Kamigamo Motoyama, Kita-ku, Kyoto City, Kyoto</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The spread of misinformation containing numerical claims online poses a severe threat, undermining the very foundation of democracy. This paper proposes a fact-checking method for automatically determining the veracity of claims that include numerical and temporal elements. The proposed method consists of a two-stage process: evidence retrieval and classification. Specifically, it combines comprehensive evidence retrieval using a Contriever model enhanced by SimCSE-based contrastive learning with a classification method that extracts crucial evidence using a Large Language Model (LLM). For experiments, we used the English dataset provided by CheckThat! 2025 Task 3. In the evidence retrieval task, the Contriever model with SimCSE-based contrastive learning achieved a Recall@100 of 0.524, significantly outperforming conventional methods like BM25. Conversely, in the classification task, the method utilizing search results from BM25 achieved the highest performance with a macro F1 of 0.5054. A significant insight gained from this study is that improvements in evidence retrieval ranking accuracy do not necessarily directly lead to enhanced classification performance.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Fact-checking</kwd>
        <kwd>Numerical claims</kwd>
        <kwd>Evidence retrieval</kwd>
        <kwd>Contrastive learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The spread of misinformation online, particularly prominent during election periods, not only
triggers social and political unrest but also poses a severe threat, undermining the very foundation of
democracy[
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Among various forms of misinformation, verifying claims that include numerical and
temporal elements is of paramount importance in fact-checking. Indeed, numerical claims constitute a
significant component of political discourse.
      </p>
      <p>
        This paper addresses the CheckThat! Lab’s Task 3: Fact-checking numerical claims [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. The objective
of this task is to determine the veracity of claims containing numerical quantities and temporal
expressions. For each claim, participants are provided with a short list of evidence and are required to classify
the claim as "True," "False," or "Conflicting" based on this evidence.
      </p>
      <p>We propose a two-stage fact-checking method consisting of an evidence retrieval step enhanced
by contrastive learning and a classification step that combines LLM-based crucial evidence extraction.
First, in the evidence retrieval step, we observed that claims and their corresponding evidence often
have diferent phrasings, even when their content is highly relevant. To comprehensively retrieve
highly relevant evidence, we adopted an evidence retrieval system composed of a Contriever model
further trained with SimCSE-based contrastive learning to capture the semantic relevance between
claims and evidence. Furthermore, in the classification using the retrieved evidence, we confirmed that
gold evidence in this task tends to be lengthy. To mitigate any negative impact on classification, we
therefore adopted a method that uses an LLM to extract important information from the evidence and
then performs classification based on these extracted results.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Automated fact-checking has garnered significant attention as a crucial countermeasure against online
misinformation [
        <xref ref-type="bibr" rid="ref3">3, 4, 5</xref>
        ]. Existing fact-checking research has largely been limited to synthetic claims [6]
and non-numerical claims [7], with a notable lack of focus on claims containing numerical information.
      </p>
      <p>Addressing this gap, Viswanathan et al. constructed QUANTEMP [8], an open-domain benchmark
specifically designed for real-world numerical claims. QUANTEMP is a diverse dataset encompassing
comparisons, statistics, durations, and temporal aspects, ofering detailed metadata and evidence.
Using this dataset, they evaluated the limitations of existing methods and presented new challenges in
numerical claim verification.</p>
      <p>The Task 3: Fact-checking numerical claims that we address in this paper aligns with the challenges
posed by QUANTEMP. This task defines two sub-tasks for determining the veracity of claims: an
"evidence retrieval task" to search for relevant evidence and a "classification task" to categorize claims
based on that evidence.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Method</title>
      <p>This task broadly consists of the following two components:
• Evidence retrieval task: retrieving evidence relevant to a given claim.
• Classification task: determining whether a claim is True, False, or Conflicting based on the claim
and retrieved evidence.</p>
      <sec id="sec-3-1">
        <title>3.1. Task Formulation</title>
        <p>This task is formulated as follows. Given a claim  ∈  ( is the claim space) as a query, and a sequence
of top- retrieved evidences  = (1, 2, ..., ) ∈ ℰ (ℰ is the evidence sequence space) obtained by a
retrieval system , a classification function  outputs a label  ∈ ℒ = { ,  ,  }:
 : (, ℰ ) → ℒ</p>
        <p>Here, the process of obtaining the evidence sequence  for a claim  by the retrieval system  is
expressed as follows:
 = top-k(sort∈ (score(, )))
= (1, 2, . . . , ) s.t. score(, 1) ≥ score(,  2) ≥ · · · ≥ score(, 
)
where  is the set of evidences relevant to claim ,  = {1, 2, . . . , }, score(, ) is a function
that returns the relevance score of document  for query , sort∈ ( ()) is a function that sorts each
element  in set  in descending order based on the value of function  (), top-k() is a function
that returns the top- elements of sequence , and  represents the -th evidence.</p>
        <p>Furthermore, each label  ∈ ℒ represents one of the following three types of content:
•  : Based on the retrieved evidence, the claim  is determined to be true.
•  : Based on the retrieved evidence, the claim  is determined to be false.
•  : Based on the retrieved evidence, it is not possible to determine whether the claim
 is true (insuficient evidence or conflicting content).</p>
        <p>The classification model takes the claim  and the retrieved evidence sequence  as input and outputs
the probability  (|, ) for label :
 (|, ) = softmax(ℎ(, ))
(1)
(2)
(3)
(4)
where ℎ(, ) represents the feature representation by a neural network, and  ∈ ℒ =
{ ,  ,  } represents the predicted label.</p>
        <p>The final predicted label ˆ is determined as follows:
h = MeanPooling(Contriever(), mask)
h = MeanPooling(Contriever(, ), mask)</p>
        <sec id="sec-3-1-1">
          <title>3.2.2. SimCSE-based Contrastive Learning for Contriever</title>
          <p>When retrieving evidence sentences relevant to claims using dense retrieval, models may struggle
to generalize to novel topics not present in the training data, potentially performing worse than
conventional sparse retrieval methods like BM25. The Contriever model has been shown to outperform
BM25 [9] in terms of Recall@100 on various datasets, even when pre-trained in an unsupervised manner,
by pre-training a dense retriever with contrastive learning [10]. Therefore, to enable the Contriever
model to more comprehensively retrieve evidence sentences relevant to claims, we further trained the
model using SimCSE-based contrastive learning on the semantic relatedness between claims and gold
evidences.</p>
          <p>Contrastive learning is performed using claim  and a sentence , extracted from its gold evidence
as a positive pair, and claim  and a sentence , extracted from another claim’s gold evidence as a
negative pair.</p>
          <p>First, the Contriever encoder is used to convert claim  and evidence sentence , (or ,) into
vector representations:
ˆ = arg max  (|, )</p>
          <p />
          <p>Thus, it is important to note that the classification task depends on the results of the evidence retrieval
task, and the ranking performance of the retrieval system afects the accuracy of the classification
results.</p>
          <p>Here, MeanPooling is an average pooling operation that considers the attention mask, and mask and
mask are the attention masks for the claim and evidence sentence, respectively.</p>
          <p>The resulting representation vectors h ∈ R and h ∈ R are then transformed by an MLP layer:
h′ = (Wh + b),
h′ = (Wh + b)</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Evidence Retrieval</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Dataset Construction for Evidence Retrieval Evaluation</title>
          <p>Evidence retrieval is the process of selecting highly relevant evidence for a given claim. In this task,
explicit claim-evidence pairs are not provided in the supplied data, which makes evaluating ranking
performance challenging. To address this, we explicitly constructed claim-evidence pairs by leveraging
the gold evidences present in the validation data.</p>
          <p>Let  = {1, 2, . . . , } be the set of claims in the validation data, and  be the gold evidence
corresponding to each claim . We segmented each gold evidence  into individual sentences to
obtain a set of evidence sentences  = {,1, ,2, . . . , , }. The relevance label (,  ) is defined as
follows:
(,  ) =
{︃1 if  ∈</p>
          <p>0 if  ∈ ⋃︀̸=</p>
          <p>Through this process, we constructed a dataset  = {(,  , (,  ))| ∈ {1, . . . , },  ∈
{1, . . . , ||}} consisting of 13,019 claim-evidence pairs (train: 9,935 pairs, dev: 3,084 pairs), enabling
quantitative evaluation of ranking performance in evidence retrieval. Here,  = ⋃︀
=1  represents the
set of all evidence sentences.
(5)
(6)
(7)
(8)
(9)
where  is the activation function, W ∈ R′× is the weight matrix, b ∈ R′ is the bias vector, and
′ is the dimension after transformation.</p>
          <p>With a batch size of , the transformed representations of all claims in the batch are represented as
a matrix H = [h′,1, h′,2, . . . , h′,] ∈ R× ′ , and the transformed representations of all evidence
sentences as a matrix H = [h′,1, h′,2, . . . , h′,] ∈ R× ′ .</p>
          <p>The similarity matrix S = H(H) ∈ R× is computed within the batch, and the model is trained
using the SimCSE loss:</p>
          <p>1 ∑︁ log
ℒ = −  =1</p>
          <p>exp(S/ )
∑︀=1 exp(S / )
Here, S is the similarity between the -th claim and its corresponding positive evidence sentence,
S ( ̸= ) is the similarity between the -th claim and the -th evidence sentence (negative example),
and  is the temperature parameter.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Classification Task</title>
        <sec id="sec-3-3-1">
          <title>3.3.1. Evidence Sentence Processing for Classification Model Training</title>
          <p>Two primary approaches can be considered for training the classification model:
• Retrieving relevant evidence using a ranking algorithm like BM25 with the claim as a query, and
then training the classification model using these results.</p>
          <p>• Directly using the gold evidence corresponding to the claim to train the classification model.</p>
          <p>While the former allows for automatic evidence acquisition, it carries the risk of retrieving
irrelevant sentences, which could negatively impact classification performance. The latter approach is
advantageous for leveraging highly reliable evidence. However, gold evidence is generally lengthy and
not suitable for direct use in training a classification model. Therefore, we propose extracting crucial
information from the gold evidence and transforming it into a format suitable for classification model
training.</p>
          <p>Given a claim  ∈  ( is the claim space) and its corresponding gold evidence  ∈  ( is the gold
evidence space), an LLM-based crucial segment extraction function extract outputs a set of important
evidence sentences ext = {e1xt, e2xt, . . . , ext} ∈ ext (ext is the space of important evidence sentence
sets):
cl(, ext) =  ∈ ℒ
(10)
(11)
(12)
(13)
(14)
extract : (, ) →</p>
          <p>ext
ext = LLM( (, ))</p>
          <p>The extraction process is achieved by providing a prompt  (, ) as input to a function
LLM corresponding to an LLM model:</p>
          <p>Here,  = {1, , . . . , } is the set of important evidence sentences extracted by the LLM.</p>
          <p>2</p>
          <p>The prompt  (, ) is constructed by combining the claim  and the gold evidence , taking the
following form:</p>
          <p>(, ) = Template ⊕   ⊕</p>
          <p>Here, ⊕ is the string concatenation operator, and Template is the prompt template specifying the
extraction task. Using the extracted evidence sentence set ext, the classification model cl infers the
predicted label  as follows:
Prompt Template and LLM Details. For the extraction of crucial evidence sentences, we used the
prompt template as shown in Figure 1. For all evidence extraction using an LLM (Large Language Model),
we utilized the unsloth/Qwen3-8B-bnb-4bit model without employing Chain-of-Thought prompting.</p>
          <p>Please output the following information in a bulleted list:
### Claim:
[claim]
### Document:
[gold evidence]
Output Example:
- result1
- result2
### Judgment:
Extract and concisely output only the direct evidence from the document needed
to determine if the claim is [label]. Do not include any unnecessary
explanations or analysis.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Data Augmentation for Improved Noise Robustness</title>
          <p>Training the classification model solely on crucial information extracted by an LLM could lead to
training with only correct positive examples. This raises concerns about its ability to efectively learn
robustness against erroneous information, which is expected in real-world deployments. Therefore, we
decided to intentionally inject irrelevant sentences during the training of the classification model.</p>
          <p>For each claim  and its LLM-extracted evidence sentence set  = {1, , . . . , }, we
2
randomly inject irrelevant sentences as noise. Let  = ⋃︀=1  be the union of all extracted
evidence sentences across all claims. The set of noise candidates  for claim  is defined as follows:
 =  ∖</p>
          <p>Here,  is the set of evidence sentences extracted from the gold evidence of claims other than .
The noise injection function AddNoise is defined as:</p>
          <p>AddNoise(, , ) = 
 is generated by the following process:
Here,</p>
          <p>=  ∪ RandomSample(, )</p>
          <p>RandomSample(, ) is a function that randomly selects  sentences from the noise
candidate set , where  is a hyperparameter representing the number of sentences to be injected as
noise. The final training evidence sentence set   is expressed as:</p>
          <p>= AddNoise(, , ) (18)</p>
          <p>Through this, we anticipate that the classification model will operate robustly even when noise is
present during inference. The classification model  robust is trained to minimize:</p>
          <p>(robust(, train), gt)
where (·, ·) represents the loss function, and gt represents the ground truth label. Here,
crossentropy loss was used as the loss function.
(15)
(16)
(17)
(19)</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. 4-Class Classification for Irrelevant Evidence Detection</title>
          <p>During inference, a retrieval system might present evidence irrelevant to a given claim. To address such
situations, we introduced a new label, "Irrelevant," to the classification model. We trained the model to
categorize evidence sentences unrelated to the claim under this new label.</p>
          <p>For the training data of the "Irrelevant" label, we used evidence sentences  =  ∖  extracted
from the gold evidence of other claims for each claim . This means that an evidence sentence is
defined as irrelevant to claim  if it was extracted from the gold evidence of a diferent claim  ( ̸= ).</p>
          <p>Extending the conventional 3-class classification, a 4-class classification function irr now outputs a
label  ∈ ℒirr = {True, False, Conflicting, Irrelevant}:
irr : (, ℰ ) → ℒirr
(20)</p>
          <p>This allows the model to identify irrelevant evidence even if appropriate evidence sentences are not
retrieved. Consequently, it enables a strategy where the system can re-perform evidence retrieval and
re-classify if irrelevant evidence is detected.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments</title>
      <sec id="sec-4-1">
        <title>4.1. Experimental Settings</title>
        <p>For the claim classification task, the following settings were used for training and evaluation.
• Model: FacebookAI/roberta-base
• Maximum sequence length: 512
• Number of labels: Automatically determined from the data (e.g., Conflicting, False, True, etc.)
Training settings:
• Learning rate: 2 × 10 −5
• Batch size: 128 (training), 128 (evaluation)
• Number of epochs: 10
• Weight decay: 0.01
• Adam epsilon: 1 × 10 −8
• Scheduler: linear
• Warmup ratio: 0.1</p>
        <p>All experiments were conducted using a Tesla V100-PCIE-32GB GPU.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Evidence Retrieval</title>
        <p>We evaluated and compared three algorithms for retrieving evidence sentences relevant to claims:
BM25, Contriever, and Contriever  (additional training with SimCSE). Table 1 presents the results.</p>
        <p>The fact that the three models showed similarly high performance in P@1 to P@3 indicates that when
clear, relevant evidence sentences exist for numerical claims, the diferences between retrieval methods
are limited. Contriever  demonstrated significant performance improvements, particularly from
P@10 onwards and in Recall metrics, achieving a substantial increase to 0.524 for Recall@100 and 0.731
for Recall@1000. This is likely due to the SimCSE-based contrastive learning enabling more efective
learning of semantic relevance between claims and gold evidence. While BM25 showed excellent
performance in top-tier precision, it lagged behind other methods in Recall metrics.</p>
        <p>For fact-checking tasks, it is considered crucial to collect a wide range of diverse evidence sentences.
Therefore, Contriever  , with its high Recall performance, proved to be the optimal choice. On the
other hand, BM25 can still be a viable option when computational resources are limited or when only
the highest-ranked evidence is required.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Classification for Fact-Checking</title>
        <p>Table 2 presents the results of the fact-checking classification using the development data. Despite
Contriever  demonstrating high accuracy in the evidence retrieval results (Table 1), it achieved
the lowest macro F1 in the classification task. This clearly indicates that improvements in retrieval
performance do not necessarily translate directly to enhanced classification performance.
SimCSE-finetuned Contriever improved recall by including a greater number of relevant evidence sentences in the
retrieval results. However, the precision at top ranks (e.g., top-1 or top-3) did not suficiently improve,
and thus this did not lead to better overall classification performance. This suggests that while the
ifne-tuned Contriever is efective at broadly collecting semantically related sentences, it is less efective
at ranking the most crucial evidence at the top. Therefore, a two-stage retrieval approach—first using
Contriever for initial retrieval to gather a wide range of candidates, followed by a reranking model to
place the most relevant evidence at higher ranks—would likely be more efective. BM25-based retrieval
achieved the most stable classification performance, proving to be the optimal choice from a practical
perspective.</p>
        <p>Contrary to expectations, the noise augmentation method led to a performance decrease, particularly
a significant drop in Conflicting predictions. This phenomenon can be attributed to the model’s tendency,
after being exposed to irrelevant sentences during training, to classify ambiguous or weakly supported
cases as True or False rather than Conflicting. By learning to make predictions even in the presence of
noise, the model becomes less sensitive to ambiguity and is more likely to output a definitive label. As a
result, the recall and F1 score for the Conflicting class decreased, while misclassifications into the True
or False classes increased. However, a slight improvement was observed for True predictions, partially
confirming the efect of improved noise robustness. These findings suggest that noise injection can
enhance robustness to irrelevant information while also blurring the criteria for identifying ambiguous
cases in the model. In the future, further improvements are needed, such as optimizing data augmentation
and loss function design, to enhance robustness against irrelevant information while more accurately
identifying ambiguous cases.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Irrelevant Evidence Detection</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this paper, we proposed and validated a method following a two-stage approach for evidence retrieval
and classification in fact-checking numerical claims. In the evidence retrieval task, Contriever   , further
trained with SimCSE-based contrastive learning, achieved substantial performance improvements,
particularly in Recall metrics, demonstrating its ability to efectively learn the semantic relevance
between claims and gold evidence. Meanwhile, BM25 maintained stable performance in top-tier
precision, confirming its practicality from a computational eficiency perspective.</p>
      <p>However, a crucial insight gained from the classification task was that high accuracy in evidence
retrieval does not necessarily directly lead to improved classification performance. The classification
model using BM25-based search results achieved the most stable macro F1, indicating its optimality from
a practical standpoint. Class-wise analysis revealed that False predictions consistently had the highest
F1 score across all methods, while True predictions proved to be the most challenging. Although the
noise augmentation method unexpectedly led to an overall performance decrease, a slight improvement
was observed for True predictions, suggesting potential for improved noise robustness.</p>
      <p>As future work, we aim to build a two-stage retrieval system that leverages the high Recall
performance of Contriever  . By applying re-ranking techniques to comprehensively retrieved candidate
documents, we expect to achieve performance improvements that balance both retrieval coverage and
ranking accuracy, by placing more relevant evidence sentences higher in the results.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>A part of this work was supported by JSPS KAKENHI Grant Number 23K11342.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author utilized Gemini for revisions related to grammar and
clarity. These tools were employed to refine sentence structure, correct typographical errors, and
enhance the overall quality of the language. They were also used for translating content into English.
No generative content was used in the analysis, figures, or experimental sections. After using these
tools/services, the author reviewed and edited the content as needed and assumes full responsibility for
the content of this publication.
(FEVER), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 78–82. URL: https:
//aclanthology.org/2022.fever-1.8/. doi:10.18653/v1/2022.fever-1.8.
[4] J. Chen, A. Sriram, E. Choi, G. Durrett, Generating literal and implied subquestions to
factcheck complex claims, in: Y. Goldberg, Z. Kozareva, Y. Zhang (Eds.), Proceedings of the 2022
Conference on Empirical Methods in Natural Language Processing, Association for Computational
Linguistics, Abu Dhabi, United Arab Emirates, 2022, pp. 3495–3516. URL: https://aclanthology.org/
2022.emnlp-main.229/. doi:10.18653/v1/2022.emnlp-main.229.
[5] I. Augenstein, C. Lioma, D. Wang, L. Chaves Lima, C. Hansen, C. Hansen, J. G. Simonsen, MultiFC:
A real-world multi-domain dataset for evidence-based fact checking of claims, in: K. Inui, J. Jiang,
V. Ng, X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural
Language Processing and the 9th International Joint Conference on Natural Language Processing
(EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, 2019, pp. 4685–
4697. URL: https://aclanthology.org/D19-1475/. doi:10.18653/v1/D19-1475.
[6] A. Sathe, S. Ather, T. M. Le, N. Perry, J. Park, Automated fact-checking of claims from Wikipedia, in:
N. Calzolari, F. Béchet, P. Blache, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isahara, B. Maegaard,
J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis (Eds.), Proceedings of the Twelfth Language
Resources and Evaluation Conference, European Language Resources Association, Marseille,
France, 2020, pp. 6874–6882. URL: https://aclanthology.org/2020.lrec-1.849/.
[7] M. Schlichtkrull, Z. Guo, A. Vlachos, Averitec: A dataset for real-world claim verification with
evidence from the web, 2023. URL: https://arxiv.org/abs/2305.13117. arXiv:2305.13117.
[8] V. V, A. Anand, A. Anand, V. Setty, Quantemp: A real-world open-domain benchmark for
factchecking numerical claims, 2024. URL: https://arxiv.org/abs/2403.17169. arXiv:2403.17169.
[9] S. Robertson, H. Zaragoza, et al., The probabilistic relevance framework: Bm25 and beyond,</p>
      <p>Foundations and Trends® in Information Retrieval 3 (2009) 333–389.
[10] G. Izacard, M. Caron, L. Hosseini, S. Riedel, P. Bojanowski, A. Joulin, E. Grave, Unsupervised dense
information retrieval with contrastive learning, arXiv preprint arXiv:2112.09118 (2021).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlichtkrull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          , A survey on
          <source>automated fact-checking, Transactions of the Association for Computational Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>178</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bendou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bouamor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Iturra-Bocaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! lab task 3 on fact-checking numerical claims</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2025</year>
          , Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Papotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bellomarini</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Giudice,</surname>
          </string-name>
          <article-title>Neural machine translation for fact-checking temporal claims</article-title>
          , in: R.
          <string-name>
            <surname>Aly</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Christodoulopoulos</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Cocarascu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mittal</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Schlichtkrull</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Thorne</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Vlachos (Eds.),
          <source>Proceedings of the Fifth Fact Extraction and VERification Workshop</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>