<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Fraunhofer SIT at CheckThat! 2025: Multi-Instance Evidence Pooling for Numerical Claim Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>André Runewicz</string-name>
          <email>andre.runewicz@sit.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Moritz Ranly</string-name>
          <email>paul.ranly@sit.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Inna Vogel</string-name>
          <email>inna.vogel@advisori.de</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Steinebach</string-name>
          <email>martin.steinebach@sit.fraunhofer.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ATHENE - National Research Center for Applied Cybersecurity</institution>
          ,
          <addr-line>Rheinstrasse 75, 64295 Darmstadt, Germany, https://</addr-line>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Advisori FTC GmbH</institution>
          ,
          <addr-line>Kaiserstrasse 44, 60329 Frankfurt am Main, Germany, https:// advisori.de/</addr-line>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Fraunhofer Institute for Secure Information Technology SIT</institution>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The growing spread of misinformation, particularly on social media, has increased the demand for scalable and accurate automated fact-checking systems. This paper presents our approach to Task 3 of the CheckThat! 2025 lab, which focuses on the verification of numerical claims. We propose a three-stage architecture comprising (i) semantic evidence retrieval using dense bi-encoder representations stored in a FAISS index, (ii) contrastive re-ranking with a fine-tuned cross-encoder leveraging weak supervision from gold evidences, and (iii) claim classification using multi-instance learning (MIL) with various evidence pooling strategies. Experiments on the QuanTemp dataset demonstrate that attention and LogSumExp pooling outperform standard concatenation methods, with our best model achieving a macro-F1 score of 0.5213 on the test set. Additionally, ablation studies confirm the efectiveness of contrastive re-ranking and the practical advantages of dense retrieval, which achieves both higher accuracy and significantly faster retrieval than traditional BM25.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Fact-checking</kwd>
        <kwd>Claim verification</kwd>
        <kwd>Misinformation detection</kwd>
        <kwd>Numerical claims</kwd>
        <kwd>Multi-instance learning</kwd>
        <kwd>Crossencoder re-ranking</kwd>
        <kwd>Contrastive learning</kwd>
        <kwd>Dense retrieval</kwd>
        <kwd>Natural language processing</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Misinformation is one of today’s greatest challenges in the digital world. Whether they are distributed
intentionally or not, they can have a large influence on our opinions and spread rapidly [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Since many
adults refer to social media platforms when consuming news [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the risk of further dissemination of
disinformation is more present than ever. Conventional fact-checking organizations like PolitiFact1 or
Snopes2 struggle to keep up with the amount of information worth checking, as manual fact checking is
both time-consuming and tedious [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. To address these concerns and automatically detect
misinformation, Nakov et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] presented the full claim verification pipeline for the CheckThat! 2022 lab.
The pipeline consists of four main tasks, namely check-worthiness estimation, verified claim retrieval ,
evidence retrieval and claim verification .
      </p>
      <p>
        The CheckThat! lab has been addressing diferent parts of the claim verification pipeline in multiple
iterations of its appearance, for instance in its releases in 2018 (check-worthiness detection and claim
verification), 2020 (all subtasks) and 2024 (check-worthiness detection) [
        <xref ref-type="bibr" rid="ref5 ref6 ref7">5, 6, 7</xref>
        ]. In the 2025 edition
of the CheckThat! lab, task 3 addresses the last step of the pipeline — claim verification — with a
focus on numerical claims [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Numerical claims are here defined as claims that require validation of
quantitative or temporal statements, e.g., a mention of a date or year, or a quantity of participants.
      </p>
      <p>In this paper, we present our approach to the third task, which follows a three-stage architecture: (i)
initial evidence candidate retrieval using dense vector similarity from a FAISS index, (ii) fine-grained
re-ranking through a contrastively trained cross-encoder, and (iii) final claim classification via a MIL
framework based on roberta-large-mnli. Our method emphasizes semantic retrieval over
traditional surface-level approaches (e.g., BM25), leverages weakly supervised contrastive signals from gold
evidences, and avoids noisy input concatenation by aggregating evidence representations through
pooling mechanisms. This modular setup allows the system to flexibly combine retrieval and reasoning,
while remaining scalable for real-world application.</p>
      <p>Our main contributions are as follows:
• We propose a modular pipeline for claim verification that integrates dense retrieval, contrastive
re-ranking, and multi-instance classification for numerical claims.
• We introduce a contrastive training setup using weak supervision from summarized gold evidences
to improve re-ranking quality without relying on manual annotation.
• We evaluate multiple evidence aggregation strategies (Max, LogSumExp, Attention), and show
that pooling-based MIL improves over standard concatenation approaches.</p>
      <p>
        The paper is structured as follows: Section 2 gives an overview of the fake news detection pipeline
as presented in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and the problem of the specific task. Section 3 summarizes previously conducted
works in this field. In Section 4 we present our methodology, which we then evaluate and analyze in
Section 5. Finally, we provide a conclusion and future outlook to our approach in Section 6.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <p>This section provides an overview of the fake news detection pipeline (Subsection 2.1) and presents a
formal definition of the claim verification task (Subsection 2.2).</p>
      <sec id="sec-2-1">
        <title>2.1. Fake News Detection Pipeline</title>
        <p>
          The process of detecting and evaluating fake news was first formulated by Nakov et al. in 2018 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and
further refined in the following editions. Since 2020 the pipeline is presented in four distinct stages. We
will look at an adaption from the pipeline as shown in Figure 1.
        </p>
        <p>The first task, check-worthiness estimation, assesses whether a statement needs further verification,
i.e., if it needs to be passed on to a verifying authority. The second task, verified claim retrieval , attempts
to determine whether the claim has already been fact-checked by searching a database of previously
verified claims. If a matching verified claim is found, its label can directly support the classification
of the new claim. The third step, evidence retrieval, involves gathering additional relevant evidence
from external sources, such as web documents or news outlets, which may help in verifying the claim.
Finally, the fourth task, claim verification , combines the original claim with the retrieved evidences to
assess its veracity. This is described in more detail in the following Subsection.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Problem Formulation</title>
        <p>The task of claim verification involves determining the veracity of a claim by consulting a set of
evidences retrieved from a large corpus. Formally, given a query (claim) and relevant evidences, the
goal is to predict whether the claim is true, false, or in contradiction with the available information.
Mathematically, this can be described as follows:</p>
        <sec id="sec-2-2-1">
          <title>1. Definitions</title>
          <p>Let:
•  = {1, 2, . . . ,  } be the set of queries (claims).
•  = {1, 2, . . . ,  } be the corpus of evidences.
•  :  ×   →  be the verification model , which maps a query and a set of  evidences to a label.
•  = {True, False, Contradiction} be the label space.</p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2. Evidence Retrieval Function</title>
          <p>Let  be the retrieval function, selecting the top- relevant evidence passages from the corpus for a
given query :</p>
          <p>This can be formalized as:</p>
          <p>:  → 
() = arg</p>
          <p>max
⊂ ,||= ∈</p>
          <p>∑︁ rel(, )
ˆ =  (, )
(1)
(2)
(3)
(4)
where rel(, ) is a relevance scoring function.</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>3. Classification Objective</title>
          <p>Given a query  ∈  and its retrieved evidence set  = (), the classifier predicts:
where ˆ ∈  is the predicted label.</p>
        </sec>
        <sec id="sec-2-2-4">
          <title>4. Learning Objective</title>
          <p>Given a training data set {(, , )}=1, where  is the ground truth label for query  with evidence
set , the model parameters  are optimized by minimizing the classification loss:
ℒ( ) =</p>
          <p>1 ∑︁ ℓ( (, ), )
 =1
where ℓ is a loss function (cross-entropy).</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Dataset</title>
        <p>
          The relevant dataset for the claim verification task in english is QuanTemp [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], a multi-domain dataset
utilizing real-world data and focusing on numerical claims. While past iterations of the task comprised
of political (mainly PolitiFact) or social media data (from X, formerly Twitter) [
          <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
          ], QuanTemp consists
of multiple domains as thus has more varied appliances. The detailed distribution of the dataset as
provided by the authors is given in Table 1.
        </p>
        <p>As can be seen, the dataset is imbalanced and consists mostly of false claims. It can be argued that
the sources of the data, which comes mostly from PolitiFact and Snopes, efects the false classifications,
as fact-checking organizations mainly deal with questionable or incorrect claims which need further
verification.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Related Work</title>
      <p>
        Claim verification as a standalone task was first presented in 2018 and the next two editions. It
was then reintroduced in the 2023 CheckThat! lab. It has been held in both english and arabic
language, but it generally has not had as many participants as other tasks in the domain of the full
claim verification pipeline. In most cases, the participants retrieved external information to be able
to satisfactory classify the given claims [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Methods used for classification include more traditional
approaches like SVMs, random forest classifiers or logistic regression [ 10], but also neural architecures
like CNNs [11]. Ghanem et al. [12] proposed a three-stage verification approach, consisting of evidence
retrievel, evidence ranking and textual entailment. For the 2023 task, which was held in arabic, Touahri
and Mazroui [13] implemented linguistic features that focus on the alignment between claims and
externally extracted texts.
      </p>
      <p>Apart from challenges in the CheckThat! lab, approaches utilized Transformer models like
BERT [14] or other language model ensembles [15] for claim verification. More recently, architectures
based on graph networks [16] and large language models (LLMs) [17] also showed promising results
and outperformed many traditional approaches.</p>
      <p>
        Many of the presented approaches were evaluated using FEVER [18], a dataset consisting of over
185,000 generated claims used for claim verification. In 2024, Venktesh et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] introduced a novel
real-world dataset, where all claims are sourced from real fact-checks. The data focuses on claims that
include numerical claims. Furthermore, they also experimented with claim decomposition to improve
claim verification, which showed promising results on the QuanTemp dataset.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Our submitted pipeline (english only) follows a three-stage architecture: (i) evidence candidate retrieval
using dense vectors, (ii) re-ranking using a fine-tuned cross-encoder, and (iii) final claim classification
using a large NLI model. In the following, each step is described in detail.</p>
      <p>Evidence Retrieval We first pre-computed document embeddings using
sentencetransformers/all-MiniLM-L6-v23 and stored them in a FAISS index for eficient
innerproduct search. The embeddings were L2-normalized to enable cosine similarity retrieval. At inference
time, each claim was encoded into a query embedding, normalized, and used to retrieve the top-100
most similar evidence snippets from the index. The goal was to retrieve not only lexically similar
evidence snippets, but also semantically relevant ones that may be phrased diferently.
3https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
where (· ) denotes the choice of pooling operator:
1. Max pooling (hard winner-take-all) picks the single most confident snippet per class:
where ,ℓ is the logit of class ℓ.
2. LogSumExp pooling is a smooth approximation of the maximum:
3. Attention pooling learns a soft relevance weight for every snippet:</p>
      <sec id="sec-4-1">
        <title>Re-ranking with Contrastive Fine-Tuning</title>
        <p>The initial candidates were re-ranked using
cross-encoder/ms-marco-MiniLM-L-6-v24, fine-tuned in a contrastive setup. To improve
reranking beyond the vanilla MS MARCO model, we generated weak supervision signals from
goldlabeled evidence snippets in the training and validation sets. Further, these snippets were summarized
using LLaMA-3.1-8B5 to remove noise and produce cleaner positives. During training, the model
was optimized using margin ranking loss6, where each claim served as an anchor, its corresponding
summarized gold evidence as the positive, and the top-100 BM25-retrieved candidates as negatives. This
training strategy encouraged the model to assign higher relevance scores to gold summaries by pulling
them closer to the claim in the embedding space. Crucially, this step addressed the main bottleneck in
the task: retrieval quality.
prediction was obtained by:</p>
      </sec>
      <sec id="sec-4-2">
        <title>Claim Classification with Multi Instance Learning</title>
        <p>The final decision was produced by
FacebookAI/roberta-large-mnli trained in a MIL setup. For every claim  ∈ , we took the
 = 10 re-ranked evidence passages  = {1, . . . , } ⊂  , and forwarded each pair (, ) through
the encoder, obtaining a class-logit vector  ∈ R , with  = 3 (True, False, Contradiction). The
 snippet-level logit vectors were then pooled into a single bag-level vector (· )(, ) ∈ R before
applying softmax. Each pooling operator produced a scalar score (· )(, ) for class ℓ. The final
ℓ
ˆ = arg max (· )(, )
ℓ</p>
        <p>ℓ
ℓmax(, ) = max ,ℓ,</p>
        <p>1≤ ≤ 
ℓlse(, ) = log︁( ∑︁ exp ,ℓ .</p>
        <p>︁)
  =</p>
        <p>exp(︀ ⊤ℎ)︀
∑︀=1 exp(︀ ⊤ℎ )︀
( = 1024 for roberta-large).</p>
        <p>where ℎ ∈ R is the CLS embedding of (, ) and  ∈ R is a single learned parameter vector
Max pooling simply takes the highest logit among all snippets for each class and ignores the rest. It is
noise-tolerant, but discards any partial support from other evidences. Moreover, it performs poorly
when all top- evidence snippets are irrelevant, as no useful evidence contributes to the prediction.</p>
        <p>LogSumExp pooling is a smooth approximation of max and amplifies strong evidence while still
giving weaker evidences some influence, which makes it more stable than hard max.</p>
        <p>
          Attention-based pooling, used in our final submission, allows the model to focus on the most
informative snippets while ignoring irrelevant ones, without requiring an arbitrary truncation or noisy
concatenation of top-k candidates. Furthermore, it is naturally robust to variable evidence quality across
claims. In essence, it performs an additional, internal ranking of evidence snippets during classification,
letting the model weigh each piece of evidence according to its usefulness for the final decision.
4https://huggingface.co/cross-encoder/ms-marco-MiniLM-L6-v2
6https://pytorch.org/docs/stable/generated/torch.nn.MarginRankingLoss.html
Training Details To ensure comparability with Venktesh et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], all models were trained using only
the oficial data splits provided in the task description/GitLab repository. We did not use any external
data sets or knowledge bases. The training and validation splits exactly followed those defined in the
oficial paper. While it might have been possible to significantly boost test performance by externally
scraping sources like Google Fact Check7, we deliberately avoided such shortcuts to stay within the
intended constraints of the task.
        </p>
        <p>Training was conducted on an NVIDIA L4 GPU with 22.5 GB VRAM. The model was trained for up
to 10 epochs with early stopping (patience = 3), and best validation performance was reached between
epochs 5 and 6. We used a batch size of 3 with gradient accumulation steps of 8 (efective batch size
of 24). The learning rate was set to 2e-5, with a cosine scheduler and 10% linear warm-up. The model
was compiled using torch.compile for performance, and trained using bfloat16 precision via PyTorch
AMP. We used a class-weighted cross-entropy loss to handle class imbalance, which led to improved
macro-F1, albeit with a slight decrease in weighted-F1. For reproducibility, all experiments were run
with random seed 42.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results &amp; Analysis</title>
      <p>In this section, we present the results of our main method, followed by ablation studies on the test set,
and a summary of additional experiments and insights.</p>
      <p>Results on Development Set We first evaluated our approach using three diferent pooling strategies:
attention-based, LogSumExp, and max pooling. All experiments used the same retriever
(all-MiniLML6-v2) and the contrastively fine-tuned cross-encoder described in Section 4. As shown in Table 2,
attention-based pooling achieved the best performance, with a macro-F1 of 0.5937 and a weighted-F1
of 0.6682. LogSumExp and max pooling performed comparably but slightly below attention pooling.
Based on these results, we selected the attention-based pooling model as our final submission.
Ablation Studies on Test Set To better understand the impact of individual components, we
conducted ablation studies on the test set (see Table 3). Specifically, we varied the retriever (all-MiniLM-L6-v2
vs. BM25) and the re-ranker (fine-tuned vs. vanilla) across the three pooling strategies. Several key
observations emerged:
• Retriever quality matters: The bi-encoder (MiniLM) consistently outperformed BM25 across
all pooling methods. Furthermore, retrieving the top-100 initial candidates using pre-computed
MiniLM embeddings with a FAISS index was approximately six times faster than BM25 retrieval,
making it significantly more practical for large-scale applications.
• Contrastive fine-tuning helps: In all configurations using MiniLM, the fine-tuned re-ranker
outperformed the vanilla version, confirming the efectiveness of our contrastive training strategy
with summarized gold evidences. At this point, it is worth noting that we used all top-100 retrieved
evidence snippets as negatives during training. This simplification is theoretically suboptimal,
since some of these snippets may actually be relevant, and treating them as negatives could
push useful evidence away from the claim in the embedding space. Due to the absence of labels
(i.e., which specific snippets are truly supportive for a given claim), we adhered to this heuristic
approach.
• Pooling strategy robustness: LogSumExp pooling achieved the highest macro-F1 (0.5213),
suggesting that it generalized slightly better to the test set than attention or max pooling.</p>
      <p>Interestingly, when using BM25 for retrieval, the vanilla re-ranker occasionally performed better
than its fine-tuned counterpart, likely due to BM25 returning noisier candidates that disrupted the
ifne-tuned model’s learned ranking behavior. Overall, the results afirm the benefits of our modular
architecture, particularly the fine-tuned re-ranking step and the use of attention-based or LogSumExp
pooling in a MIL framework.
When inspecting the classwise F1 scores (Table 3) together with the confusion matrices in Figure 2,
three clear trends appear:</p>
      <p>First, the False label is by far the easiest to recognize. All three pooling strategies achieve an F1
between 0.75 and 0.78, which is driven by the large number of correct False predictions (≈ 1,700
instances in every matrix) and relatively few confusions with the other classes. Because this class
accounts for roughly two-thirds of the test set, its high F1 score drives the micro- and weighted-F1
(macro-F1 only proportional).</p>
      <p>Second, the True label benefits most from the smoother LogSumExp pooling. Compared with
attention pooling, LogSumExp leaves the number of True→False errors virtually unchanged (252 vs.
254) but substantially increases correct True→True hits (264 → 338) and cuts True→Conflicting
confusions (201 → 125), raising the class-wise F1 from 0.4087 to 0.4489. In contrast, Max pooling
sacrifices True accuracy (F1=0.3594) by sending many true claims to the Conflicting bucket (260
instances). This indicates that its hard selection amplifies noise in borderline evidence.</p>
      <p>Third, the Conflicting label remains the hardest case overall. Even the best setting (Max pooling,
F1=0.4104) still misclassifies roughly half of the conflicting claims, mostly as False. This corroborates the
qualitative observation that distinguishing genuinely contradictory evidence from merely missing or
weak evidence is challenging. The models often fall back to the majority False decision when evidence
is ambiguous.</p>
      <p>In general, LogSumExp ofers the best balance between True and False, whereas Max pooling excels
at recognizing Conflicting cases, and attention pooling sits in-between. These findings highlight a
trade-of: improving minority classes (True, Conflicting) comes at a slight cost to the False performance,
and vice-versa.</p>
      <sec id="sec-5-1">
        <title>Further Experiments and Findings</title>
        <p>
          We explored a number of additional modeling strategies:
• Claim decomposition: Incorporating decomposed claims (as proposed in Venktesh et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ])
did not improve performance. In fact, using the claim-only method yielded slightly better results.
l
e
Lab lse
eu Fa
r
T
e
u
r
T
iliftcnng
oC
264
174
137
True
252
1713
253
201
388
274
255
1718
244
260
424
348
        </p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>This paper presented a three-stage architecture for numerical claim verification in the CheckThat!
2025 lab. We first retrieve evidence passages with a dense bi-encoder stored in a FAISS index, then refine
the candidate list with a contrastively fine-tuned cross-encoder, and finally classify each claim in a MIL
framework based on roberta-large-mnli that aggregates snippet-level logits through alternative
pooling operators. Evaluated on the QuanTemp dataset, the complete system reaches a macro-F1 of
0.5213 and a weighted-F1 of 0.6280 with LogSumExp pooling, which outperforms concatenation-based
baselines and demonstrates that pooling-based MIL ofers a viable alternative to simplistic evidence
concatenation. Ablation studies confirm that dense retrieval ofers both higher accuracy and a six-fold
speed-up over BM25, while the contrastive fine-tuning of the re-ranker consistently enhances ranking
quality.</p>
      <p>Despite these gains, the overall macro-F1 is still capped by the recall of the retrieval stage, which
remains the main bottleneck of this task. Future work could therefore focus on retrieval-centric
improvements, including late-interaction models such as ColBERT that combine token-level matching
with eficient vector search. In addition, manually labeling the truly relevant evidence snippets for
each claim would allow us to avoid treating helpful passages as negatives during contrastive training,
providing much cleaner learning signals for the re-ranker.
8https://sbert.net/examples/sentence_transformer/applications/retrieve_rerank/README.html#
pre-trained-cross-encoders-re-ranker</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, we used ChatGPT-4o to assist with light editing tasks such as
grammar correction, writing style improvement, and occasional rephrasing. At no point was any section
of the paper fully generated by the tool. All content and ideas originate from us. All outputs from
the AI assistant were critically reviewed, revised where necessary, and verified by us. We take full
responsibility for the content of this publication and confirm that the use of AI-assisted tools is in
accordance with the CEUR-WS policy on AI-assisted writing.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments References</title>
      <p>
        This research work was supported by the National Research Center for Applied Cybersecurity ATHENE.
ATHENE is funded jointly by the German Federal Ministry of Research, Technology and Space and the
Hessian Ministry of Science and Research, Arts and Culture.
[10] K. Yasser, M. Kutlu, T. Elsayed, bigir at clef 2018: Detection and verification of check-worthy
political claims., in: CLEF (Working Notes), 2018.
[11] D. Wang, J. Simonsen, B. Larseny, C. Lioma, The copenhagen team participation in the factuality
task of the competition of automatic identification and verification of claims in political debates of
the clef-2018 fact checking lab, Cappellato et al.[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] (2020).
[12] B. Ghanem, G. Glavaš, A. Giachanou, S. P. Ponzetto, P. Rosso, F. Rangel, Upv-uma at checkthat!
lab: verifying arabic claims using a cross lingual approach, in: CEUR Workshop Proceedings,
volume 2380, RWTH Aachen, 2019, pp. 1–10.
[13] I. Touahri, A. Mazroui, Evolutionteam at clef2020-checkthat! lab: Integration of linguistic and
sentimental features in a fake news detection approach., in: CLEF (working notes), 2020.
[14] A. Soleimani, C. Monz, M. Worring, Bert for evidence retrieval and claim verification, in: European
      </p>
      <p>Conference on Information Retrieval, Springer, 2020, pp. 359–366.
[15] S. Gurrapu, L. Huang, F. A. Batarseh, Exclaim: Explainable neural claim verification using
rationalization, in: 2022 IEEE 29th Annual Software Technology Conference (STC), IEEE, 2022, pp.
19–26.
[16] Y. Chen, H. Liu, Y. Liu, R. Yang, H. Yuan, Y. Fu, P. Zhou, Q. Chen, J. Caverlee, I. Li, Graphcheck:
Breaking long-term text barriers with extracted knowledge graph-powered fact-checking, arXiv
preprint arXiv:2502.16514 (2025).
[17] G. Fenza, D. Furno, V. Loia, P. P. Trotta, Multi-llm agents architecture for claim verification (2025).
[18] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a large-scale dataset for fact
extraction and verification, arXiv preprint arXiv:1803.05355 (2018).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>E.</given-names>
            <surname>Aïmeur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Amri</surname>
          </string-name>
          , G. Brassard,
          <article-title>Fake news, disinformation and misinformation in social media: a review</article-title>
          ,
          <source>Social Network Analysis and Mining</source>
          <volume>13</volume>
          (
          <year>2023</year>
          )
          <fpage>30</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>P. R.</given-names>
            <surname>Center</surname>
          </string-name>
          ,
          <article-title>Social media and news fact sheet</article-title>
          ,
          <year>2023</year>
          . URL: https://www.pewresearch.org/journalism/ fact-sheet/
          <article-title>social-media-and-news-fact-sheet/</article-title>
          , accessed:
          <fpage>2025</fpage>
          -05-27.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Georgiev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D. S.</given-names>
            <surname>Martino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mohamed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Assisting the human fact-checkers: Detecting all previously fact-checked claims in a document</article-title>
          ,
          <source>arXiv preprint arXiv:2109.07410</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , G. Da San Martino,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babulkov</surname>
          </string-name>
          ,
          <article-title>Overview of the clef-2022 checkthat! lab task 2 on detecting previously fact-checked claims (</article-title>
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Màrquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Atanasova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kyuchukov</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Da San Martino, Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <source>Interaction: 9th International Conference of the CLEF Association, CLEF</source>
          <year>2018</year>
          , Avignon, France,
          <source>September 10-14</source>
          ,
          <year>2018</year>
          , Proceedings 9, Springer,
          <year>2018</year>
          , pp.
          <fpage>372</fpage>
          -
          <lpage>387</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , G. Da San Martino, M. Hasanain,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babulkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hamdan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          , et al.,
          <source>Overview of checkthat!</source>
          <year>2020</year>
          :
          <article-title>Automatic identification and verification of claims in social media</article-title>
          ,
          <source>in: Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF</source>
          <year>2020</year>
          , Thessaloniki, Greece,
          <source>September 22-25</source>
          ,
          <year>2020</year>
          , Proceedings 11, Springer,
          <year>2020</year>
          , pp.
          <fpage>215</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Przybyła</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Caselli</surname>
          </string-name>
          , G. Da San Martino,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the clef-2024 checkthat! lab: check-worthiness, subjectivity, persuasion, roles, authorities, and adversarial robustness</article-title>
          ,
          <source>in: International Conference of the Cross-Language Evaluation Forum for European Languages</source>
          , Springer,
          <year>2024</year>
          , pp.
          <fpage>28</fpage>
          -
          <lpage>52</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Korre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! Lab: Subjectivity, fact-checking, claim normalization, and retrieval</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <article-title>Quantemp: A real-world open-domain benchmark for fact-checking numerical claims</article-title>
          ,
          <source>in: 47th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          ,
          <string-name>
            <surname>SIGIR</surname>
          </string-name>
          <year>2024</year>
          ,
          <article-title>Association for Computing Machinery</article-title>
          (ACM),
          <year>2024</year>
          , pp.
          <fpage>650</fpage>
          -
          <lpage>660</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>