<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DS@GT at CheckThat! 2025: Evaluating Context and Tokenization Strategies for Numerical Fact Verification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maximilian Heil</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aleksandar Pramov</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Georgia Institute of Technology</institution>
          ,
          <addr-line>North Ave NW, Atlanta, GA 30332</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <fpage>6459</fpage>
      <lpage>6459</lpage>
      <abstract>
        <p>Numerical claims - statements involving quantities, comparisons, and temporal references - pose unique challenges for automated fact-checking systems. In this study, we evaluate modeling strategies for veracity prediction of such claims using the QuanTemp dataset and building our own evidence retrieval pipeline. We investigate three key factors: (1) the impact of more evidences with longer input context windows using ModernBERT, (2) the efect of right-to-left (R2L) tokenization, and (3) their combined influence on classification performance. Contrary to prior findings in arithmetic reasoning tasks, R2L tokenization does not boost natural language inference (NLI) of numerical tasks. A longer context window does also not enhance veracity performance either, highlighting evidence quality as the dominant bottleneck. Our best-performing system achieves competitive macro-average F1 score of 0.57 and places us among the Top-4 submissions in Task 3 of CheckThat! 2025. Our code is available at https://github.com/dsgt-arc/checkthat-2025-numerical.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Transformer</kwd>
        <kwd>Retrieval</kwd>
        <kwd>ModernBERT</kwd>
        <kwd>Tokenization</kwd>
        <kwd>Context Window</kwd>
        <kwd>Numerical Understanding</kwd>
        <kwd>Fact Checking</kwd>
        <kwd>CEUR-WS</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In the broader scope of automatic fact-verification systems, the CLEF CheckThat! 2025 edition through
its diferent tasks, investigates best practices for each component of the typical pipeline of such a
system: from establishing claim check-worthiness, undergoing claim normalization and culminating
in evidence retrieval and natural language inference.[
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] Task 3 in particular, focuses on systems for
automatic claim verification of numerical and temporal claims[
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Numerical misinformation — claims
involving statistics, comparisons, intervals, and temporal expressions — poses a distinct challenge to
automated fact-checking systems. While advances in neural architectures and the availability of
largescale datasets have significantly improved claim verification for general text, verifying numerical claims
remains underexplored and substantially more dificult. This is especially critical given the documented
“numeric-truth efect,” where the presence of numbers lends false credibility to misinformation [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>To that end, CheckThat! Task 3 employs the QuanTemp dataset that was introduced as the first
largescale, real-world benchmark dedicated entirely to the verification of numerical and temporal claims
[6]. Collected from 45 fact-checking organizations worldwide, it comprises over 15,000 diverse claims
annotated with fine-grained labels—True, False, or Conflicting—and is accompanied by a web-sourced
evidence corpus of over 423,000 specially parsed evidence snippets. This dataset spans statistical,
comparative, interval, and temporal categories, ofering a comprehensive resource for evaluating and
improving models on numerical fact-checking tasks.</p>
      <p>Our approach investigates improvements over existing methods by having two premises: 1) More
evidences and longer context will improve the prediction quality of the model, provided that the model
can actually handle the longer context with an increased length of the input tokens; 2) Acknowledging
that the way tokenization is done on numbers plays a role in abstract reasoning for LLM models,
and that a switch in the number tokenization from left-to-right to right-to-left (R2L) could aid even
a BERT-based model in the veracity estimation of numerical claims [7]. To this end, in our study, we
postulate the following three research questions (RQs)
RQ 1 Does longer context (3 vs. 9) of retrieved evidence snippets improve the veracity prediction?
RQ 2 Does R2L-tokenization improve performance?
RQ 3 Does combining long context and R2L-tokenization outperform the other settings?</p>
      <p>Through ablation studies, we answer each of them by employing ModernBERT[8] - a BERT
architecture which is specifically suited to maintain longer context, as well as applying R2L numerical
tokenization. In addition, the dataset exhibits an unbalanced distribution of the true/false/conflicting
labels , with false claims being notably with the highest frequency1. Therefore, we also experiment
with focal loss to address the mild class imbalance [9]. In addition, for fast prototyping we also employ
LoRA, which adapts only a small portion of the parameters during fine-tuning [10].</p>
      <p>In our modeling pipeline, we therefore conduct the in-depth analysis on QuanTemp’s English dataset
by building a hybrid BM25 + transformer evidence retrieval-reranking pipeline, subsequently plugged
into a ModernBERT Natural Language Inference (NLI) core. Moreover, following [6], we also employ a
claim decomposition step[11], which aims to split the claim into separate parts and improve the retrieval
stage, in a matter akin to query expansion.</p>
      <p>To answer RQ1-RQ3, we run a systematic ablation that varies context length and applies R2L
numerical tokenization. Our experiments—spanning ModernBERT, MathRoBERTa, and GPT-4o-mini
[12] few-shot prompting—show that the longer-context, R2L-tokenized ModernBERT configuration
delivers the strongest performance.</p>
      <p>The paper is structured as follows: Section 2 shows related works, Section 3 presents the explanatory
data analysis (EDA), Section 4 explains our modeling approach, Section 5 presents our results, Section 6
show further research avenues and Section 7 concludes.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related work</title>
      <p>The major inspiration for our modeling pipeline (and indeed the core of the English dataset provided
by CLEF) is based on the work of QuanTemp, which itself is based on an earlier version that appears
in NumTemp [13]. Their contribution closes a particular gap in the field of fact-checking research
the verification of numerical claims, which are typical around e.g., political debates. While many prior
eforts have proposed automated claim verification systems and evidence retrieval architectures (e.g.,
[14], [15]), these largely target general textual claims and rarely emphasize the intricacies of quantitative
or temporal reasoning.</p>
      <p>At the first step of the pipeline, [ 6] adopts a typical two-step evidence retrieval-reranking approach[16].
First, it uses a sparse (BM25) retrieval system to retrieve a broader set of (100) evidences, which then
subsequently reranked by a neural reranking model which takes into account semantic relevance on
top of the key-word matching that BM25 delivers.</p>
      <p>For claim verification, they experiment with several transformer-based natural language inference
(NLI) models, including BERT variants, and also explore prompting strategies for large language models
in both few-shot and zero-shot configurations. This dual focus on retrieval and verification for
numerical claims makes their framework especially relevant for real-world applications, where numerical
misinformation can be particularly persuasive due to the well-documented "numeric-truth efect." Our
work builds on this foundation, extending it through targeted ablations, alternate tokenization schemes,
and additional model comparisons that further explore the boundaries of model performance in this
domain.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Exploratory Data Analysis</title>
      <p>The (English) dataset consists of 432,320 evidence snippets and 15,514 claims labeled in three categories
- 18.79% True, 57.93% False and 23.27% Conflicting . More details on the composition of the dataset and
the procedure used to retrieve the evidence are outlined in QuanTemp.</p>
      <p>It is of interest to consider the semantic overlap between the labels of Conflicting, True, and False
statements. If these labels form well-separated clusters in a lower-dimensional embedding space, this
could suggest that veracity prediction is more tractable—since semantically distinct groups are easier to
classify. Conversely, overlapping regions between these classes may highlight ambiguous or nuanced
cases that are inherently more dificult to resolve, especially for automated systems.</p>
      <p>To that end, we apply a Uniform Manifold Approximation and Projection (UMAP) [17] dimensionality
reduction to the ModernBERT-large embeddings of the claims in the English training dataset (in their
original, non-decomposed form). The resulting scatterplot visualization, shown in Figure 1, displays the
embeddings colored by their respective veracity labels: True, False, and Conflicting.</p>
      <p>The plot reveals a noticeable separation of False claims from the other two categories, while True and
Conflicting statements frequently overlap. This pattern can be attributed both to the inherent dificulty
of distinguishing these categories—particularly when evidence is ambiguous or contradictory—and to
the complex mapping strategy employed in the QuanTemp dataset, where nuanced labels (e.g., “mostly
true”) are abstracted into broader categories like Conflicting [ 13, Table 9]. Such overlap in semantic
space underscores the challenges in automated veracity assessment, especially for claims requiring
nuanced interpretation or context-dependent evidence.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Modeling Approach</title>
      <p>Our modeling approach follows closely the typical claim verification pipeline of evidence retrieval &amp;
reranking, followed by a veracity classifier. Figure 2 depicts the process.</p>
      <sec id="sec-4-1">
        <title>4.1. Evidence Retrieval</title>
        <p>At the first step, as done in [ 6], we also perform with a claim decomposition by using GPT-4o-mini [12],
which aims to split the original underlying claim into 3 separate smaller claims, for which we retrieve
evidence from the evidence corpus. To generate structured sub-questions from factual claims, we
used a prompt tailored for decomposition. The prompt was done with low temperature (0.3) to reduce
randomness and increase overall coherence in the generated yes/no questions. A moderate frequency
penalty (0.6) and higher presence penalty (0.8) encouraged semantic variety while still controlling for
drift from the original claim. The maximum number of tokens (300) proved to strike a good balance
between the model generating diverse but also concise sub-claims.</p>
        <p>Next, we followed a standard two-step approach for retrieving evidence for each decomposed
subclaim [16]: a sparse BM25 retrieval, followed by a reranking step, for which we used cross-encoder
(which encodes both the query and the retrieved document). This two-step retrieval procedure (BM25 +
reranker) showed good performance in [6] and served as a motivation for our approach.</p>
        <p>After some experimentation with transformer-based rerankers, we settled on
ms-marco-MiniLM-L-12v2 - a cross-encoder reranker, fine-tuned on MS Marco Passage ranking dataset[ 18]. The experimentation
was easily facilitated via a centralized API package by [19] and while we did not experiment with
LLM-based reranking, this certainly remains part of our plans for the future2. For each decomposed
sub-claim, the initial BM25 retrieval was set to 50 documents, and we would then follow with the
reranking step on them. The post-reranking top 1-3 of each sub-claim were then kept for the next steps,
depending on the particular specification of the ablation study, in line with RQ 1. The input for the
subsequent NLI task then constitutes an input string composed of: the original claim, (a set of the)
decomposed claims and the respective retrieved &amp; reranked evidences.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Natural Language Inference</title>
        <p>Our main workhorse model for the veracity classifier was ModernBERT - an optimized model based
on the BERT architecture, that can natively support sequence length of 8,192, a considerable increase
of the usual BERT-based context window of 512. It does so by introducing changes to the embedding
procedure (employing rotary positional embeddings), which are more scalable to longer contexts, as
well as an alteration between global and local attention mechanisms to manage memory and compute
eficiency. The resulting support for (much) longer context (max input length) lends itself to explore
the efect of shorter vs. longer context as stated in RQ 1 and was thus the main driver behind our
modeling choice. Within the transformer based classifiers, we also experimented with MathRoberta - a
ifne-tuned Roberta model [ 20] on mathematical discussion posts, which was part of the CheckThat!
Task 3 baseline. Both encoders have been fine-tuned with a learning rate of 2e-5 and cross-entropy loss.
Training has been performed on two, Quadro RTX 6000-24GB, two Tesla V100-32GB or a H100-80GB
on the Phoenix cluster of Georgia Tech’s Partnership for an Advanced Computing Environment[21] or
locally on an Apple M3 Pro GPU-36GB and Metal Performance Shaders.</p>
        <p>Lastly, for the dev set, we also experimented with a few-shot prompted GPT-4o-mini [12] model as
a classifier, on the original claims. To incorporate dataset priors into the model’s output, we apply a
logit bias based on the empirical distribution of labels (True, False, Conflicting) in the validation set.
Specifically, for each target label, we compute the log-odds of its prior probability and scale these values
by a factor  , which controls the strength of the prior. These scaled log-odds are then assigned as biases
to the corresponding output tokens, encouraging the model to produce predictions that are nudged
towards the observed label distribution in the original claims. The empirical results however, showed
significantly worse performance than the BERT-based classifier and we thus did not pursue that avenue
further.</p>
        <p>One additional change that we applied for the transformer-based models was focal loss [9]. While
the imbalance in the dataset is not severe by any means, we investigated whether accounting for said
imbalance could bring an added value in the performance. Focal loss addresses class imbalance by
down-weighting easy examples and focusing training on hard, misclassified ones. It does so by applying
a modulating factor to the standard cross-entropy loss, reducing the impact of well-classified majority
class examples and thereby improving the model’s sensitivity to underrepresented classes.</p>
        <p>As noted by [7], tokenization can play a critical role in numerical reasoning with large language
models, as the usual left-to-right segmentation of numbers can hinder the numerical reasoning. It
remains unclear whether the same efect would hold for higher-level tasks, such as NLI for numerical
claims. We therefore integrate their right-to-left (R2L) tokenization into ModernBERT and evaluate
whether this altered token order also improves transformer-based NLI performance on numerical
fact-verification.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Evaluation</title>
        <p>We fine-tune and evaluate our models based on the macro-averaged F1-measure (macro F1)
where  1 is the class-wise F1 score</p>
        <p>Macro-F1 =</p>
        <p>1 ∑︁  1</p>
        <p>=1
 × 
 1 = 2 ×  + 
(1)
(2)
where  is the recall of class  and  is the precision of class .</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>Following our research questions, Table 1 presents our results. First, we show results obtained with the
QuanTemp dataset and natural language inference with MathRoBERTa (Benchmark). The macro-avg. F1
drops from train (0.75) to validation (0.56) indicating weak generalization and an overfit after 3 epochs
of fine-tuning. Train accuracy (0.67) and validation accuracy (0.66) remains stable as it is less sensitive
to the class distribution. When we recreate the claim decomposition and the evidence retrieval but stick
with MathRoBERTa (Our-Data), we observe a slight performance drop for train (macro-avg. F1: 0.56)
and validation (macro-avg. F1: 0.52). This is especially driven by less and fewer correct classification
of conflicting claims. Apparently, our recreation of the numerical claim verification pipeline provides
poorer evidences for the encoder to verify the claims during natural language inference.</p>
      <p>Next, we explore RQ 1 by comparing short (256 tokens) and long (1,024) context windows given our
dataset with up to 3 evidences per question and ModernBERT, an encoder with a maximum context
window of 8,192 tokens. Here we contrast 1 evidence per question and 256 context window
(ShortContext) with 3 evidences per question and a 1,024 context window (Long-Context). Whilst training
performance between short-context (macro-avg. F1: 0.50) and long-context (macro-avg. F1: 0.64) vary
significantly, validation performance is similar. This indicates that a longer context window is not
helpful for numerical veracity prediction. Hence, our results agree with findings in [ 6]. Nevertheless,
our results could be limited by poorer evidences for the veracity prediction. Ultimately, providing the
inference system with three weak evidences instead of one weak evidence could prove to be without an
efect on the final outcome.</p>
      <p>Moreover, we investigate RQ2 the efect of tokenization by switching the ModernBERT Tokenizer to
Right-to-left (R2L) instead of Left-to-right. Results are based on our retrieved evidence dataset. R2L
Short-Context achieves a 0.38 macro-avg. F1 for train and a 0.45 macro-avg. F1 for validation. Similar,
R2L Long-Context also performs poor and macro-avg. F1 for train (validation) drops to 0.42 (0.47). These
results contradict findings in [ 7]. This surprise in performance can be attributed to the vast diference
of use cases between arithmetic tests of language models compared to our numerical and temporal
veracity prediction in CheckThat! Task 3. Here, the numerical aptitude of the language model might
not be so critical compared to tests of arithmetic. In addition, these findings support the conclusion that
the quality of the retrieved evidences is poor; high-quality evidences are vital for high-performance
veracity prediction.</p>
      <p>These results also show a negative outcome for RQ3, the combination of long-context and R2L
tokenization show underwhelming performance. These results motivate us to fine-tune ModernBERT
on the QuanTemp dataset (Submission) for submission. Here, we do not employ R2L tokenization
and just use 1 evidence snippet per question. Nevertheless, we keep the context-window at 1,024
tokens to empower the encoder to attend to all tokens during the veracity prediction. This results
in a validation performance similar to the organizer’s benchmark (macro-avg. F1: 0.57). Whereas
our submission-model has better performance when classifying true claims (validation F1: 0.55) and
weaker performance when classifying conflicting claims (validation F1: 0.36). We extend our research
by exploring parameter-eficient fine-tuning (PEFT) and focal loss. As expected, ModernBERT with
PEFT on the QuanTemp dataset has a weaker performance as validation macro-avg. F1 drops to 0.49.
Results with focal loss are not statistically significant from the submission results with cross-entropy
loss. ModernBERT trained on the organizer dataset with focal loss achieves 0.65 macro-avg. F1 for train
and 0.57 macro-avg. F1 for validation. This demonstrates that focal loss is not helpful in case of weak
class imbalance.</p>
      <p>Table 2 shows the time complexity of fine-tuning the diferent scenarios with a NVIDIA
H10080GB. We report runtime in minutes, epochs until fine-tuning is finished (early-stopping after 2
epochs) and finally the ratio out of runtime and epochs (time eficiency). We see that the smaller
model, MathRoBERTa, typically finishes fine-tuning earlier (1.35 - 1.39 ratios) than ModernBERT
approaches (2.12 - 7.83 ratios). Prolonged Fine-tuning with ModernBERT is driven by the quadratic
costs of attention when using a large context window. Surprisingly, when only fine-tuning 1% of
the trainable parameters of the encoder via PEFT, the time eficiency does not drop. This shows
that our train pipeline has significant overhead which eliminates all the time complexity benefits of PEFT.</p>
      <p>Finally, Table 3 presents our test results placing us on the 4ℎ out of 10 participants. It documents a
further drop in performance from validation (macro-avg. F1: 0.63) to test (macro-avg. F1: 0.58). This is
driven by poorer precision and recall of the true claim class (validation F1: 0.36, test F1: 0.38).</p>
    </sec>
    <sec id="sec-6">
      <title>6. Future work</title>
      <p>Our findings point to several directions for future improvement. Replacing or augmenting BM25 with
hybrid retrieval methods — combining dense and sparse representations - and incorporating LLM-based
reranking could enhance evidence quality, which remains a key bottleneck in our pipeline. Both,
reranking models and the ModernBERT classifier, could benefit from additional fine-tuning on
mathand number-centric corpora (e.g., MathQA [22]) to strengthen numerical reasoning. Furthermore,
normalizing numbers, dates, and other numerical or temporal expressions in both claims and evidence
may help language models better capture semantic equivalence and distinctions in embedding space,
thereby improving inference precision. Lastly, leveraging ensembles of veracity classifiers could
improve robustness by integrating complementary strengths across model architectures and training
configurations.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>Our paper presents our modeling choices for numerical fact verification in Task 3 of the CheckThat!
2025 Lab. We evaluated the impact of longer input context windows, right-to-left (R2L) numerical
tokenization, and their combination on the veracity prediction of numerical and temporal claims. After
recreating the evidence retrieval pipeline with claim decomposition, we observed a drop in performance
suggesting a weaker quality of evidences. Therefore, in absence of high-quality evidence, we show that
neither longer context nor R2L tokenization improves performance. Contrary to our expectations, this
suggest that extending input size or altering tokenization strategy is less important than constructing a
high-quality evidence retrieval pipeline.</p>
      <p>Our strongest results were achieved by a ModernBERT-based pipeline using only one evidence
snippet per decomposed question, a 1,024 context length and classical left-to-right tokenization. This
configuration achieved competitive validation performance and placed us among the top-performing
systems in the shared task. However, the overall drop in performance from validation to test highlights
generalization challenges in this domain.</p>
      <p>Our results motivate future eforts in hybrid retrieval, numerical normalization, and ensemble
modeling to further innovate in the underexplored subfield of numerical claim verification.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgments</title>
      <p>We thank the DS@GT CLEF team for providing valuable comments and suggestions. This research
was supported in part through research cyberinfrastructure resources and services provided by the
Partnership for an Advanced Computing Environment (PACE) at the Georgia Institute of Technology,
Atlanta, Georgia, USA.</p>
    </sec>
    <sec id="sec-9">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used OpenAI-GPT-4o: Grammar and spelling check.
After using this tool, the authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.
[6] V. Venktesh, A. Anand, A. Anand, V. Setty, Quantemp: A real-world open-domain benchmark for
fact-checking numerical claims, arXiv preprint arxiv:2403.17169 (2024).
[7] G. Lee, G. Penedo, L. von Werra, T. Wolf, From digits to decisions: How tokenization impacts
arithmetic in llms, https://huggingface.co/spaces/huggingface/number-tokenization-blog, 2024.</p>
      <p>Accessed: 2025-07-06.
[8] B. Warner, A. Chafin, B. Clavié, O. Weller, O. Hallström, S. Taghadouini, A. Gallagher, R. Biswas,
F. Ladhak, T. Aarsen, et al., Smarter, better, faster, longer: A modern bidirectional encoder for fast,
memory eficient, and long context finetuning and inference, arXiv preprint arXiv:2412.13663
(2024).
[9] T.-Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal loss for dense object detection, in: Proceedings
of the IEEE international conference on computer vision, 2017, pp. 2980–2988.
[10] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, et al., Lora: Low-rank
adaptation of large language models., ICLR 1 (2022) 3.
[11] D. Wadden, Z. Lin, L. Liu, M. Gardner, H. Hajishirzi, L. Zettlemoyer, Generating literal and implied
subquestions to fact-check complex claims, in: Proceedings of the 60th Annual Meeting of the
Association for Computational Linguistics (ACL), 2022.
[12] OpenAI, Gpt-4 technical report, ArXiv (2023).
[13] V. Venktesh, A. Anand, A. Anand, V. Setty, Numtemp: A real-world benchmark to verify claims
with statistical and temporal expressions, CoRR (2024).
[14] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph, A.
Kulkarni, A. K. Nayak, et al., Claimbuster: The first-ever end-to-end fact-checking system, Proceedings
of the VLDB Endowment 10 (2017) 1945–1948.
[15] J. Chen, G. Kim, A. Sriram, G. Durrett, E. Choi, Complex claim verification with evidence retrieved
in the wild, arXiv preprint arXiv:2305.11859 (2023).
[16] K. A. Hambarde, H. Proenca, Information retrieval: recent advances and beyond, IEEE Access 11
(2023) 76581–76604.
[17] L. McInnes, J. Healy, J. Melville, Umap: Uniform manifold approximation and projection for
dimension reduction, arXiv preprint arXiv:1802.03426 (2018).
[18] P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra,
T. Nguyen, et al., Ms marco: A human generated machine reading comprehension dataset, arXiv
preprint arXiv:1611.09268 (2016).
[19] B. Clavié, rerankers: A lightweight python library to unify ranking methods, 2024.</p>
      <p>arXiv:2408.17344.
[20] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[21] PACE, Partnership for an Advanced Computing Environment (PACE), 2017. URL: http://www.</p>
      <p>pace.gatech.edu.
[22] A. Amini, S. Gabriel, S. Lin, R. Koncel-Kedziorski, Y. Choi, H. Hajishirzi, MathQA: Towards
interpretable math word problem solving with operation-based formalisms, in: Proceedings of the
2019 Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, 2019, pp. 2357–2367.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Korre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! Lab: Subjectivity, fact-checking, claim normalization, and retrieval</article-title>
          , in: J.
          <string-name>
            <surname>Carrillo-de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>García Seco de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Dietze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hafid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Korre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Schellhammer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Todorov</surname>
          </string-name>
          ,
          <string-name>
            <surname>V. V.</surname>
          </string-name>
          ,
          <article-title>The clef-2025 checkthat! lab: Subjectivity, fact-checking, claim normalization, and retrieval</article-title>
          , in: C.
          <string-name>
            <surname>Hauf</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Jannach</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Kazai</surname>
            ,
            <given-names>F. M.</given-names>
          </string-name>
          <string-name>
            <surname>Nardini</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Pinelli</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Silvestri</surname>
          </string-name>
          , N. Tonellotto (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Nature Switzerland, Cham,
          <year>2025</year>
          , pp.
          <fpage>467</fpage>
          -
          <lpage>478</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2025</year>
          , Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bendou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bouamor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Iturra-Bocaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscakova</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! lab task 3 on fact-checking numerical claims</article-title>
          ,
          <source>in: [3]</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>N.</given-names>
            <surname>Sagara</surname>
          </string-name>
          , Consumer Understanding and
          <article-title>Use of Numeric Information in Product Claims</article-title>
          ,
          <source>Ph.d. dissertation</source>
          , University of Oregon, Eugene,
          <string-name>
            <surname>OR</surname>
          </string-name>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>