<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NGU_Research at CheckThat 2025: An LLM-Based Hybrid Fact Checking Pipeline for Numerical Claims</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mohamed A. Abdallah</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rokayah M. Fekry</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Samhaa R. El-Beltagy</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>NewGiza University, First 6th of October</institution>
          ,
          <addr-line>Giza Governorate 3294701, Cairo</addr-line>
          ,
          <country country="EG">Egypt</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>In this work, we present a four-stage, retrieval-augmented LLM pipeline for fact-checking numerical claims. The pipeline rewrites each numerical claim into a focused question, fuses OpenAI dense vectors with BM25 to fetch evidence, answers in context with DeepSeek-Chat, and issues language-aware verdicts via GPT-4.1-mini. The system ranked second in Arabic (macro F1 = 0.635) and fourth in Spanish (0.244) on the CLEF-2025 leaderboard, showing that balanced hybrid retrieval and encoding proper insights into prompts can deliver competitive accuracy on limited hardware.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;BM25</kwd>
        <kwd>MacroF1 score</kwd>
        <kwd>Hallucinations</kwd>
        <kwd>Hybrid Retrieval Models</kwd>
        <kwd>Instruction Tuning</kwd>
        <kwd>Large Language Models (LLMs)</kwd>
        <kwd>Prompt Engineering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Hallucinations of AI, which are defined as correct-sounding but incorrect data, are a phenomenon that
has been observed in generative AI tools based on large language models such as GPT, T5, and BART.
In her work [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], Kamel showed the distribution of fear of hallucinations in various fields, specifically
when AI-generated output is used for decision making.
      </p>
      <p>
        Hallucinations are not just a linguistic issue. It extends to factual correctness, which is dangerous when
numerical data is involved. Numerical hallucinations involve the generation of incorrect or fabricated
numerical data (e.g., statistics, dates, values) in the AI-generated outputs. Bera et al. addressed this
issue by introducing a framework for validating numerical values in AI-generated financial summaries.
Their system uses a T5-based model to predict masked numbers and compares them with ground-truth
values extracted from source reports. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]
      </p>
      <p>
        With the advancement of summarization of technical reports, the need of developing such models
to ensure the factuality of the numerical data in these reports increases day by day. Bera et al. in
[
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] has provided an automated framework that focuses on validating the numerical data in generated
ifnancial report summaries. Their approach addressed the prediction of masked numerical values in
the summaries generated by using a T5-based model. By taking only the most relevant sections of the
original report of interest, they cross-checked between the predicted and actual numerical data to verify
the numerical factuality of a specific sentence in the report.
      </p>
      <p>
        V. Venktesh et al. has introduced QuanTemp to address the gap between the available models and the
numerical claims available in real life. With a multi-domain dataset that focused on temporal, statistical,
and other diverse aspects. Their dataset was also supported by detailed and precise metadata as well
a strong basis evidence collection to avoid data leakage. They were able to achieve a model with a
macro-F1 score of 58.32 that was able to address real-life data [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>The previous researches highlight that numerical hallucination is a growing subfield requiring
domainspecific fact-checking approaches. Therefore, recent work has emphasized the value of multistage or
hybrid fact-checking pipelines.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset Exploration, Parsing, and feature extraction</title>
        <p>The corpus we work with mixes three languages and three data splits. After merging the oficial training
and validation partitions we have 13632 training claims and 4048 validation claims: English dominates
(≈ 73 %), followed by Arabic and Spanish. The class balance is skewed; half the claims are labelled
"False", one-fifth "Conflicting", and the rest "True," with Arabic containing no Conflicting examples at
all. A separate test set (3656 English claims, 1806 Spanish claims, and 482 Arabic claims) is held back
for leaderboard scoring.</p>
        <p>The dataset is organized as JSON files per language (English, Spanish, and Arabic) containing each
claim’s text, taxonomy tag, original verdict, normalized quantities, source URLs, and oracle document
metadata for training and validation. For every claim, a pool of top-100 BM25-retrieved evidence is
pulled from a shared, language-specific corpus. Each piece of evidence provides a gold context to guide
the three-way (true/false/conflicting) classification task.</p>
        <p>Our insights about data and the annotation process show that nearly every claim carries at least
one explicit quantity: in the merged data, numeric-bearing statements outnumber text-only statements
roughly a seven-to-one ratio. Correlation analysis with Cramér’s V represented in fig.1 highlights
which auxiliary signals help with the decision of the final label.</p>
        <p>English and Spanish fact-checking teams often talk in shades of truth, from the wild “pants-fire” label
all the way through “barely-true,” “half-true,” and “mostly-true.” To fit our three-way task, we simply
treat “barely-true” as False, “half-true” as Conflicting, and “mostly-true” as True as shown in fig.2a and
ifg.2b.</p>
        <p>(a) Cramér’s V Train
(b) Cramér’s V Val.</p>
        <p>In Arabic, the vocabulary looks diferent, terms like carry any hint of doubt straight
into "False," reserving "True" only for rock-solid claims fig.3a and fig.3b. That sharper boundary gives
Arabic a very diferent feel, and our model needs to work within that editorial style. By baking these
language-specific tendencies into our prompts as default priors, we let the LLM start its reasoning with
the same instincts that human fact-checkers bring. Because every claim in a language taps the same
pool, the overlap between evidences is huge.</p>
        <p>(a) Arabic Train Labels
(b) Arabic Validation Labels</p>
        <p>In general, data shows serious bias towards the false label in Fig. 1reffig:disttrain and Fig.4b, suggesting
that it would confirm the ability of a fact-checking pipeline to detect which claims are false.</p>
        <p>English training claims point to 2.17 million document mentions, yet only 383k IDs are unique, and
more than 320k of those are cited by multiple claims (nearly 84 % re-use). Spanish and Arabic exhibit
even tighter reuse. For Spanish, training data shows nearly 117k mentions vs. 6.9k unique IDs as a 98%
re-use, and for Arabic training data, we have nearly 221k mentions and 5k unique, indicating greater
than 99% re-use. For validation and test sets, evidences are also shared across multiple claims. This
structure is more than curiosity, it has two direct consequences for our work. First, the dense overlap
simplifies the retrieval mission. Second, by encoding the language-specific label priors in our prompts,
we have the intuition of boosting system performance, as later results show.</p>
        <p>(a) Training Label Distribution
(b) Validation Label Distribution</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Large Language Models (LLMs)</title>
        <sec id="sec-3-2-1">
          <title>3.2.1. Introduction to LLMs</title>
          <p>
            GPT-3, GPT-4, Gemini, and many other well-known models that have been part of our daily life are
examples of large language models (LLMs). These models are transformer-based models that are very
well trained on a huge and vast corpus of mainly textual data by following self-supervised learning.
State-of-the-art performance on a variety of natural language understanding and generation tasks
has been made possible by their ability to encode intricate linguistic, semantic, and factual patterns.
Language Models are Few-Shot Learners[
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. To train an LLM, the model passes by two main stages:
pretraining and Fine-tuning. In the pretraining stage, the model learns the general language patterns by
providing large text datasets like Common Crawl, Wikipedia, etc. [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. Aiming to improve the alignment
with human expectations, the models are fine tuned in the second stage by being adjusted on specific
datasets. In this stage, the models follow supervised learning or reinforcement learning with human
feedback (RLHF). Training Language Models to Follow Instructions with Human Feedback [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ].
          </p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Prompt engineering as a paradigm</title>
          <p>
            Prompt engineering is a term that refers to formulating input texts (prompts) to steer the model to
generate specified responses. This method has emerged as a lightweight replacement yet powerful
method for full retraining models. There are multiple prompting strategies like zero-shot, Few-shot,
and chain-of-thought prompting. All of these strategies involve providing the model with the task
description, however, the amount of examples provided to the model vary from one strategy to the
other. In zero-shot prompting, only the task description is provided to the model without providing
any examples, while in few-shot a few examples are also provided to the model [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. Chain-of-thought
prompting is a bit diferent were the model is provided by a step-by-step reasoning of the whole task.
This method is generally used in tasks that involve arithmetic or factual reasoning [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ]. Adapting
general-purpose LLMs to domain-specific tasks, such as fact-checking, requires prompt engineering.
This is because it points the LLMs in the direction of pertinent information and promotes lines of
reasoning that strengthen factual consistency [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ].
          </p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Automating LLM for factuality checking</title>
          <p>
            With the improvement of LLMs in text generation tasks, it became crucial to develop models that are
also capable of evaluating the quality, relevance, and factual correctness of the content. The role of
LLMs in fact-checking tasks involves judging the truthfulness of a claim of interest with respect to
given evidence that serves as ground truth. This method expands on the idea that LLMs can internally
model world knowledge and reasoning pathways adequate for factual verification when trained on a
variety of fact-rich corpora [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ].
          </p>
          <p>
            Zheng et al in [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], and Bai et al in [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] have shown in their studies that LLMs were able to successfully
judge textual generation that surpassed human performance in either correctness, helpfulness, or even
harmfulness. The studies not only included judging text generations but also expanded to include
evaluating summaries or answers in QA systems, as Bubeck et al have shown in their work [12].
Moreover, Jiang et al in [13] have assessed the factuality of statements by comparing them with
structured or unstructured statements. FEVER, a framework introduced by Thorne et al in [14], has
inspired some other researches to follow the same approach by prompting a claim to the model along
with a piece of an evidence that is supporting or contradicting with the claim. A verdict is then returned
with one of the following labels: (“SUPPORTED”, “REFUTED”, “INSUFFICIENT INFO”).
          </p>
        </sec>
        <sec id="sec-3-2-4">
          <title>3.2.4. Instruction Tuning of LLMs for Fact-Checking of Numerical Claims</title>
          <p>
            In order to help LLMs generalize to new or unseen instructions without any additional training, the
instruction tuning process is followed. The process involves exposing the model to a set of pairs of
instructions and responses, which helps it to follow these natural language instructions. The instruction
tuning process difers from general fine-tuning, which typically focuses on a specific task, in that it is
based on teaching the model to follow task descriptions and expectations. Such a process improves the
versatility of a model across various downstream tasks [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ], [15]. T0 [16], FLAN-T5 [17], and InstructGPT
[
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] are examples of notable instruction-tuned models that showed a strong performance in zero-shot
and few-shot in summarization, reasoning, and QA tasks.
          </p>
          <p>Instruction tuning plays a significant role in the context of fact-checking, specifically for numerical
claims. The instruction tuning would be helpful from many aspects, such as understanding what
constitutes a factual claim. They are also able to reason over numerical and symbolic information.
When provided with evidence, structured reasoning, or multi-hop deduction, the models are able to
justify the judgments. Models can be trained using prompts such as:
"Given the claim and the evidence, decide whether the claim is supported, refuted, or Not
Enough Information. Explain your reasoning."</p>
          <p>Several datasets, like the previously mentioned FEVER, AQuA, or even some corpora that contain
numerical statement associated with evidences are adapted to serve for this purpose. However,
transparency and consistency of the outputs of the model should be considered. Some techniques are used
to address these aspects, such as chain-of-thought prompting and rationale generation [18].</p>
          <p>In the context of numerical data, it is quite challenging to relay on LLMs for evaluating and checking
the factuality and correctness of a given numerical claim. When it comes to numerical and arithmetic
operations, LLMs can fail to generate precise numbers since they are prone to numerical hallucinations.
The key challenge is that not only does context understanding matter, but also the mathematical
verification of a claim is needed to evaluate its factuality. In such cases, the typical instruction tuning
can be insuficient if they are not explicitly trained on numerical datasets or if external tools like
calculator modules or symbolic reasoning chains are not integrated with it. Recently, hybrid setups, as
have been presented by Chen et al. and jiang et al. in [19], and [13] respectively tend to combine LLMs
with other retrieval modules, symbolic calculators, or even rule-based verification pipelines. Their
results have shown a significant reliability in quantitative claims verification.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. System Design</title>
        <p>Our fact-checking system in Fig. 5 moves each claim through four modular stages: it first calls a
multilingual LLM that creates a concise investigative question from the provided claim. This created a
question, and the original claim drives a hybrid retriever that blends dense semantic similarity from
OpenAI embeddings with exact-term BM25 scores, retrieving the top 15 passages most likely to mention
the claim. These passages are provided as a context to a model that will use in answering the created
question for a short reference answer in the same language. Finally, a verdict classifier compares that
answer against the original claim.</p>
        <sec id="sec-3-3-1">
          <title>3.3.1. Query Generation</title>
          <p>The first functional block rewrites each raw claim into a single-sentence question in the very same
language. We feed the claim to DeepSeek-Chat under a few-shot prompt that instructs the model to
hunt explicitly for all numbers, dates, quantities, and any stated causal links or consequences while
forbidding any judgment of trustworthiness. Because the training pool spans Arabic, Spanish, and
English, the prompt illustrates all three languages and enforces language preservation in the output. To
avoid redundant calls, the module keeps a pickled cache keyed by the original claim; once a question
has been generated, it can be reused throughout later experiments.</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>3.3.2. Hybrid Evidence Retrieval</title>
          <p>In the retrieval stage, we concatenate the DeepSeek-generated question directly with the original
claim, creating a single “stacked” query that anchors both the numeric surface forms and their possible
paraphrases. This composite query is submitted to a Qdrant index where every passage is stored
twice: once as a dense vector produced by OpenAI’s text-embedding-3-large model and again as a
sparse term-frequency map for BM25 scoring. Dense vectors help us match phrased or translated
numbers (“three-and-a-half million”, “3.5 M”), while the sparse map guards against LLM hallucinations
by insisting on literal term overlap. To rank candidates we apply a simple linear fusion: each passage
receives a min-max-normalized cosine score from the dense space and a normalized BM25 score; the
two are blended with equal weight ℎ = 0.5. Extensive ablations showed this midpoint consistently
lifts both early precision and deep-rank recall versus either signal alone, and using a cut-of of fifteen
passages gives us a comfortable recall bufer within token limits. The top-15 passages, ordered by this
hybrid score, advance as the evidence context for the question-answering module.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>3.3.3. Question Answering</title>
          <p>The fifteen ranked passages are concatenated—truncated if they exceed an 8k character safety cap
and supplied, along with the generated question, to DeepSeek-Chat under a “document-first” prompt,
i.e., the model must harvest its answer from the provided context and may consult prior knowledge
only when the evidence is genuinely silent. This constraint keeps explanations grounded and reduces
hallucination as in ref. The prompt also enforces brevity and language fidelity, so the reply is a compact,
fact-centered sentence in Arabic or Spanish, mirroring the question’s tongue. The result is a crisp
reference answer that distils the numeric gist of the evidence and is ready for verdict comparison.</p>
        </sec>
        <sec id="sec-3-3-4">
          <title>3.3.4. Verdict Classification</title>
          <p>In the final step, we present the original claim and the reference answer as a tuple to a lightweight
GPT-4.1-mini judge. The prompt spells out a quantitative rubric that is guided by the insights we
extracted from the original labels and annotation tendencies. If the answer confirms at least 75% of
the claim’s numeric and causal content the label is True; if it contradicts or diverges on most key facts
the label is False; anything in between is Conflicting. Because Arabic training data never uses the
Conflicting tag, we deploy a parallel prompt for Arabic that collapses all ambiguous or mixed cases into
False, mirroring native editorial practice. The judge runs at zero temperature and is instructed to output
the single word label—no rationale, no extra tokens—so its decision can be consumed directly by the
evaluation stage. To stay within LLM rate-limits and OpenAI embedding quotas, we batch embedding
calls and maintain a pickled question cache that eliminates redundant DeepSeek prompts, keeping
per-claim latency near one second on an A100 GPU Colab environment. With the design fully laid out,
we now turn to the empirical studies that shaped these choices.</p>
        </sec>
        <sec id="sec-3-3-5">
          <title>3.3.5. Efect of Hybrid Fusion</title>
          <p>Tables 3, 4, and 5 sweep the fusion weight  from 0 (BM25-only) through 0.5 (equal blend) to 1
(similarityonly) while holding every other setting fixed to the OpenAI dense backbone and stacked queries. A
clear precision-recall trade-of emerges: BM25 dominates deep recall in Arabic (R@15 = 0.970) but
collapses on English early precision (MRR = 0.221). Pure similarity flips that pattern, giving English the
best head-of-rank scores (MRR = 0.369, R@15 = 0.713) yet losing recall in Arabic. The midpoint  = 0.5
ofers the most balanced profile, which maintains near-BM25 recall in Arabic (0.911) and near-similarity
precision in English, while even posting the highest Spanish R@15 (0.962). These results confirm that a
modest blend is the safest cross-lingual choice for downstream QA and verdict stages.</p>
        </sec>
        <sec id="sec-3-3-6">
          <title>3.3.6. Impact of k (number of retrieved passages)</title>
          <p>Reading only the hybrid rows inside Tables 3, 4, and 5, we see how recall widens when the cutof grows
from the first 5 passages to the full 15. Arabic improves from 0.832 (R@5) to 0.911 (R@15); English
climbs from 0.525 to 0.693; and Spanish posts the biggest gain, leaping from 0.750 to 0.962. Mean
Reciprocal Rank changes by less than 0.02 in any language, showing that extra passages rarely shift the
position of the first hit but do rescue edge cases that would otherwise be missed. We standardize on k =
15 for the remainder of the pipeline but set the limit to 8 k-character context budget as the increase in
related overhead cost is still considerate.</p>
        </sec>
        <sec id="sec-3-3-7">
          <title>3.3.7. Stacking Question with claim</title>
          <p>We stacked the generated question together with the original claim. The intuition behind this idea is
that claims may have inaccurate pieces of information, hence, the presence of the original query can
provide a useful signal to the retriever to help better retrieve the gold chunk. The question spells out
the “how many?” or “what percent?” giving the encoder clearer clues, while BM25 still sees the original
text. We keep this approach the default for our main pipeline. As we unpacked system components and
illustrated experiments, we now turn to the results.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Our final submission, run on the shared test servers, placed 2 nd on the Arabic leaderboard with a
macro-average F1 of 0.635 (3 trials), driven by a True-class F1 of 0.542 and a False-class F1 of 0.729;
conflicting is naturally 0.0 because that label does not exist in Arabic. On the Spanish leaderboard we
ifnished 4 th; the single run we submitted scored a macro-average F1 of 0.244, with class-wise F1s of
0.169 (True), 0.142 (Conflicting), and 0.421 (False), with a notably better F1 in the False class for both
largely due to the natural imbalance in the data and its tendency towards the False class. Fig.4a and fig.
4b. It’s a basic idea that making more predictions of the False class would usually increase the odds
of F1-score of the False class, which is basically the major class, rather than the F1-score of the other
two classes. Although we skipped English entirely due to limited compute hours, the two languages
we tackled still placed comfortably in competing positions of their respective tables. Two important
insights emerge. First, hybrid retrieval with  ≈ 0.5 proves robust: it consistently beats the BM25-only
approach and the similarity-only approach. Second, language-aware labelling insights matter for better
LLM instruction tuning. Together, these results confirm that careful prompt engineering and a balanced
fusion strategy can ofset modest hardware budgets. This closes the empirical evaluation of our system.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>This study addressed the CLEF-2025 task of fact-checking numerical claims across Arabic, Spanish,
and English. We framed the problem as a four-step, LLM-centric pipeline: a multilingual question
generator rewrites each claim, a hybrid retriever (OpenAI dense vectors + BM25,  = 0.5) gathers
iffteen evidence passages, a document-first LLM extracts a concise answer, and a verdict judge assigns
True, False, or Conflicting with an Arabic-specific prompt that collapses ambiguity to False. Extensive
ablations showed that each module contributes measurably. Key design choices, including stacking the
generated question with the claim, normalizing fusion scores, and caching LLM calls, kept the system
both interpretable and resource-light while respecting language-specific editorial styles. The hybrid
retriever outperformed BM25-only or similarity-only baselines in every language. On the hidden test
servers our submission ranked 2nd in Arabic (Macro F1 = 0.635) and 4th in Spanish (Macro F1 = 0.244)
despite omitting English to conserve GPU hours, evidence that prompt-level language adaptation and
balanced retrieval can deliver competitive accuracy under tight hardware and time budgets.</p>
    </sec>
    <sec id="sec-6">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used OpenAI-GPT-4o: Grammar and spelling check.
After using this tool, the authors reviewed and edited the content as needed and take full responsibility
for the publication’s content.
[12] S. Bubeck, V. Chandrasekaran, R. Eldan, et al., Sparks of artificial general intelligence: Early
experiments with gpt-4, arXiv preprint arXiv:2303.12712 (2023).
[13] H. Jiang, X. Lin, Y. Gao, et al., Can language models verify factual claims?, arXiv preprint
arXiv:2305.14271 (2023).
[14] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, Fever: a large-scale dataset for fact
extraction and verification, in: Proceedings of the 2018 Conference of the North American Chapter
of the Association for Computational Linguistics, 2018, pp. 809–819.
[15] Y. Wang, Y. Kordi, S. Mishra, et al., Self-instruct: Aligning language models with self-generated
instructions, arXiv preprint arXiv:2212.10560 (2022).
[16] V. Sanh, A. Webson, C. Rafel, et al., Multitask prompted training enables zero-shot task
generalization, arXiv preprint arXiv:2110.08207 (2021).
[17] H. W. Chung, L. Hou, S. Longpre, et al., Scaling instruction-finetuned language models, arXiv
preprint arXiv:2210.11416 (2022).
[18] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V. Le, D. Zhou,
Chain-ofthought prompting elicits reasoning in large language models, 2022. URL: https://arxiv.org/abs/
2201.11903. doi:10.48550/arXiv.2201.11903. arXiv:2201.11903.
[19] Y. Chen, Y. Zhao, B. Y. Lin, et al., Factuality enhanced language models for numerical reasoning,
arXiv preprint arXiv:2302.04279 (2023).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Anand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bendou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bouamor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Iturra-Bocaz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! lab task 3 on fact-checking numerical claims</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2025</year>
          , Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kamel</surname>
          </string-name>
          ,
          <article-title>Understanding the impact of ai hallucinations on the university community</article-title>
          ,
          <source>Cybrarians Journal</source>
          <volume>62</volume>
          (
          <year>2024</year>
          )
          <fpage>74</fpage>
          -
          <lpage>91</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Bera</surname>
          </string-name>
          , et al.,
          <article-title>Validating numerical information in ai-generated financial summaries</article-title>
          ,
          <source>in: Proceedings of the ACL Workshops</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>V.</given-names>
            <surname>Venktech</surname>
          </string-name>
          , et al.,
          <article-title>Quantemp: Quantitative fact verification across temporal domains</article-title>
          ,
          <source>in: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          , Language Models are
          <string-name>
            <surname>Few-Shot Learners</surname>
          </string-name>
          ,
          <year>2020</year>
          . URL: http://arxiv.org/abs/
          <year>2005</year>
          .14165. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2005</year>
          .
          <volume>14165</volume>
          , arXiv:
          <year>2005</year>
          .14165 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Wainwright</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kelton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Leike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <article-title>Training language models to follow instructions with human feedback, 2022</article-title>
          . URL: http://arxiv.org/abs/2203.02155. doi:
          <volume>10</volume>
          .48550/arXiv.2203.02155, arXiv:
          <fpage>2203</fpage>
          .02155 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ichter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Chain-ofThought
          <source>Prompting Elicits Reasoning in Large Language Models</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/ 2201.11903. doi:
          <volume>10</volume>
          .48550/arXiv.2201.11903, arXiv:
          <fpage>2201</fpage>
          .11903 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Gong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <source>Allies: Prompting Large Language Model with Beam Search</source>
          ,
          <year>2023</year>
          . URL: http://arxiv.org/abs/2305.14766. doi:
          <volume>10</volume>
          .48550/arXiv.2305. 14766, arXiv:
          <fpage>2305</fpage>
          .14766 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Glaese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>McAleese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aslanides</surname>
          </string-name>
          , et al.,
          <article-title>Improving alignment of language models via human feedback</article-title>
          ,
          <source>arXiv preprint arXiv:2204.05862</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lee</surname>
          </string-name>
          , et al.,
          <article-title>Judging llm-as-a-judge: Evaluating llms as evaluators</article-title>
          ,
          <source>arXiv preprint arXiv:2305.14688</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ndousse</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          , N. DasSarma,
          <string-name>
            <given-names>D.</given-names>
            <surname>Drain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Fort</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ganguli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Henighan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Joseph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kadavath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kernion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Conerly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>El-Showk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Elhage</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>HatfieldDodds</surname>
          </string-name>
          , D. Hernandez,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hume</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Johnston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kravec</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lovitt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nanda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olsson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          , T. Brown, J. Clark,
          <string-name>
            <given-names>S.</given-names>
            <surname>McCandlish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Olah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <article-title>Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022</article-title>
          . URL: https://arxiv.org/abs/2204. 05862. arXiv:
          <volume>2204</volume>
          .
          <fpage>05862</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>