<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Towards Automated Fact-Checking of Real-World Claims: Exploring Task Formulation and Assessment with LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Premtim Sahitaj</string-name>
          <email>sahitaj@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vera Schmitt</string-name>
          <email>vera.schmitt@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Junichi Yamagishi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jawan Kolanowski</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Möller</string-name>
          <email>sebastian.moeller@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Automated Fact-Checking, Large Language Models, Retrieval-Augmented Generation,</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Deutsches Forschungszentrum für Künstliche Intelligenz, Speech and Language Technology Lab, Berlin</institution>
          ,
          <addr-line>10559</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Harz University of Applied Sciences, Faculty of Automation and Computer Science</institution>
          ,
          <addr-line>Wernigerode, 38855</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National Institute of Informatics, Digital Content and Media Sciences Research Division</institution>
          ,
          <addr-line>Tokyo, 101-8430</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Technische Universität Berlin, Quality and Usability Lab, Berlin</institution>
          ,
          <addr-line>10587</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Fact-checking is necessary to address the increasing volume of misinformation. Traditional fact-checking relies on manual analysis to verify claims, but it is slow and resource-intensive. This study establishes baseline comparisons for Automated Fact-Checking (AFC) using Large Language Models (LLMs) across multiple labeling schemes (binary, three-class, five-class) and extends traditional claim verification by incorporating analysis, verdict classification, and explanation in a structured setup to provide comprehensive justifications for real-world claims. We evaluate Llama-3 models of varying sizes (3B, 8B, 70B) on 17,856 claims collected from PolitiFact (2007-2024) using evidence retrieved via restricted web searches. We utilize TIGERScore as a reference-free evaluation metric to score the justifications. Our results show that larger LLMs consistently outperform smaller LLMs in classification accuracy and justification quality without fine-tuning. We find that smaller LLMs in a few-shot inference scenario provide comparable task performance to fine-tuned Small Language Models (SLMs) with large context sizes, while larger LLMs consistently surpass them. Evidence integration improves performance across all models, with larger LLMs benefiting most. Distinguishing between nuanced labels remains challenging, emphasizing the need for further exploration of labeling schemes and alignment with evidences. Our findings demonstrate the potential of retrieval-augmented AFC with LLMs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Misinformation, whether spread inadvertently or with the intention to deceive, is a global challenge
that can be mitigated efectively through fact-checking eforts [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Generally, fact-checking is defined
as the assessment of the truthfulness of a check-worthy claim [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. For fact-checking to be efective,
fact-checking itself must be convincing and justified [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. A well-known source of human-verified
knowledge is PolitiFact1, where experts manually identify check-worthy claims from news and social
media and document their verification eforts in written articles. Traditional fact-checking of these
claims relies on human-driven exploration, analysis, and conclusion. Consequently, this process is rather
slow and expensive, lagging behind the rapid spread of misinformation. Delayed fact-checking eforts
allow false narratives to take hold, distort reality, and influence public opinion, a vulnerability that is
often exploited by bad actors [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Additionally, moderation policies [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and pre-bunking methodologies
[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] ofer proactive strategies by addressing misinformation before it spreads widely.
AFC systems assist human eforts to combat misinformation by leveraging state-of-the-art techniques
from areas such as Natural Language Processing (NLP), Natural Language Generation (NLG), and
Information Retrieval (IR). Ideally, these systems automatically extract claims from the presented media,
ROMCIR 2025: The 5th Workshop on Reducing Online Misinformation through Credible Information Retrieval (held as part of
      </p>
      <p>CEUR</p>
      <p>
        ceur-ws.org
retrieve relevant and credible references, and provide evidence-based verdicts on the aggregated results.
As opposed to style-based detection approaches that learn to distinguish claims based on writing
patterns, AFC systems follow a knowledge-based approach that relies on verification knowledge to
make judgements on claims [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. Expert fact checkers can utilize AFC systems as intelligent decision
support assistance to eliminate repetitive manual tasks, highlight inconsistencies, and present their
ifndings [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        Humans often distrust fact-checking work that challenges their beliefs, perceiving them as biased or
manipulated [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This skepticism is likely to be aggravated with closed systems, where the lack of
transparency around internal mechanisms and design decisions further erodes trust. Brandtzaeg and
Følstad [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] argue that to strengthen trust, fact-checking processes must be made transparent.
LLMs such as GPT-4, Claude 3.5 Sonnet, and Llama-3, have provided significant potential for a broad
range of text-to-text reasoning tasks. Integrating LLMs as an inference engine into AFC systems may
enhance transparency by generating veracity predictions and the accompanying natural language
explanations. However, Setty [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] demonstrates that, for AFC-related classification tasks, fine-tuned
small language models (SLM) outperform LLMs. This indicates that further research is needed to
efectively utilize LLMs for AFC.
      </p>
      <p>This paper investigates the task formulation and assessment for AFC of real-world claims with LLMs to
establish baselines in various settings and to evaluate whether truthfulness ratings can be efectively
modeled or if alternative approaches to claim annotations and task formulation are necessary. Based on
these findings, future approaches can make more informed design choices and improve the reliability and
efectiveness of AFC systems. We propose a framework for AFC with LLMs in a few-shot setup without
model fine-tuning for claim analysis, claim veracity prediction, and the generation of justifications as
natural language explanations. In the scope of this work, we assess the performance of our framework
on 17,856 real-world claims from PolitiFact based on three labeling schemes, with or without web
evidence, and across models of diferent sizes (i.e. 3B, 8B, 70B). Using reference-free evaluation metrics
and conducting extensive experiments, we provide insight into how evidence integration, model size,
and labeling complexity impact system performance. Additionally, we consider fine-tuning small
stateof-the-art classification models for estimating the upper bound of predictive performance extractable
from diferent components of the data points in the collected dataset and assess the performance of
LLMs with a few-shots relative to this limit. Thus, our findings contribute to the development of more
robust and transparent AFC systems using LLMs.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        Our work builds on the existing body of research in fact-checking and retrieval-augmented generation
while addressing several gaps in the literature. Prior studies have established the value of
transformerbased architectures such as BERT and GPT for tasks ranging from sequence classification to text
generation [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ] and have shown that integrating retrieval mechanisms via RAG can improve factual
grounding [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Automated fact-checking frameworks typically consist of claim detection, evidence
retrieval, and claim verification [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Claim detection identifies check-worthy claims [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], often guided by
factors such as relevance or harm, while evidence retrieval involves collecting and selecting relevant
information to justify verdicts [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ]. Claim verification can be broken down into two main tasks: (a)
verdict prediction and (b) justification production [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        Our approach unifies the components of claim verification into a single structured framework. However,
unlike previous works that often rely on fine-tuned models or separate stages for classification and
explanation [
        <xref ref-type="bibr" rid="ref16 ref17 ref18 ref19 ref20">16, 17, 18, 19, 20</xref>
        ], we propose a few-shot inference setup using LLMs that simultaneously
produces analysis, verdict classification, and justification generation in a structured format. Prior work
has highlighted that while LLMs can generate justifications, they are prone to hallucinations [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] and
may lead users to over-rely on potentially incorrect explanations [22]. Our integrated approach is
motivated by chain-of-thought reasoning techniques [23], which implement step-by-step analysis and
aims to facilitate consistency between the generated verdict and its justification.
      </p>
      <p>
        Additionally, we evaluate our approach on open-source LLMs of diferent scale, in contrast to similar
prior work that utilizes only closed-source models such as ChatGPT [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. Moreover, our experimental
analysis extends previous findings by exploring the efects of varying label granularity, from binary
to multi-class setups, and by systematically investigating the impact of evidence integration on both
classification performance and justification quality. While some works, for example, Augenstein et al.
[24], have studied diverse labeling schemes, our study directly compares performance across a hierarchy
of related schemes. This empirical insight addresses a notable gap in the literature regarding the
interplay between label complexity, model scale, and the integration of external evidence.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        Fact-checking organizations, that document their eforts and share them publicly, ofer a great
opportunity to analyze relevant misinformation and model the verification process. Moreover, by providing the
initial judgment on what is check-worthy or not, fact-checking experts greatly reduce the complexity of
the task at hand [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. At PolitiFact, experts select check-worthy claims by determining whether they are
verifiable as opposed to opinions and personal experiences, potentially misleading, significant enough
to influence public discourse, likely to be repeated, or if a typical reader would reasonably question
their truthfulness. The content at PolitiFact is localized around topics that can be found in US news.
In this work, we utilize a dataset collected from PolitiFact’s online repository of fact-checking eforts.
PolitiFact is a frequently used source of misinformation data, as seen in LIAR LIAR [25] or Mocheg [26].
We collect 23,495 data points from English PolitiFact articles between 2007 and the January 26, 2024.
Claims not attributed to public figures (i.e. social media posts) were excluded, as these were
predominantly evaluated as fake, resulting in a refined dataset of 17,856 claims. In the context of this research, we
are interested in collecting the claims that have been deemed check-worthy, the entity that shared said
claim, the context in which the claim has been produced, and finally the rating that has been assigned
to the claim. We also match and provide the background descriptions of the entity that produced the
claim. Figure 1 illustrates the available features.
      </p>
      <sec id="sec-3-1">
        <title>Source: New York Times Editorial Board</title>
        <p>Background: The editorial board is made up of 16 journalists ...</p>
        <p>Context: ... stated on June 14, 2017 in a New York Times editorial
Claim: ”A political map circulated by Sarah Palin’s 2019s PAC
incited Rep. Gabby Gifords’s 2019 shooting”</p>
        <p>Label: False</p>
        <p>PolitiFact’s rating system follows an ordinal six-class labeling scheme. Table 1 provides the oficial
descriptions of these six classes.</p>
      </sec>
      <sec id="sec-3-2">
        <title>TRUE ... is accurate and there’s nothing significant missing.</title>
        <p>MOSTLY TRUE ... is accurate but needs clarification or additional information.</p>
        <p>HALF TRUE ... is partially accurate but leaves out important details or takes things out of context.
MOSTLY FALSE ... contains an element of truth but ignores critical facts [...].</p>
        <p>FALSE ... is not accurate.</p>
        <p>PANTS ON FIRE ... is not accurate (thus false) and makes a ridiculous claim.</p>
        <p>While PolitiFact assigns a separate PANTS ON FIRE label to document the characteristic of ridiculousness
in claims, we are only interested in the dimension of truthfulness and therefore treat this special label
as a sub-case of False. Thus, we merge the classes, discard the sixth label, and reduce the overall set of
labels to five. Table 2 illustrates the resulting distribution of classes.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This Section outlines the methodology used to design and evaluate our framework for automated
fact-checking with LLMs. Following the description of the data collection from PolitiFact, we formulate
the problem and the experimental setup. Specifically, we discuss model selection, labeling scheme
choices, and evidence retrieval.</p>
      <sec id="sec-4-1">
        <title>4.1. Task Formulation</title>
        <p>The approach in this study is motivated by the need to enhance coherence, consistency, and
interpretability in automated fact-checking systems. By combining reasoning, classification, and explanation
as justification within a single framework, we aim to leverage intermediate analysis to improve
performance and ensure consistency between outputs. This study approaches automated fact-checking as a
multi-component task with three key objectives:
1. Reasoning: Producing a detailed, step-by-step analysis of the claim using the available
information.
2. Verdict: Assigning a veracity label to the claim based on a predefined set of categories.
3. Explanation: Providing a clear and concise explanation in natural language to support the
assigned verdict.</p>
        <p>The reasoning task follows the idea of chain-of-thought reasoning [23] by constructing a step-by-step
analysis of the available information as a natural language explanation [27]. Thus, the verdict
classification is integrated with both preceding analysis and subsequent explanation to enhance performance,
building on insights from existing research. Zhang et al. [28] demonstrate that jointly generating
explanations and predictions outperforms explain-then-predict models. Similarly, Atanasova et al.
[29] find that generating fact-checking explanations alongside veracity predictions improves both
the performance and the quality of the explanations. These tasks are addressed within a few-shot
classification framework, utilizing instruction-based prompts to guide LLMs in generating structured
outputs.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Prompt Design</title>
        <p>We design the prompts based on the previously outlined problem formulation and established principles
of prompt engineering [30]. Each prompt is composed of three main components: system, user, and
assistant. The system message sets the model’s context and provides the instructions, including the
selected labeling scheme. The user message specifies the speaker, context, and claim, with evidence
included when available. The assistant message contains the model’s response to the input. To simulate
a chat history with desired outputs, few-shot examples, one per label, are included as user and assistant
message pairs following the system message and preceding the actual input.</p>
        <p>SYSTEM: You are an intelligent decision support system for automated fact-checking.
Your tasks are:
1. Analyze the claim step-by-step.
2. Classify the claim’s veracity based on your analysis. [LABELS]
3. Provide a concise natural language explanation for the verdict prediction.</p>
        <p>USER: [SPEAKER][CONTEXT] the claim [CLAIM]. Evidence: [EVIDENCE]
To ensure consistency and enable automated processing, we enforce a structured output format using
the vLLM2 and outlines3 libraries. In this context, structure refers to the property of generated output
satisfying a constrained syntax [31]. The output is generated as a parsable JSON object with the
following properties: reasoning, verdict, and explanation. Reasoning as free-text, step-by-step analysis
of the claim. The verdict as the predicted veracity label, constrained to any option of the predefined set
of labels. Lastly, the concise natural language explanation arguing the verdict prediction.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Model Selection</title>
        <p>To evaluate performance across diferent model scales, we selected a range of LLMs from the Llama 3
series. We choose Llama architecture models due to their state-of-the-art performance and open-source
availability, making them well-suited for evaluating automated fact-checking systems. The models
used in this study are Llama-3.2-3B, Llama-3.1-8B, Llama-3.1-70B, Llama-3.3-70B in their
instructionifnetuned state. The selection covers varying parameter sizes (3B, 8B, 70B) to investigate the relationship
between model scale and task performance. Our strategy is to evaluate the most recent model available
at each size. The 3.2 line was the first to introduce the 3B size, while the only 8B version is found
in the 3.1 line. For the 70B size, checkpoints are available in both the 3.1 and 3.3 lines. All models
have a December 2023 knowledge cutof. During pre-training, the 3.2 models processed 9 trillion
tokens, whereas the 3.1 and 3.3 models processed 15 trillion tokens. The 3.3 70B Llama model achieves
comparable performance to the 3.1 405B model4, making it one of the most performant open source
models at this size. This justifies its inclusion as an additional option in model selection. All models are
used in their instruction-tuned state to ensure alignment with the task. Instead of further fine-tuning, we
rely on the models’ available capabilities to perform few-shot reasoning, classification and explanation.</p>
      </sec>
      <sec id="sec-4-4">
        <title>4.4. Label Schemes</title>
        <p>Fact-checkers adopt varied approaches to labeling schemes, reflecting diferent priorities and
methodologies. Some, such as FullFact5, rely solely on justifications without assigning explicit ratings to claims.
Others, like PolitiFact and Snopes6, implement labeling systems grounded in the idea of truthfulness. A
further extension of these schemes includes labels for scenarios where evidence is incomplete or
unavailable. In the AFC community, truthfulness labels are frequently mapped to a conceptual dimension
that evaluates factuality based on available ground-truth evidence. Labels such as supported, refuted,
cherry-picked, or not enough information (NEI) are commonly used [32, 33, 26], requiring significant
human efort for exploration and annotation. While these approaches provide valuable insights, they
also introduce complexities related to interpretation and consistency in annotations. We postpone this
perspective to future work. Our focus in this study is to assess whether fact-checking can be efectively
modeled across diferent granularities of truthfulness on the collected data. Specifically, we aim to
evaluate the trade-ofs between simpler and more nuanced labeling schemes in terms of their impact
on classification performance and justification quality. To evaluate the impact of label granularity
on fact-checking performance, we merge the five original PolitiFact labels ( True, Mostly True, Half
True, Mostly False, False) into coarser schemes, progressively reducing complexity while preserving
interpretability. In the three-class scheme, the original labels true and false are grouped into mostly
true and mostly false, respectively. In the binary scheme, the label half-true is merged into mostly true.
We aim to align our label aggregation with PolitiFact’s definitions, as introduced in Table 1. Table 2
illustrates the resulting distributions.
2https://github.com/vllm-project/vllm
3https://github.com/dottxt-ai/outlines
4https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct#benchmarks
5https://fullfact.org/
6https://snopes.com/
14.18%
18.75%
19.79%
17.99%
29.30%
The PolitiFact label definitions, as specified in Section 3, are consistent across schemes. By evaluating
these schemes, we aim to understand how diferent levels of granularity influence the model’s ability to
classify claims and provide useful explanations.</p>
      </sec>
      <sec id="sec-4-5">
        <title>4.5. Evidence Retrieval</title>
        <p>Although PolitiFact’s fact-checking articles provide human-collected evidence that informs the
justification and final verdict, extracting and decontextualizing these evidences is not trivial and requires
additional specialized modeling and annotation. Consequently, in this study we focus on web-based
fact-checking to gather relevant information. We collect the evidence by querying a web search API7
for each claim and retrieve the top 10 search results. We do not apply any query optimization or
re-ranking of results. We restrict the search to exclude a list of well-known US fact-checking sites
as well as snippets that mention keywords such as ”PolitiFact”, ”fact-check”, or ”debunk” to exclude
fact-checking articles or direct references. This way, we aim to reduce information leaking in from
pages reporting the actual verification results, rather than evidence. Due to these constrains, we were
not able to retrieve evidences for 667 claims. Table 3 lists three search results for the claim presented in
Figure 1 where we shortened the snippet text and removed the title and source URLs.</p>
      </sec>
      <sec id="sec-4-6">
        <title>4.6. Experimental Setup</title>
        <p>To assess the performance of our automated fact-checking approach, we utilize a combination of
classification and generation evaluation metrics. These metrics evaluate both the performance of
verdict classification and the quality of generated outputs, ensuring a comprehensive analysis of system
performance. We report accuracy and F1-Scores in diferent aggregations strategies to observe diferent
aspects of the classification results. To evaluate the quality of generated outputs, we use TIGERScore,
a reference-free metric that has been fine-tuned to assess generated text quality based on a set of
criteria and assign penalties to mistakes [34]. Specifically, comprehension, accuracy, informativeness,
and coherence are evaluated. TIGERScore provides an error evaluation of the generated outputs and
assigns penalty scores between [−5, −0.5] for each error without relying on ground truth references.
The penalty scores are added up and reported for each case. Thus, a score close to 0 shows higher
quality output. In this study, we utilize the 13B TIGERScore model with default hyperparameters to
evaluate generated outputs. The evaluation prompt design follows our task prompt as described in
Section 4.2.</p>
        <p>Due to the stochastic nature of LLMs, evaluation is often not trivial. Thus, we run each fact-checking
task three times and report the majority vote for the classification performance evaluation. Additionally,
as TIGERScore is a generative evaluation metric, we also run it three times and report the average
metric for the justification quality assessment.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Evaluation</title>
      <p>The evaluation section presents a detailed analysis of our automated fact-checking approach. We assess
the task performance based on model size, labeling scheme, and the impact of evidence retrieval on
both classification performance and the quality of generated outputs. This evaluation is structured
around our predefined hypotheses and utilizes the previously introduced range of metrics to ensure
a robust assessment. Additionally, statistical analyses are conducted to determine the significance of
observed performance diferences.</p>
      <sec id="sec-5-1">
        <title>5.1. Hypotheses</title>
        <p>Our evaluation focuses on several fundamental questions regarding the introduced problem setting.
We examine whether models can reliably distinguish between the original truthfulness labels, or if
alternative approaches to claim annotation and the fact-checking task formulation are required. We
also consider potential limitations on the granularity of truthfulness labels that models can efectively
handle. Additionally, we assess the role of parametric knowledge in task performance, specifically
whether model size yields the expected efect of better performance. Finally, we investigate the impact
of evidence integration on task performance. Based on these research questions, our evaluation is
structured around the following hypotheses:
Hypothesis  1: Classification task performance decreases as label complexity increases.
Hypothesis  2: Justification quality decreases as label complexity increases.</p>
        <p>Hypothesis  3: Retrieving and incorporating evidence improves both classification accuracy and the
quality of generated justifications.</p>
        <p>Hypothesis  4: Larger models perform better in the classification task and produce higher quality
justifications.</p>
        <p>Hypothesis  5: Smaller models benefit more significantly from evidence integration than larger
models due to less parametric knowledge being available.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Example Output</title>
        <p>Previously, we introduced a claim involving the New York Times editorial and Sarah Palin in Table 1
and showcased examples of retrieved web evidence in Table 3. In Figure 2, we now present an actual
output generated by the Llama3.3-70B model under the evidence-augmented setting with the five-class
labeling scheme.</p>
        <p>The output in Table 2 demonstrates good justification quality. The verdict is correctly classified as False,
aligning with the evidence and reasoning provided. The reasoning section efectively incorporates the
retrieved evidence, presenting a detailed analysis of the claim and referencing the correction issued by
the New York Times. It also mentions the court’s ruling in favor of the publication, which is not directly
relevant to the claim verification. The explanation is concise and supports the verdict, accurately
summarizing the key points without introducing ambiguity. This example highlights the potential of
retrieval-augmented generation to improve classification accuracy and justification quality.
The results presented in Tables 4,5, and 6 illustrate the results in classification performance across
diferent labeling schemes and model sizes, with and without evidence retrieval. For the five-class setup
(Table 4), evidence retrieval consistently enhances model performance, as seen in higher  1 scores and
TIGERScore improvements. However, the 3B model struggles to outperform the baseline significantly,
indicating limited capacity in handling a complex task such as automated fact-checking.
Verdict</p>
        <p>Explanation</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Results</title>
        <p>Content
The New York Times editorial board claimed that a political map circulated by
Sarah Palin’s PAC incited Rep. Gabby Gifords’ shooting. However, there is no
established link between the two events, and the Times subsequently issued
a correction. The case was later brought to court, and the New York Times
was sued for defamation. The court ultimately ruled in favor of the New York
Times, stating that the editorial did not defame Palin.</p>
        <p>False
The statement is not accurate. There is no established link between the political
map circulated by Sarah Palin’s PAC and the shooting of Rep. Gabby Gifords,
and the New York Times issued a correction after publishing the claim.</p>
        <p>In the three-class classification scheme (Table 5), evidence retrieval again provides a notable performance
boost across all models, with improvements becoming more pronounced in larger models. This indicates
that as label complexity decreases, models are better able to leverage evidence to enhance classification
accuracy and justifications. The 3.3-70B-Instruct model achieves the highest scores, emphasizing the
advantage of size when combined with external knowledge. Table 5 presents the classification metrics
for the three-class scheme. Similar to the five-class results, evidence retrieval enhances performance
across all models.</p>
        <p>For binary classification results in Table 6, the reduced complexity of the task yields the highest overall
performance across all models. Evidence retrieval continues to provide a measurable benefit, particularly
in the largest models, where the highest  1 scores and TIGERScore improvements are observed.
In the following, we investigate the indications we have described earlier with statistical analyses to
draw conclusions about the hypotheses we specified in Section 5.1.</p>
        <p>We conducted a Friedman test on  1  across the three classification schemes, with and without
evidence. The result indicates that at least one of the schemes difers significantly (  &lt; 0.05 ) in terms
of classification performance. These findings support hypothesis  1 that as labeling becomes more
complex, classification performance tends to decrease, potentially due to more nuanced distinctions
between labels that increase the quantity of prediction errors.
For TIGERScore evaluation, the Friedman test was significant (  &lt; 0.05 ) for the setting with evidences,
but a subsequent Conover’s test revealed no significant pairwise diferences. Additionally, the Friedman
test reveals no significance for the setting without evidence. This result suggests that there is no
measurable diference in justification quality across the three schemes, with or without evidence. These
ifndings reject hypothesis  2 that more complex label sets negatively afect overall justification quality.
This may be in part due to claim analysis and explanation being dificult enough, regardless of whether
the label schemes are more or less complex.</p>
        <p>To determine the statistical significance of performance when including evidence, we conducted paired
t-tests comparing models with and without evidence across all classification schemes for the  1 
and TIGERScore. The results indicate a statically significant diference (p &lt; 0.01) for both metrics
when evidence retrieval is included during the fact-checking task. This supports hypothesis  3. Thus,
external evidence helps the model to disambiguate classes and produce more useful justifications for
the fact-checking task.</p>
        <p>To evaluate whether larger models outperform their smaller counterparts, we performed a Friedman
test on both  1  and TIGERScore with and without evidence across four diferent model sizes. The
results indicate a significant diference (  &lt; 0.05 ), confirming that model size has a measurable impact
on performance. These findings support hypothesis  4.</p>
        <p>Finally, to investigate whether smaller models benefit more from evidence integration than larger models,
we examined the performance gains by subtracting the no evidence scores from the with evidence
scores for both  1  and TIGERScore across all model sizes. For  1  gains, the Friedman test
showed no significant diference (  = 0.167 ), whereas the TIGERScore gains were statistically significant
( &lt; 0.05 ). Thus, we partially reject hypothesis  5. This implies that larger models benefit even more
from external evidence, presumably due to their ability to reason efectively across long context sizes,
whereas smaller models exhibit relatively limited improvements. We expect that integrating more
credible and complete information sources, could enhance overall performance for both smaller and
larger models even further.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Ablation Study</title>
        <p>Since we consider fine-tuning approaches impractical for real-world automated fact-checking, due
to the dynamic and fast-changing nature of misinformation, which limits the usefulness of models
trained on static datasets, our primary focus in this study has been on few-shot inference using large
language models. Earlier encoder-based architectures, such as BERT, were constrained by a maximum
sequence length of 512 tokens, which restricted their ability to incorporate additional context. Recent
advancements, such as ModernBERT [35], adjust the original BERT architecture and are able to support
sequence lengths of up to 8192 tokens. This allows the integration of more contextual information and
retrieved evidence directly into the classification process, enabling the evaluation of their utility for
veracity prediction.</p>
        <p>To complement our few-shot evaluation and to better understand how diferent input signals contribute
to classification outcomes, we conduct an ablation study using the ModernBERT-large architecture.
Specifically, we fine-tune the model across a series of input configurations to assess how predictive
performance changes when incrementally adding additional contextual information. We begin with
the claim alone as input. We then add information about the surrounding context in which the claim
appeared, such as a speech, interview, or social media post. Next, we incorporate the speaker who
issued the claim. Finally, we include retrieved web evidence that provides external factual grounding.
This study helps quantify the individual impact of each component and provides an empirical upper
predict performance bound for fine-tuning on the dataset, enabling a more informed comparison with
few-shot LLM performance.
The results in Table 7 show that incorporating evidence consistently produces the most significant gains
across all label granularities. In the five-class setting, starting with only the claim results in the lowest
performance. Adding context leads to modest improvements, suggesting that surrounding details help
disambiguate some claims. For example, knowing whether a statement was made during a campaign
rally or in an oficial policy document can influence its interpretation. Speaker information further
improves performance, which may be attributed to prior knowledge about the speaker’s reliability,
role, or political alignment that implicitly guides veracity estimation. In the binary setting, adding
context does not improve performance and even reduces it slightly. This outcome likely stems from the
way binary labels are constructed by merging more nuanced classes. As a result, diferent claims with
dissimilar contexts may be grouped under the same binary label, making context a noisy feature. In
contrast, speaker information helps more consistently. This may reflect the fact that in coarse-grained
tasks, speaker identity acts as a high-level signal about the probable factuality of a claim.
The classification results align closely with the previously presented LLM few-shot inference results,
showing that evidence consistently provides significant performance improvements across all label
schemes. For both classifiers and LLMs, the inclusion of evidence enables better disambiguation and
enhances predictive performance, particularly in more complex multi-class tasks. While smaller LLMs
provide comparable performance across tasks in few-shot inference, larger LLMs consistently surpass
the fine-tuned SLMs without requiring fine-tuning and provide the additional advantage of generating
reasoning and detailed justifications across more extensive contexts. This highlights the general utility
of LLMs for AFC.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion and Conclusion</title>
      <p>This study investigated AFC of real-world claims using LLMs in a few-shot inference scenario. By
evaluating task performance across three labeling schemes and multiple LLM sizes of the same
architecture, we demonstrated the importance of evidence integration, model scale, and labeling complexity
in determining system efectiveness. Evidence retrieval consistently improved classification accuracy
and justification quality, with larger models showing the most significant gains. In contrast, smaller
models struggled to perform or benefit as much from evidence integration, highlighting the need
for further optimization in computationally constrained environments. While more coarse-grained
labels naturally yield higher performance, future work should explore how to integrate alternative
labeling strategies and a more nuanced assessment of claims across diferent perspectives to develop
a robust AFC approach. Our experiments show that LLMs can efectively perform multi-component
tasks by reasoning over presented data and generating detailed justifications. However, our results also
indicate that alternative approaches leveraging fine-tuning can be advantageous for specific subtasks or
in resource-constrained settings. For instance, while LLMs excel in knowledge-based reasoning and
explanation generation in few-shot scenarios, models like ModernBERT can be suficiently efective for
classification tasks when supervised training data is available. This suggests the potential for hybrid
frameworks where supervised fine-tuning is employed for tasks that rarely change, such as document
type or natural language inference classification, while LLMs are reserved for dynamic scenarios that
require integration of dynamic facts to produce grounded reports. Moreover, integrating more credible
evidence, including human-aggregated sources, could further enhance AFC performance by providing
more reliable context for claim evaluation. Furthermore, although our study focused on LLMs from the
Llama family, future work could benefit from expanding the comparison to include diferent model
families. A broader analysis comparing diverse model families would be valuable, especially for applications
where training and inference costs, use cases, and interpretability requirements difer substantially.
To extend AFC toward intelligent decision assistance for expert fact-checkers, future research should
focus on structuring justifications to align more closely with human verification strategies. This
includes presenting concise, faithful explanations that detail key reasoning steps and clearly highlight
the integrated evidence. Preliminary observations indicate that LLM-based systems can sufer from
hallucinations, underscoring the need for extensive evaluation and user studies to understand how
experts interpret and trust the generated explanations. Such studies would not only help refine the
presentation of justifications but also identify gaps in current AFC systems and better define their role
in supporting, rather than replacing, human fact-checking eforts.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This study is partially funded by the German Federal Ministry of Education and Research (BMBF,
reference: 03RU2U151C) in the scope of the research project news-polygraph and by JST AIP Acceleration
Research (JPMJCR24U3) and JST CREST Grants (JPMJCR20D3).
Linguistics, Online, 2020, pp. 1906–1919. doi:10.18653/v1/2020.acl-main.173.
[22] C. Si, N. Goyal, S. T. Wu, C. Zhao, S. Feng, H. Daumé III, J. Boyd-Graber, Large Language Models</p>
      <p>Help Humans Verify Truthfulness – Except When They Are Convincingly Wrong, 2024.
[23] J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, D. Zhou,
Chain-of</p>
      <p>Thought Prompting Elicits Reasoning in Large Language Models, 2023.
[24] I. Augenstein, C. Lioma, D. Wang, L. Chaves Lima, C. Hansen, C. Hansen, J. G. Simonsen, MultiFC:
A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims, in: Proceedings
of 2019 EMNLP-IJCNLP, Association for Computational Linguistics, 2019, pp. 4685–4697.
[25] W. Y. Wang, “Liar, Liar Pants on Fire”: A New Benchmark Dataset for Fake News Detection,
in: R. Barzilay, M.-Y. Kan (Eds.), Proceedings of the 55th Annual Meeting of the Association for
Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics,
Vancouver, Canada, 2017, pp. 422–426. doi:10.18653/v1/P17-2067.
[26] B. M. Yao, A. Shah, L. Sun, J.-H. Cho, L. Huang, End-to-End Multimodal Fact-Checking and
Explanation Generation: A Challenging Dataset and Models, in: Proceedings of the 46th
International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp.
2733–2743. doi:10.1145/3539618.3591879. arXiv:2205.12487.
[27] N. Kotonya, F. Toni, Towards a Framework for Evaluating Explanations in Automated Fact
Verification, 2024. arXiv:2403.20322.
[28] Z. Zhang, K. Rudra, A. Anand, Explain and Predict, and then Predict Again, in: Proceedings of the
14th ACM International Conference on Web Search and Data Mining, WSDM ’21, Association for
Computing Machinery, New York, NY, USA, 2021, pp. 418–426. doi:10.1145/3437963.3441758.
[29] P. Atanasova, J. G. Simonsen, C. Lioma, I. Augenstein, Generating Fact Checking Explanations,
in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
Association for Computational Linguistics, 2020, pp. 7352–7364.
[30] S. Schulhof, M. Ilie, N. Balepur, K. Kahadze, A. Liu, C. Si, Y. Li, A. Gupta, H. Han, S. Schulhof,
P. S. Dulepet, S. Vidyadhara, D. Ki, S. Agrawal, C. Pham, G. Kroiz, F. Li, H. Tao, A. Srivastava,
H. Da Costa, S. Gupta, M. L. Rogers, I. Goncearenco, G. Sarli, I. Galynker, D. Peskof, M. Carpuat,
J. White, S. Anadkat, A. Hoyle, P. Resnik, The Prompt Report: A Systematic Survey of Prompting
Techniques, 2024. arXiv:2406.06608.
[31] B. T. Willard, R. Louf, Eficient Guided Generation for Large Language Models, 2023.
[32] A. Hanselowski, C. Stab, C. Schulz, Z. Li, I. Gurevych, A Richly Annotated Corpus for Diferent</p>
      <p>Tasks in Automated Fact-Checking, 2019. doi:10.48550/arXiv.1911.01214. arXiv:1911.01214.
[33] M. Schlichtkrull, Z. Guo, A. Vlachos, AVeriTeC: A Dataset for Real-world Claim Verification with</p>
      <p>Evidence from the Web, 2023. doi:10.48550/arXiv.2305.13117. arXiv:2305.13117.
[34] D. Jiang, Y. Li, G. Zhang, W. Huang, B. Y. Lin, W. Chen, TIGERScore: Towards Building Explainable</p>
      <p>Metric for All Text Generation Tasks, 2024.
[35] B. Warner, A. Chafin, B. Clavié, O. Weller, O. Hallström, S. Taghadouini, A. Gallagher, R. Biswas,
F. Ladhak, T. Aarsen, N. Cooper, G. Adams, J. Howard, I. Poli, Smarter, Better, Faster, Longer:
A Modern Bidirectional Encoder for Fast, Memory Eficient, and Long Context Finetuning and
Inference, 2024. doi:10.48550/arXiv.2412.13663. arXiv:2412.13663.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lewandowsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cook</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lombardi</surname>
          </string-name>
          ,
          <source>Debunking Handbook</source>
          <year>2020</year>
          ,
          <year>2020</year>
          . doi:
          <volume>10</volume>
          .17910/B7.1182.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          , Fact Checking:
          <article-title>Task definition and dataset construction</article-title>
          , in: C.
          <string-name>
            <surname>DanescuNiculescu-Mizil</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Eisenstein</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>McKeown</surname>
            ,
            <given-names>N. A.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the ACL 2014 Workshop on Language Technologies and Computational Social Science</source>
          , Association for Computational Linguistics, Baltimore,
          <string-name>
            <surname>MD</surname>
          </string-name>
          , USA,
          <year>2014</year>
          , pp.
          <fpage>18</fpage>
          -
          <lpage>22</lpage>
          . doi:
          <volume>10</volume>
          .3115/v1/
          <fpage>W14</fpage>
          -2508.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Arslan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tremayne</surname>
          </string-name>
          ,
          <string-name>
            <surname>Toward Automated</surname>
          </string-name>
          Fact-Checking:
          <article-title>Detecting Checkworthy Factual Claims by ClaimBuster</article-title>
          ,
          <source>in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          ,
          <source>Halifax NS Canada</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1803</fpage>
          -
          <lpage>1812</lpage>
          . doi:
          <volume>10</volume>
          .1145/3097983.3098131.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlichtkrull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          ,
          <source>A Survey on Automated Fact-Checking, Transactions of the Association for Computational Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>178</fpage>
          -
          <lpage>206</lpage>
          . doi:
          <volume>10</volume>
          .1162/tacl_a_
          <fpage>00454</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>B.</given-names>
            <surname>Nyhan</surname>
          </string-name>
          , Facts and Myths about Misperceptions,
          <source>Journal of Economic Perspectives</source>
          <volume>34</volume>
          (
          <year>2020</year>
          )
          <fpage>220</fpage>
          -
          <lpage>236</lpage>
          . doi:
          <volume>10</volume>
          .1257/jep.34.3.220.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Cresci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trujillo</surname>
          </string-name>
          , T. Fagni,
          <article-title>Personalized Interventions for Online Moderation</article-title>
          ,
          <source>in: Proceedings of the 33rd ACM Conference on Hypertext and Social Media</source>
          , HT '22,
          <string-name>
            <surname>Association</surname>
          </string-name>
          for Computing Machinery, New York, NY, USA,
          <year>2022</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>251</lpage>
          . doi:
          <volume>10</volume>
          .1145/3511095.3536369.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>B.</given-names>
            <surname>Thompson</surname>
          </string-name>
          ,
          <article-title>How to battle misinformation with Sander van der Linden</article-title>
          ,
          <source>Nature</source>
          (
          <year>2023</year>
          ).
          <source>doi:10. 1038/d41586-023-00899-0.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zafarani</surname>
          </string-name>
          ,
          <article-title>A Survey of Fake News: Fundamental Theories, Detection Methods, and Opportunities</article-title>
          ,
          <source>ACM Computing Surveys</source>
          <volume>53</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>40</lpage>
          . doi:
          <volume>10</volume>
          .1145/3395046. arXiv:
          <year>1812</year>
          .00315.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>L.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <source>Understanding the promise and limits of automated fact-checking</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>P. B. Brandtzaeg</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Følstad</surname>
          </string-name>
          ,
          <article-title>Trust and distrust in online fact-checking services</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>60</volume>
          (
          <year>2017</year>
          )
          <fpage>65</fpage>
          -
          <lpage>71</lpage>
          . doi:
          <volume>10</volume>
          .1145/3122803.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>V.</given-names>
            <surname>Setty</surname>
          </string-name>
          ,
          <article-title>Surprising Eficacy of Fine-Tuned Transformers for Fact-Checking over Larger Language Models</article-title>
          ,
          <source>in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval</source>
          , ACM, Washington DC USA,
          <year>2024</year>
          , pp.
          <fpage>2842</fpage>
          -
          <lpage>2846</lpage>
          . doi:
          <volume>10</volume>
          . 1145/3626772.3661361.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT:
          <article-title>Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          ,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1810</year>
          .
          <volume>04805</volume>
          . arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Narasimhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Salimans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Sutskever</given-names>
            ,
            <surname>Improving Language Understanding by Generative Pre-Training</surname>
          </string-name>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks</article-title>
          ,
          <year>2021</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2005</year>
          .
          <volume>11401</volume>
          . arXiv:
          <year>2005</year>
          .11401.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>W.</given-names>
            <surname>Ferreira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          ,
          <article-title>Emergent: A novel data-set for stance classification</article-title>
          , in: K.
          <string-name>
            <surname>Knight</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Nenkova</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          Rambow (Eds.),
          <source>Proceedings of the</source>
          <year>2016</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , San Diego, California,
          <year>2016</year>
          , pp.
          <fpage>1163</fpage>
          -
          <lpage>1168</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N16</fpage>
          -1138.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>N.</given-names>
            <surname>Kotonya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Toni</surname>
          </string-name>
          ,
          <source>Explainable Automated Fact-Checking: A Survey</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>D.</given-names>
            <surname>Russo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Tekiroğlu</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Guerini, Benchmarking the Generation of Fact Checking Explanations, Transactions of the Association for Computational Linguistics 11 (</article-title>
          <year>2023</year>
          )
          <fpage>1250</fpage>
          -
          <lpage>1264</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>I.</given-names>
            <surname>Eldifrawi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Trabelsi</surname>
          </string-name>
          ,
          <source>Automated Justification Production for Claim Veracity in Fact Checking: A Survey on Architectures and Approaches</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2407</volume>
          .
          <fpage>12853</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. T.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Fact-Checking Complex Claims with Program-Guided Reasoning</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>6981</fpage>
          -
          <lpage>7004</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shu</surname>
          </string-name>
          ,
          <article-title>Explainable Claim Verification via Knowledge-Grounded Reasoning with Large Language Models</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>6288</fpage>
          -
          <lpage>6304</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          .findings-emnlp.
          <volume>416</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>J.</given-names>
            <surname>Maynez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bohnet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>McDonald</surname>
          </string-name>
          ,
          <article-title>On Faithfulness and Factuality in Abstractive Summarization</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <source>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          , Association for Computational
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>