<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Benjamin Curtis</string-name>
          <email>curtisbw@dukes.jmu.edu</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefania Dzhaman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthew Maisonave</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jian Wu</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science &amp; Engineering, Lehigh University</institution>
          ,
          <addr-line>Lehigh, PA</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Computer Science, Old Dominion University</institution>
          ,
          <addr-line>Norfolk, VA</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>James Madison University</institution>
          ,
          <addr-line>Harrisonburg, VA</addr-line>
          ,
          <country country="US">United States</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>Verifying scientific claims is challenging for the general public because most people lack domain knowledge. Manual verification by subject domain experts is accurate, but it is obviously not scalable to meet the rising number of scientific claims on the Web. Whether the emerging large language models and large reasoning models can be used for scientific claim verification, and how their performances compare to humans, are still research questions. To this end, we developed a new benchmark MSVEC2 that consists of 138 claims from credible fact verification websites and science news outlets. Two tasks were given to both human and LLM participants. Task 1 requests the tester (LLMs or humans) to discern the truthfulness of claims using only prior knowledge. Task 2 requests testers to determine the stance of a scientific claim relative to an abstract of a research paper. The LLMs that were evaluated include GPT-3.5, GPT-4, GPT-4o, GPT-o1, and DeepSeek-R1. We recruited 23 college students in various majors to participate in the human study. We found that all LLMs score higher in F1 and accuracy compared to human testers in truthfulness classification (Task 1), with GPT-4o achieving the highest F1 score among all the models. The performance of LLMs in stance classification (Task 2) depended on the prompting configuration, with Chain-of-thought prompting yielding consistent improvements for all LLMs except GPT-o1. However, the best performance of LLMs is still not suficient for reliable scientific claim verification under standard prompt settings.</p>
      </abstract>
      <kwd-group>
        <kwd>scientific claim verification</kwd>
        <kwd>large language model</kwd>
        <kwd>large reasoning model</kwd>
        <kwd>prompt engineering</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Online scientific disinformation misrepresents the findings of scientific papers and disseminates
misleading or even malicious information to internet users. The prevalence of scientific misinformation
online has become rampant in the news and on social media sites. Fact verification websites such as
Reuters.com and Snopes.com use teams of professionals to fact-check claims from multiple sources
before judging their truthfulness. However, manually verifying scientific claims is time-consuming and
often requires extensive domain knowledge (e.g., to read and digest scientific literature), and therefore
does not scale to the massive number of claims spread on the internet. This leaves a pressing need
for tools that can automatically verify scientific claims by assessing their credibility and providing a
rationale for the assessment. Large language models (LLMs) and their variants, large reasoning models
(LRMs), have been shown to have exceptional skills in text parsing and reasoning tasks, e.g., [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. For
convenience, we call both types LLMs. Although LLMs have been evaluated in their fact-checking
capabilities against benchmark datasets such as FEVER [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the majority of existing datasets focus on
verifying general claims. Whether contemporary LLMs’ capabilities on scientific claim verification
(SCV)
have reached the level of human beings has not been systematically investigated. It also remains unclear
whether their performance is suficient for reliable deployment in SCV applications.
https://www.cs.odu.edu/~jwu/ (J. Wu)
CEUR
Workshop
      </p>
      <p>ISSN1613-0073</p>
      <p>
        In this paper, we aim to fill this gap by evaluating the performance of five widely used LLMs on
a carefully curated dataset containing 138 scientific claims compiled from credible fact verification
websites. As a pilot study, we explore baseline prompting methods, including zero-shot, one-shot, and
Chain-of-Thought (CoT; [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) on two tasks. Task 1 requires testers (i.e., LLMs or human respondents) to
judge the truthfulness of scientific claims. Task 2 requires testers to classify the stances of an abstract
from a scientific paper relative to a claim. To evaluate the human performance on the same tasks, we
recruited 23 college students and asked them the same questions. The results allow us to perform a
comparative study across LLMs and between LLMs and college students.
      </p>
      <p>
        To evaluate the performance, we developed a new dataset by carefully selecting a subsample of
scientific claims from an existing dataset, MSVEC [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The new dataset consists of 138 scientific claims,
each of which is annotated as true or false based on its original labels in the fact verification websites
or credible science news outlets, and paired with a reference abstract that supports the claim, refutes
the claim, or does not have enough information to determine the truthfulness of the claim.
      </p>
      <p>We perform extensive experiments by varying the prompts and evaluate performances in multiple
settings. For Task 1, we find that all LLM versions outperform humans in either discerning true or
false claims. We also observe that the two LRMs (GPT-o1 and DeepSeek-R1) tend to assign the “false”
label to claims they struggle with. For Task 2, we find that CoT prompting used on all GPT versions
outperforms humans in nearly all trials; the few-shot prompting method generally outperforms humans,
and the zero-shot prompting method yielded mixed results. We find that for Task 2, the refutes stances
achieve relatively low performance by 10% in both human and LLM experiments, suggesting that
contradictory relationships are the hardest to correctly identify compared with support or NEI (not
enough information) stances.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. SCV Datasets</title>
        <p>
          Early SCV datasets for evidence-based fact-checking were created from general fact-checking sources
rather than scientific sources. The FEVER dataset [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], laid the groundwork for claim-evidence alignment.
The dataset contains 185,000 claims sourced from Wikipedia and uses the claim labeling structure
Supports/Refutes/NEI, which FEVER helped define. The SciFact dataset [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] contains 1409 scientific
claims, which are human-coded citation contexts, supported by 5,183 abstracts of papers, mostly in
biomedical science domains, which are also labeled as Supports/Refutes/NEI.
        </p>
        <p>
          The SCitance [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] and RECV [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ] datasets emphasize the reasoning process. The aim behind SCitance
was to manually rewrite FEVER-style claims with “citances”, described as naturally occurring citation
sentences. The RECV benchmark introduced either deductive or abductive reasoning-type labels, which
span across multiple datasets, including VitaminC [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], CLIMATE-FEVER [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], and PHEMEPlus [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Other
datasets are developed in specific domains, such as CliVER [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] (biomedical sciences), HealthVer [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]
(health-related claims), and NLI4CT [12] (clinical trial), These datasets contributed reasoning awareness
and naturalistic text to the space; however, they lack human baselines and cross-domain diversity and
the claims and claim labels are not collected from credible fact-verification websites.
        </p>
        <p>As SCV datasets grew, researchers focused their attention on expanding the size of the datasets by
automatically generating claims and evidence. The datasets SciClaimHunt and SciClaimHuntNum
[13] were built using this methodology. Synthetic datasets achieve impressive scalability, wide domain
diversity, and a meaningful inclusion of numerical reasoning. However, synthetic negations can
misrepresent reasoning due to a lack of human nuance, and they may embed generator bias.</p>
        <p>Our dataset distinguishes itself from existing datasets in several key aspects. First, instead of rewriting
citation contexts in scientific papers as claims, our claims are collected from fact-checking websites or
credible science news outlets, making them closer to the scientific claims seen in the real world. The
global truthfulness has been verified by experts or science news editors instead of being inferred from
the citation relationships in scientific papers. Furthermore, the dataset contains 9 distinct domains with
a stance distribution balanced as 35.5% Supports, 21% Refutes, and 43.5% NEI. For the binary truthfulness
task, the stances are balanced as 53.6% True and 46.4% False. In our dataset, a True claim does not
necessarily have to be associated with a supportive abstract. Table 1 summarizes the properties of
selected SCV datasets and ours.
2,000 claim–evi- Supports / Refutes (Rea- Multi-domain reason- Derived from SCITANCE with added
dence pairs soning) ing tasks deductive/abductive reasoning
annotations
138 claims</p>
        <p>True / False and
Supports / Refutes / NEI</p>
        <p>Open-domain
scientific claims (9
domains)</p>
        <p>Human-verified benchmark updated
from MSVEC (2023); curated
claim–abstract pairs for LLM evaluation</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. SCV Methods</title>
        <p>The two mainstream SCV methods include named retrieval-based systems and LLMs.</p>
        <p>FactDetect [15] introduced a modular pipeline that performs claim decomposition, evidence retrieval,
fact-level evaluation, and aggregation. Both the lexical retriever (BM25 [16]) and the dense retriever
(ColBERT [17]) were used to locate relevant sentences for evidence retrieval. The CliVER framework
consists of document collection in which a hybrid lexical and dense retrieval from PubMed was used,
document retrieval, sentence selection, label prediction, and training and evaluation. The ensemble of
RoBERTa [18], PubMedBERT [19], and T5 [20] models predicts whether the rationale Supports, Refutes,
or is Neutral to the claim. Recently, CoVERt [21] was introduced along with the PICO [22] structured
evidence framework. This approach emphasizes scalability and domain specialization. Both FactDetect
and CliVER highlight retrieval and decomposition for accurate verification. Limitations to these systems
include a supervised data dependency, domain-specific design, and limited reasoning depth.</p>
        <p>Recently, researchers shifted their focus to the improvement of SCV using LLMs, which provides a
generalizable solution for open-domain claim verification. For example, ProToCo [ 23] is a prompt-based
consistency training framework that uses three claim variants: afirmation, negation, and uncertainty.
The framework trains LLMs to keep answers logically coherent across variants, as well as improving
factual reliability in few and zero-shot settings. MAPLE [24] models micro-language evolution between
claims and evidence, as well as capturing subtle semantic shifts that signal factual entailment. A T5
model with LoRA [25] is trained to generate claims from evidence and vice versa.</p>
        <p>This paper focuses on providing a baseline comparison of SCV performance between commonly used
LLMs and humans (represented by college students), which has not been done by any of the previous
studies.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. The MSVEC2 Dataset</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset Construction and Properties</title>
        <p>
          Our SCV dataset, named MSVEC2, is derived from the original MSVEC dataset consisting of 200 labeled
claims and claim–abstract pairs [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The claims were sourced from fact-checking websites and credible
news outlets, and the abstracts are from peer-reviewed scientific articles. We removed 62 claims in the
following categories. (1) Non-scientific claims ; (2) claims that are not self-contained (i.e., needing more
context); (3) compound claims (i.e., a claim composed of multiple sub-claims) (see Table 3 for examples).
        </p>
        <p>In addition to the removal of the above unqualified claims, each claim-abstract pair is also manually
inspected by two undergraduate researchers independently against the source to ensure the paper was
actually used to support/refute the claims. In certain cases, the MSVEC data may identify a diferent
paper from the correct paper in the claim-abstract pair because the original news article cites several
papers. The consensus rate is 99%. Pairs with misidentified papers or the lack of reviewer consensus
are removed, leaving 138 scientific claims in the final dataset. Each claim is labeled with True or False
and an abstract that either supports, refutes, or does not provide enough information (NEI) relative to
the claim. Covering nine distinct scientific domains (Table 2), MSVEC2 was designed as a multi-domain
benchmark dataset rather than a domain-specific corpus. The distribution of stance labels is shown in
Table 2. In total, 53.6% of the claims were labeled True and 46.4% False.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Research Tasks</title>
        <p>MSVEC2 supports two tasks. Task 1 evaluates the ability to determine the truthfulness of a scientific
claim. The tester, either an LLM or a human respondent, is presented with the claim text only and
asked to judge whether it is true or false. Task 2 evaluates the ability to classify the stance of a scientific
abstract relative to a claim. Given a claim and an abstract, the tester, either an LLM or a human
respondent, selects one of three stances: Supports, Refutes, or NEI.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <sec id="sec-4-1">
        <title>4.1. Evaluation Metrics</title>
        <p>Both tasks can be treated as classification problems, we adopt precision  , recall  , and  1 score as the
evaluation metrics. We also calculate the Accuracy to evaluate the overall performance. For Task !, we
calculate the  ,  , and  1 of the True and False claims. For Task 2, we calculate the  ,  , and  1 for
the support, refute, and NEI stances.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Human Study</title>
        <p>Because we aim to compare humans’ performance against LLMs’, we selected human participants
with reasonable educational backgrounds to understand scientific claims and make independent decisions .
Although varied across countries, the general academic goal of K-12 education is to equip students with
foundational knowledge and critical thinking skills. The majority of college students have finished K-12
education, so they should possess a reasonable educational background to make independent decisions
about scientific claims. The goal of graduate school is to achieve a deep, specialized education in a chosen
ifeld. Therefore, choosing graduate students will significantly narrow the range of the represented
population of this study. According to the US Census Bureau, more than 90% of US population aged
18 and above have finished secondary education. Obtaining a large-scale human subject sample with
diverse ages and backgrounds is beyond our capability and will be reserved for future study. Therefore,
we chose to focus on college students because they meet our educational level criteria, and we can draw
meaningful conclusions based on a reasonably sized human subject sample.</p>
        <p>We recruited a total of 23 college students from the 1st through the 4th year from an R1 university
according to the Carnegie classification system. The participants include 12 females and 11 males, with
an average GPA of 3.57. Among the participants, 69.6% majored in engineering disciplines, 17.4% in
the nursing, biological, and chemistry sciences, and 13.0% in other disciplines. Each participant took
part in a survey to carry out Task 1 and Task 2. Qualtrics, an online survey platform, was used to pose
the queries. Participants took the surveys on their own devices and on their own time. Participants
were shown a five-minute instructional video before beginning the surveys, which gave examples of
questions they would encounter and explained the protocol for answering them.</p>
        <p>The whole survey was divided into 5 sessions, each covered 14 claims. Each claim had 2 corresponding
questions corresponding to Task 1 and Task 2 (see Section 3.2). Each session generally took participants
between 30 and 60 minutes to complete, and they were asked to complete all 5 sessions within 10 days.
A limit of 10 days was given to balance the workload and reduce the possibility of acquiring external
knowledge relevant to the claims through school education or life experience, so their performance
stayed relatively consistent across all sessions. Participants were required not to refer to any external
sources when working on the tasks. Each participant was awarded an Amazon gift card worth $80 upon
completion as compensation for their time. The human study results were micro-averaged, or pooled
together and evaluated as one participant, and the F1-score was compared to the F1-score observed in
the LLM trials.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Large Language Model Study</title>
        <p>
          Here, we evaluate 5 commonly used LLMs, including GPT-3.5, GPT-4, GPT-4o, GPT-o1, and DeepSeek-R1
on Tasks 1 and 2. GPT-3.5, GPT-4, and GPT-4o were selected due to their strong performance on many
general tasks and popularity to be used for baseline comparison, e.g., [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. GPT-o1 and DeepSeek-R1 are
usually considered LRMs [26].
        </p>
        <p>
          For Task 1, we only test the zero-shot prompting method because the claims are ad hoc and thus do
not need examples or an articulation of the reasoning process. For Task 2, we test three prompting
methods for each LLM (including LRMs), zero-shot, few-shot, and chain-of-thought (CoT; [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]). In
few-shot prompts, we provide an LLM with examples of correctly answered queries before posing the
test query. In the CoT prompts, we provide examples of correctly answered queries before posing a
question. In the examples, an abstract and a claim were first given, followed by a four-step reasoning
process, shown below.
        </p>
        <p>Read the claim and abstract below, then reason step by step before answering the
question:
Claim: [example claim]
Abstract: [example abstract]
Question: Does the abstract of the scientific paper support the claim, refute the claim,
or is there not enough information?
Answer: Step 1: Read the whole abstract and extract information relevant to the question:
[relevant information]
Step 2: Identify the relevant statement: [relevant statement]
Step 3: Give reasoning to rationalize your decision: [rationale]
Step 4: Conclusion: [conclusion]
Now, read the new claim and abstract below and answer the question at the end:
Claim: [target claim]
Abstract: [target abstract]
Question: Does the abstract of the scientific paper support or refute the claim, or is
there not enough information?</p>
        <p>Answer with one of the following labels: SUPPORTS, REFUTES, or NOT ENOUGH INFORMATION
4.3.1. Experimental Settings
All model runs were performed with temperature 0 using standardized prompt templates to ensure
consistency across models and tasks. Model outputs were normalized to canonical labels (i.e., True/False
for Task 1 and Supports/Refutes/NEI for Task 2) before scoring. 15 entries in the dataset were
reserved as examples for few-shot and CoT prompting (Task 2 only). The remaining entries are used as
test samples for Tasks 1 and 2. For each claim of Task 2, three examples, corresponding to three stance
labels (i.e., Supports/Refutes/NEI) were given. We experimented with up to 3 shots. The examples
were selected by prioritizing an even distribution of claims from diferent domains.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>5.1. Task 1
The results of Task 1 are summarized in Table 4. The results suggest that all LLMs outperform humans
in terms of F1 scores and accuracy at determining the truthfulness of scientific claims, whether the
original claim is true or false. The discrepancy of accuracy ranges 0.10 − 0.19. The discrepancies of
 1  and  1   range 0.10 − 0.22 and 0.09 − 0.19, respectively. All LLMs achieve slightly better F1
scores for False claims compared with True claims.</p>
      <p>The two LRM models (GPT-o1 and DeepSeek-R1) favored recall on False claims, showing that they
prefer rejecting uncertain statements rather than incorrectly afirming them. In contrast, the three
general LLMs (GPT-3.5, GPT-4, and GPT-4o) favored recall on True claims, indicating an opposite bias.
The results from Task 1 demonstrate that statistically state-of-the-art LLMs are more accurate than
humans for discerning true or false scientific claims. However, even the best performing LLM (an
accuracy of 0.90 and an  1  of 0.91 for GPT-4o) has significant room to improve.
5.2. Task 2
The F1 scores of humans and LLMs with various prompting methods are shown in Table 5 (the detailed
results are shown in the Appendix.), suggesting that depending on the version and prompting method,
the LLMs may outperform humans at classifying the stance of a scientific paper abstract with respect to
a given claim. LLMs outperformed human participants in most stance categories, with the amount of
improvement depending on the prompting method and model type. CoT prompting produced the most
consistent performance gains and was especially beneficial for Refute stances, which is likely due to
examples and step-wise reasoning instructions aiding in models resolving contradictions between the
claim and the abstract. Few-shot prompting performed well on Support stances, but was less efective
on Refute and NEI stances. GPT-4o achieved the most balanced results across all stances. GPT-o1 and
DeepSeek-R1 reached similar accuracy and performed particularly well on NEI and Refute classes. The
human group achieves relatively low performance on the Refute stance.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <sec id="sec-6-1">
        <title>6.1. Performance Discussion</title>
        <p>The implications of our study shed light on possibilities in using the state-of-the-art LLMs to discern
the truthfulness of scientific claims seen on the web. Our results suggest that LLMs have powerful
reasoning and text parsing capabilities that allow them to outperform humans, here represented by
college students, at scientific claim verification tasks. However, the overall performance of the best
LLMs is still unsatisfying for being deployed as a service. For example, the best performance of Task
1 was achieved by GPT-4o with an accuracy. The best F1-score is 0.87 and 0.91 for True and False
claims, respectively. This indicates that a significant fraction of claims are still mislabeled.</p>
        <p>In contrast, the low F1 scores of the Refutes stance in Task 2 (Table 5) for both humans and LLMs
suggest that contradictory relationships between the claims and abstracts are likely to be more dificult
to discern than support or NEI relationships.</p>
        <p>The limitations of our study are the size of the human study and the participant population being
limited to college students. In future work, a larger and more diverse human participant pool will be
constructed to better represent the web content consumers.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Reasoning-Optimized Model Behavior</title>
        <p>
          GPT-o1 and Deepseek-R1 are reasoning-optimized models that generate intermediate reasoning traces
before outputting answers and use deliberative reasoning over direct factual recall. The results of Task 1
(Table 4) indicate that GPT-o1 and DeepSeek-R1 lean toward cautious labeling like False, which is
seen from the high recall compared with low precision values. This pattern aligns with findings from
previous benchmarks, which show that explicit reasoning and structured explanation traces can lead
models to over-reject partially supported claims, e.g., [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Our findings show that producing longer
reasoning chains does not always improve F1, which could be attributed to generated rationales being
occasionally self-contradictory or disconnected from evidence. If so, this suggests that explicit reasoning
may introduce error propagation when intermediate steps are not grounded in fact.
        </p>
        <p>GPT-o1 performed best with concise few-shot prompts in Task 2, indicating that GPT-o1 likely
performs implicit internal reasoning that explicit CoT disrupts. Interestingly, DeepSeek-R1 showed the
opposite trend, with 3-CoT prompting yielding the best performance. This may be due to a diference in
structure between the two models, so DeepSeek-R1 benefits from explicit external reasoning traces that
reinforce stance alignment and coherence. The diference suggests that reasoning-optimized design can
manifest diferently and that it shapes the decision style of the model rather than uniformly improving
factual accuracy.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>We developed a new benchmark dataset MSVEC2, consisting of 138 scientific claims from credible
fact-verification websites and science news outlets, including truthfulness labels and an abstract that
supports, refutes, or does not contain enough information with respect to the claim. We benchmarked
humans, represented by 23 college students and 5 state-of-the-art LLMs, through two tasks, namely
truthfulness classification and stance classification. We found that all LLMs score higher in F1 scores
and accuracy compared to humans in truthfulness classification, with GPT-4o achieving the highest F1
score among all the models. The performance of LLMs in stance classification depends on the prompting
configuration, with Chain-of-thought yielding consistent improvements for all LLMs except GPT-o1.
However, the performance of LLMs is still not suficient for reliable scientific claim verification under
standard prompt settings.</p>
    </sec>
    <sec id="sec-8">
      <title>Generative AI Declaration</title>
      <p>The authors used ChatGPT and Grammarly to perform grammar and spelling checks, paraphrase, and
reword. After using the tools, the authors reviewed and edited the content as needed and took full
responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgments</title>
      <p>This project is partially supported by the US National Science Foundation Award #2149607.
Evidence-based fact-checking of health-related claims, in: Findings of the association for
computational linguistics: EMNLP 2021, 2021, pp. 3499–3512.
[12] Jullien, Maël and Valentino, Marco and Frost, Hannah and O’Regan, Paul and Landers, Donal and
Freitas, André, SemEval-2023 task 7: Multi-evidence natural language inference for clinical trial
data, arXiv preprint arXiv:2305.02993 (2023).
[13] Kumar, Sujit and Sharma, Anshul and Khincha, Siddharth Hemant and Shrof, Gargi and Singh,
Sanasam Ranbir and Mishra, Rahul, Sciclaimhunt: A large dataset for evidence-based scientific
claim verification, arXiv preprint arXiv:2502.10003 (2025).
[14] D. Wadden, K. Lo, B. Kuehl, A. Cohan, I. Beltagy, L. L. Wang, H. Hajishirzi, Scifact-open: Towards
open-domain scientific claim verification, arXiv preprint arXiv:2210.13777 (2022).
[15] Jafari, Nazanin and Allan, James, Robust claim verification through fact detection, arXiv preprint
arXiv:2407.18367 (2024).
[16] Robertson, Stephen and Zaragoza, Hugo and others, The probabilistic relevance framework: BM25
and beyond, Foundations and Trends® in Information Retrieval 3 (2009) 333–389.
[17] Khattab, Omar and Zaharia, Matei, ColBERT: Eficient and efective passage search via
contextualized late interaction over BERT, in: Proceedings of the 43rd International ACM SIGIR conference
on research and development in Information Retrieval, 2020, pp. 39–48.
[18] Liu, Yinhan and Ott, Myle and Goyal, Naman and Du, Jingfei and Joshi, Mandar and Chen, Danqi
and Levy, Omer and Lewis, Mike and Zettlemoyer, Luke and Stoyanov, Veselin, RoBERTa: A
robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692 (2019).
[19] Gu, Yu and Tinn, Robert and Cheng, Hao and Lucas, Michael and Usuyama, Naoto and Liu,
Xiaodong and Naumann, Tristan and Gao, Jianfeng and Poon, Hoifung, Domain-Specific Language
Model Pretraining for Biomedical Natural Language Processing, ACM Trans. Comput. Healthcare
3 (2021). URL: https://doi.org/10.1145/3458754. doi:10.1145/3458754.
[20] j. Rafel, Colin and Shazeer, Noam and Roberts, Adam and Lee, Katherine and Narang, Sharan and
Matena, Michael and Zhou, Yanqi and Li, Wei and Liu, Peter J, Exploring the limits of transfer
learning with a unified text-to-text transformer 21 (2020) 1–67.
[21] Liu, Hao and Soroush, Ali and Nestor, Jordan G and Park, Elizabeth and Idnay,
Betina and Fang, Yilu and Pan, Jane and Liao, Stan and Bernard, Marguerite and
Peng, Yifan and Weng, Chunhua, Retrieval augmented scientific claim
verification, JAMIA Open 7 (2024) ooae021. URL: https://doi.org/10.1093/jamiaopen/ooae021.
doi:10.1093/jamiaopen/ooae021.
arXiv:https://academic.oup.com/jamiaopen/articlepdf/7/1/ooae021/56904263/ooae021.pdf.
[22] S. A. Miller, J. L. Forrest, Enhancing your practice through evidence-based decision making:
Pico, learning how to ask good questions, Journal of Evidence Based Dental Practice 1 (2001)
136–141. URL: https://www.sciencedirect.com/science/article/pii/S1532338201700243. doi:https:
//doi.org/10.1016/S1532-3382(01)70024-3.
[23] F. Zeng, W. Gao, Prompt to be consistent is better than self-consistent? few-shot and
zeroshot fact verification with pre-trained language models, in: Findings of the Association for
Computational Linguistics, ACL 2023, Proceedings of the Annual Meeting of the Association for
Computational Linguistics, Association for Computational Linguistics (ACL), 2023, pp. 4555–4569.
Publisher Copyright: © 2023 Association for Computational Linguistics.; 61st Annual Meeting of
the Association for Computational Linguistics, ACL 2023 ; Conference date: 09-07-2023 Through
14-07-2023.
[24] X. Zeng, A. Zubiaga, MAPLE: Micro analysis of pairwise language evolution for few-shot claim
verification, in: Y. Graham, M. Purver (Eds.), Findings of the Association for Computational
Linguistics: EACL 2024, Association for Computational Linguistics, St. Julian’s, Malta, 2024, pp.
1177–1196. URL: https://aclanthology.org/2024.findings-eacl.79/.
[25] Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and
Wang, Shean and Wang, Lu and Chen, Weizhu and others, LoRA: Low-rank Adaptation of Large
Language Models, ICLR 1 (2022) 3.
[26] Li, Zhong-Zhi and Zhang, Duzhen and Zhang, Ming-Liang and Zhang, Jiaxin and Liu, Zengyan
and Yao, Yuxuan and Xu, Haotian and Zheng, Junhao and Wang, Pei-Jie and Chen, Xiuyi and
others, From System 1 to System 2: A Survey of Reasoning Large Language Models, arXiv preprint
arXiv:2502.17419 (2025).
llllll Extended Task 2 Metrics by Model and Configuration</p>
      <p>Configuration Stance Accuracy Precision Recall F1</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Dougrez-Lewis</surname>
          </string-name>
          ,
          <article-title>John and Akhter, Mahmud Elahi and Ruggeri, Federico and Löbbers, Sebastian and He, Yulan and Liakata, Maria, Assessing the Reasoning Capabilities of LLMs in the context of Evidence-based Claim Verification</article-title>
          ,
          <source>arXiv preprint arXiv:2402.10735</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Christodoulopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <article-title>FEVER: a large-scale dataset for fact extraction and VERification</article-title>
          , in: M.
          <string-name>
            <surname>Walker</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Stent (Eds.),
          <source>Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , New Orleans, Louisiana,
          <year>2018</year>
          , pp.
          <fpage>809</fpage>
          -
          <lpage>819</lpage>
          . URL: https://aclanthology.org/N18-1074/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N18</fpage>
          - 1074.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schuurmans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , et al.,
          <article-title>Chain-of-thought prompting elicits reasoning in large language models</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>24824</fpage>
          -
          <lpage>24837</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Evans</surname>
          </string-name>
          ,
          <article-title>Michael and Soós, Dominik and Landers, Ethan and Wu, Jian, MSVEC: A multidomain testing dataset for scientific claim verification</article-title>
          ,
          <source>in: Proceedings of the Twenty-fourth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>504</fpage>
          -
          <lpage>509</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>D.</given-names>
            <surname>Wadden</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. van Zuylen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cohan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <article-title>Fact or fiction: Verifying scientific claims</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>7534</fpage>
          -
          <lpage>7550</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          . emnlp-main.
          <volume>609</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp- main.609.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>Alvarez</surname>
          </string-name>
          , Carlos and Bennett, Maxwell and Wang, Lucy Lu,
          <article-title>Zero-shot scientific claim verification using LLMs and citation text</article-title>
          ,
          <source>in: Proceedings of the Fourth Workshop on Scholarly Document Processing (SDP</source>
          <year>2024</year>
          ),
          <year>2024</year>
          , pp.
          <fpage>269</fpage>
          -
          <lpage>276</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>Schuster</surname>
          </string-name>
          , Tal and Fisch, Adam and Barzilay, Regina, Get Your Vitamin C!
          <article-title>Robust Fact Verification with Contrastive Evidenc</article-title>
          , in: K.
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rumshisky</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Hakkani-Tur</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Beltagy</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Bethard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cotterell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Chakraborty</surname>
          </string-name>
          , Y. Zhou (Eds.),
          <source>Proceedings of the</source>
          <year>2021</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>624</fpage>
          -
          <lpage>643</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .naacl-main.
          <volume>52</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .naacl- main.52.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>T.</given-names>
            <surname>Diggelmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Boyd-Graber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bulian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ciaramita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Leippold</surname>
          </string-name>
          ,
          <article-title>Climate-fever: A dataset for verification of real-world climate claims</article-title>
          , arXiv preprint arXiv:
          <year>2012</year>
          .
          <volume>00614</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <surname>Dougrez-Lewis</surname>
          </string-name>
          ,
          <article-title>John and Kochkina, Elena and Arana-Catania, Miguel and Liakata, Maria and He, Yulan, PHEMEPlus: enriching social media rumour verification with external evidence</article-title>
          ,
          <source>arXiv preprint arXiv:2207.13970</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Soroush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. G.</given-names>
            <surname>Nestor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Idnay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bernard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Peng</surname>
          </string-name>
          , et al.,
          <article-title>Retrieval augmented scientific claim verification</article-title>
          ,
          <source>JAMIA open 7</source>
          (
          <year>2024</year>
          )
          <article-title>ooae021</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Sarrouti</surname>
          </string-name>
          , Mourad and Abacha, Asma Ben and Mrabet, Yassine and
          <string-name>
            <surname>Demner-Fushman</surname>
          </string-name>
          , Dina,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>