<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UNH at CheckThat! 2025: Fine-tuning Vs Prompting in Claim Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Joe Wilder</string-name>
          <email>Joe.Wilder@unh.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikhil Kadapala</string-name>
          <email>Nikhil.Kadapala@unh.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Benji Xu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mohammed Alsaadi</string-name>
          <email>Mohammed.Alsaadi@unh.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aiden Parsons</string-name>
          <email>Aiden.Parsons@unh.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mitchell Rogers</string-name>
          <email>Mitchell.Rogers@unh.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Palash Agrawal</string-name>
          <email>Palash.Agrawal@unh.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Adam Hassick</string-name>
          <email>Adam.Hassick@unh.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Dietz</string-name>
          <email>dietz@cs.unh.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of New Hampshire</institution>
          ,
          <addr-line>Durham, NH, 03824</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>We participate in CheckThat! Task 2 English and explore various methods of prompting and in-context learning, including few-shot prompting and fine-tuning with diferent LLM families, with the goal of extracting checkworthy claims from social media passages. Our best METEOR score is achieved by fine-tuning a FLAN-T5 model. However, we observe that higher-quality claims can sometimes be extracted using other methods, even when their METEOR scores are lower.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Faithful extractions. We are especially concerned about extracted claims where LLM hallucinations
introduce content not present in the original post. An example of a hallucinated and overly verbose
claim follows:</p>
      <sec id="sec-1-1">
        <title>Social Media Post:</title>
        <p>The salary of a U.S. Senator is $174,000 per year. This is Joe Biden’s house. . .
seems legit :) The salary of a U.S. Senator is $174,000 per year. This is Joe
Biden’s house. . . seems legit :) The salary of a U.S. Senator is $174,000 per
year. This is Joe Biden’s house. . . seems legit :)</p>
      </sec>
      <sec id="sec-1-2">
        <title>Hallucinated Extracted Claim</title>
        <p>Joe Biden’s house, purchased for an amount significantly exceeding the cumulative
value of his annual U.S. Senator’s salary of $174,000, raises questions about
potential additional, undisclosed sources of income that may have contributed
to the down payment, mortgage payments, property taxes, insurance premiums, and
ongoing maintenance costs associated with the property.</p>
        <p>Our emphasis is on exploring the design space across diferent LLMs and methods—fine-tuning
and few-shot prompting—in search of the best trade-of between optimizing the METEOR score and
producing claims that are genuinely useful for human fact-checkers.</p>
        <p>We conduct a broad exploration of methods, prompts, and LLMs, casting a wide net. Our approaches
fall into three overarching categories:
1. Fine-tuning approaches,
2. Prompting approaches, and
3. "Frustratingly easy" baselines.1</p>
        <p>We describe all explored approaches and submit those performing best on the validation set in terms
of METEOR.</p>
        <p>We only use resources provided by the Task 2 organizers and publicly available large language models
from Hugging Face and the Together.AI API service.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Approaches: Fine-Tuning and Prompting</title>
      <p>In this section, we describe methods relying on fine-tuning across LLMs of diferent parameter scales.</p>
      <p>
        Our key takeaway: Flan-T5 Large [
        <xref ref-type="bibr" rid="ref18 ref3">3</xref>
        ] ofered the best compromise between raw capability and
practical fine-tuning feasibility under our hardware and time constraints.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Finetuned Flan-T5 Large</title>
        <p>
          This approach fine-tuned the Flan-T5 Large [
          <xref ref-type="bibr" rid="ref18 ref3">3</xref>
          ] model on the CLEF 2025 Task 2 training dataset to
align its outputs more closely with the gold-standard claims. Fine-tuning was performed using the
huggingface transformers library without advanced techniques such as LoRA or PEFT.
        </p>
        <p>Due to resource limitations, billion-parameter models were out of scope. We opted for Flan-T5 Large
(783M parameters), which can run locally and is more manageable to train due to its smaller size. A
straightforward task-specific prompt was prepended to training examples:
1We did not use baselines provided by the organizer.</p>
        <p>Please read the following social media post and extract the claim made within it.
Normalize the claim by rephrasing it in a clear and concise manner.
Post: $text
Extracted Claim:</p>
        <p>The training ran for 10 epochs on an NVIDIA 4060 GPU, taking nearly four days to complete.
This approach’s strength lies in its ability to internalize extraction patterns not easily expressible via
prompting alone. It achieved an average validation-set METEOR score of 0.5569.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. 2.1 LoRA fine-tuning of Flan-T5 Base</title>
        <p>
          The motivation behind this submission was to balance performance with eficiency. Full fine-tuning of
T5-Base [
          <xref ref-type="bibr" rid="ref18 ref3">3</xref>
          ] gives strong results but incurres high computational costs. Prompt tuning, while eficient,
yields limited gains—especially on larger models. LoRA [
          <xref ref-type="bibr" rid="ref19 ref4">4</xref>
          ] provides a middle ground by updating only
0.4% of parameters, keeping the rest of the model frozen.
        </p>
        <p>LoRA allowed us to adapt the model efectively with minimal overhead. We chose Flan-T5 Base due
to its strong baseline performance, aiming to retain quality while reducing resource demands.</p>
        <p>This model achieved a validation METEOR score of 0.3054 and was our third-best test-set run, with a
test METEOR of 0.28.
2.3.</p>
      </sec>
      <sec id="sec-2-3">
        <title>Fine-tuned DeepSeek-R1-Distill-Llama-8b Approach</title>
        <p>
          Here, we fine-tuned the DeepSeek-R1-Distill-Llama-8b [
          <xref ref-type="bibr" rid="ref20 ref5">5</xref>
          ] causal language model on the training
dataset to assess its ability to produce reference-style summaries. The training used the following
system prompt:
        </p>
        <p>Extract the verifiable claim as one sentence from the user input.</p>
        <p>This approach achieved a validation METEOR score of 0.2541.</p>
        <p>Fine-tuning yields marginal improvements over a baseline model using a single-shot prompt on
the validation set, while zero-shot prompting ofers no measurable gain. We conclude that achieving
meaningful improvements through fine-tuning—even with moderate-sized 8B-parameter LLMs and
using LoRA—would require computational resources beyond our current budget constraints.</p>
        <p>
          We compare our fine-tuning methods to prompting-based and in-context learning approaches. These
include variations on few-shot prompting, self-refinement [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], self-scoring [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], and post-processing.
2.4. Claimifying Social Media Posts with Self-Refinement
This method uses a combination of prompting strategies and an iterative Self-Refinement stage. In this
stage, the same LLM that generates the initial claim provides feedback based on specific criteria that
evaluates the check-worthiness of the claim against the input text. This feedback, along with the input
and initial claim, is fed back into the same LLM, which generates a refined version of the claim based
on the feedback.
        </p>
        <p>We tested both zero-shot and few-shot prompts, with and without a Chain-of-Thought (CoT) trigger
phrase. The CoT phrase acts as a cue for the model to reason step by step before producing a final claim.
We evaluated these configurations with and without one or more iterations of the self-refinement stage.</p>
        <p>
          For the Task 2 submission, we used the step-by-step "Claimify" process [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. After extracting the
initial claim, we applied one iteration of SELF-REFINE [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          We also evaluated a variant using a few-shot prompt followed by the CoT trigger phrase
Let’s think step by step [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ].
        </p>
        <p>
          Models explored: GPT-4.1-nano [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], Gemini-2.0-Flash [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], LLaMA-3.3-70B [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], Grok3 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]
Highest Avg METEOR score: 0.332 (Grok3 + Few-shot-CoT)
Prompt variations tested:
1. Zero-shot:
        </p>
        <p>Identify the decontextualized, stand-alone, and verifiable central claim in
the given post: ${post}
2. Zero-shot-CoT: Zero-shot + Let’s think step by step.
3. Few-shot: Four examples from the training set followed by the Zero-shot instruction
4. Few-shot-CoT: Few-shot + Let’s think step by step.</p>
      </sec>
      <sec id="sec-2-4">
        <title>2.5. Keyword Few-Shot (KBFP) and Self-Refine</title>
        <p>
          This method explores a smart selection of few-shot examples using keyword matching, combined with
(or without) a Self-Refine step. All implementations used the LLaMA 3.3 70B model [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
Keyword Few-Shot. The Keyword Few-Shot method selects relevant examples from the training
set by matching keywords found in the target social media post. These examples are then used to
construct a few-shot prompt [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ].
        </p>
        <p>For the example post:
The salary of a U.S. Senator is $174,000 per year. This is Joe Biden’s house...
seems legit :)
The method extracts a claim such as:
The main claim is that Joe Biden’s house appears to be too expensive for him to
afford on a U.S. Senator’s salary of $174,000 per year, implying that there may
be some other, potentially questionable, source of income.</p>
        <p>
          Self-Refine. As an additional step, we apply one iteration of the Self-Refine procedure [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], using the
following prompt:
        </p>
        <p>Refine the following claim to make it more precise.</p>
        <p>Here is the text: ${the claim}
Output only the refined claim and nothing else.</p>
        <p>While the base Keyword Few-Shot method yields a higher METEOR score on average, we observe
that the addition of Self-Refine often produces more concise and less redundant claims.</p>
        <p>The refined version of the above claim is:
Joe Biden’s house appears to be too expensive to be affordable solely based on
his U.S. Senator’s salary of $174,000 per year, suggesting that there may be an
additional, unreported, or unexplained source of wealth that contributed to its
purchase or maintenance.</p>
        <p>We find that repeated applications of Self-Refine do not improve the results. On the contrary, multiple
iterations tend to introduce verbosity and hallucinated facts not grounded in the original post.
Issues with the Gold Claim. It’s worth noting that the gold-standard claims themselves can have
shortcomings. For instance, the gold claim for the example post is:</p>
        <p>Joe Biden lives in a large estate bought on a senator’s salary.</p>
        <p>This omits key details like the senator’s actual salary ($174,000), which may be necessary for
verification and doesn’t reflect the original post’s implication that the estate seems unafordable based on that
salary alone.
2.6.</p>
      </sec>
      <sec id="sec-2-5">
        <title>Subclaim Extraction and Filtering with Refinement</title>
        <p>
          We explore a multi-stage approach that begins by extracting several potential claims, so-called “sub
claims”, from each social media post. These are then scored [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] and filtered before a final synthesis step
generates the main predicted claim.
        </p>
        <p>
          In the first stage, LLaMA 3.3 70B model [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] is prompted to extsract multiple sub claims from the
post. These sub claims represent possible interpretations or factual assertions implied by the content.
        </p>
        <p>Next, we introduce a filtering stage. Rather than passing all sub claims to the claim synthesis step,
we rank them using a self-assigned importance score (1 to 10) and retain only those scoring 7 or higher.
This limits noise and reduces the cognitive load on the synthesis LLM.</p>
        <p>Despite these refinements, the METEOR score did not improve significantly. To address this, we added
a third step: post-synthesis revision. A final LLM call revisits the synthesized claim, comparing it with
the original post. It performs a "quality check" focused on factual accuracy, emphasis, and eliminating
redundancy or verbosity. The prompt in this stage explicitly instructs the model to consolidate language
while preserving core meaning.</p>
        <p>This three-stage pipeline—subclaim extraction, importance-based filtering, and post-synthesis
refinement—aims to balance comprehensiveness with clarity and precision.</p>
      </sec>
      <sec id="sec-2-6">
        <title>2.7. Max Multi-Prompt</title>
        <p>
          We observed that many social media posts are comments on images found online, while the claims in
our dataset often describe those images. If we use a generic prompt, the extracted content won’t align
well with the dataset’s gold claims. To address this, we designed a prompt that instructs the model to
imagine searching for the referenced image online and then describe its likely content. A similar idea
has also been reported by Perez et al. [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
        </p>
        <p>We also noted that many posts rely on metaphor or sarcasm, often targeting the government. For
instance, when a user writes, “Biden’s annual salary is only $170K,” the implication—delivered with
irony—is that Biden must be supplementing his income through questionable means to aford a luxury
home. Similarly, posts about epidemics often question vaccine policies sarcastically, implicitly accusing
public health authorities of negligence or malice.</p>
        <p>
          To account for these nuances, we created targeted prompts tailored to each type of rhetorical device.
Empirical results were obtained using the LLaMA 3.3 70B model [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ].
        </p>
        <p>This approach demonstrates the potential benefits of intelligently triaging between multiple prompt
templates. To simulate an upper bound on this strategy, we applied several diferent prompts to the
same social media post and selected the resulting claim with the highest METEOR score.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Evaluation Approach: Baseline</title>
      <sec id="sec-3-1">
        <title>3.1. Regurgitation Baseline</title>
        <p>
          To evaluate the significance of our METEOR scores, we designed a “frustratingly easy” [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] baseline
that simply reuses the original social media post or a truncated version as a stand-in for actual claim
extraction. Surprisingly, this sets a strong reference point for METEOR performance.
        </p>
        <p>We explored the following variations:
• Full social media post
• Truncating after the first 100 characters, omitting partial words at the end
• Using only the nouns and verbs from the post</p>
        <p>On the validation set, using the full post yields a METEOR score of 0.19. Truncating the post leads to
improved results, with the best configuration achieving 0.24. This baseline scored 0.23 on the test set.</p>
        <p>By contrast, using only the nouns and verbs significantly reduces METEOR performance.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experimental Evaluation</title>
      <p>4.1. Setup
We used only the datasets provided by the CheckThat! Lab Task 2 organizers, focusing exclusively on
the English language subset.</p>
      <p>Several methods relied on LLMs without additional training, using in-context learning only. In cases
where few-shot examples were used, they were selected from the training set. Table 1 lists which LLM
was used in each method that contributed to our empirical results.</p>
      <p>For methods involving fine-tuning, all available training data was used.</p>
      <p>All approaches were evaluated on the validation subset using the METEOR metric, as implemented
in the NLTK toolkit. While we submitted methods with the highest METEOR score, we were interested
in exploring methods with the following criteria.</p>
      <p>• Validation METEOR score
• Novelty and insightfulness of the approach
• Our subjective preference for the style of the extracted claims
5. Evaluation Results on Validation and Test Set</p>
      <p>Discussion of Results. The best-performing method was fine-tuning Flan-T5-Large (Section 2.1),
which achieved a METEOR score of 0.5569. This result highlights that larger models do not always
guarantee better outcomes—Flan-T5-Large struck the best balance in our experiments. On the test set,
this is still our best approach despite the performance drop.
Claimifying social media posts with self-refinement
Self-Refine with KBFP
Subclaim extraction and filtering with refinement
Max Multi-Prompt / Single prompt score
Model
FLAN-T5-Large
Flan-T5 Base
DeepSeek-R1-Distill-Llama-8b</p>
      <p>The runner-up is a prompting-based method based on Claimify with Self-Refine, described in Section
2.4. It achieved a METEOR score of 0.331 on the validation set, which manifested in a 0.33 on the oficial
test set.</p>
      <p>The third-best was T5-Base fine-tuned using LoRA. Although it achieved a lower score of 0.3054, its
advantage lies in being a smaller and more eficient model. On the test set, it still obtains a reasonable
0.28 in METEOR.</p>
      <p>The fourth method was Keyword-Based Few-Shot Prompting (KBFP), which used few-shot examples
with LLaMA 3.3 70B. We did not submit this method to the challenge.</p>
      <p>Despite Fine-tuning methods obtaining the highest METEOR scores, we believe that somewhat better
claims can be extracted with other approaches.</p>
      <p>Subjectively, as we discuss below in Section 5.2, the most useful claims were generated by one
iteration of Self-Refine combined with KBFP or Claimify, particularly for the Joe Biden house example.
This approach correctly highlighted the assertion that the house seemed too expensive for his reported
salary. In contrast, other methods either focused only on stating the salary amount ($174,000) or made
vague claims about the size of the house. Several outputs hallucinated or speculated about visual content
in the image, which was not part of the dataset.</p>
      <p>We also observed that the gold claim for this example was not ideal: it omitted both the assertion
about afordability and the actual salary figure—both of which are critical for verifying the claim.</p>
      <sec id="sec-4-1">
        <title>5.1. Overall Leaderboard</title>
        <p>Our best method, described in Section 2.1, placed us 9th on the leaderboard (Table 3). Notably, rank 12
was occupied by a test submission of the method from Section 2.4. Our naive baseline outperformed
the final two teams in the rankings.</p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. Examples of Claims Extracted on Validation Set</title>
        <p>We manually reviewed extracted claims for the first few validation instances. We found considerable
variation in claim styles: some were more actionable for fact-checking, while others focused more on
the rhetorical tone or the poster’s motivation.</p>
        <p>Although these diferences significantly impact utility for human fact-checkers, they are not well
captured by the METEOR metric, which is limited to overlap with gold-standard claims.
Finetuned Flan-T5-Large. METEOR score: 0.5569, described in Section 2.1.</p>
        <p>Joe Biden owns the house in this photo.</p>
        <p>A Holocaust story told in the New York Times
This leopard cub's mother was killed by a trophy hunter
Test Set Results
Video shows a crocodile spotted in a residential area in Hyderabad during the
ongoing heavy rains.</p>
        <p>This method was able to closely mimic the patterns of the training dataset in some samples but still
failed in other instances. It often does not produce proper claims, as it focuses solely on optimizing the
METEOR score.</p>
        <p>T5-base (LoRA, fine-tuned) METEOR Score: 0.3054, described in Section 2.2.</p>
        <p>• Extracted Claim: Joe Biden’s house is a fake
• Extracted Claim: Jewish boy adopted by US Jewish family
• Extracted Claim: Video of a crocodile in Hyderabad</p>
        <p>This model extracts short, concise claims that are easy to verify. However, the claims are often overly
literal and fail to capture the deeper meaning or intent behind the social media post.</p>
        <sec id="sec-4-2-1">
          <title>Fine-tuned DeepSeek-R1-Distill-Lama-8b</title>
          <p>METEOR score: 0.2541 , described in Section 2.3.</p>
          <p>The essential primary claim is that a U.S. Senator earns $174,000 per year,
and it is Joe Biden's house.</p>
          <p>The Karnofsky family adopted a 7-year-old boy into their home, providing him
with food through his homework until he was 12, when they helped him buy his
first instrument, showcasing their support for his musical talent.</p>
          <p>The essential primary claim is that trophy hunting is horrific.</p>
          <p>The essential primary claim is that none of the listed items (Magarmacch,
Heavy Rain, Hyderabad, Crocodile, Alert) are the main focus.</p>
          <p>The administration is now blaming the victims of today’s deadly attacks in
Kabul for not leaving earlier.</p>
          <p>The extracted claims are generally suitable for fact-checking and, in several cases, elaborate on the
underlying message of the post.</p>
          <p>We observed that the trained model often prefixes responses with phrases like “the essential primary
claim is...” This model tends to extract short, concise claims that are easy to verify. However, these
claims are frequently overly literal and fail to capture the deeper meaning or intent behind the social
media post.</p>
          <p>Claimifying social media posts with self-refinement METEOR score: 0.332, Section 2.4.
• Extracted Claim: US Senator’s annual salary is $174,000
• Extracted Claim:</p>
          <p>A Lithuanian Jewish family employed a 7-year-old boy until he was 12 and gave
him money to buy his first instrument.
• Extracted Claim: Rescued animal’s mother was killed by a hunter.</p>
          <p>• Extracted Claim: Crocodile sighted in Hyderabad during heavy rain.</p>
          <p>The generated claims are semantically closer to the gold-standard claims. In a separate analysis, we
ifnd that an average BERTScore (F1 mean) of 0.82 is achieved. Compared to the original input text, this
indicates that the generated claims remain true to both the content and context of the topic, despite
achieving a lower METEOR score.</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>Max multi-prompt.</title>
          <p>METEOR score: (up to) 0.3277, described in Section 2.7.
• Extracted Claim: The annual salary of a U.S. Senator is $174,000.</p>
          <p>Owning such a home on that salary doesn’t add up. This is a tongue-in-cheek
critique of perceived wealth versus official pay.</p>
          <p>While the first prompt yields a claim that is objectively fact-checkable, we believe the second prompt
better captures the motivation behind the social media post.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Overall Conclusions and Main Findings</title>
      <p>We found that most LLMs produced claims related to the content of the social media posts. However,
these often diverged from the gold standard claims, which negatively impacted their METEOR scores.
When averaged across 1,170 validation examples, most of our methods converged on a METEOR score of
approximately 0.27. The diferences between methods only became apparent through manual inspection
and checking for coverage of all key claim elements.</p>
      <p>We also observed that many gold claims failed to capture all the critical assertions. In longer social
media posts, multiple check-worthy claims were often present, making it dificult, even for human
judges, to determine the primary claim without additional context about the user’s information needs.</p>
      <p>Several gold claims referenced images linked in the social media posts. Since these images were not
included in the dataset, any claims based on them were speculative.</p>
      <p>With suficient fine-tuning, the smaller FLAN T5 model was able to approximate the method used to
extract gold standard claims.</p>
      <p>We found that multiple iterations of self-refinement often led LLMs to hallucinate or produce overly
verbose responses. This pattern was consistent across a wide range of models, including LLaMA, GPT
4.1 nano, Gemini 2.0 Flash, and Grok3. In some cases, these models returned the same claim without
any improvements after a few iterations, which can be attributed to the rigid criterion imposed by the
prompt used for both extraction and feedback.</p>
      <p>In many cases, the baseline outputs without self-refinement achieved higher METEOR scores and
produced claims that more accurately reflected the content of the original post.</p>
      <p>We also noted that the instruction following was inconsistent. For example, directives such as
omitting phrases like "The main claim is..." were often ignored, particularly by LLaMA models. Using
structured outputs, such as JSON or Pydantic formats, improved adherence but frequently resulted in
truncated outputs that were no longer valid JSON.</p>
      <p>Acknowledgment and Declaration on Generative AI
This work was conducted in part by participants of the CS881 graduate course “Data Science for
Knowledge Graph and Text” at the University of New Hampshire, as well as by students in the
Computing Research Association’s UR2PhD program. We gratefully acknowledge their contributions and
enthusiasm throughout the research process.</p>
      <p>This material is based upon work supported by the National Science Foundation under Grant No.
1846017. Any opinions, findings, conclusions, or recommendations expressed in this material are those
of the authors and do not necessarily reflect the views of the National Science Foundation.</p>
      <p>During the preparation of this work, the author(s) used ChatGPT, Grammarly in order to: Grammar
and spelling check, Paraphrase and reword. After using this tool/service, the author(s) reviewed and
edited the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-6">
      <title>A. Appendix</title>
      <p>A.1. Prompts used in Claimifying social media posts with Self-refinement</p>
      <sec id="sec-6-1">
        <title>Claim Extraction System Prompt Identity / System Message</title>
        <p>You are a helpful AI assistant and an expert in claim detection, extraction, and
normalization.</p>
      </sec>
      <sec id="sec-6-2">
        <title>Instructions</title>
        <p>You are given a noisy social media post that contains only text but it might
have been posted alongside a photo or video on the platform it was extracted
from.
* Your task is to detect, extract, and respond with a normalized claim.
* A claim is a statement or assertion that can be objectively verified as true or
false based on empirical evidence or reality.
* Follow the below steps to analyze the input text and arrive at the final
response.
* Step 1: Sentence Splitting and Context Creation
* Start by splitting the post into individual sentences.
* Now create or retrieve a context for each of those sentences by looking at
the two preceding and two following sentences.
* Step 2: Selection
* Now determine if each sentence contains any verifiable information based
on the context created for the sentence in the previous step.
* For each sentence, do the following:
* If the sentence does not contain any verifiable information, discard
that sentence.
* If the sentence contains both verifiable and unverifiable information,
rewrite the sentence retaining only verifiable information.
* If the sentence doe not contain any unverifiable information, return
the original sentence.
* Step 3: Disambiguation
* Use only the words found in the original input text when generating a response.
* The claim must be strictly extracted from the input without adding any inferred
or assumed context.
* The claim should be a concise single sentence (up to a maximum of 25 words)
that captures the main point of the post without any additional context or
details. Prioritize the main claim if multiple claims are present.
* The claim should be a self-contained factual statement that can be verified. It
should not contain any subjective opinions, speculations, or interpretations.
* Pay attention to negative sentiment, named entities, names of people, and
linguistic features like assertions, hedges, implications, etc. If any one
of these features are present in the post, it should be reflected in the claim.
* Do not include any additional information or explanations in your response.
* Minor clarifications (e.g., implied agent) are allowed if they are logically
unavoidable and directly inferable from the input.
* If the input text contains any Named Entities, they must be included in your
responses.
* Return your response in the style of a short caption or headline of a news
bulletin.
* If the given input text is mostly likely to be referencing or directly talking
about or posted alongside a photo or video, return the response that starts
with either (1) Photo shows &lt;your_response&gt; or (2) Video shows &lt;your_response&gt;.
* Always return your response in English even if the original input is in a
different language.</p>
        <p>Claim Extraction User Prompt
# Here are some examples of how to identify a decontextualized, stand-alone,
and verifiable central claim in a post.
&lt;user_query id="example-1"&gt;
**Identify the decontextualized, stand-alone, and verifiable central claim in
the given post:** Lieutenant Retired General Asif Mumtaz appointed as
Chairman Pakistan Medical Commission PMC Lieutenant Retired General Asif
Mumtaz appointed as Chairman Pakistan Medical Commission PMC Lieutenant
Retired General Asif Mumtaz appointed as Chairman Pakistan Medical Commission
PMC None.</p>
        <p>Let’s think step by step.
&lt;/user_query&gt;
&lt;assistant_response id="example-1"&gt;
**Normalized claim:** Pakistani government appoints former army general to
head medical regulatory body.
&lt;/assistant_response&gt;
--------------------------------------------&lt;user_query id="example-2"&gt;
**Identify the decontextualized, stand-alone, and verifiable central claim in
the given post:** A priceless clip of 1970 of Bruce Lee playing Table Tennis
with his Nan-chak !! His focus on speed A priceless clip of 1970 of Bruce Lee
playing Table Tennis with his Nan-chak !! His focus on speed A priceless clip
of 1970 of Bruce Lee playing Table Tennis with his Nan-chak !! His focus on
speed None
Let’s think step by step.
&lt;/user_query&gt;
&lt;assistant_response id="example-2"&gt;
**Normalized claim:** Late actor and martial artist Bruce Lee playing table
tennis with a set of nunchucks.
&lt;/assistant_response&gt;
--------------------------------------------&lt;user_query id="example-3"&gt;
**Identify the decontextualized, stand-alone, and verifiable central claim in
the given post:** Hydrate YOURSELF W After Waking Up Water 30 min Before a Meal
DRINK Before Taking a Shower 2192 2192 Before Going to Bed at the correct time
T A YE Helps activate internal organs Helps digestion Helps lower blood
pressure. Helps to avoid heart attack Health+ by Punjab Kesari.</p>
        <p>Let’s think step by step.
&lt;/user_query&gt;
&lt;assistant_response id="example-3"&gt;
**Normalized claim:** Drinking water at specific times can have different
health benefits
&lt;/assistant_response&gt;
--------------------------------------------&lt;user_query id="example-4"&gt;
**Identify the decontextualized, stand-alone, and verifiable central claim in
the given post:** Eating vaginal fluids makes you immune to cancer, and other
diseases. Do it for health. Scientists at St. Austin University in North
Carolina, they investigated the benefits of vaginal or cervical mucus
consumption and the results were amazing. These fluids contain high levels of
active proteins up to 10 minutes after leaving the female body. The vaginal
fluid is rich in protein, sodium, vitamins like C1, C4, C4, vc and others.
This study confirms what was exposed by Dr. John d. Moore in his 2009 study of
the "equivalent exchange" theory, which indicates that women and men benefit in
the same way. The benefits of "swallowing" vaginal fluids are:
1. **Eliminates buttons and buttons**
2. **Stimulates the electrical charges of the cells**
3. **Prevents prostate cancer.**
4. **Improved digestion.**
5. **Very effective against constipation.**
6. **It makes teeth and bones stronger.**
7. **Helps the functioning of the kidneys Share men! Everything is for your health!</p>
        <p>Share it on all social networks.**
Let’s think step by step.
&lt;/user_query&gt;
&lt;assistant_response id="example-4"&gt;
**Normalized claim:** St. Austin University North Carolina says eating vaginal
fluid makes you immune to cancer
&lt;/assistant_response&gt;
--------------------------------------------&lt;user_query id="example-5"&gt;
**Identify the decontextualized, stand-alone, and verifiable central claim in
the given post:** Corona virus before it reaches the lungs it remains in the
throat for four days drinking water a lot and gargling with warm water &amp;
salt or vinegar eliminates the virus $\ldots$
Let’s think step by step.
&lt;/user_query&gt;
&lt;assistant_response id="example-5"&gt;
**Normalized claim:** Gargling water can protect against coronavirus
&lt;/assistant_response&gt;
--------------------------------------------&lt;|User|&gt;: Identify the decontextualized, stand-alone, and verifiable central
claim in the given post: {input} &lt;|End_user|&gt;</p>
        <p>Let’s think step by step.</p>
      </sec>
      <sec id="sec-6-3">
        <title>Feedback Generation Prompt</title>
      </sec>
      <sec id="sec-6-4">
        <title>Identify / System Message</title>
      </sec>
      <sec id="sec-6-5">
        <title>Instructions</title>
        <p>You are a professional fact-checker and an expert in claim normalization.</p>
        <p>Your task is to provide detailed, constructive feedback on the generated
response based on the criteria provided to ensure that the normalized claims
are not only consistent with the original post, but are also self-contained and
verifiable.</p>
        <p>We want to iteratively improve the above generated response. To help with this,
please score the response on the following criteria using a 0-10 scale, and
provide a brief justification for each score:
1. **Verifiability:** To what extent does the response contain claims that can
be independently verified using reliable sources? (0 = not verifiable,
10 = fully verifiable)
2. **Likelihood of Being False:** How likely is it that the response contains</p>
        <p>Optionally, suggest specific improvements to the response based on your
evaluation.</p>
        <p>Response/Normalized Claim: ${Extracted Claim}</p>
      </sec>
      <sec id="sec-6-6">
        <title>Refined Claim Generation Prompt</title>
      </sec>
      <sec id="sec-6-7">
        <title>Identity / System Message</title>
        <p>You are a professional fact-checker and expert in claim normalization.
Instructions
* Your task is to refine the genrated response in light of the feedback
provided.
* Using the feedback provided, return a refined version of the generated
response, ensuring that the normalized claim is consistent with the original
post, self-contained, and verifiable.
* Your response must only be based on the feedback provided.
* Do not speculate, provide subjective opinions, or add any additional
information or explanations.
* Only include the refined, normalized claim in your response.
* If no meaningful refinement is necessary, re-output the original normalized
claim as-is.
* If the response is not not decontextualized, stand-alone, and verifiable,
improve the response by adding more context from the original post if needed.
&lt;|user_query|&gt;{original user query}&lt;|end_of_user_query|&gt;
&lt;|assistant_response|&gt;{Initial Claim}&lt;|end_of_assistant_response|&gt;
&lt;|feedback|&gt;{feedback}&lt;|end_of_feedback|&gt;
&lt;|instruction|&gt;Based on the feedback provided, please refine the above
generated response/normalized claim.&lt;|end_of_instruction|&gt;</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2025 CheckThat! lab task 2 on claim normalization</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.), Working Notes of CLEF 2025 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2025</year>
          , Madrid, Spain,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Thorne</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Christodoulopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mittal</surname>
          </string-name>
          ,
          <article-title>FEVER: a large-scale dataset for fact extraction and verification</article-title>
          ,
          <source>in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>809</fpage>
          -
          <lpage>819</lpage>
          . URL: https://aclanthology.org/N18-1079.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suzgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <surname>Scaling</surname>
          </string-name>
          instruction-finetuned
          <source>language models</source>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2210.11416.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Lora:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2106.09685. arXiv:
          <volume>2106</volume>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>DeepSeek-AI</surname>
          </string-name>
          ,
          <article-title>Deepseek-r1-distill-llama-</article-title>
          <string-name>
            <surname>8b</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: https://huggingface.co/deepseek-ai/ DeepSeek-R1-
          <article-title>Distill-Llama-8B.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Madaan</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>U.</given-names>
            <surname>Yang</surname>
          </string-name>
          , et al.,
          <article-title>Self-refine: Iterative refinement with self-feedback</article-title>
          ,
          <source>arXiv preprint arXiv:2303.17651</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kadavath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kundu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kernion</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goldie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mirhosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>McKinnon</surname>
          </string-name>
          , et al.,
          <article-title>Training a helpful and harmless assistant with rlhf</article-title>
          ,
          <source>arXiv preprint arXiv:2204.05862</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Metropolitansky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <article-title>Towards efective extraction and evaluation of factual claims</article-title>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.10855. arXiv:
          <volume>2502</volume>
          .
          <fpage>10855</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Madaan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tandon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hallinan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wiegrefe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Alon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Dziri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Prabhumoye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. P.</given-names>
            <surname>Majumder</surname>
          </string-name>
          , K. Hermann,
          <string-name>
            <given-names>S.</given-names>
            <surname>Welleck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yazdanbakhsh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <article-title>Self-refine: Iterative refinement with self-feedback</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2303.17651. arXiv:
          <volume>2303</volume>
          .
          <fpage>17651</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
          <article-title>Large language models are zero-shot reasoners</article-title>
          ,
          <year>2023</year>
          . URL: https://arxiv.org/abs/2205.11916. arXiv:
          <volume>2205</volume>
          .
          <fpage>11916</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          , Gpt-
          <volume>4</volume>
          .1 nano,
          <year>2025</year>
          . URL: https://platform.openai.com/docs/models/gpt-4.1-nano.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <surname>Google-DeepMind</surname>
          </string-name>
          ,
          <source>Gemini 2.0 flash</source>
          ,
          <year>2024</year>
          . URL: https://blog.google/technology/google-deepmind/ google-gemini-
          <source>ai-update-december-2024/.</source>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>Meta</surname>
          </string-name>
          ,
          <source>Llama 3.3 70b instruct</source>
          ,
          <year>2024</year>
          . URL: https://huggingface.co/meta-llama
          <source>/Llama-3</source>
          .
          <fpage>3</fpage>
          <string-name>
            <surname>-</surname>
          </string-name>
          70B-Instruct.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14] xAI,
          <article-title>Grok 3 beta - the age of reasoning agents</article-title>
          ,
          <year>2025</year>
          . URL: https://x.ai/news/grok-3.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Neubig</surname>
          </string-name>
          , et al.,
          <article-title>What makes good in-context examples for gpt-3?</article-title>
          , in: Findings of EMNLP,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <article-title>True few-shot learning with language models</article-title>
          ,
          <source>in: Advances in Neural Information Processing Systems</source>
          <volume>34</volume>
          (NeurIPS
          <year>2021</year>
          ),
          <year>2021</year>
          , pp.
          <fpage>11054</fpage>
          -
          <lpage>11070</lpage>
          . URL: https: //arxiv.org/abs/2105.11447.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>H.</given-names>
            <surname>Daumé</surname>
          </string-name>
          <string-name>
            <surname>III</surname>
          </string-name>
          ,
          <article-title>Frustratingly easy domain adaptation</article-title>
          ,
          <source>in: Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics</source>
          ,
          <year>2007</year>
          , pp.
          <fpage>256</fpage>
          -
          <lpage>263</lpage>
          .
          <article-title>false or misleading information? (0 = very unlikely</article-title>
          ,
          <volume>10</volume>
          = very likely)
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          3. **Public Interest:
          <article-title>** How likely is the response to be of general public interest or relevance? (0 = not interesting</article-title>
          ,
          <volume>10</volume>
          = highly interesting)
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          4. **Potential Harm:
          <article-title>** How likely is the response to be harmful, offensive, or cause negative consequences? (0 = not harmful</article-title>
          ,
          <volume>10</volume>
          = extremely harmful)
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          5. **Check-Worthiness:
          <article-title>** How important is it to fact-check this response? (0 = not worth fact-checking, 10 = highly worth fact-checking)</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>