<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>dfkinit2b at CheckThat! 2025: Leveraging LLMs and Ensemble of Methods for Multilingual Claim Normalization</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Tatiana Anikina</string-name>
          <email>tatiana.anikina@dfki.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ivan Vykopal</string-name>
          <email>ivan.vykopal@kinit.sk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sebastian Kula</string-name>
          <email>sebastian.kula@kinit.sk</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ravi Kiran Chikkala</string-name>
          <xref ref-type="aff" rid="aff4">4</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Natalia Skachkova</string-name>
          <email>natalia.skachkova@dfki.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jing Yang</string-name>
          <email>jing.yang@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Veronika Solopova</string-name>
          <email>veronika.solopova@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vera Schmitt</string-name>
          <email>vera.schmitt@tu-berlin.de</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simon Ostermann</string-name>
          <email>simon.ostermann@dfki.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Centre for European Research in Trusted AI</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Faculty of Information Technology, Brno University of Technology</institution>
          ,
          <addr-line>Brno</addr-line>
          ,
          <country country="CZ">Czech Republic</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>German Research Center for Artificial Intelligence, Saarland Informatics Campus</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Kempelen Institute of Intelligent Technologies</institution>
          ,
          <addr-line>Bratislava</addr-line>
          ,
          <country country="SK">Slovakia</country>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>Saarland University</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Technische Universität Berlin</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The rapid spread of misinformation on social media across languages presents a major challenge for fact-checking eforts. Social media posts are often noisy, informal, and unstructured, with irrelevant content, making it dificult to extract concise, verifiable claims. To address this, the CLEF 2025 CheckThat! Shared Task on Multilingual Claim Extraction and Normalization focuses on transforming social media posts into normalized claims, short, clear and check-worthy statements that capture the essence of potentially misleading content. In this paper, we investigate several approaches to this task, including parameter-eficient fine-tuning, prompting large language models (LLMs), and an ensemble of methods. We evaluate our approaches in two settings: monolingual, where we are provided with training and validation data, and the zero-shot setting, where no training data is available for the target language. Our approaches achieved first place in 6 out of 13 languages in the monolingual setting and ranked second or third in the remaining languages. In the zero-shot setting, we achieved the highest performance across all seven languages, demonstrating strong generalization to unseen languages.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Fact-Checking</kwd>
        <kwd>Claim Normalization</kwd>
        <kwd>Claim Extraction</kwd>
        <kwd>Multilingual NLP</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The proliferation of false and misleading information online has emerged as a pressing global concern.
Social media platforms, due to their rapid dissemination and high popularity, have become a fertile
ground for the spread of misinformation. From public health mis- and disinformation to political
propaganda, unverified and often harmful content can quickly gain traction, influencing public opinions
in significant ways. Moreover, misinformation generated by LLMs poses an additional risk to society,
as they are able to generate convincing texts that can be potentially misused to spread mis- and
disinformation [
        <xref ref-type="bibr" rid="ref1 ref2 ref29 ref30">1, 2</xref>
        ].
      </p>
      <p>In response, automated fact-checking has become a vital tool in the fight against mis- and
disinformation. However, an issue arises from the ability to extract and represent claims from noisy, informal
and contextually ambiguous social media posts. They often lack clarity, use slang, and subjective or
emotional language, which makes it dificult for the automated tools, but also for fact-checkers, to
focus on the most important statements contained within the posts. This necessitates an intermediate
step – claim normalization – where unstructured and noisy social media posts are transformed into
clear, concise, and verifiable claims. This process is crucial for extracting meaningful information from
unstructured and cluttered posts, enabling more accurate and scalable fact-checking.</p>
      <p>
        The global nature of false information highlights the importance of developing methods that are robust
across languages. Deploying a unified approach for content moderation in multiple languages is not
only more cost-efective, particularly for media organizations and journalists with limited computational
resources, but also facilitates the identification and matching of related claims across diferent countries.
In addition, the tools that are limited to a single language are insuficient in addressing the full scale
of false information, making multilingual claim normalization essential for comprehensive
factchecking. To address these challenges, the CLEF 2025 Shared Task on Multilingual Claim Extraction and
Normalization [
        <xref ref-type="bibr" rid="ref31 ref32 ref33">3, 4, 5</xref>
        ] focuses on simplifying and restructuring social media content by generating
normalized claims. For instance, below is an example of a short social media post with the corresponding
normalized claim:
      </p>
      <p>Post: "A 40-ton truck lifted by 2,000 drones https://t.co/lyBi5JNJ7X A 40-ton truck lifted by 2,000
drones https://t.co/lyBi5JNJ7X A 40-ton truck lifted by 2,000 drones https://t.co/lyBi5JNJ7X
None."</p>
      <p>Normalized Claim: "Thousands of drones lift a truck."</p>
      <p>The shared task is organized into two settings: monolingual and zero-shot. The monolingual setting
covers 13 languages, including both high and low-resource ones: English, German, French, Spanish,
Portuguese, Hindi, Marathi, Punjabi, Tamil, Arabic, Thai, Indonesian, and Polish. This setting contains
training, development and test data and thus enables model fine-tuning and language-specific evaluation
when models are trained and tested on the data in the same language. Zero-shot is a more challenging
setting that includes only the test data in 7 unseen languages — Dutch, Romanian, Bengali, Telugu,
Korean, Greek, and Czech. The goal of this setting is to assess the generalization capabilities of LLMs
without any language-specific training data.</p>
      <p>We address the shared task by exploring various multilingual LLM-based approaches: zero-shot and
few-shot prompting, LoRA adapters, and ensembling methods.2 Based on the experimental results and
our submissions to the shared task, we found that the best-performing approach largely depends on the
language, the multilingual support of the LLM, and the amount of available data for fine-tuning and
few-shot prompting. In the zero-shot setting, the best scores were achieved either with prompting a
large multilingual Gemma3 27B model, or by using an ensemble of methods as described in Section
3.2.4 that combines the outputs of diferent approaches by selecting the most representative samples.
In the monolingual setting, the best scores were obtained either with adapter-based fine-tuning (for 4
languages), few-shot prompting (3 languages), or with ensembling (6 languages). The ensemble method
proved to be an overall very successful strategy for selecting the most appropriate normalized claims in
our experiments.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Multilingual Fact-Checking. Fact-checking is a multi-step process, typically involving claim
detection, claim-matching, evidence retrieval and claim verification [ 6]. In multilingual contexts, the
pipeline faces additional challenges due to the linguistic diversity and varying resource availability
across languages. Previous work aimed to address this issue by extending the fact-checking datasets
beyond English, with additional languages. Chang et al. [7] introduced a multilingual version of the
FEVER dataset [8], a dataset constructed using machine translation into five additional languages. Other
popular multilingual datasets include X-Fact [9] or MultiClaim [10], which focused on more diverse
languages, including low-resource ones.</p>
      <sec id="sec-2-1">
        <title>2Our code is available at: https://github.com/tanikina/clef2-normalization</title>
        <p>
          Existing research for multilingual approaches mostly focused on two directions: (1) translating data
into English and using monolingual models [11]; or (2) directly using multilingual models on the data,
whether by fine-tuning or by developing novel approaches for multilingual fact-checking [ 12, 13].
Recent studies have explored the use of LLMs in multilingual fact-checking. Singhal et al. [14] evaluated
the multilingual capabilities of LLMs across five diverse languages using various techniques. However,
challenges remain, and performance on the low-resource languages is still suboptimal [
          <xref ref-type="bibr" rid="ref3">15</xref>
          ].
Verified Claim Retrieval. Verified claim retrieval, also known as claim-matching [
          <xref ref-type="bibr" rid="ref4">16</xref>
          ] or previously
fact-checked claim retrieval [10], is one of the important tasks within the fact-checking process [
          <xref ref-type="bibr" rid="ref5">17</xref>
          ].
While the primary goal of verified claim retrieval is to determine whether a given claim has already
been fact-checked based on a set of previously verified claims, there are also auxiliary tasks designed to
enhance the performance on this task [
          <xref ref-type="bibr" rid="ref6">18</xref>
          ].
        </p>
        <p>
          Since the spread of false information is a global phenomenon, it is necessary to check the fact-checked
claims across languages and not only in English. Therefore, the first multilingual datasets for
claimmatching were developed [
          <xref ref-type="bibr" rid="ref4 ref7">16, 19</xref>
          ]. Pikuliak et al. [10] introduced the largest multilingual dataset, which
includes fact-checks in 39 languages and social media posts in 27 languages.
        </p>
        <p>
          The most common approach for verified claim retrieval includes using text embedding models
(TEMs) [
          <xref ref-type="bibr" rid="ref8 ref9">20, 10, 21</xref>
          ] or BM25 [
          <xref ref-type="bibr" rid="ref10">22, 10</xref>
          ] for the identification of similar claims based on a given input.
However, since the multilingual datasets mostly contain social media posts, the retrieval phase faces
several challenges. One of the main problems is that some social media posts are long, especially those
from Facebook, which makes the retrieval using semantic similarity more challenging. Furthermore,
social media posts can contain information unnecessary for the retrieval and fact verification, which
can impact the performance for particular tasks.
        </p>
        <p>
          Claim Normalization. Claim normalization, a task related to verified claim retrieval, aims to
transform complex, unstructured and noisy claims or social media posts into concise, standalone and verifiable
statements. This process enhances the eficiency of fact-checking by facilitating better verified claim
retrieval, evidence retrieval and verification. Sundriyal et al. [
          <xref ref-type="bibr" rid="ref31">3</xref>
          ] defined the claim normalization as the
task of simplifying the claim made in a social media post in a concise form.
        </p>
        <p>
          Sundriyal et al. [
          <xref ref-type="bibr" rid="ref6">18</xref>
          ] introduced the claim normalization task, which focuses on decomposing complex
and noisy social media posts into more straightforward and understandable forms, termed as normalized
claims. They proposed CACN, a novel approach that leverages the chain-of-thought and few-shot
demonstrations to produce normalized claims. Their experiments demonstrated that CACN outperforms
several baselines. However, they limit their experiments to English social media posts and English
fact-checking data only.
        </p>
        <p>
          Ni et al. [
          <xref ref-type="bibr" rid="ref11">23</xref>
          ] addressed challenges in factual claim detection, including inconsistent definitions. In their
work, they aimed to standardize the definition of factual claims to avoid misconceptions. The authors
defined the factual claim as a statement that contains objectively verifiable facts without subjective
opinions. In some of our approaches, we build upon this definition and use it as a characteristic of the
normalized claims.
        </p>
        <p>
          In addition, Metropolitansky and Larson [
          <xref ref-type="bibr" rid="ref12">24</xref>
          ] proposed a framework for evaluating claim
extraction in the context of fact-checking. They introduced Claimify, an LLM-based claim extraction and
demonstrated that it outperforms existing methods under their evaluation framework. While the claim
normalization and claim extraction are diferent tasks, both aim to produce concise and verifiable claims.
While normalization simplifies and clarifies existing claims from a given text, extraction identifies such
claims from a broader context and usually decontextualizes them for further verification. Despite the
diferences, both share the goal of generating clear claims suitable for automated fact-checking.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <p>
          The dataset for the CheckThat 2025 task of extracting and normalizing social media posts includes
20 languages from diverse language families and scripts [
          <xref ref-type="bibr" rid="ref31">3</xref>
          ]. Table 1 presents the statistics for each
language. The task provides the data in two settings: monolingual and zero-shot. In the monolingual
setting, the data contain all three splits – train, development and test, while in the zero-shot setting, only
the test split is provided. Importantly, the shared task data are imbalanced, even when training splits
are available, their size substantially difers between the languages: from 102 samples in Tamil to 11374
samples in English (see Table 1).
        </p>
        <p>
          Data Collection. The data are sourced from the Google Fact-check Explorer API3 and are extracted
from the Claim Review Schema4. The Claim Review Schema contains the fact-checked claims paired
with the posts they address through the corresponding fact-check. Finally, the data for the task consists
of pairs of social media posts and fact-checked claims, which serve as the normalized claims for the
specific post [
          <xref ref-type="bibr" rid="ref31">3</xref>
          ].
        </p>
        <p>
          Data Pre-Processing. We found that for some languages in the monolingual setting, there was a
substantial overlap between the samples in the training and development data (see Figure 1 for the
claim overlap and Figure 12 in the Appendix for the post overlap). Therefore, we applied some
preprocessing and filtered out all exact duplicates, ensuring that the training and development data are
non-overlapping. We also found that some posts and claims have mixed languages, e.g., the post can
be in Hindi but its normalized claim is in English. Even when languages are the same, some claims
in the training data have very low similarity to the corresponding gold posts. This can happen, e.g.,
when the post is referring to some image or video, but those are not provided together with the textual
inputs, and therefore it is impossible for the model to generate correct claims for such cases. We used
SentenceTransformers5 [
          <xref ref-type="bibr" rid="ref13">25</xref>
          ] to measure the similarity between the claims and posts and filtered out all
cases with a similarity score less than 0.05. For language detection, we employed the fasttext-langdetect
library [
          <xref ref-type="bibr" rid="ref14">26</xref>
          ] and discarded the cases where either the post or the gold claim was in English while the
expected target was another language. The statistics regarding the filtered training data can be found
in Table 2.
        </p>
        <p>Moreover, we experimented with additional filtering and normalization methods. We tested on
the development set whether we can improve the results by removing excessive punctuation and
normalizing the hashtags and URLs, i.e., extracting meaningful tokens from them, such as converting
#MasksDoNotWork into masks do not work, or
https://www.technocracy.news/blaylock-face-masks-poseserious-risks-to-the-healthy/ into https://www.technocracy.news/ blaylock face masks pose serious risks</p>
        <sec id="sec-3-1-1">
          <title>3https://toolbox.google.com/factcheck/apis 4https://schema.org/ClaimReview 5https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2</title>
          <p>to the healthy. We also tried removing repeated text sequences in posts (see example in Section 1).
However, cleaning the data in this way and using the “normalized posts” for prompting did not result
in any substantial improvement of the final performance. Therefore, we only performed de-duplication
and similarity filtering as described above and did not modify the original posts.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Experimental Setup</title>
        <p>To perform the normalization of the social media posts, we experimented with various strategies
and LLMs. Specifically, we focused on model fine-tuning with LoRA adapters and the prompting
experiments. For evaluating the performance of the proposed methods, we leveraged the METEOR
Score. We evaluated the final performance using the development sets for particular languages in the
monolingual setting. In addition, we provide the results on test sets from the submitted results for both
monolingual and zero-shot settings.</p>
        <p>In this section, we describe the models used in our experiments (Section 3.2.1), fine-tuning of selected
LLMs (Section 3.2.2) and prompting experiments with various scenarios (Section 3.2.3).
3.2.1. Models
For our experiments and the proposed methods, we selected multiple LLMs, which are detailed in
Table 3. Specifically, we focused on multilingual LLMs with various model sizes ranging from 8B to
405B and compared their eficiency in generating normalized claims.</p>
        <p>
          In total, we employed 9 LLMs in various experiments, especially focusing on parameter-eficient
ifne-tuning and prompting. Most of these LLMs were used primarily for prompting experiments across
all languages or particular experiments for the Polish language. Additionally, Gemma3 4B, Gemma3
27B, and Qwen3 14B were fine-tuned using LoRA adapters to further tailor their performance to the
claim normalization task.
3.2.2. Parameter-Eficient Fine-Tuning
For the monolingual setting, we fine-tuned LoRA adapters [
          <xref ref-type="bibr" rid="ref21">33</xref>
          ] for the Qwen3 14B model6 using the
Unsloth library7. In addition, we experimented with fine-tuning Gemma3 4B and Gemma3 27B. However,
based on the performance on the development set, we chose Qwen3 14B for the shared task submission.
We also experimented with both short and verbose task descriptions as additional input to the model
and found that the verbose version results in better METEOR scores. This verbose version provides a
detailed task description and the definition of the normalized claim with the criteria based on [
          <xref ref-type="bibr" rid="ref6">18</xref>
          ], we
used this version for all adapter-based submissions. More details regarding the adapter fine-tuning,
including the hyperparameter values, can be found in Appendix B.
        </p>
        <p>We also checked whether the generated claim is a valid text, because sometimes LLM generates a
long string of repeated characters or tokens. To avoid such nonsensical outputs, we checked whether
the output claim contained less than three diferent tokens or less than five diferent characters and
repeated generation if this was the case. We also set a constraint that the output should not contain
http because this is an indicator that some URLs were copied from the post, which typically results in
badly normalized claims.
3.2.3. Prompting Experiments
In this section, we describe several experiments for the monolingual and zero-shot settings across
languages. We divided these experiments into two categories: (1) monolingual and zero-shot experiments,
where we experimented with LLMs across all 20 languages within the shared task; and (2) Polish
experiments, in which we experimented with LLMs particularly only for the Polish language and also
with one Polish LLM – Bielik Instruct v2.3.</p>
        <p>Furthermore, we performed additional prompting experiments using Direct and Summarization based
normalization techniques for both monolingual and zero-shot settings across languages, see section C.3
in the Appendix.</p>
        <p>Monolingual and Zero-Shot Experiments. Given that the claim normalization task also includes
zero-shot setting, where the training and development data are not available, we experimented with
various prompting techniques to address this limitation. Specifically, we experimented with: (1)
zero-shot prompting; (2) few-shot prompting with a varied number of demonstrations; (3) translated
zero-shot prompting; and (4) translated few-shot prompting. In addition, for the few-shot prompting
and translated few-shot prompting, we experimented with using the filtered and unfiltered data for
selecting demonstrations for the prompt. In our experiments with LLMs, we set do_sample=False to
enforce greedy decoding, ensuring deterministic output by selecting the most probable next token.</p>
        <sec id="sec-3-2-1">
          <title>6https://huggingface.co/unsloth/Qwen3-14B 7https://github.com/unslothai/unsloth</title>
          <p>In Zero-Shot prompting, we provide LLMs with the task description and the main characteristics
that the normalized claims should fulfill. In this scenario, we rely on the LLM’s understanding of the
task based on the given instructions in English without any previous examples (see Figure 7). For the
Translated Zero-Shot prompting, we utilized Google Translate for translating the English prompt
into particular languages for both monolingual and zero-shot settings.</p>
          <p>The characteristics of the normalized claim can be complex to comprehend, and there are variances
across languages in what the normalized claims look like. Therefore, we employed the Few-Shot
prompting, in which we extended the zero-shot prompting by providing demonstrations from the
training data, while the instruction is in English (see Figure 8). In the Translated Few-Shot prompting,
we translated the instruction into particular languages, while the demonstrations are kept in the original
languages as sampled from the training set.</p>
          <p>
            To select few-shot demonstrations, we utilized the semantic similarity between posts using the
GTE-Multilingual-Base8 [
            <xref ref-type="bibr" rid="ref22">34</xref>
            ] embedding model, which supports more than 70 languages. We
calculated the similarity between the analyzed social media post and the posts that are contained in the
training data and selected the top K as demonstrations. We experimented with prompts containing 1,
2, 5 and 10 demonstrations. For few-shot prompting, we selected the most similar samples across all
languages and not only from the particular language. Given the fact that there are languages for which
we do not have any training data, we decided to select samples from the combined training set of all
languages.
          </p>
          <p>Few-shot experiments were done in two variants: filtered and unfiltered . In the filtered scenario,
we used the filtered training data for selecting demonstrations, in which we removed the posts that
were included in various splits for a particular language and not only in the training set as described
in Section 3.1. For the unfiltered scenario, we employed original training sets and especially the
combination of all training data for the sample selection process.</p>
          <p>Polish Experiments. In our additional experiments, we specifically focused on Polish, a low-resource
language, which consists of a total of 304 samples. The limited size of this dataset motivated a more
comprehensive analysis of the application of various LLMs and diverse prompts to achieve performance
comparable to that of models for high-resource languages, such as English and French. Specifically,
we selected Polish over other low-resource languages, such as Tamil or Marathi, to focus on
Latinscript languages and reduce variability from diferent writing systems. Polish also represents the
unrepresented Slavic language family in multilingual NLP, allowing us to address this gap. Furthermore,
having a native Polish speaker among the authors enabled more accurate evaluation and interpretation
of LLM outputs.</p>
          <p>For experiments with the Polish language, we employed three LLMs, especially the Bielik v2.3 –
Polish model and multilingual Llama3.1 Nemotron Ultra and Llama3.1 405B. For these LLMs,
we leveraged two prompting strategies: (1) Chain-of-Thought (CoT) and (2) Few-Shot prompting.</p>
          <p>
            The CoT prompt in Polish was developed with the assistance of the Llama3.1 405B model and
relevant research papers, especially by Sundriyal et al. [
            <xref ref-type="bibr" rid="ref6">18</xref>
            ] and Sundriyal et al. [
            <xref ref-type="bibr" rid="ref31">3</xref>
            ]. We instructed
the Llama3.1 405B model to generate a CoT prompt based on the description of the task and the
normalized claim. We refer to this prompting strategy as Polish-CoT, which is shown along with the
English translation in Figure 9 in the Appendix. These experiments with Polish-CoT were done only
for Bielik v2.3.
          </p>
          <p>The second set of experiments investigated the efectiveness of a few-shot strategy, specifically using
3, 10, and 20-shot prompting. For few-shot prompting, we selected demonstrations from the unfiltered
training set based on a cosine similarity using the paraphrase-multilingual-MiniLM-L12-v2
model. An example of a system prompt and a few-shot prompt used can be found in Figure 10 and
Figure 11 in the Appendix.</p>
        </sec>
        <sec id="sec-3-2-2">
          <title>8https://huggingface.co/Alibaba-NLP/gte-multilingual-base</title>
          <p>3.2.4. Ensemble of Methods
For method ensembling, we first collected the data from the five top-performing generation strategies
(the exact setting depends on the language, and may include few-shot prompting with diferent models
and fine-tuned LoRA adapters). Second, we compute a centroid (averaged) embedding for each
normalized claim based on the sentences encoded with paraphrase-multilingual-MiniLM-L12-v2
model. Third, we computed the similarity score between all claims generated by the top-5 methods
for the same post and their centroid embedding, and selected as the final output the claim that has
the highest similarity to the centroid. The idea behind this approach is to leverage the “wisdom of the
crowd” and find the most common representation of the generated claims. LLM outputs may difer in
quality depending on the input, for instance, sometimes the claim is generated in the wrong language
or includes some hallucinated content, but if in 4 out of 5 cases the generated claim uses the correct
target language and references the same core content, this issue will be self-corrected by automatically
picking the most representative sample with the embedding closest to the centroid.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation</title>
      <p>In this section, we present our findings on parameter-eficient fine-tuning and LLM prompting for
the claim normalization task. We begin with the results from the monolingual setting (Section 4.1),
including observations from LoRA fine-tuning (Section 4.1.1) and evaluations of various prompting
techniques (Section 4.1.2). Additionally, we report the final results from the shared task submission
platform (Section 4.2), covering both the monolingual and zero-shot settings. This includes ranking of
our methods and the identification of the best-performing approaches for specific languages on the test
set.</p>
      <sec id="sec-4-1">
        <title>4.1. Monolingual Settings</title>
        <p>In the monolingual setting, we evaluate our approaches by training and testing on the data from the
same languages. This allows us to focus on language-specific performance and assess the efectiveness
of parameter-eficient fine-tuning and prompting methods. Since ground truth labels for the test set are
unavailable, we report the final performance of our approaches based on the development set.
4.1.1. Parameter-Eficient Fine-Tuning Results
Based on the initial prompting results, we found that the multilingual Gemma3 27B model achieves
good results for many languages in the monolingual setting. Therefore, we focused on that model when
doing experiments with LoRA adapters (see Table 4), but replaced Gemma3 27B with Qwen3 14B for
the final submission, because Qwen3 outperformed Gemma3 and showed the best average performance
in our later experiments with prompting (see Section 4.1.2 for more detail). We did not repeat the same
experiments with Qwen3 due to the lack of time and computational resources, and directly fine-tuned
the adapters on the de-duplicated training set prepared according to Section 3.1.</p>
        <p>Given that the shared task data are imbalanced (see Table 1), we experimented with diferent ways
of augmenting and balancing the data to mitigate this issue. For instance, LoRA-translated in Table 4
relies on data augmentation via translation from English into the target languages. We used the Google
Translate API and selected the posts with less than 1500 characters as source data. The translated posts
and normalized claims were then combined with the original samples and used for fine-tuning the
adapters (see Appendix B for the fine-tuning details). We also experimented with filtering out “bad
translations” by applying a set of heuristics (LoRA-translated-v2 in Table 4). In this setting, we ensure
that both social media posts and claims share the same target language, and the cosine similarity between
each translated post and the corresponding gold claim is above the median computed on the train data for
each language using Sentence Transformer model paraphrase-multilingual-MiniLM-L12-v2.</p>
        <p>To mitigate the imbalance without adding new data points, we also considered the setting, where we
ifne-tune a single adapter on the mixed data from diferent languages, LoRA-all-balanced in Table 4, but
all post-claim pairs are subsampled to 500 per language to ensure equal representation and diversity.
Since the gains in performance were marginal, we did not repeat these experiments for Qwen3 14B
and used the original, non-translated data, training a separate adapter for each language.
ara
deu
eng
fra
msa</p>
        <p>The results in Table 4 (based on the development data) indicate that the basic LoRA adapter
separately fine-tuned on each target language (LoRA-target) already achieves the optimal
performance for English, Hindi, Marathi, Punjabi and Thai. Using a single adapter fine-tuned on the
mixture of diferent languages with roughly equal representation (LoRA-balanced) results in small
improvements for Arabic, French, and Portuguese, and using translated data without any additional
ifltering (LoRA-translated) is slightly beneficial only for German. Note that filtering out bad examples
from the training set and ensuring high similarity between the translated claims and posts that have
the correct target language (LoRA-translated-v2) is beneficial for some languages and namely leads to
small improvements for Indonesian, Polish, Spanish, and Tamil. However, due to the fact that fine-tuning
adapters on a large amount of translated data is computationally expensive and brings only marginal
gains, we decided to fine-tune the Qwen3 adapters only on the original data for each language. The
ifnal results, including the fine-tuned adapters and the ensemble method, are discussed in more detail
in Section 4.2.</p>
        <p>The comparison between the fine-tuned LoRA adapters with Qwen3 14B and the few-shot prompting
of Qwen3 32B (best strategy according to Table 5 with filtered data) is shown in Figure 2. 9 The results
indicate that for some languages (e.g. German, Polish, Arabic) the diference in performance is
negligible, while for others (e.g. Indonesian and English) adapters substantially outperform few-shot
prompting. Although the amount of training data has some impact on the downstream performance (as
indicated by much better performance on the English data), the pattern is not consistent. For instance,
Portuguese has more than 1500 samples in the training set, but few-shot prompting outperforms adapters,
while Tamil has only 100 samples, but adapters achieve the best METEOR score (+10.8% compared to
the few-shot prompting).</p>
        <p>Overall, high-resource languages with a significant amount of training data (English, Spanish,
Portuguese, and French) demonstrate relatively good performance (0.44-0.65), and when
highor mid-resource languages have comparatively less data (&lt;500 for German and Arabic), they tend to
underperform (0.30-0.36). As for the low-resource languages, adapters work well for Tamil and Punjabi
9We did not fine-tune adapters for Qwen3 32B because of the limited computational resources at the time of the submission.
while Og refers to prompts translated into the target language (e.g., Arabic prompts for the ara language). The Fil.
column specifies the few-shot prompting setup: ✓denotes that filtered data was used to sample demonstrations,
whereas an empty cell indicates the use of unfiltered data. Best results for each language are in bold and
second-best are underlined.
ara
deu
eng
fra
msa
but achieve slightly worse results for Marathi. Both methods obtain almost identical scores for Polish
that has a very small amount of training data (only 151 samples after filtering). On average, languages
with non-Latin script (Arabic, Hindi, Marathi, Punjabi, Tamil, and Thai) obtain lower scores than the
ones with Latin script (0.33 vs. 0.47).
4.1.2. Prompting Experiments
In addition to LoRA adapters, we employed various strategies for instructing LLMs with a specific focus
on evaluating the results of the proposed approaches in the zero-shot setting, where we are not provided
with the training and development sets.</p>
        <p>Zero and Few-Shot Prompting.</p>
        <sec id="sec-4-1-1">
          <title>For the comprehensive evaluation of various settings across</title>
          <p>languages in the monolingual settings, we evaluated the zero-shot and few-shot prompting along with
the translated version. The overall results are shown in Table 5, where we provide the results across 13
languages, five LLMs and in six settings. In addition, we compare the prompts written in English versus
those written in the target language to measure the impact of the instruction language on the model’s
performance.</p>
          <p>Across all models, we observe a consistent improvement when moving from zero-shot to few-shot
prompting. The best average performance in the monolingual setting is achieved by Qwen3
32B in the 10-shot setting with the English instruction and when using unfiltered data for selecting
samples. In addition, Gemma
3</p>
          <p>27B performed comparably well when using 10-shot prompting with
the instruction in the target language without unfiltered data.</p>
          <p>In zero-shot prompting, the prompts written in English consistently outperformed those in
the target language, which can be caused by the fact that since LLMs are trained on a variety of
languages, English still presents the major part of the pre-training, and therefore, the LLM can still better
process input when using English instruction instead of translated instructions. However, in a few-shot
prompting, prompts in the target language (Og) outperformed those in English across most languages,
specifically for Gemma3 and Qwen2.5 models, suggesting that aligning the instruction language with
the input language and demonstrations helps the model better contextualize the task.</p>
          <p>High-resource Western European languages, such as Spanish, English, French, Portuguese,
demonstrated consistently strong performance, with English and Portuguese achieving the highest scores,
both exceeding 0.55. In addition, as can be expected, English showed the strongest performance
across many LLMs, particularly using the 10-shot setting. Notably, Qwen3 32B reaches the highest
score of 0.59, indicating the model’s strong performance in English.</p>
          <p>Languages with non-Latin scripts, such as Arabic, Thai, Tamil, Hindi, Marathi, and Punjabi, showed
more variable performance. Among them, Arabic and Tamil performed the best with Qwen3
models (whether 8B or 32B). On the other hand, Thai achieved relatively low performance, especially using
zero-shot prompting, with a maximum 0.07 METEOR score. However, by providing demonstrations,
the performance increased to more than 0.28.</p>
          <p>Surprisingly, the German language exhibited very low performance across LLMs, particularly using
zero-shot prompting. This outcome may be attributed to issues with data quality, as our manual
inspection revealed several issues. In some cases, the normalized claims associated with social media
posts were written in a diferent language, or the key information from the normalized claim was absent
from the post. Such discrepancies likely hinder the model’s ability to generate appropriate claims,
especially without additional context. Moreover, many normalized claims referenced images or videos
that were not included in the input. As a result, LLMs were not able to recognize or indicate that certain
claims were grounded in visual evidence.</p>
          <p>Prompting Results for Polish. For the Polish language, we conducted a separate set of experiments
and evaluated it on the development set, where the samples from the unfiltered training set were
employed as demonstrations for few-shot prompting. In addition, we provide the results on the
test set obtained from the submission site, where both training and development sets were used for
demonstration selection.</p>
          <p>The results from Table 6 indicate that the optimal performance for Polish on the development dataset
was achieved using the Llama3.1 Nemotron Ultra model with a 10-shot learning approach. In
contrast, the best results on the test dataset were obtained using the Llama3.1 405B model with a
20-shot learning approach.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Final Results</title>
        <p>Table 7 presents the final evaluation results of our proposed approaches on the oficial test set for the
shared task, covering both monolingual and zero-shot settings. Our approaches performed competitively
across a wide range of languages, achieving the first rank in 13 out of the 20 evaluated languages .</p>
        <p>In the monolingual setting, we achieved the top score in six languages, especially Arabic, English,
Hindi, Marathi, Punjabi and Tamil. Notably, from our proposed approaches, the ensemble methods
performed the best for six languages, while fine-tuned LoRA adapters for Qwen3 model achieved
superior performance on four languages. Fine-tuned Qwen3 demonstrated strong performance
in low-resource scenarios like Marathi, Indonesian or Tamil. In addition, prompting techniques with
Qwen3 shown to be efective for German and Punjabi.</p>
        <p>In the zero-shot setting, our methods obtained the highest score in all seven languages. This
demonstrated the generalization capabilities of our approaches even in the absence of training data for
the target languages. Here, the use of Gemma3 and the ensemble of methods were crucial for achieving
the best performance.</p>
        <p>The largest gap between our score and the overall best score occurred for Thai, where our ensemble
of methods scored 0.30 against the best of 0.59, placing us third. This suggests a potential area for
improvement that involves exploring further prompting strategies and model adaptation in syntactically
diverse languages.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Discussion</title>
      <p>Our Main Findings. Our experiments show that LLMs are capable of performing the task of claim
normalization for a variety of languages even when no or only few samples are available. However,
diferent models may generate claims of diferent quality for the same post. Therefore, it is important to
further “normalize” and post-process generated claims by using the ensemble method to find the most
representative sample for each claim. This method resulted in the best score for 5 out of 7 languages in
the zero-shot setting of the shared task, and it was the best strategy for almost half of the languages in
the monolingual setting.</p>
      <p>Overall, LLMs like Gemma3 and Qwen3 demonstrate strong multilingual capabilities. Gemma3 turned
out to be the strongest model for Czech and Korean in the zero-shot setting, while Qwen3 showed better
performance in the monolingual setting. Models of larger sizes (e.g., 32 B for Qwen and 27 B for Gemma)
are generally better at claim normalization, but for some configurations and languages smaller models
perform on-par or even outperform the larger ones. Additionally, we found that extra pre-processing
and cleaning of the data does not substantially improve the scores, and our best results, depending on
the language, were achieved with a few-shot prompting or fine-tuning with the original data.
Limitations &amp; Challenges. The shared task presents several challenges and limitations. The provided
dataset is unbalanced, and for some languages, there are thousands of examples (English, Spanish) while
for others, it is only a few hundreds (Polish, Marathi, Tamil). Languages have diferent scripts, and some
of them are very low-resource (e.g., Bengali, Punjabi, and Telugu).</p>
      <p>A key limitation concerns the post and claim overlap across the dataset splits. While we identified and
addressed the problem of overlapping claims and posts between the original training and development
data, the potential overlap between the training and testing data has not been analyzed. This makes the
ifne-tuning and few-shot prompting somewhat unreliable unless all overlapping instances are removed.
This issue introduces an evaluation bias, especially in monolingual settings, where the models may
appear to perform better due to memorization rather than generalization.</p>
      <p>The input lengths can vary significantly, and some posts are very long and exceed the context window
of LLMs, requiring truncation (posts can be up to 31843 characters and 5020 tokens in the English
training set). Posts often include a lot of repetition along with excessive punctuation, emojis, URLs,
hashtags, and ungrammatical sentences. Some gold posts and claims also appear in diferent languages,
adding complexity to both fine-tuning and the interpretation of demonstrations. A number of claims
also reference external media, such as videos or images, which are not included in the input, and this
leads to potential loss of context and incorrectly or incompletely generated claims.</p>
      <p>In addition, some social media posts include language that can be ofensive, and LLMs refuse to
generate any normalized claims based on such content, e.g., “I understand you’ve expressed strong
negative feelings and used ofensive language towards Greta Thunberg. I want to be clear that I cannot
and will not generate responses that include hate speech, insults, or profanity. My purpose is to be helpful
and harmless, and that includes respecting individuals regardless of difering opinions.” Although the
models were instructed to act as fact-checkers or experts in detecting misinformation, they still refused
to generate normalized claims in certain cases. This behaviour, however, also demonstrates their ability
to refuse potentially harmful content and further spread misinformation.</p>
      <p>Furthermore, understanding certain claims may require world knowledge or familiarity with specific
events. The lack of context is an important limitation of the shared task data because some of the posts
cannot be normalized without access to the conversational threads and additional media accompanying
the post. E.g., it is not possible to infer “girl” in the gold normalized claim “Girl from Ethiopia’s Mursi
tribe” based solely on “Mursi tribe Ethiopia Africa Mursi tribe Ethiopia Africa Mursi tribe Ethiopia Africa
None”. Therefore, gold annotations are not always a realistic goal for the generated output, and having
such examples in the gold data may encourage model hallucinations.</p>
      <p>Future Work. In the future, researchers can consider experimenting with diferent approaches to
data augmentation. For instance, LLMs can be leveraged to generate more samples (post-claim pairs)
for underrepresented languages, and such data could then be further used for adapter fine-tuning. In
addition, we see the potential in refining model predictions by applying self-revision, and the ensemble
method that proved to be successful in our experiments could be applied to the outputs of the same
model (i.e., one could do self-ensemble and find the most representative claims among all generated
variants). Although both self-ensemble and self-revision increase inference time, they have the potential
to improve the quality of generated data and avoid outliers, which is very important for low-resource
scenarios.</p>
      <p>
        Another direction to pursue is to test diferent ways of integrating additional constraints in the prompt
and performing checks after the generation (e.g., ensuring that both the post and its normalized claim
have the same language, and their similarity score is above the threshold derived based on the training
data). We used some of the constraints when generating claims with adapters, but not in the prompting
experiments. One could also benchmark additional multilingual models (e.g., Aya-100 [
        <xref ref-type="bibr" rid="ref23">35</xref>
        ]) and use
soft prompts instead of adapters for parameter-eficient fine-tuning. Furthermore, the integration of
dynamic selection of in-context demonstrations without relying on a fixed number of samples (top K)
can be investigated in future work. This is especially important for the languages that do not have much
data that can be used as demonstrations. The selection of the demonstrations based on the similarity
threshold can help to eliminate those examples that could potentially harm the performance.
      </p>
      <p>
        In addition, future work could explore the impact of using normalized claims on other fact-checking
tasks, such as claim-matching, evidence retrieval, or fact verification. There are already eforts to
evaluate the impact of claim decomposition on the fact-checking performance [
        <xref ref-type="bibr" rid="ref24">36</xref>
        ]. However, the efect
of normalized claims on the performance of particular tasks has not been analyzed. Claims normalization
may help reduce noise and ambiguity, potentially leading to improved model performance on these
tasks. Especially, in claim-matching, normalized claims can enhance the identification of whether a
given claim was previously fact-checked, since normalized claims more closely resemble the statements
with which they are being compared in this task, e.g., when using semantic similarity. A comparative
analysis between using raw social media posts and their corresponding normalized claims would
provide valuable insights into the benefits and limitations of normalization. Moreover, it would be
interesting to conduct a feasibility analysis in real time settings by integrating multilingual LLM-based
claim normalization in fact-checking workflows and see how this approach can be scaled.
      </p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we presented our approaches to multilingual claim normalization in the context of the
CLEF 2025 CheckThat! shared task. By combining parameter-eficient fine-tuning, prompting strategies,
and ensemble methods, we addressed the challenges posed by noisy, informal, and multilingual social
media content. Our methods demonstrated strong performance across both monolingual and zero-shot
settings, achieving first place in 6 out of 13 monolingual languages and top scores in all 7 zero-shot
languages.</p>
      <p>We found that the efectiveness of each approach varied by language and resource availability.
LoRAbased fine-tuning proved efective for low-resource scenarios, while few-shot prompting with models
like Qwen3 32B yielded the best results in high-resource settings. The ensemble method, leveraging
outputs from multiple strategies, emerged as a robust solution for selecting representative normalized
claims, especially in zero-shot scenarios. Our findings highlight the potential of multilingual LLMs for
claim normalization and their adaptability across diverse languages.</p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgments</title>
      <p>This research was partially supported by DisAI - Improving scientific excellence and creativity in combating
disinformation with artificial intelligence and language technologies , a project funded by Horizon Europe
under GA No.101079164, by LorAI - Low Resource Artificial Intelligence , a project funded by Horizon
Europe under GA No.101136646, by the Ministry of Education, Youth and Sports of the Czech Republic
through the e-INFRA CZ (ID:90254), by the German Federal Ministry of Research, Technology and
Space (BMFTR) as part of the projects TRAILS (01IW24005) and VeraExtract (01IS24066), as well as
BIFOLD Agility Project FakeXplain.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>
        During the preparation of this work, some authors used Grammarly and ChatGPT in order to check the
spelling and paraphrase. After using these tools, the authors reviewed and edited the content as needed
and take full responsibility for the publication’s content.
[
        <xref ref-type="bibr" rid="ref31">3</xref>
        ] M. Sundriyal, T. Chakraborty, P. Nakov, Overview of the CLEF-2025 CheckThat! lab task 2 on
claim normalization, in: G. Faggioli, N. Ferro, P. Rosso, D. Spina (Eds.), Working Notes of CLEF
2025 - Conference and Labs of the Evaluation Forum, CLEF 2025, Madrid, Spain, 2025.
[
        <xref ref-type="bibr" rid="ref32">4</xref>
        ] F. Alam, J. M. Struß, T. Chakraborty, S. Dietze, S. Hafid, K. Korre, A. Muti, P. Nakov, F. Ruggeri,
S. Schellhammer, V. Setty, M. Sundriyal, K. Todorov, V. V., The clef-2025 checkthat! lab: Subjectivity,
fact-checking, claim normalization, and retrieval, in: C. Hauf, C. Macdonald, D. Jannach, G. Kazai,
F. M. Nardini, F. Pinelli, F. Silvestri, N. Tonellotto (Eds.), Advances in Information Retrieval,
Springer Nature Switzerland, Cham, 2025, pp. 467–478.
[
        <xref ref-type="bibr" rid="ref33">5</xref>
        ] F. Alam, J. M. Struß, T. Chakraborty, S. Dietze, S. Hafid, K. Korre, A. Muti, P. Nakov, F. Ruggeri,
S. Schellhammer, V. Setty, M. Sundriyal, K. Todorov, V. Venktesh, Overview of the CLEF-2025
CheckThat! Lab: Subjectivity, fact-checking, claim normalization, and retrieval, in: J. Carrillo-de
Albornoz, J. Gonzalo, L. Plaza, A. García Seco de Herrera, J. Mothe, F. Piroi, P. Rosso, D. Spina,
G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction.
      </p>
      <p>Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025), 2025.
[6] P. Nakov, D. Corney, M. Hasanain, F. Alam, T. Elsayed, A. Barrón-Cedeño, P. Papotti, S. Shaar,
G. Da San Martino, Automated Fact-Checking for Assisting Human Fact-Checkers, in: Z.-H. Zhou
(Ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence,
IJCAI21, International Joint Conferences on Artificial Intelligence Organization, 2021, pp. 4551–4558.</p>
      <p>URL: https://doi.org/10.24963/ijcai.2021/619. doi:10.24963/ijcai.2021/619, survey Track.
[7] Y.-C. Chang, C. Kruengkrai, J. Yamagishi, XFEVER: Exploring Fact Verification across Languages,
in: J.-L. Wu, M.-H. Su (Eds.), Proceedings of the 35th Conference on Computational Linguistics and
Speech Processing (ROCLING 2023), The Association for Computational Linguistics and Chinese
Language Processing (ACLCLP), Taipei City, Taiwan, 2023, pp. 1–11. URL: https://aclanthology.
org/2023.rocling-1.1/.
[8] J. Thorne, A. Vlachos, C. Christodoulopoulos, A. Mittal, FEVER: a Large-scale Dataset for Fact
Extraction and VERification, in: M. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018
Conference of the North American Chapter of the Association for Computational Linguistics:
Human Language Technologies, Volume 1 (Long Papers), Association for Computational
Linguistics, New Orleans, Louisiana, 2018, pp. 809–819. URL: https://aclanthology.org/N18-1074/.
doi:10.18653/v1/N18-1074.
[9] A. Gupta, V. Srikumar, X-Fact: A New Benchmark Dataset for Multilingual Fact Checking, in:
C. Zong, F. Xia, W. Li, R. Navigli (Eds.), Proceedings of the 59th Annual Meeting of the Association
for Computational Linguistics and the 11th International Joint Conference on Natural Language
Processing (Volume 2: Short Papers), Association for Computational Linguistics, Online, 2021, pp.
675–682. URL: https://aclanthology.org/2021.acl-short.86/. doi:10.18653/v1/2021.acl-short.
86.
[10] M. Pikuliak, I. Srba, R. Moro, T. Hromadka, T. Smoleň, M. Melišek, I. Vykopal, J. Simko, J. Podroužek,
M. Bielikova, Multilingual Previously Fact-Checked Claim Retrieval, in: H. Bouamor, J. Pino,
K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, Singapore, 2023, pp. 16477–16500. URL:
https://aclanthology.org/2023.emnlp-main.1027/. doi:10.18653/v1/2023.emnlp-main.1027.
[11] A. Singhal, V. Shao, G. Sun, R. Ding, J. Lu, K. Zhu, A Comparative Study of Translation Bias and
Accuracy in Multilingual Large Language Models for Cross-Language Claim Verification, 2024.</p>
      <p>URL: https://arxiv.org/abs/2410.10303. arXiv:2410.10303.
[12] R. F. Cekinel, P. Karagoz, Ç. Çöltekin, Cross-Lingual Learning vs. Low-Resource Fine-Tuning:
A Case Study with Fact-Checking in Turkish, in: N. Calzolari, M.-Y. Kan, V. Hoste, A. Lenci,
S. Sakti, N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational
Linguistics, Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL, Torino,
Italia, 2024, pp. 4127–4142. URL: https://aclanthology.org/2024.lrec-main.368/.
[13] R. Panchendrarajan, A. Zubiaga, Entity-aware Cross-lingual Claim Detection for Automated</p>
      <p>Fact-checking, 2025. URL: https://arxiv.org/abs/2503.15220. arXiv:2503.15220.
[14] A. Singhal, T. Law, C. Kassner, A. Gupta, E. Duan, A. Damle, R. L. Li, Multilingual Fact-Checking</p>
    </sec>
    <sec id="sec-9">
      <title>A. Computational Resources</title>
      <p>
        For our experiments, we leveraged a computational infrastructure consisting of A40 PCIe 40GB, H100
NVL 94GB NVIDIA GPUs, while our experiments ran in parallel on multiple GPUs. In addition, the
Polish experiments were conducted on a local workstation equipped with an NVIDIA GeForce RTX
3080 GPU and utilising the NVIDIA NIM platform [
        <xref ref-type="bibr" rid="ref25">37</xref>
        ].
      </p>
    </sec>
    <sec id="sec-10">
      <title>B. Details on Parameter-Eficient Fine-Tuning</title>
      <p>The adapters were tuned for each language separately, using the filtered training data. In the pilot
experiments with German we found that the maximum sequence length 2048, learning rate 2e-4, and
You are a fact-checking expert
Create a normalized claim from the unstructured post.</p>
      <p>Now process this post:
{post}
linear scheduler work well for the normalized claim generation, thus we re-used these hyperperameters
for training adapters in all languages. We use  = 32 and _ℎ = 32 with _ = 0,
and train the adapters for 3 epochs to avoid overfitting. At inference time we set __
to 256, and generate the claims with the following hyperparameters:  = 0.7, _ =
0.8, _ = 20.</p>
    </sec>
    <sec id="sec-11">
      <title>C. Prompting Experiments</title>
      <sec id="sec-11-1">
        <title>C.1. Prompt Templates</title>
        <p>In this section, we present the system and prompt templates used for specific prompting experiments.
Figure 4 and Figure 5 illustrate the prompt templates for the Direct Normalization and
SummarizationBased Normalization approaches, respectively. Each template includes two demonstration examples.
The prompt design emphasizes key aspects such as maintaining focus on important points, eliminating
redundancy, ensuring objectivity in claims, and using clear, simple language. In the zero-shot approach
for monolingual experiments, we assign a fact-checker role to LLMs and prompt it to generate a
normalized claim from an unstructured input post, see Figure 3 for the prompt template. For the zero-shot
experiments we use the direct normalization prompt without any demonstrations from train set.</p>
        <p>For the zero-shot and few-shot prompting experiments, described in Section 3.2.3, we used the system
prompt shown in Figure 6. Our zero-shot prompt is shown in Figure 7, while the extended version for
the few-shot prompting is illustrated in Figure 8. In few-shot prompting, we replace {examples} with
a list of social media posts along with the normalized claims. The number of demonstrations depends
on the setting and whether we used 1, 2, 5 or 10-shot prompting.</p>
        <p>Figure 10 and Figure 11 show, respectively, the system prompt and user prompt for few-shot prompting
experiments for the Polish language.</p>
      </sec>
      <sec id="sec-11-2">
        <title>C.2. Post Overlap in Development Data</title>
        <p>Figure 12 presents the overlap between the gold training and developing data.</p>
      </sec>
      <sec id="sec-11-3">
        <title>C.3. Additional Results</title>
        <p>Few-Shot Prompting. Table 8 presents the results for varying numbers of demonstrations for the
few-shot prompting. In this scenario, we employed instructions written in the English language. Overall,
both Qwen3 models consistently outperformed the Gemma3 model using few-shot prompting.
The best averaged performance was achieved by Qwen3 32B with 10-shot using unfiltered data. This
demonstrated that Qwen3 are better equipped to handle the demonstrations and they also show stronger
multilingual capabilities.</p>
        <p>Increasing the number of demonstrations in the prompt generally improves performance,
particularly for large models. For example Qwen3 32B improved from 0.315 1-shot, unfiltered) to 0.375
(10-shot, unfiltered). Moreover, using unfiltered data often led to better results on average.</p>
        <p>Similarly to the results using zero-shot and 10-shot prompting, Latin-script Indo-European
languages yielded the highest scores, reflecting both their prevalence in pre-training data and linguistic
Create a best normalized claim from the unstructured data.</p>
        <p>Follow these guidelines:
Example 1:
Post: 'Lieutenant Retired General Asif Mumtaz appointed as Chairman Pakistan Medical Commission PMC Lieutenant Retired General
Asif Mumtaz appointed as Chairman Pakistan Medical Commission PMC Lieutenant Retired General Asif Mumtaz appointed as
Chairman Pakistan Medical Commission PMC None.'
Normalized Claim: 'Pakistani government appoints former army general to head medical regulatory body.'
Example 2:
Post: A priceless clip of 1970 of Bruce Lee playing Table Tennis with his Nan-chak !! His focus on speed A priceless clip of 1970 of
Bruce Lee playing Table Tennis with his Nan-chak !! His focus on speed A priceless clip of 1970 of Bruce Lee playing Table Tennis with
his Nan-chak !! His focus on speed None
Normalized Claim: Late actor and martial artist Bruce Lee playing table tennis with a set of nunchucks.</p>
        <p>Now process this claim:
{post}</p>
        <p>Create a summary from the unstructured data in the form of a normalized claim.</p>
        <p>Follow these guidelines:
1. Focus on the main message — Extract only the most important factual statement from the post.
2. Remove redundancy— Ignore repetition, extraneous details, and any irrelevant content (hashtags, usernames, etc.).
3. Keep it objective — Avoid opinions, judgments, or speculation.
4. Use simple language— Rephrase complex or convoluted sentences into clear, direct statements.
5. Formatting — Use ONLY this format: Normalized Claim: [your claim here]
Example 1:
Post: 'Lieutenant Retired General Asif Mumtaz appointed as Chairman Pakistan Medical Commission PMC Lieutenant Retired General
Asif Mumtaz appointed as Chairman Pakistan Medical Commission PMC Lieutenant Retired General Asif Mumtaz appointed as
Chairman Pakistan Medical Commission PMC None.'
Normalized Claim: 'Pakistani government appoints former army general to head medical regulatory body.'
Example 2:
Post: A priceless clip of 1970 of Bruce Lee playing Table Tennis with his Nan-chak !! His focus on speed A priceless clip of 1970 of
Bruce Lee playing Table Tennis with his Nan-chak !! His focus on speed A priceless clip of 1970 of Bruce Lee playing Table Tennis with
his Nan-chak !! His focus on speed None
Normalized Claim: Late actor and martial artist Bruce Lee playing table tennis with a set of nunchucks.</p>
        <p>Now process this claim:
{post}</p>
        <p>You are an expert in misinformation detection and fact-checking. Your task is to identify the central claim in the given post while
preserving its original language.
similarity to English. In contrast, languages using non-Latin scripts showed lower performance,
highlighting the challenges in multilingual generalization for underrepresented scripts. However, there are
some exceptions, such as Indonesian and Tamil, where the best performance was over 0.42.
Direct and Summarization-Based Normalization Methods. As additional experiments for the
monolingual setting, we experimented with three diferent approaches: two few-shot prompting methods
and zero-shot prompting. In the first approach (hereafter referred to as Direct Normalization Approach),</p>
        <p>The central claim should meet the following criteria:
- **Verifiable**: It must be a factual assertion that can be checked against evidence.
- **Concise**: It should be a single, clear sentence that captures the main claim of the post.
- **Socially impactful**: It should be a statement that could influence public opinion, health, or policy.
- **Free from rhetorical elements**: Do not include opinions, rhetorical questions, or unnecessary context.
- **Preserve Original Language**: The output should be in the same language as the input post.</p>
        <p>Output only the central claim without additional explanation or formatting.</p>
        <p>Post: {post}</p>
        <p>Normalized claim:</p>
        <p>You are an expert in misinformation detection and fact-checking. Your task is to identify the central claim in the given post while
preserving its original language.</p>
        <p>The central claim should meet the following criteria:
- **Verifiable**: It must be a factual assertion that can be checked against evidence.
- **Concise**: It should be a single, clear sentence that captures the main claim of the post.
- **Socially impactful**: It should be a statement that could influence public opinion, health, or policy.
- **Free from rhetorical elements**: Do not include opinions, rhetorical questions, or unnecessary context.
- **Preserve Original Language**: The output should be in the same language as the input post.</p>
        <p>Output only the central claim without additional explanation or formatting.</p>
        <p>Examples: {examples}
Post: {post}</p>
        <p>Normalized claim:
we instructed LLMs to generate the most accurate normalized claims directly from unstructured data.
The prompt template used for Direct Normalization is illustrated in Figure 4. In the second method
(Summarization-Based Normalization), we summarized the unstructured data into a normalized claim,
as shown in the prompt template in Figure 5. For both approaches, we included two demonstrations
that were randomly selected from the training set as references and evaluated the performance on the
development set. In the zero-shot approach for the monolingual experiments, where the LLMs rely solely
on their pre-trained knowledge, we provided instructions without including any training examples.
For the monolingual experiments for each language, we used the instruction in the specific language by
translating the prompt into that language using the Google Translate API.</p>
        <p>In the zero-shot setting with 7 languages, we relied on the direct normalization approach, as it
produced the best results in the monolingual experiments. In this case, however, we instructed LLMs
using prompts written entirely in English, without any translated prompts or demonstrations. This
setup evaluates the model’s ability to generalize across languages using its pre-trained multilingual
capabilities.</p>
        <p>
          For these experiments, we selected three LLMs, specifically Llama4 Scout [
          <xref ref-type="bibr" rid="ref26">38</xref>
          ], Llama3.3
Instruct 70B [
          <xref ref-type="bibr" rid="ref15">27</xref>
          ] and Mistral Saba [
          <xref ref-type="bibr" rid="ref27">39</xref>
          ]. Additionally, for running the experiments, we used the
Groq API [
          <xref ref-type="bibr" rid="ref28">40</xref>
          ], configured with a maximum output limit of 80 tokens and a temperature setting of 0.3.
        </p>
        <p>Table 9 presents the results for Mistral Saba, Llama 3.3 Instruct, and Llama 4 Scout
in the monolingual setting on the development set, using zero-shot, direct and summarization-based
normalization approaches. Among these, the direct normalization approach with Mistral Saba</p>
        <p>Your task is to simplify a noisy, unstructured social media post into a concise form while preserving the core assertion. You will be given
a post and you need to generate a normalized claim. Please respond with the normalized claim.</p>
        <p>The normalised claim must contain a maximum of 10 words or fewer. The normalised claim must be in the Polish language only.
achieves the highest average score on the development set. The lowest average score is observed with
Mistral Saba using the zero-shot approach. The diference between the highest and lowest average
score is 0.083. We observe that, all three models perform better than zero-shot setting with direct and
summarization-based normalization.</p>
        <p>ara</p>
        <p>Approach
Zero-shot
Direct
Summarization
deu
eng</p>
        <p>hi
0.283
0.319
0.302
0.325
0.353
0.307
0.309
0.333
msa</p>
        <p>Instruct, and
0.108
0.152
0.157
msa
0.188
0.294
0.298
pol</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Vykopal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pikuliak</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Srba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Moro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Macko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bielikova</surname>
          </string-name>
          ,
          <article-title>Disinformation Capabilities of Large Language Models</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>14830</fpage>
          -
          <lpage>14847</lpage>
          . URL: https: //aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>793</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>793</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Zugecova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Macko</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Srba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Moro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kopal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Marcincinova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mesarcik</surname>
          </string-name>
          ,
          <article-title>Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2412.13666. arXiv:
          <volume>2412</volume>
          .13666. using LLMs, in: D.
          <string-name>
            <surname>Dementieva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Ignat</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Mihalcea</surname>
            , G. Piatti,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Tetreault</surname>
            , S. Wilson,
            <given-names>J</given-names>
          </string-name>
          .
          <string-name>
            <surname>Zhao</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the Third Workshop on NLP for Positive Impact</source>
          , Association for Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>13</fpage>
          -
          <lpage>31</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .nlp4pi-
          <fpage>1</fpage>
          .2/. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .nlp4pi-
          <fpage>1</fpage>
          .2.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>I.</given-names>
            <surname>Vykopal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pikuliak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ostermann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Šimko</surname>
          </string-name>
          ,
          <source>Generative Large Language Models in Automated Fact-Checking: A Survey</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.02351. arXiv:
          <volume>2407</volume>
          .
          <fpage>02351</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Kazemi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Garimella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Gafney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hale</surname>
          </string-name>
          , Claim Matching Beyond English to Scale Global Fact-Checking, in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>4504</fpage>
          -
          <lpage>4517</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>347</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>347</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Hrckova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Moro</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Srba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Simko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bielikova</surname>
          </string-name>
          , Autonomation, not Automation:
          <article-title>Activities and Needs of Fact-checkers as a Basis for Designing Human-</article-title>
          <source>Centered AI Systems</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2211.12143. arXiv:
          <volume>2211</volume>
          .
          <fpage>12143</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>From chaos to clarity: Claim normalization to empower fact-checking</article-title>
          , in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <source>Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>6594</fpage>
          -
          <lpage>6609</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .findings-emnlp.
          <volume>439</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . findings-emnlp.
          <volume>439</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Mansour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babulkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          , G. Da San Martino, T. Elsayed,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2021 CheckThat! lab task 2 on detecting previously fact-checked claims in tweets and political debates (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babulkov</surname>
          </string-name>
          , G. Da San Martino, P. Nakov,
          <article-title>That is a Known Lie: Detecting Previously Fact-Checked Claims</article-title>
          , in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J. Tetreault (Eds.),
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>3607</fpage>
          -
          <lpage>3618</lpage>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main.
          <volume>332</volume>
          /. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>332</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [21]
          <string-name>
            <surname>I. Larraz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Míguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sallicati</surname>
          </string-name>
          ,
          <article-title>Semantic similarity models for automated fact-checking: ClaimCheck as a claim matching tool</article-title>
          , Profesional de la
          <source>Información</source>
          <volume>32</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          , G. Da San Martino, P. Nakov,
          <article-title>The Role of Context in Detecting Previously Fact-Checked Claims</article-title>
          , in: M.
          <string-name>
            <surname>Carpuat</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.-C. de Marnefe</surname>
            ,
            <given-names>I. V.</given-names>
          </string-name>
          <string-name>
            <surname>Meza Ruiz</surname>
          </string-name>
          (Eds.),
          <source>Findings of the Association for Computational Linguistics: NAACL</source>
          <year>2022</year>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Seattle, United States,
          <year>2022</year>
          , pp.
          <fpage>1619</fpage>
          -
          <lpage>1631</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .findings-naacl.
          <volume>122</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .findings-naacl.
          <volume>122</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Stammbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sachan</surname>
          </string-name>
          , E. Ash, M. Leippold,
          <article-title>AFaCTA: Assisting the Annotation of Factual Claim Detection with Reliable LLM Annotators</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>1890</fpage>
          -
          <lpage>1912</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>104</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>104</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>D.</given-names>
            <surname>Metropolitansky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <source>Towards Efective Extraction and Evaluation of Factual Claims</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.10855. arXiv:
          <volume>2502</volume>
          .
          <fpage>10855</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          , Sentence-BERT:
          <article-title>Sentence Embeddings using Siamese BERT-Networks</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1908</year>
          .10084.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>A.</given-names>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bojanowski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jégou</surname>
          </string-name>
          , T. Mikolov, FastText.zip:
          <article-title>Compressing text classification models</article-title>
          ,
          <source>arXiv preprint arXiv:1612.03651</source>
          (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>A.</given-names>
            <surname>Grattafiori</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dubey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Jauhri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pandey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kadian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Dahle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Letman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mathur</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schelten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaughan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mitra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sravankumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hinsvark</surname>
          </string-name>
          , et al.,
          <source>The Llama 3 Herd of Models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407. 21783. arXiv:
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bercovich</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Levy</surname>
          </string-name>
          , I. Golan,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dabbah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>El-Yaniv</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Puny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Galil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Moshe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ronen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Nabwani</surname>
          </string-name>
          , et al.,
          <source>Llama-Nemotron: Eficient Reasoning Models</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/ 2505.00949. arXiv:
          <volume>2505</volume>
          .
          <fpage>00949</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          , et al.,
          <source>Qwen2 Technical Report</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.10671. arXiv:
          <volume>2407</volume>
          .
          <fpage>10671</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>A.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lv</surname>
          </string-name>
          , et al.,
          <source>Qwen3 Technical Report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2505.09388. arXiv:
          <volume>2505</volume>
          .
          <fpage>09388</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kamath</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ferret</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vieillard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Merhej</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Perrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Matejovicova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramé</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rivière</surname>
          </string-name>
          , et al.,
          <source>Gemma 3 Technical Report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2503.19786. arXiv:
          <volume>2503</volume>
          .
          <fpage>19786</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>K.</given-names>
            <surname>Ociepa</surname>
          </string-name>
          , Łukasz Flis,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wróbel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gwoździej</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kinas</surname>
          </string-name>
          ,
          <source>Bielik 11b v2 technical report</source>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2505.02410. arXiv:
          <volume>2505</volume>
          .
          <fpage>02410</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Chen, LoRA:
          <article-title>Low-Rank Adaptation of Large Language Models</article-title>
          ,
          <source>in: The Tenth International Conference on Learning Representations, ICLR</source>
          <year>2022</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          ,
          <source>April 25-29</source>
          ,
          <year>2022</year>
          , OpenReview.net,
          <year>2022</year>
          . URL: https: //openreview.net/forum?id=nZeVKeeFYf9.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Zhang,</surname>
          </string-name>
          <article-title>mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval</article-title>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2407.19669. arXiv:
          <volume>2407</volume>
          .
          <fpage>19669</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Üstün</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Aryabumi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yong</surname>
          </string-name>
          , W.-Y. Ko, D. D'souza, G. Onilude,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bhandari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.- L.</given-names>
            <surname>Ooi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kayid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Vargus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Blunsom</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fadaee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kreutzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hooker</surname>
          </string-name>
          , Aya Model:
          <article-title>An Instruction Finetuned Open-Access Multilingual Language Model</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>15894</fpage>
          -
          <lpage>15939</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>845</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>845</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Long</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Decomposition dilemmas: Does claim decomposition boost or burden fact-checking performance?</article-title>
          , in: L.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ritter</surname>
          </string-name>
          , L. Wang (Eds.),
          <source>Proceedings of the</source>
          <year>2025</year>
          <article-title>Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Association for Computational Linguistics</article-title>
          , Albuquerque, New Mexico,
          <year>2025</year>
          , pp.
          <fpage>6313</fpage>
          -
          <lpage>6336</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>320</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2025</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>320</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>NVIDIA</given-names>
            <surname>Corporation</surname>
          </string-name>
          ,
          <source>NIM Platform</source>
          ,
          <year>2023</year>
          . URL: https://developer.nvidia.com/nim.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Meta</surname>
            <given-names>AI</given-names>
          </string-name>
          , Llama-4
          <string-name>
            <surname>-Scout-</surname>
          </string-name>
          17B
          <source>-16E</source>
          ,
          <year>2025</year>
          . URL: https://huggingface.co/meta-llama/ Llama-4
          <string-name>
            <surname>-Scout-</surname>
          </string-name>
          17B-16E.
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [39]
          <string-name>
            <surname>Mistral</surname>
            <given-names>AI</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mistral-Small-</surname>
          </string-name>
          24B-Base-
          <volume>2501</volume>
          ,
          <year>2025</year>
          . URL: https://huggingface.co/mistralai/ Mistral-Small
          <string-name>
            <surname>-</surname>
          </string-name>
          24B-Base-
          <volume>2501</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>Groq</given-names>
            <surname>Inc</surname>
          </string-name>
          ., Quickstart - GroqDocs, https://console.groq.com/docs/quickstart,
          <year>2025</year>
          . URL: https: //console.groq.com/docs/quickstart, accessed:
          <fpage>2025</fpage>
          -05-26.
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <article-title>1. Focus on the main message - Extract only the most important factual statement from the post</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          2.
          <string-name>
            <surname>Remove</surname>
          </string-name>
          redundancy
          <article-title>- Ignore repetition, extraneous details, and any irrelevant content (hashtags, usernames, etc</article-title>
          .).
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>3. Keep it objective - Avoid opinions, judgments, or speculation.</mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          4.
          <article-title>Use simple language- Rephrase complex or convoluted sentences into clear, direct statements</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          5. Formatting -
          <article-title>Use ONLY this format: Normalized Claim: [your claim here] 1-shot 1-shot 2-shot 2-shot 5-shot 5-shot 10-shot 10-shot 1-shot 1-shot 2-shot 2-shot 5-shot 5-shot 10-shot 10-shot 1-shot 1-shot 2-shot 2-shot 5-shot 5-shot 10-shot 10-shot</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>