<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>X (P. Amatya);</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Factiverse and IAI at CheckThat! 2025: Adaptive ICL for Claim Extraction</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pratuat Amatya</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vinay Setty</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Factiverse AS</institution>
          ,
          <addr-line>Prof. Olav Hanssens v. 7 A, 4021 Stavanger</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Stavanger</institution>
          ,
          <addr-line>Kjell Arholms gate 41, 4021 Stavanger</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>In this paper, we describe methods and results of our participation in the 2025 CheckThat! Lab “Task 2: Claim Normalization”. The task aims to develop methods to extract concise claims from noisy, unstructured social media posts. The task itself comprised two settings: monolingual and zero-shot setting. In this paper, we focus more on the monolingual setting where training and testing data are available for 13 diferent languages such as English, German, French, etc. and utilized fine-tuning based methods, zero-shot prompting, and in-context learning (ICL) methods using a fixed and a adaptive number of examples to improve the generations of normalized claims. While the model performance was not consistent across all languages, we achieved notable placements on the organizer's leader-board: fifth-best in English, fourth-best in German, French, Indonesian, and sixth-best in Spanish and Portuguese using fine-tuning based method. Our experiment highlighted the efectiveness of both ifne-tuning and ICL based methods with comparable performance scores using labeled test dataset over baseline (zero-shot prompting). The experiment also revealed some challenges associated with the nature of social media data and some limitations associated with models we experimented with, hence providing us insight into how the method could be improved for claim normalization.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Claim-normalization</kwd>
        <kwd>Fact-checking</kwd>
        <kwd>In-Context Learning</kwd>
        <kwd>Fine-tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The rapid proliferation of misinformation on social media platforms has created an urgent need for
efective automated fact-checking technologies. A foundational step in this pipeline is claim
normalization—the task of simplifying noisy and unstructured posts into concise, structured statements that
can be further verified [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. The CheckThat! Lab 2025 Task 2 focuses on this crucial preprocessing step,
particularly in a multilingual setting, where diverse linguistic and stylistic features pose additional
challenges [2].
      </p>
      <p>Claim normalization is inherently challenging due to the variability and informality of social media
language, including the use of slang, sarcasm, code-switching, and regional idioms. These
characteristics are further exacerbated in multilingual contexts, where linguistic diversity introduces additional
complexity. Thus, building robust claim normalization systems that generalize well across languages
and domains is of paramount importance.</p>
      <p>The 2025 CheckThat! Lab Task 2 provides a rigorous benchmark for this task, encompassing both
monolingual and zero-shot settings over 13 languages. In our submission, we address both tracks but
place particular emphasis on the monolingual scenario where annotated training data is available.</p>
      <p>We evaluate and compare three core approaches for claim normalization: (i) zero-shot prompting, (ii)
ifne-tuning large language models (LLM) , and (iii) in-context learning (ICL). Within the ICL framework,
we experiment with two variants: Fixed In-Context Learning (FICL), where a constant number of 
examples are included in every prompt, and Adaptive In-Context Learning (AICL), where the number of
examples is dynamically adjusted for each input based on semantic similarity. While prior work has
investigated fixed-  ICL, our key contribution lies in the implementation and empirical evaluation of
the adaptive variant, which leverages cosine similarity in a vector-based retrieval setup to tailor the
example set to each input. This enables more context-aware prompting without the need for model
retraining or handcrafted selection strategies.</p>
      <p>The novelty of our AICL approach lies in its use of a vector-based retrieval mechanism (ChromaDB)
combined with a cosine similarity heuristic to select contextually relevant examples from the training
data. Unlike traditional ICL methods that rely on a fixed number of examples, AICL tailors the quantity
and content of examples based on the semantic proximity of the input, striking a balance between
informativeness and cognitive load in the prompt. This adaptation mechanism allows the model to
operate more eficiently in heterogeneous data scenarios without requiring explicit model retraining or
extensive hyperparameter tuning.</p>
      <p>Our results demonstrate that both fine-tuning and ICL-based methods significantly outperform the
zero-shot baseline across multiple languages, with AICL matching the performance of FICL while
ofering superior flexibility and scalability. Specifically, our system achieved top-6 leaderboard placements
in several languages, including fifth-best in English and fourth-best in German, French, and Indonesian.</p>
      <p>To provide structural overview of this paper, in Section 2, we review related work on claim
normalization, emphasizing its significance in the context of fact-checking, as well as an exploration of ICL
methods. Section 4 details our experimental setup, including architectural diagrams and formal problem
descriptions. The evaluation of four distinct approaches to claim normalization using standard metrics
is presented in Section 5, along with a discussion of the results. In Section 6, we outline directions for
future research. Finally, Section 7 summarizes our contributions and key findings. We believe these
contributions not only advance the state-of-the-art in claim normalization but also provide a reusable
framework for multilingual NLP tasks involving few-shot learning and LLM-based generation.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Automated Fact-Checking and Claim Detection</title>
        <p>Automated fact-checking has become an essential area of research in response to the growing spread
of misinformation across digital platforms. The typical fact-checking pipeline consists of multiple
stages, including claim detection, evidence retrieval, claim verification, and justification generation.
The process begins with claim detection, which identifies statements that require verification.</p>
        <p>Early eforts in this field established foundational frameworks for structuring automated fact-checking,
encompassing key stages such as claim detection, evidence retrieval, and verdict prediction [3]. These
frameworks have since evolved to address challenges in open-domain fact-checking and to incorporate
the role of contextual information in verification [4].</p>
        <p>An important sub-area of research focuses on identifying check-worthy claims. Early works focused
on feature engineering for claim detection [5]. CLEF CheckThat! had several editions of claim detection
task [6]. The best solutions have been using fine-tuned transformers and LLM [ 7]. However, claim
extraction is akin to abstractive summarization, which is fundamentally diferent task than claim
detection.</p>
        <p>Recent advances have also highlighted the practical implementation of fact-checking systems.
Finetuned transformer-based models have demonstrated strong performance in multilingual fact-checking
scenarios, often outperforming larger language models [8]. Furthermore, integrated tools that combine
claim detection and verification within user-friendly interfaces have been developed to support the
deployment of automated fact-checking in real-world applications [9].</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Claim Normalization and Summarization</title>
        <p>Claim normalization, the process of transforming verbose or unstructured text into concise,
checkworthy claims, is crucial for efective fact-checking. This task shares similarities with abstractive
summarization, where the goal is to generate a coherent summary capturing the essence of the source
text.</p>
        <p>
          Recently, Check-worthiness Aware Claim Normalization (CACN) framework [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ], combining
chainof-thought prompting with claim check-worthiness estimation to decompose complex social media
posts into normalized claims, has been introduced by Sundriyal et al. [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Their CLAN dataset provides
valuable resources for training and evaluating claim normalization systems.
        </p>
        <p>The analogy between claim normalization and summarization lies in their shared objective of distilling
essential information. However, claim normalization is more constrained, focusing on extracting factual
statements suitable for verification, whereas summarization may include interpretative content.</p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. In-Context Learning and Adaptive Example Selection</title>
        <p>ICL has gained prominence with the advent of LLM, allowing models to perform tasks by conditioning
on a few examples provided in the prompt. Traditional ICL methods often use a fixed number of
examples (fixed-  ICL), which may not be optimal for all inputs.</p>
        <p>Chandra et al. [10] proposed an adaptive approach that predicts the optimal number of in-context
examples based on the input, leading to significant improvements in text classification tasks. Their
AICL method dynamically adjusts the number of examples, enhancing performance without extensive
hyperparameter tuning.</p>
        <p>Our work builds upon these insights by implementing an AICL approach for claim normalization. By
dynamically selecting the number and content of in-context examples based on semantic similarity
thresholds, our method balances the trade-of between providing suficient contextual information and
avoiding prompt overload. This enhances the quality of generated normalized claims, contributing to
the development of more robust and scalable fact-checking systems.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>A brief statistics on dataset provided for CheckThat! 2025 claim normalization task (task 2) is presented
in Table 1. The dataset is classified into two setups: i) Monolingual, and ii) Zero-shot setup. In
monolingual setup, training, development and test dataset are provided for 13 diferent languages and
requires models to be trained, validated and tested in isolation to one particular language, intent being
that the model learns language-specific patterns and structures. As for zero-shot setup, only test dataset
is provided for 7 diferent languages. The ambition here is to evaluate generalization capability of a
model to unseen languages.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>We conducted a series of experiments on claim normalization using four distinct approaches: (i)
Zeroshot prompting (serving as the baseline), (ii) Fine-tuned model approach, (iii) FICL, and (iv) AICL.
The task involved using multilingual dataset comprising labeled training, development, and test sets
spanning 13 languages. The training set was utilized for model fine-tuning as well as for constructing
example sets in the ICL methods, while the development and test sets were reserved for validation and
ifnal evaluation. The performance was evaluated using average METEOR score calculated between the
generated claims and their corresponding reference normalized claims.</p>
      <sec id="sec-4-1">
        <title>4.1. Zero-shot Prompting</title>
        <p>This approach involves using a carefully designed prompt (Figure 6) with explicit instructions to perform
claim normalization, without providing any in-context examples.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Fine-tuning</title>
        <p>Building upon the provided baseline, which fine-tunes the mT5 model—a multilingual
sequenceto-sequence (Seq2Seq) transformer model—we extended our experimentation by fine-tuning the
google/flan-t5-large [11] model on the available training data. The flan-t5-large model
is a variant of T5 [12] that has been instruction-tuned to follow natural language instructions more
efectively, which we hypothesized would be advantageous for the check-worthiness estimation task.</p>
        <p>However, one key challenge in this task is that not all target languages had suficient training data
available. For languages where training data was provided, the fine-tuned flan-t5-large model
performed reasonably well, demonstrating its capacity to generalize to this specific task. In contrast, for
languages lacking annotated training data, model performance was inherently limited by the absence
of in-language supervision.</p>
        <p>Although larger variants of the model, such as google/flan-t5-xl and google/flan-t5-xxl,
were considered for experimentation, they could not be fine-tuned in our current setup due to GPU
memory constraints and computational resource limitations.</p>
        <p>To address the data scarcity issue for low-resource languages, we explored data augmentation through
machine translation. In this approach, training data from high-resource languages was translated into
the low-resource target languages to create synthetic training data. This method aimed to enhance
the model’s ability to perform cross-lingual generalization and improve performance in languages
for which no direct training data was provided. While this translation-based strategy is not without
limitations—such as potential translation artifacts and domain mismatch—it proved to be a promising
direction for improving coverage and robustness of our multilingual fact-checking pipeline.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. In-Context Learning (ICL)</title>
        <p>A key strength of LLM lies in their ability to leverage examples and instructions embedded within the
input prompt. ICL leverages this capability by incorporating task-relevant examples into the prompt
(Figure 7) to guide generation. Our ICL strategies were inspired by the work of Chandra et al. [10], who
demonstrated that dynamically adjusting the number of in-context examples can substantially improve
performance in text classification tasks. Their AICL method showed an improvement of 17% over the
ifxed-example baseline.</p>
        <p>In Figure 1 we illustrate a block diagram for our ICL approach to claim normalization. As a preparatory
step, we use a vector database ChromaDB to store training examples and retrieve relevant samples for
query input. ChromaDB utilizes the all-MiniLM-L6-v2 [13] model from Sentence Transformers as its
default embedding function, which generates contextualized and semantically meaningful sentence
embeddings. These embeddings capture sentence-level semantics and are well suited for similarity
search tasks. We used cosine distance as the dissimilarity measure to retrieve top-ranked examples
from the vector store for any given input and applied the FICL and AICL approach accordingly.</p>
        <sec id="sec-4-3-1">
          <title>4.3.1. Fixed In-Context Learning (FICL)</title>
          <p>FICL approach inserts a fixed number (K) of examples selected from the training set based on their
similarity to the input. The most similar K examples are injected into the prompt (Figure 7) to generate
a normalized claim. Formally, the posterior probability of generating the true claim can be expressed as
 ( | , ) =  (, (); LLM)
(1)
where x is input text, () is an example set of K numbers of documents that are most similar to x,
and LLM are the decoder parameters of the pretrained LLM.</p>
          <p>We ran the generation of normalized claims using FICL using diferent values of K in the range of
0 (Zero-shot) to upper bound value M=9 examples and computed the average METEOR score of the
generated claims against the test labels. The results illustrated in Figure 2 and Table 3 will be discussed
in Section 5.</p>
        </sec>
        <sec id="sec-4-3-2">
          <title>4.3.2. Adaptive In-Context Learning (AICL)</title>
          <p>The main idea behind the AICL approach is to dynamically determine the number of examples based on
similarity metrics between the input and candidate examples, in contrast to fixed size of the example
set in FICL. Our implementation of AICL adopts a simple approach instead of a prediction model as
proposed by Chandra et al. [10]. Instead of using a trained model to predict the optimal number of
examples K, we applied a heuristic based on cosine distance thresholds. For each test instance, we
selected all training examples whose cosine distances to the input are below a predefined threshold. This
approach eliminates the need for a dedicated prediction model while maintaining the adaptability of
AICL. Formally, the posterior probability of generating normalized claim can be expressed as Equation 2
 ( | ,  ) =  (,  (); LLM)
(2)
where  is the cosine distance threshold that is used to filter similar documents to generate an example
set. The notable diference here is that while () in Equation 1 is set of examples of fixed size K,
 () returns a set of examples of varying size based on  .</p>
          <p>We experimented with cosine distance threshold values ranging from 0.00 to 1.50 in increments of
0.05 (similar to hyperparameter grid search) and evaluated the corresponding average METEOR scores
to determine optimal performing threshold. The results using labeled dev dataset are illustrated in
Figure 3 and Table 4 and discussed in Section 5.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>We present results on the accuracy of the approaches in Table 2. Here, we focus our results on the
English, German, French, Spanish, and Portuguese languages, where we report an average METEOR
score for the normalization of the test data sets with gold labels. Among the four approaches we have
tried, the Google flan-t5-l model that was fine-tuned on training data showed superior performance
across majority of the languages except for Portuguese, while zero-shot approach performed the worst.
Between ICL based methods, the FICL method shows better performance consistently over AICL across
all languages, even surpassing fine-tuned method in Portuguese. The performance gap between FICL
and AICL is marginally small in Portuguese. Talking about language-specific trends, English and
Spanish have higher scores across all approaches, suggesting easier generalization or better training
data in these languages. German shows relatively lower overall performance, possibly due to inadequate
training dataset. We further analyze the results of the ICL methods using dev dataset in the following
section.</p>
      <sec id="sec-5-1">
        <title>5.1. Fixed In-Context Learning (FICL)</title>
        <p>The more detailed result of the FICL method in the dev dataset is presented in Figure 2 and Table 3. The
zero-shot baseline achieved an average METEOR score of 0.24. In contrast, FICL attained a peak score
of 0.43 when the number of in-context examples K was set to 4 (Table 3), showing an approximate 80%
improvement over the baseline. As illustrated in Figure 2, incorporating a small number of relevant
examples leads to substantial gains in generation quality. However, further increasing the number of
examples beyond this optimal point results in diminishing performance, likely due to the introduction
of noise and less relevant contextual information.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Adaptive In-Context Learning (AICL)</title>
        <p>The AICL configuration achieved its highest performance with an average METEOR score of 0.40 at a
cosine distance threshold of 0.75 (Figure 3). This result is comparable to the optimal score observed
in the FICL setting, indicating that AICL can approach peak performance without requiring a fixed
number of in-context examples.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Future Work</title>
      <p>Among the approaches we have evaluated (as in Table 2), we assume that the margin of improvement
for the zero-shot method and the FICL method is narrow due to their reduced model complexity, and
therefore we want to prioritize further investigation of the fine-tuned models and the AICL method.</p>
      <sec id="sec-6-1">
        <title>6.1. Fine-tuned model</title>
        <p>To further enhance the performance of the fine-tuning approach, several avenues can be explored in
future work. One promising direction is to conduct systematic hyperparameter tuning of key training
parameters which were not exhaustively optimized in our current setup. Fine-tuning these parameters
could lead to more stable convergence and improved generalization across languages. Additionally,
while our experiments were limited to the flan-t5-large model due to resource constraints, larger variants
such as flan-t5-xl and flan-t5-xxl ofer greater representational capacity and could potentially yield
significant performance gains. With access to enhanced computational resources, fine-tuning these
larger models on multilingual training data may further improve the quality of claim normalization,
especially in high-resource language settings.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Adaptive In-Context Learning (AICL)</title>
        <p>To perform some further analysis on the efectiveness of AICL approach, we compared example set
size (K value for each test data point) generated by AICL method to ideal example set size that would
have resulted in best claim normalization performance using identical ICL prompt and underlying
model, henceforth referred to as oracle method. We compared statistics of example set size of the
best performing AICL method and oracle method visually using a heatmap as in Figure 4. We outline
following notes for future work in this direction.</p>
        <p>• The expectation was to see high datapoint counts along the diagonal of the heatmap Figure 4 if
AICL was able to predict the best K values (example set size). We realize that this is not evident
from the heatmap. We also visually compared the distribution of predicted K values with that of
the best scoring K values (oracle method) using Figure 5 and observed a significant divergence
between these two distributions. Hence, this analysis does not support AICL’s efectiveness.
However, since the average METEOR score improved significantly compared to the zero-shot
baseline as in Figure 3, we maintain our confidence in the approach and believe it deserves further
investigation.
• We also notice a region in the heatmap Figure 4 (bright yellow region top-left) that represents
significant number of instances when including semantically dissimilar examples seem to also
have positive efect in claim normalization which is counter intuitive. It may hint that there may
exist a mechanism other than semantic similarity that could be used for example selection and
improved performance, hence the cases worth investigating.
• Due to constraints in time and computational resources, we were unable to comprehensively
evaluate the efectiveness of larger or more advanced language models for claim normalization.
However, in preliminary testing using a limited set of samples, we observed that the use of
Qwen8B, a model known for its superior reasoning capabilities and instruction-following performance
compared to Mistral-7B, led to a notable improvement in the METEOR scores of generated claims.
Based on these encouraging results, we plan to further evaluate the AICL method using more
advanced models such as DeepSeek-R1, Qwen3-32B and Llama 4 models in future work.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>In this study, we evaluated four approaches to claim normalization in a multilingual setting: zero-shot
prompting, fine-tuning, FICL, and AICL. Among these, the fine-tuning-based method yielded the highest
performance across all evaluated languages except for Portuguese, followed closely by FICL and AICL
methods. The zero-shot approach consistently underperformed relative to the others.</p>
      <p>Our experiments demonstrate the efectiveness of fine-tuning LLM such as Flan-T5, particularly
when adequate training data is available. However, we recognize that the fine-tuning approach can be
further enhanced through systematic hyperparameter tuning and the use of more capable pre-trained
models.</p>
      <p>Although AICL did not outperform FICL in our current setup, we believe it remains a promising
approach. Its dynamic example selection mechanism enables context-aware prompting without additional
model training. Moreover, AICL’s performance, while slightly behind FICL, surpassed the zero-shot
baseline by a significant margin. Our analysis reveals opportunities for improving example selection
strategies—potentially beyond semantic similarity—to boost AICL’s efectiveness. These insights, along
with encouraging early results from more advanced models, motivate further exploration of AICL using
larger architectures such as DeepSeek-R1, Qwen3-32B, and LLaMA 4 in future work.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT and Overleaf in order to perform
grammatical refinements. After using these tool(s)/service(s), the author(s) reviewed and edited the
content as needed and take(s) full responsibility for the publication’s content.
[2] M. Sundriyal, T. Chakraborty, P. Nakov, Overview of the CLEF-2025 CheckThat! lab task 2 on
claim normalization, ????
[3] A. Vlachos, S. Riedel, Fact checking: Task definition and dataset construction, in: Proceedings
of the ACL 2014 workshop on language technologies and computational social science, 2014, pp.
18–22.
[4] J. Thorne, A. Vlachos, Automated fact checking: Task formulations, methods and future directions,
in: Proceedings of the 27th International Conference on Computational Linguistics (COLING 2018),
Association for Computational Linguistics, 2018, pp. 3346–3359. doi:10.18653/v1/C18-1283.
[5] N. Hassan, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez, S. Gawsane, S. Hasan, M. Joseph,
A. Kulkarni, A. K. Nayak, V. Sable, C. Li, M. Tremayne, Claimbuster: The first-ever
endto-end fact-checking system, Proceedings of the VLDB Endowment 10 (2017) 1945–1948.
doi:10.14778/3137765.3137815.
[6] A. Barrón-Cedeño, F. Alam, J. M. Struß, P. Nakov, T. Chakraborty, T. Elsayed, P. Przybyła, T. Caselli,
G. Da San Martino, F. Haouari, et al., Overview of the clef-2024 checkthat! lab: check-worthiness,
subjectivity, persuasion, roles, authorities, and adversarial robustness, in: International Conference
of the Cross-Language Evaluation Forum for European Languages, Springer, 2024, pp. 28–52.
[7] P. R. Aarnes, V. Setty, P. Galuščáková, Iai group at checkthat! 2024: Transformer models and data
augmentation for checkworthy claim detection, in: Notebook for the CheckThat! Lab Task 1 at
CLEF 2024, CLEF CheckThat’24, CEUR Workshop Proceedings, Grenoble, France, 2024. CLEF 2024:
Conference and Labs of the Evaluation Forum, September 09–12, 2024.
[8] V. Setty, Surprising eficacy of fine-tuned transformers for fact-checking over larger language
models, in: Proceedings of the 47th International ACM SIGIR Conference on Research and
Development in Information Retrieval, SIGIR ’24, 2024, p. 2842–2846.
[9] V. Setty, Factcheck editor: Multilingual text editor with end-to-end fact-checking, in: Proceedings
of the 47th International ACM SIGIR Conference on Research and Development in Information
Retrieval, SIGIR ’24, 2024, p. 2744–2748.
[10] M. Chandra, D. Ganguly, I. Ounis, One size doesn’t fit all: Predicting the number of examples for
in-context learning, in: Advances in Information Retrieval, 2025.
[11] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tai, W. Fedus, Y. Li, X. Wang, M. Dehghani,
S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros,
M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V. Zhao, Y. Huang, A. Dai, H. Yu,
S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V. Le, J. Wei, Scaling
instructionifnetuned language models, J. Mach. Learn. Res. (2024).
[12] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res. (2020).
[13] W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, M. Zhou, Minilm: deep self-attention distillation for
task-agnostic compression of pre-trained transformers, in: Proceedings of the 34th International
Conference on Neural Information Processing Systems, NIPS ’20, 2020.
###
{post}
###
{{</p>
      <p>"normalized_claim":
language&gt;
}}
————- Examples
————{examples}
——— End of Examples
———###
{post}
###
{{</p>
      <p>"normalized_claim":
language&gt;
}}</p>
    </sec>
    <sec id="sec-9">
      <title>A. Prompts used for Zero-shot and ICL experiments</title>
      <p>You are a helpful AI assistant. Given a noisy and unstructured social media
post, rewrite it as a simple and concise statement.</p>
      <p>Produce a concise statement for the following post (delimited by ###).
The original language of the post is {language}.</p>
      <p>Always produce a valid JSON string as a final output using the format below.</p>
      <p>&lt;generated normalized claim translated in {language}</p>
      <p>You are a helpful AI assistant. Given a noisy and unstructured social media
post, rewrite it as a simple and concise statement.</p>
      <p>Below are some examples of the task intended with input post and expected outcome.
Produce a concise statement for the following post (delimited by ###).
The original language of the post is {language}.</p>
      <p>Always produce a valid JSON string as a final output using the format below.</p>
      <p>&lt;generated normalized claim translated in {language}</p>
    </sec>
    <sec id="sec-10">
      <title>B. Average METEOR scores obtained for Zero-shot and ICL experiments</title>
      <p>Average METEOR Score
0.241939
0.368395
0.368404
0.371689
0.376016
0.377882
0.381005
0.384421
0.386896
0.388922
0.391500
0.392321
0.394702
0.397776
0.399228
0.401437
0.400422
0.400107
0.400828
0.399999
0.396168
0.396700
0.390203
0.389940
0.391778
0.394388
0.393908
0.394985
0.394675
0.394162
0.394125
0.393938</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sundriyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>From chaos to clarity: Claim normalization to empower fact-checking, in: Findings of the Association for Computational Linguistics: EMNLP</article-title>
          <year>2023</year>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>