<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>DSHacker at CheckThat! 2024: LLMs and BERT for Check-Worthy Claims Detection with Propaganda Co-occurrence Analysis</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Paweł Golik</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arkadiusz Modzelewski</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Aleksander Jochym</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Polish-Japanese Academy of Information Technology</institution>
          ,
          <country country="PL">Poland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Padua</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents our approach to check-worthiness detection, one of the main tasks in the CheckThat! Lab 2024 at the Conference and Labs of the Evaluation Forum. The challenge was to create a system to determine whether a claim found in Dutch and Arabic tweets or English debate snippets needs fact-checking. We explored ifne-tuning pre-trained BERT-based models and employing a few-shot prompting technique with OpenAI GPT models. Our study compared monolingual models based on the BERT architecture with a multilingual XLMRoBERTa-large model capable of processing data in multiple languages. Additionally, we investigated the link between propaganda detection and the check-worthiness of content. We also incorporated the recently released OpenAI GPT-4o model. Our systems' impressive performance, surpassing baseline results across all languages, is highlighted by our high-ranking positions: 3rd in Arabic, 2nd in Dutch, and 8th in English, with even better outcomes in post-deadline experiments.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Check-Worthiness</kwd>
        <kwd>Fact-Checking</kwd>
        <kwd>XLM-RoBERTa</kwd>
        <kwd>GPT-3</kwd>
        <kwd>5</kwd>
        <kwd>GPT-4o</kwd>
        <kwd>Propaganda</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>1.1. Problem Overview</title>
        <p>Nowadays, information spreads from many online sources. As a result, it is crucial to ensure the
information is accurate, as it afects public discourse and people’s decisions. The spread of disinformation
and misinformation threatens the integrity of public discussions, impacting areas such as news reporting,
political debates, and social media interactions. It is therefore essential to have solid fact-checking
systems.</p>
        <p>
          Fact-checking is not just about verifying that something is true. It is also essential for helping
people make good decisions based on accurate information and ensuring that the shared information is
trustworthy [
          <xref ref-type="bibr" rid="ref1 ref12">1</xref>
          ]. Such eforts help create an environment where people can have thoughtful discussions
and make deliberate choices. Due to the recent rapid advancements in Artificial Intelligence, automated
systems can now ofer assistance and support the fact-checking process.
        </p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2. Task Description</title>
        <p>
          CheckThat! Lab at CLEF 2024 addresses issues that aid research and decision-making throughout the
fact-checking process [
          <xref ref-type="bibr" rid="ref13 ref2">2</xref>
          ]. In the initial editions, CheckThat! Lab focused on developing an automated
system to assist journalist fact-checkers during the key stages of the text verification process, which
follows a structured pipeline [
          <xref ref-type="bibr" rid="ref3 ref4">3, 4, 5, 6, 7</xref>
          ]:
1. Assessing whether a document or claim is check-worthy, i.e., determining if its veracity should
be checked by a journalist.
2. Retrieving previously verified claims that could aid in fact-checking the current claim.
3. Gathering further evidence from the Web, if necessary, to support the verification.
4. Making a final decision on the factual accuracy of the claim based on the collected evidence.
        </p>
        <p>
          CheckThat! Lab Task 1 at CLEF 2024 focuses on the first step of the pipeline. Its goal is to provide
an automated system for deciding whether a tweet or transcription claim needs fact-checking [
          <xref ref-type="bibr" rid="ref13 ref2">2, 8</xref>
          ].
Traditionally, this decision involves experts or human reviewers considering whether the claim can be
proven true and if it could cause harm before labeling it as worth checking [
          <xref ref-type="bibr" rid="ref13 ref2">2, 8</xref>
          ].
        </p>
        <p>In this scenario, we are dealing with a binary classification task. For each instance, which is a short
text like a tweet or a caption from a political debate transcription, we aim to predict one of two labels:
"Yes" or "No," indicating whether the text is worth fact-checking.</p>
      </sec>
      <sec id="sec-1-3">
        <title>1.3. Our Experiments</title>
        <p>Our approach to check-worthiness detection involved experimenting with both monolingual and
multilingual models, as well as leveraging Large Language Models (LLMs) through few-shot prompting.
For monolingual models, we fine-tuned BERT-based architectures tailored to specific languages like
English, Dutch, and Arabic. In the multilingual approach, we utilized the XLM-RoBERTa-large model,
ifne-tuning it on combined datasets from multiple languages. Additionally, we employed OpenAI’s
GPT-3.5 and the recently released GPT-4o models for few-shot prompting, which demonstrated results
comparable to fine-tuned BERT-based models. Furthermore, we explored the relationship between
propaganda detection and check-worthiness by fine-tuning models on a propaganda detection dataset
before applying them to the check-worthiness task. Our comprehensive experiments enabled us to
surpass baseline results and achieve high-ranking positions across diferent languages: 3 rd in Arabic,
2nd in Dutch, and 8th in English, with even better outcomes in post-deadline experiments.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Detecting disinformation has become a crucial area of research. Researchers work not only on
disinformation, but also address particular challenges associated with identifying disinformation,
misinformation, and fake news. One such challenge is determining which claims are worthy of checking
[7]. Hassan et al. [9] introduced a dataset from U.S. presidential debates and created classification
models to diferentiate among three distinct categories: check-worthy factual claims, non-factual claims,
and insignificant factual claims. Jaradat et al. [10] developed ClaimRank, an online tool designed to
identify check-worthy claims, with support for two languages: English and Arabic. Kartal and Kutlu
[11] proposed a hybrid model which combines BERT with various features to prioritize claims based on
their check-worthiness.</p>
      <p>
        The detection of check-worthy claims has also been a research focus in past years within CheckThat!
Labs [
        <xref ref-type="bibr" rid="ref1 ref12">12, 13, 1, 7, 14</xref>
        ]. Alam et al. [
        <xref ref-type="bibr" rid="ref1 ref12">1</xref>
        ] introduced a task in 2023 for check-worthiness detection in
multimodal and multigenre content with a multilingual dataset with three languages: Arabic, Spanish,
and English. Team ES-VRAI proposed diferent methods based on pre-trained transformer models and
sampling techniques for detecting check-worthiness in a multigenre content [15]. Their approach
resulted in the first position in the Arabic language [ 15]. Team OpenFact was the best-performing team
on English [
        <xref ref-type="bibr" rid="ref1 ref12">1</xref>
        ]. They fine-tuned GPT-3 curie model using more than 7K instances of sentences from
debates and speeches annotated for check-worthiness [16]. In Spanish language the best performance
achieved work done by Team DSHacker [17]. Modzelewski et al. [17] presented a system based on
ifne-tuning XLM-RoBERTa on all languages with additional data augmentation. For data augmentation
Team DSHacker utilized GPT-3.5 model for translating and paraphrasing available training data [17].
      </p>
      <p>Dataset
TRAIN</p>
      <p>DEV
DEV-TEST</p>
      <p>TEST</p>
      <p>EN
ES
NL
AR
EN
ES
NL
AR
EN
ES
NL
AR
EN
NL
AR</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The dataset consists of texts and their corresponding gold labels annotated by human experts, forming
a multilingual dataset. It includes data in four languages: English, Spanish, Dutch, and Arabic. Dataset
was divided into training , validation (dev) , and dev-test −  datasets with gold labels.
We made final predictions for the test datasets  published in English, Dutch, and Arabic. The
English texts depict captions from political debates, while in the other languages, they represent tweets.
In most datasets, the classes are imbalanced, with a predominance of texts that are not check-worthy.
Table 1 provides detailed information about the datasets. For more information, refer to Hasanain et al.
[8].</p>
    </sec>
    <sec id="sec-4">
      <title>4. Our Approach</title>
      <p>We experimented with two approaches for text classification: fine-tuning BERT-based models and
utilizing few-shot prompting with GPT models, including recently released GPT-4o model. These
methodologies are both widely used today and represent the state-of-the-art in the industry [18].
However, deciding whether fine-tuning or in-context learning yields better results is not trivial. Many
scientists have addressed the challenge of comparing the two approaches fairly [18].</p>
      <sec id="sec-4-1">
        <title>4.1. Fine-tuning BERT-based Models</title>
        <p>To provide a comprehensive overview of the efectiveness of BERT-based models in identifying
checkworthy content across multiple languages, we conducted experiments using two types of models. First,
we employed monolingual models that we fine-tuned exclusively on the training data from a single
language. Secondly, we used multilingual models that we fine-tuned on diferent combinations of the
training datasets from multiple languages. Our experiments allowed us to compare the performance of
both multilingual and monolingual models in the context of check-worthiness detection.</p>
        <p>We started with comprehensive hyperparameter tuning for all chosen models (mono- and
multilingual). This involved fine-tuning each model on the training dataset  using every combination of
hyperparameter values we specified. We used the proposed  dataset for validation. Then, we
assessed each model’s performance by measuring 1 score to determine the most efective hyperparameter
values.</p>
        <p>To obtain the final models for submission, we merged the training  and validation  datasets
to form the ultimate training dataset. We then retrained the model with the best hyperparameter values
using the −  dataset as the final validation set. Please refer to our Appendix A, which presents
optimal hyperparameters of each final submission.</p>
        <sec id="sec-4-1-1">
          <title>4.1.1. Monolingual Models</title>
          <p>
            For each language, we have chosen a single pretrained monolingual model available at HuggingFace
that we later fine-tuned on the corresponding language:
• ENGLISH (MONO-EN) - FacebookAI/roberta-large - the language model (355M parameters)
trained on English data in a self-supervised fashion [19].
• DUTCH (MONO-NL) - DTAI-KULeuven/robbert-2023-dutch-large - the first Dutch large (355M
parameters) model trained on the OSCAR2023 dataset [
            <xref ref-type="bibr" rid="ref5">20</xref>
            ].
• ARABIC (MONO-AR) - UBC-NLP/MARBERT - trained on randomly sampled 1B Arabic tweets
(with at least 3 Arabic words) from a large in-house dataset of about 6B tweets [
            <xref ref-type="bibr" rid="ref6">21</xref>
            ].
          </p>
        </sec>
        <sec id="sec-4-1-2">
          <title>4.1.2. Multilingual Models</title>
          <p>
            We also fine-tuned a multilingual FacebookAI/xlm-roberta-large [
            <xref ref-type="bibr" rid="ref7">22</xref>
            ] model on a combined dataset from
all available languages (MULTI-ALL). Since Spanish was not included in the final submission, we
performed another fine-tuning using only English, Dutch, and Arabic data ( MULTI-NO-ES).
          </p>
          <p>Additionally, we experimented with fine-tuning a multilingual model previously fine-tuned on a
diferent propaganda-related dataset. We first fine-tuned the FacebookAI/xlm-roberta-large model on the
propaganda presence binary classification task and then fine-tuned the model again on the CheckThat!
Task 1 data (MULTI-PROP2). Refer to Section 5 for more information.</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Few-shot Prompting with GPT Models</title>
        <p>We also employed Large Language Models (LLMs) to generate check-worthiness predictions. Our
experiments included OpenAI’s gpt-4o (GPT-4o) and gpt-3.5-turbo-1106 (GPT-3.5) generative models.
We implemented the few-shot prompting technique using the OpenAI Chat Completions API. Few-shot
prompting with GPT models leverages pre-trained language models to perform specific tasks without
retraining. Instead, the model is guided by providing a few examples and their expected responses
within the input prompt.</p>
        <p>Each prediction request sent to the GPT model consisted of a list of messages presented to the model.
Each message contains the role and content attribute. There are three roles available:
1. system message helps set the behavior of the model (assistant) by providing it context and
guidelines.
2. user messages can provide exemplary requests for the assistant. In our case - example requests
for a provided text’s check-worthiness evaluation.</p>
        <p>3. assistant messages indicate the expected output of the assistant.</p>
        <p>In our experiments, the conversation is formatted starting with a system message that clarifies the
task and the concept of check-worthiness. This is followed by alternating pairs of user and assistant
messages. One pair for each few-shot example, where a user message poses a question about the
example’s content check-worthiness, and the corresponding assistant message provides the gold label
for the example, either ’Yes’ or ’No’. The final message following the pairs is one user message with
the actual text to be classified by the model (See Appendix B). For each instance to be classified, we
included four examples of few-shot prompting from the training dataset, two of which are check-worthy.
The chosen few-shot examples were consistent in a given language. The prompt templates remained
consistent for both the GPT-4o and GPT-3.5 experiments.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Propaganda Co-occurrence Analysis</title>
      <p>
        Since propaganda often involves misleading, biased, or manipulative information, such content is
naturally more likely to warrant fact-checking [
        <xref ref-type="bibr" rid="ref8 ref9">23, 24</xref>
        ]. Therefore, we decided to indirectly analyze
whether propaganda co-occurs with check-worthy claim. For that purpose we predicted the presence
of propaganda using a model fine-tuned on a propaganda dataset. The underlying assumption was that
check-worthy statements are more likely to contain propaganda techniques, given their potential to
persuade or manipulate public opinion. We leveraged a multilingual FacebookAI/xlm-roberta-large [
        <xref ref-type="bibr" rid="ref7">22</xref>
        ]
model fine-tuned by Modzelewski et al. [
        <xref ref-type="bibr" rid="ref10">25</xref>
        ] on the IberLEF DIPROMATS 2024 Task 1a [
        <xref ref-type="bibr" rid="ref11">26</xref>
        ] and then we
employed it on the check-worthiness −  dataset to evaluate this hypothesis (MUTLI-PROP1).
IberLEF DIPROMATS 2024 Task 1a is a binary classification task for propaganda detection in English and
Spanish tweets [
        <xref ref-type="bibr" rid="ref11">26</xref>
        ].
      </p>
      <p>Table 2 shows the performance metrics of the MULTI-PROP1 model calculated for the − 
dataset. Relatively high precision shows that many texts with detected propaganda are indeed worth
fact-checking. However, low recall illustrates that many check-worthy texts do not contain propaganda
detectable by our model. Such results lead to an intuitive conclusion that propaganda often signals the
need for fact-checking, but not all check-worthy statements necessarily rely on propaganda methods.</p>
      <p>We then further fine-tuned this model on CheckThat! Task 1 data and utilized it to predict
checkworthiness (MULTI-PROP2). Due to time constraints, we did not perform hyperparameter tuning for
this model. Instead, we used the hyperparameter values obtained from the search conducted during the
MULTI-ALL experiment.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Results</title>
      <p>Since the number of allowed submissions was limited to one submission per language, we selected
the models for the final predictions based on the 1 scores obtained on the −  dataset. The
models selected for the final submission are: MONO-EN, which ranked 8th on the final leaderboard for
English, and MULTI-NO-ES, which ranked 2nd on the final leaderboard for Dutch and 3 rd on the final
leaderboard for Arabic. Table 3 provides the details of the models we submitted. Additionally, the table
shows the baseline provided by the organisers of ChecktThat Lab 2024 Task 1 and the score of the best
team for each language.</p>
      <p>After the submission deadline, we experimented with Large Language Models, namely GPT-3.5 and
GPT-4o. Moreover, our experiments that combined knowledge from propaganda classification with
check-worthiness detection through models MULTI-PROP1 and MULTI-PROP2 have also taken
place after the deadline. Unfortunately, due to time constraints and strict deadlines, we could not
complete these experiments before the submission deadline. Therefore, we did not consider their results
on the −  dataset when selecting models for submission. However, some of our post-deadline</p>
      <p>Oficial</p>
      <p>Rank
8
2
3
results are superior to our final submission results and, in the case of the Dutch language, even surpass
the leaderboard winner.</p>
      <p>Table 4 shows the results of all experiments we conducted. We report the 1 scores calculated on
the −  and  datasets. We did not have access to the ground truth for the test data while
developing this system. Nevertheless, we calculated the test  dataset scores after the organizers
released the labels for this dataset following the submission deadline. Each record of DEV-TEST columns
represents the 1 scores yielded by a model fine-tuned once on the combined training dataset (  +
) with −  used as a validation dataset.</p>
      <sec id="sec-6-1">
        <title>6.1. Results</title>
        <sec id="sec-6-1-1">
          <title>6.1.1. English Language Results</title>
          <p>The submission of the MONO-EN model earned us the 8th position on the leaderboard. Notably, the
diference between the winning model’s 1 score and ours was small—just 0.042. Interestingly, the
MULTI-NO-ES model, an xlm-roberta-large fine-tuned on English, Dutch, and Arabic data, achieved a
better 1 score than our chosen submission, with a score of 0.7647. However, the diference may not be
statistically significant. Most of the remaining models nearly matched our best result.</p>
        </sec>
        <sec id="sec-6-1-2">
          <title>6.1.2. Dutch Language Results</title>
          <p>In Dutch, we came very close to winning with the MULTI-NO-ES model, achieving an 1 score of 0.73,
just 0.02 points behind the actual winner. Two of our post-deadline models surpassed our submission
result - MULTI-PROP2 with a slightly better 1 score of 0.7336 and GPT-4o with an impressive
0.7915. The results from GPT-4o even outperformed those of the leaderboard winner. Interestingly, the
monolingual approach MONO-NL performed relatively poor, with an 1 score of 0.6182.</p>
        </sec>
        <sec id="sec-6-1-3">
          <title>6.1.3. Arabic Language Results</title>
          <p>Similarly, we submitted the predictions of the MULTI-NO-ES model for the Arabic language. This
submission secured us third place on the leaderboard. The diference in 1 scores between the
MULTINO-ES model and the 1st ranked model on the leaderboard is 0.031. The monolingual model performed
worse than the multilingual systems, but the diference is minor. Interestingly, GPT-3.5 produced one
of the best results - 0.5539. Our top post-deadline model, MULTI-PROP2, nearly matched the 1 score
of the leaderboard winner, with a diference of just 0.002.</p>
        </sec>
        <sec id="sec-6-1-4">
          <title>6.1.4. GPT Few-shot Prompting Results</title>
          <p>The few-shot prompting approach using GPT models showed similar results to the fine-tuning approach
with BERT-based models. Across all languages, the GPT-4o model consistently outperformed or
performed similarly to the GPT-3.5 model. Specifically for English, GPT-4o performed noticeably
better than GPT-3.5 and achieved results only slightly lower than fine-tuned BERT-based models. In the
case of Dutch, GPT-4o was also superior to GPT-3.5, and it emerged as the best model among those we
tested for this language. An interesting observation from our experiments is that the diferences between
Dutch on −  and  results were more pronounced for GPT models than other models. For
Arabic, GPT-3.5 and GPT-4o showed nearly identical performance, comparable to experiments using
BERT-based models.</p>
        </sec>
        <sec id="sec-6-1-5">
          <title>6.1.5. Propaganda Detection Transfer Learning Results</title>
          <p>The results of a multilingual model fine-tuned on DIPROMATS 2024 Task 1a data further fine-tuned on
English, Dutch, Arabic, and Spanish using CheckThat! 2024 Task 1 data (MULTI-PROP2) in one case
outperformed all other approaches, but most probably the diference was statistically insignificant. We
observed negligibly better results for Arabic and slightly worse results for the remaining languages
compared to the MUTLI-ALL model.</p>
        </sec>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Results Discussion</title>
        <p>Our results show that the outcomes on the  and −  datasets are significantly diferent.
The explanation for this phenomenon is not straightforward. One possible reason is overfitting to the
−  data when selecting the best model during the final fine-tuning. However, this doesn’t explain
the discrepancy observed with the GPT models with few-shot prompting, which were not fine-tuned
yet still showed significant mismatches between the −  and  results. A potential reason for
GPT large models diferences is that the −  and  data may have difered significantly.</p>
        <p>The GPT models, particularly GPT-4o, showed results comparable to the fine-tuned BERT-based
models. The few-shot prompting approach with OpenAI API ofers a more straightforward solution for
such tasks, delivering similar performance at a lower time cost.</p>
        <p>The monolingual models generally performed worse than the multilingual models in our experiments.
Thus, using a language-specific model is not always the best choice. Adding Spanish to the training
data did not always improve performance for the multilingual models.</p>
        <p>Model obtained from fine-tuning on the IberLEF DIPROMATS 2024 propaganda detection task and
further fine-tuning on check-worthiness data did not yield any significant performance improvement
compared to multilingual models fine-tuned exclusively on the check-worthiness data. Limitation of this
approach was that we did not perform hyperparameter tunning for MULTI- PROP2. Hyperparameters
optimization could significantly change our results.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions and Future Work</title>
      <p>In our work, we experimented with monolingual and multilingual approaches for various languages for
check-worthy claims detection. For the monolingual approach, we utilized BERT models tailored to
specific languages, optimizing hyperparameters and fine-tuning each model separately for each language.
For the multilingual approach, we used XLM-RoBERTa-large. First, we optimized and fine-tuned it on the
entire dataset. Then, we excluded Spanish from the training data in a second experiment. Additionally,
we employed two LLMs, namely GPT-3.5-turbo and recently released GPT-4o for each language, using
few-shot prompting to classify texts. We also fine-tuned a model on the IberLEF DIPROMATS 2024 Task
1 dataset and used this model to predict whether the data from CheckThat! Lab 2024 Task 1 contained
propaganda. With this analysis, we aimed to indirectly determine whether check-worthy data also
includes propaganda. We later utilized the mentioned model (XLM-RoBERTa-large fine-tuned on the
IberLEF DIPROMATS 2024 Task 1a for binary propaganda classification) and further fine-tuned it for
check-worthiness classification. Our final submissions for all languages were: English - RoBERTa-large
ifne-tuned, and optimized exclusively on English data; Dutch - XLM-RoBERTa-large, fine-tuned and
optimized on all data except Spanish; Arabic - XLM-RoBERTa-large, fine-tuned and optimized on all
data except Spanish.</p>
      <p>From our experiments we can conclude that GPT models, particularly GPT-4o, showed results
comparable to the fine-tuned BERT-based models. Moreover, language-specific BERT-based model performed
better only on English dataset. In other languages we got better results by utilizing multilingual model.</p>
      <p>Future work could incorporate a detailed linguistic analysis of texts to understand the diferent
linguistic features among check-worthy texts and those that do not require checking. By identifying
specific linguistic features and patterns, we could develop more nuanced systems that better diferentiate
between these types of texts. This analysis could involve examining rhetorical devices and stylistic
elements that are prevalent in check-worthy claims.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Limitations and Ethics</title>
      <p>We acknowledge that our research may raise a number of ethical issues. The first shortcoming is a lack
of clear explainability of our models’ results. Each model generates check-worthiness ratings without
explanations as to why a statement was rated check-worthy or not. Users may need explanations to
understand the basis for the model’s decisions. One of the most crucial tools for our research was
ifne-tuning BERT-based models and utilizing GPT-based Large Language Models. As a result, if BERT
or GPT-based models were trained on data containing any bias, disinformation, or misinformation,
these problems may afect the results of our experiments. The next potential shortcoming is a possible
specialization of our systems to detect specific types of check-worthy information. Our systems may
possibly not handle more subtle content or other of-topic statements well. We did not check whether
the dataset included one topic check-worthy information, and we relied on the work of the workshop
organizers to cover more than a specific type.
Conference of the CLEF Association, CLEF 2019, Lugano, Switzerland, September 9–12, 2019,
Proceedings 10, Springer, 2019, pp. 301–321.
[5] A. Barrón-Cedeno, T. Elsayed, P. Nakov, G. Da San Martino, M. Hasanain, R. Suwaileh, F. Haouari,
Checkthat! at clef 2020: Enabling the automatic identification and verification of claims in social
media, in: Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR
2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part II 42, Springer, 2020, pp. 499–507.
[6] P. Nakov, G. Da San Martino, T. Elsayed, A. Barrón-Cedeño, R. Míguez, S. Shaar, F. Alam, F. Haouari,
M. Hasanain, W. Mansour, et al., Overview of the clef–2021 checkthat! lab on detecting
checkworthy claims, previously fact-checked claims, and fake news, in: Experimental IR Meets
Multilinguality, Multimodality, and Interaction: 12th International Conference of the CLEF Association,
CLEF 2021, Virtual Event, September 21–24, 2021, Proceedings 12, Springer, 2021, pp. 264–291.
[7] P. Nakov, A. Barrón-Cedeño, G. da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli,
M. Kutlu, W. Zaghouani, et al., Overview of the clef–2022 checkthat! lab on fighting the covid-19
infodemic and fake news detection, in: International Conference of the Cross-Language Evaluation
Forum for European Languages, Springer, 2022, pp. 495–520.
[8] M. Hasanain, R. Suwaileh, S. Weering, C. Li, T. Caselli, W. Zaghouani, A. Barrón-Cedeño, P. Nakov,
F. Alam, Overview of the CLEF-2024 CheckThat! lab task 1 on check-worthiness estimation of
multigenre content, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.),
Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CLEF 2024, Grenoble,
France, 2024.
[9] N. Hassan, C. Li, M. Tremayne, Detecting check-worthy factual claims in presidential debates,
in: Proceedings of the 24th acm international on conference on information and knowledge
management, 2015, pp. 1835–1838.
[10] I. Jaradat, P. Gencheva, A. Barrón-Cedeño, L. Màrquez, P. Nakov, Claimrank: Detecting
checkworthy claims in arabic and english, arXiv preprint arXiv:1804.07587 (2018).
[11] Y. S. Kartal, M. Kutlu, Re-think before you share: a comprehensive study on prioritizing
checkworthy claims, IEEE transactions on computational social systems 10 (2022) 362–375.
[12] P. Atanasova, L. Marquez, A. Barron-Cedeno, T. Elsayed, R. Suwaileh, W. Zaghouani, S. Kyuchukov,
G. Da San Martino, P. Nakov, et al., Overview of the clef-2018 checkthat! lab on automatic
identification and verification of political claims. task 1: Check-worthiness, in: CEUR WORKSHOP
PROCEEDINGS, volume 2125, CEUR-WS, 2018, pp. 1–13.
[13] P. Atanasova, P. Nakov, G. Karadzhov, M. Mohtarami, G. Da San Martino, Overview of the clef-2019
checkthat! lab: Automatic identification and verification of claims. task 1: Check-worthiness
(2019).
[14] S. Shaar, M. Hasanain, B. Hamdan, Z. S. Ali, F. Haouari, A. Nikolov, Y. S. Kartal, F. Alam,
G. Da San Martino, et al., Overview of the clef-2021 checkthat! lab task 1 on check-worthiness
estimation in tweets and political debates., 2021.
[15] H. T. Sadouk, F. Sebbak, H. E. Zekiri, Es-vrai at checkthat! 2023: Analyzing checkworthiness in
multimodal and multigenre (2023).
[16] M. Sawiński, K. Węcel, E. P. Księżniak, M. Stróżyna, W. Lewoniewski, P. Stolarski, W. Abramowicz,
Openfact at checkthat! 2023: head-to-head gpt vs. bert-a comparative study of transformers
language models for the detection of check-worthy claims, Working Notes of CLEF (2023).
[17] A. Modzelewski, W. Sosnowski, A. Wierzbicki, Dshacker at checkthat! 2023: Check-worthiness
in multigenre and multilingual content with gpt-3.5 data augmentation, Working Notes of CLEF
(2023).
[18] M. Mosbach, T. Pimentel, S. Ravfogel, D. Klakow, Y. Elazar, Few-shot fine-tuning vs. in-context
learning: A fair comparison and evaluation, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Findings of
the Association for Computational Linguistics: ACL 2023, Association for Computational
Linguistics, Toronto, Canada, 2023, pp. 12284–12314. URL: https://aclanthology.org/2023.findings-acl.779.
doi:10.18653/v1/2023.findings-acl.779.
[19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
Roberta: A robustly optimized BERT pretraining approach, CoRR abs/1907.11692 (2019). URL:</p>
    </sec>
    <sec id="sec-9">
      <title>A. Optimal Hyperparameter Values</title>
      <p>This appendix includes the optimal hyperparameter values for our best models.
In this appendix, we present the prompt messages included with each text classification request. The
prompts are provided only in Spanish for brevity.
• User content: Answer whether the following text in Spanish is worth fact-checking. Answer
using only a single English word: Yes or No. TEXT: tve_tve vuelve a quedar en evidencia.
Desplaza al minuto 18 la denuncia del #CGPJ ante las críticas de Iglesias y habla de
“diferencias”. Exigimos al responsable de edición del telediario explicaciones y a Sánchez
que deje de utilizar rtve a su antojo @Enric_Hernandez
• Assistant content: Yes
c) Example 3:
• User content: Answer whether the following text in Spanish is worth fact-checking. Answer
using only a single English word: Yes or No. TEXT: El PSOE demostró su apoyo a Torra
durante la moción #PorLaConvivencia Lroldansu "Cs sigue siendo el referente incontestable
del Constitucionalismo en Cataluña. No podemos dejar el futuro de 7,5 millones de catalanes
en manos de Torra" #ActualidadCs
• Assistant content: No
d) Example 4:
• User content: Answer whether the following text in Spanish is worth fact-checking. Answer
using only a single English word: Yes or No. TEXT: Pedro Sánchez ha dado el visto bueno
a la apertura de las ’embajadas’ catalanas en Argentina, México y Túnez. Empiezan las
cesiones a sus socios separatistas. Incrementan gasto en sus majaderías, despreciando la
urgencia de las política sociales.</p>
      <p>• Assistant content: Yes
3. Used final user prompt: Answer whether the following text in Spanish is worth fact-checking. Answer
using only a single English word: Yes or No. TEXT: &lt;Here we provided text to classify by LLM&gt;</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Cheema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hakimov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Míguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          , et al.,
          <article-title>Overview of the clef-2023 checkthat! lab task 1 on checkworthiness in multimodal and multigenre content</article-title>
          , Working Notes of CLEF (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Przybyła</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <surname>The</surname>
            <given-names>CLEF</given-names>
          </string-name>
          -2024 CheckThat! Lab:
          <article-title>Check-worthiness, subjectivity, persuasion, roles, authorities, and adversarial robustness</article-title>
          , in: N.
          <string-name>
            <surname>Goharian</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Tonellotto</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lipani</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>McDonald</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Macdonald</surname>
          </string-name>
          , I. Ounis (Eds.),
          <source>Advances in Information Retrieval</source>
          , Springer Nature Switzerland, Cham,
          <year>2024</year>
          , pp.
          <fpage>449</fpage>
          -
          <lpage>458</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Màrquez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Atanasova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kyuchukov</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Da San Martino, Overview of the clef-2018 checkthat! lab on automatic identification and verification of political claims, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <source>Interaction: 9th International Conference of the CLEF Association, CLEF</source>
          <year>2018</year>
          , Avignon, France,
          <source>September 10-14</source>
          ,
          <year>2018</year>
          , Proceedings 9, Springer,
          <year>2018</year>
          , pp.
          <fpage>372</fpage>
          -
          <lpage>387</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          , G. Da San Martino, P. Atanasova,
          <article-title>Overview of the clef-2019 checkthat! lab: automatic identification and verification of claims, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and Interaction: 10th International http://arxiv.org/abs/
          <year>1907</year>
          .11692. arXiv:
          <year>1907</year>
          .11692.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>P.</given-names>
            <surname>Delobelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Remy</surname>
          </string-name>
          , Robbert-2023:
          <article-title>Keeping dutch language models up-to-date at a lower cost thanks to model conversion</article-title>
          ,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdul-Mageed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elmadany</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. M. B.</given-names>
            <surname>Nagoudi</surname>
          </string-name>
          ,
          <string-name>
            <surname>ARBERT</surname>
          </string-name>
          &amp;
          <article-title>MARBERT: Deep bidirectional transformers for Arabic, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th</article-title>
          <source>International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>7088</fpage>
          -
          <lpage>7105</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>551</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>551</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <article-title>Unsupervised cross-lingual representation learning at scale</article-title>
          , CoRR abs/
          <year>1911</year>
          .02116 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1911</year>
          .02116. arXiv:
          <year>1911</year>
          .02116.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>H.</given-names>
            <surname>Rashkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Jang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Volkova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Truth of varying shades: Analyzing language in fake news and political fact-checking</article-title>
          , in: M.
          <string-name>
            <surname>Palmer</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Hwa</surname>
          </string-name>
          , S. Riedel (Eds.),
          <source>Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Copenhagen, Denmark,
          <year>2017</year>
          , pp.
          <fpage>2931</fpage>
          -
          <lpage>2937</lpage>
          . URL: https://aclanthology. org/D17-1317. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>D17</fpage>
          -1317.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          , G. Da San Martino,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Feldman</surname>
          </string-name>
          ,
          <article-title>Findings of the NLP4IF-2021 shared tasks on fighting the COVID-19 infodemic and censorship detection</article-title>
          , in: A.
          <string-name>
            <surname>Feldman</surname>
          </string-name>
          , G. Da San Martino, C. Leberknight, P. Nakov (Eds.),
          <source>Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship</source>
          , Disinformation, and Propaganda, Association for Computational Linguistics, Online,
          <year>2021</year>
          , pp.
          <fpage>82</fpage>
          -
          <lpage>92</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .nlp4if-
          <fpage>1</fpage>
          . 12. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .nlp4if-
          <fpage>1</fpage>
          .
          <fpage>12</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>A.</given-names>
            <surname>Modzelewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Golik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wierzbicki</surname>
          </string-name>
          ,
          <article-title>Bilingual propaganda detection in diplomats' tweets using language models and linguistic features</article-title>
          ., in: IberLEF@ SEPLN,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>P.</given-names>
            <surname>Moral</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fraile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Marco</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Peñas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalo</surname>
          </string-name>
          , Overview of dipromats 2024:
          <article-title>Detection, characterization and tracking of propaganda in messages from diplomats and authorities of world powers</article-title>
          ,
          <source>Procesamiento del Lenguaje Natural</source>
          <volume>73</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>1. System prompt: You are a fact-checking expert. Your task aims to assess the check-worthiness of a presented text. As a fact-checker, you know that to decide whether a text is check-worthy, you must answer several auxiliary questions such as “Does the text contain a verifiable factual claim?” or “Is the text harmful?”. Please provide only the final label: “Yes” if the text is check-worthy and “No” otherwise.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          2.
          <article-title>Pairs of user and assistance prompts for few-shot prompting: • User content: Answer whether the following text in Spanish is worth fact-checking. Answer using only a single English word: Yes or No. TEXT: Mañana, viernes, no puedes perderte el gran acto</article-title>
          de cierre de campaña en Madrid.
          <source>A las 19</source>
          .
          <article-title>00 h en el Pabellón 1 de IFEMA (Madrid). Con Kiko Veneno y O'Funk'illo en concierto y la intervención de</article-title>
          @Pablo_Iglesias_, @AdaColau, @Irene_Montero_, @agarzoṅ .. ¡Te esperamos!
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>