<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>IAI Group at CheckThat! 2024: Transformer Models and Data Augmentation for Checkworthy Claim Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Peter Røysland Aarnes</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vinay Setty</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Petra Galuščáková</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Stavanger</institution>
          ,
          <addr-line>Kjell Arholms gate 41, 4021 Stavanger</addr-line>
          ,
          <country country="NO">Norway</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>09</fpage>
      <lpage>12</lpage>
      <abstract>
        <p>This paper describes IAI group's participation for automated check-worthiness estimation for claims, within the framework of the 2024 CheckThat! Lab “Task 1: Check-Worthiness Estimation”. The task involves the automated detection of check-worthy claims in English, Dutch, and Arabic political debates and Twitter data. We utilized various pre-trained generative decoder and encoder transformer models, employing methods such as few-shot chain-of-thought reasoning, fine-tuning, data augmentation, and transfer learning from one language to another. Despite variable success in terms of performance, our models achieved notable placements on the organizer's leaderboard: ninth-best in English, third-best in Dutch, and the top placement in Arabic, utilizing multilingual datasets for enhancing the generalizability of check-worthiness detection. Despite a significant drop in performance on the unlabeled test dataset compared to the development test dataset, our findings contribute to the ongoing eforts in claim detection research, highlighting the challenges and potential of language-specific adaptations in claim verification systems.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Check-worthiness</kwd>
        <kwd>Fact-checking</kwd>
        <kwd>RoBERTa</kwd>
        <kwd>LLM fine-tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        In an era where information spreads faster than our capacity to verify it, the need for robust mechanisms
to assess the veracity of circulating claims has become increasingly critical. In the automated
factchecking research community, a claim is commonly defined as “an assertion about the world that can
be checked”, as formalized by Full Fact [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. However, this definition does not address the worthiness of
checking a claim, since not every claim requires scrutiny due to the triviality. To determine whether
a claim is verifiable, several factors could be considered: whether the assertion is of public interest;
whether it is factually verifiable, such as statements about the present or the past, or involving correlation
and causation; if the claim is a rumor or conspiracy; or if it could potentially cause social harm [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ].
By directing the eforts of fact-checkers and automated systems toward claims with widespread impact,
such as those afecting public health or policy decisions, we ensure that critical information remains
reliable and verification resources are utilized efectively.
      </p>
      <p>
        In this paper, we detail our approach to training numerous models for the detection of check-worthy
claims, specifically within the framework of the 2024 CheckThat! Lab “Task 1: Check-Worthiness
Estimation” [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. This task seeks to determine whether claims found in tweets or political speech
transcriptions merit fact-checking, using a binary classification approach.
      </p>
      <p>We conducted experiments across all three CheckThat1! languages chosen by the organizers, English,
Dutch, and Arabic. Our submissions ranked best for Arabic, third for Dutch, and ninth for English.
We employed various exploratory methods tailored to each language, utilizing various pre-trained
autoregressive decoder models and encoder-only transformer models.</p>
      <p>For English and Dutch, our primary focus was to fine-tune our chosen models using the training
data provided by the organizers for each specific language. However, we also attempted to fine-tune
multilingual models using additional data beyond that of the language in which the model would be
tested. For Arabic, which would reveal itself being the most challenging dataset, we initially fine-tuned
models on Arabic training data. However, the best results were achieved by translating the Arabic test
data into English and then using a GPT-3.5 model, fine-tuned in English, to classify the data.</p>
      <p>We also took part in Task 2 of the CLEF CheckThat! 2024 challenge, which aimed to determine
whether a sentence from a news article expressed the author’s subjective viewpoint or presented
an objective perspective on the topic. As check-worthy claims are inherently objective statements,
we employed the XLM-RoBERTa-Large model, which was trained for claim detection tasks. Given
its multilingual capabilities, we utilized this model for datasets spanning English, German, Italian,
Bulgarian, Arabic, and multilingual sources. The XLM-RoBERTa-Large’s ability to handle diverse
languages made it a suitable choice for this multilingual claim detection task, enabling us to analyze
and classify sentences across various linguistic contexts.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        As traditional news media experiences a decline in popularity, particularly among younger demographics
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], platforms like X (formerly known as Twitter) and other types of microblogging services have become
primary sources for current events for many individuals. In the influx of X’s popularity, the spread of
misinformation and fake news has been increasing [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], leading to heightened awareness and concern
among researchers, policymakers, and the public. This growing attention has spurred numerous
initiatives aimed at combating false narratives, as exemplified by the pervasive misinformation during
the 2016 U.S. presidential election [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ] and the COVID-19 infodemic, both of which significantly
influenced public opinion and health behaviors [
        <xref ref-type="bibr" rid="ref3 ref9">9, 3</xref>
        ].
      </p>
      <p>
        To counteract the spread of misinformation, the research community has intensified eforts to develop
datasets and methodologies for automated fact-checking. Claim detection plays a crucial role within
these systems, serving as a foundational component for efective automated fact-checking [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. The
most significant progress in this area has been observed in the English language, with the two largest
datasets designed for this purpose being ClaimBuster [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], containing approximately 23,500 manually
annotated sentences, and CT19-T1 [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], a dataset being the result of several years’ worth of data from
the CLEF CheckThat! Lab challenges.
      </p>
      <p>
        Additionally, multilingual datasets like those documented by Gupta and Srikumar [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], primarily
used for fact-checking, are also utilized for multilingual claim detection, further enhancing the resources
available for this research. Although there exists smaller datasets, typically with fewer than 10,000
annotated sentences, they are predominantly in English [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <p>
        Over the past two years, the CheckThat! Labs have consistently used F1 scores as the oficial
measurement for the check-worthiness estimation subtask. Although the specific task descriptions and
the languages tested have varied across diferent iterations of the CheckThat! Lab, but the overarching
goal has remained consistent: to predict the check-worthiness of claims in various languages. This
work focuses primarily on text data drawn from sources such as political debates and Twitter [
        <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
        ].
For this year’s CheckThat! Lab, F1 is again the oficial measure to assess performance, continuing with
a subset of the same languages as previous editions: Arabic, Dutch, and English.
      </p>
      <p>
        For the 2022 CheckThat! Lab Task 1, focused on check-worthiness estimation, where the
NUSIDS group [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] had the winning submission with their CheckthaT5 model, which won in four out
of the six language categories that year [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]. Their model was based on the mT5, a
sequence-tosequence, massively multilingual model [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], and was trained jointly on multiple languages to promote
language-independent knowledge. Their Arabic submission achieved an F1 score of 0.628, and the
Dutch submission had an F1 score of 0.642. The winning English submission, made by the AI Rational
group, used a fine-tuned RoBERTa model and achieved an F1 score of 0.698.
      </p>
      <p>
        For the 2023 CheckThat! Lab Task 1, again focused on check-worthiness estimation (Subtask 1B), the
OpenFact group attained the best submission for English. The group fine-tuned GPT-3 which resulted
in a F1 of 0.898, and in addition they trained BERT-based model which achieved near identical results
[
        <xref ref-type="bibr" rid="ref16 ref19">16, 19</xref>
        ]. For Arabic, the ES-VRAI group submitted the best results, which were derived from a fine-tuned
MARBERT model [20] trained on a downsampled majority class, resulting in a F1 of 0.809 [21].
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>As shown in Table 1, there is a significant imbalance in the class label distribution within the training
data. If the model is exposed to one class more frequently during training, it may develop a bias towards
the majority class, leading to overfitting and poor generalization when encountering the minority class
in new data. To address these issues, one can either undersample the majority class or oversample the
minority class to create a more balanced training set. Alternatively, other data augmentation techniques
such as backtranslations, or synthetic data generation could also be used to balance the class distribution
[22]. Additionally, instead of only prioritizing training data class distribution, adjusting the evaluation
strategy during training to prioritize maximizing F1 macro-average scores, ensures that predictions for
diferent classes are treated with equal importance.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>To conduct our experiments, a series of methods was used in an attempt to optimize the performance of
the diferent fine-tuned models for a given language. These methods include translating data from one
language to another to increase the training dataset for the particular given language, text normalization,
style transfer, hyperparameter grid searches, and analyzing key performance indicators such as loss
and F1 scores during training, which were logged by the Weights &amp; Biases (W&amp;B) Python library and
online tool [23]. In this section, we will explore in greater detail the methods used to fine-tune the
diferent models. The code used to train and test our models is available on our GitHub repository 1.</p>
      <sec id="sec-4-1">
        <title>4.1. Data pre-processing and augmentation</title>
        <p>Data pre-processing became one of our experimental methods that was used in fine-tuning the diferent
models. We applied the following methods:
• Text Normalization: TweetNormalizer 2 [24] script was used post-translation for the Arabic,
Dutch and Spanish data. During our preliminary testing, the TweetNormalizer did not yield
promising results, leading us to exclude it from further experiments when training our models</p>
        <sec id="sec-4-1-1">
          <title>1IAI group code repository: https://github.com/iai-group/clef2024-checkthat</title>
          <p>2TweetNormalizer: https://github.com/VinAIResearch/BERTweet/blob/master/TweetNormalizer.py
using hyperparameter grid searches. The reasons behind the poor performance of
TweetNormalizer are not entirely clear, although it is plausible that the issue may be related to entity linking.
Unlike other approaches, TweetNormalizer does not preserve the specific “@&lt;username&gt;” tokens
in tweets. Instead, it replaces any distinct username with a generic “@USER” token, efectively
removing unique identifiers associated with diferent classes. This removal of specific usernames
could potentially disrupt contextual relevance, which might otherwise contribute positively when
ifne-tuning the models.
• Machine Translation: Due to the large amounts of data to translate, we opted to use a free of
charge translation systems available in the deep translator library. According to the recent WMT
report [25], the quality of such freely available commercial systems depends on the particular
language pair, but is relatively high for all the studied systems and language pairs. We thus
used Google Translate implementation from deep-translator3 due to its support of all studied
languages, usual performance quality and no required subscription or API key. Google Translate
was used to translate datasets from any provided source language (English, Dutch, Arabic, and
Spanish) to any target language (English, Dutch, Arabic).
• Style Transfer: As the style of the English collection (political debates) substantially difers
from the style of Dutch and Arabic collections (Twitter data), we also experimented with machine
translation with style transfer to prepare in-style training data. Specifically, we style transferred
the translated English training data to resemble a closer match to the Arabic data. To do this
style transfer, we employed gpt-3.5-turbo-0125 model via ChatGPT API. We use a single prompt
for each sentence in which we ask the system to translate the sentence and also to transfer the
style of debate into a Tweet. We used a few-shot approach with three example Tweets selected
from the Arabic training collection and the following prompt: Rephrase the following statement
as if somebody was Tweeting about it in Arabic. Output might use hashtags, emoticons, images
and links. Statement: ({text to translate}) + Here are a few examples: ({arabic examples}). Though
the quality of the translated sentences looked reasonable, using these data did not lead to any
improvement, suggesting that the domain mismatch between the collections is too large to be
crossed just by a style change. Style transfer might even afect the check-worthiness of the claim.
GPT model used for style transfer was paid and also relatively slow, what did not allow more
extensive experimentation.</p>
          <p>Few-shot chain-of-thought reasoning instruction prompt
Your task is to identify whether a given tweet text in the {lang} language
is verifiable using a search engine in the context of fact-checking.
Let’s define a function named checkworthy(input: str).
he return value should be a strings, where each string selects from "Yes",
"No".
"Yes" means the text is a factual checkworthy statement.
"No" means that the text is not checkworthy, it might be an opinion, a
question, or others.</p>
          <p>For example, if a user call checkworthy("I think Apple is a good company.")
You should return a string "No" without any other words,
checkworthy("Apple’s CEO is Tim Cook.") should return "Yes" since it is
verifiable.</p>
          <p>Note that your response will be passed to the python interpreter,
SO NO OTHER WORDS!
Always return "Yes" or "No" without any other words.</p>
          <p>checkworthy({text})
3Google Translate deep translator: https://deep-translator.readthedocs.io/en/latest/usage.html#google-translate</p>
        </sec>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Model Types and Fine-tuning</title>
        <p>For our experiments, we utilized both pre-trained generative autoregressive decoder transformer models
and pre-trained encoder-only transformer models to assess their efectiveness in predicting text in
English, Dutch, and Arabic. Our selection of generative models was based on their popularity and
availability, which includes GPT-4 [26], Mistral-7b [27], GPT-3.5 with few-shot chain-of-thought (CoT)
reasoning, and a fine-tuned GPT-3.5 [ 28]. For the encoder models, we chose XLM-RoBERTa-Large
[29] and RoBERTa-Large [30], which are prominent in multilingual training classification and English
classification tasks, respectively.</p>
        <p>For fine-tuning the encoder-only models, we utilized the Hugging Face Trainer 4 class. Although
most hyperparameters were kept in their default settings, the number of epochs was set to a static 50.
The development dataset was evaluated after each epoch, optimizing for Macro F1 score to monitor
performance. We also employed the hyperparameter grid search using Weights &amp; Biases [23] sweep
functionality to conduct multiple training runs, testing the most critical hyperparameter combinations.
In an attempt to save time during training, training would terminate early if the F1 score for the
development dataset did not improve after 3 consecutive epochs.</p>
        <p>The following list contains an overview of the diferent models in our experiments used, including
what data was used for fine-tuning, and specifies if a particular model was only used for one particular
language.</p>
        <p>
          • GPT-4 [26]: Few-shot CoT reasoning. Tested on all three languages.
• Mistral 7b [27]: Few-shot CoT reasoning. Tested on all three languages.
• GPT-3.5 [28]: Few-shot chain-of-thought reasoning approach. Tested on all three languages.
• GPT-3.5 [28] (fine-tuned): Fine-tuned on English training data for the English tests and Arabic
to English translations, and another model was fine-tuned on Spanish, Arabic and Dutch for the
Dutch test.
• XLM-RoBERTa-Large [29] (XLMR): Fine-tuned on English ClaimBuster [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], Norwegian and
        </p>
        <p>German podcasts data for claim detection [31]. The model was tested on all three languages.
• XLM-RoBERTa-Large (fine-tuned) [29]: This version of XLM-RoBERTa (which we will refer
to as “XLMR fine-tuned”) builds upon the initial fine-tuning of the aforementioned XLMR model.
It underwent additional fine-tuning with the organizer’s training data, specifically tailored for a
particular language. It was evaluated across all three languages.
• RoBERTa-Large [30]: Fine-tuned on unaltered English organizer’s training data, and was tested
only on English data.
4.2.1. English Model fine-tuning and hyperparameter tuning
For the organizer English unlabeled test data submission, we used a fine-tuned RoBERTa-Large model,
since it outperformed the other models on the development test (dev-test) datasets. The list of
hyperparameters employed for the grid search is provided in Table 2. Figure 1 illustrates the outcome of the 24
distinct training runs and their corresponding performance on the development dataset.</p>
        <sec id="sec-4-2-1">
          <title>4https://huggingface.co/docs/transformers/main_classes/trainer</title>
          <p>Given the consistently high F1 scores across multiple RoBERTa training runs on the development
and dev-test datasets, more in depth analysis was conducted. During the RoBERTa grid search, two
specific model runs, which we will refer to as model  and model , performed exceptionally well on
the development and dev-test datasets. Since only one model prediction results could be submitted
for the final evaluation, our objective was to make an informed decision on which model would likely
perform the best. For example, model  demonstrated slightly better dev-test F1 scores compared to
model , although model  performed better than model  on the development during ’s training.</p>
          <p>As a final sanity check, we compared the prediction overlap between models  and  and a fine-tuned
GPT-3.5 model to determine which RoBERTa model deviated most from the GPT-3.5’s predictions. A
significant diference in overlap with the GPT model suggests that one of the RoBERTa models might
have developed a unique pattern of predictions, which in turn could have a significant impact on its
performance on real-world data. We hypothesized that a higher percentage of overlap in predictions
with the GPT model would be advantageous.</p>
          <p>Based on comparisons and analysis of key performance indicators, such as the development dataset
F1, dev-test F1, and the loss rate, we systematically gathered and analyzed training and testing data
using W&amp;B, including test data prediction overlaps. As a result, we decided to go with the RoBERTa
model .
4.2.2. Dutch Model fine-tuning
For Dutch, we utilized XLMR, which was fine-tuned on ClaimBuster and podcast data (as detailed in
Section 4.2), as well as XLMR fine-tuned with datasets in the four languages provided by the organizers.
Additionally, we used GPT-3.5 fine-tuned on Dutch, Arabic, and Spanish data, and finally, leveraged
LLMs GPT-4 and Mistral-7b with few-shot CoT reasoning prompts. After extensive analysis, for Dutch,
GPT-4 was the best performing model on the dev-test dataset.
4.2.3. Arabic Model fine-tuning
For Arabic, none of the XLMR models or LLMs with CoT prompt performed well. Since we suspected
that the distribution of dev-test and organizer test are diferent, we randomly sampled 10% of the test
dataset and manually annotated it with labels, thereafter tested each model on that annotated sample.
Since we are three contributors to these experiments, each person annotated labeled 10% of the sample
separately. We calculated Cohen’s kappa to assess inter-annotation agreement ( = 0.424). In cases of
inter-annotation disagreement, the sentence in question would be annotated according to the majority
rule.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Results and Discussion</title>
      <p>In this section, we present the results that the diferent models produced for the dev-test and the
submission test dataset after the gold standard got published. For the dev-test set we access the
performance using metrics that include accuracy, precision, recall, and F1 scores for the positive class
(check-worthy claims), moreover, for the test dataset, only the F1 was measured.</p>
      <sec id="sec-5-1">
        <title>5.1. English</title>
        <p>The models evaluated in English includes GPT-4, Mistral-7b, GPT-3.5, XLMR, XLMR (fine-tuned), and
RoBERTa. Table 3 provides a detailed overview of the metrics for the positive class.
• RoBERTa emerged as the best performing model for accuracy (0.937), precision (0.958), and recall
(0.852) for the dev-test data, reflecting a strong ability to correctly identify relevant instances
without a high rate of false positives. This resulted in an impressive F1 score of 0.902 for the
dev-test, however, the F1 decreased by 0.099 for the test data (F1 0.753). Conversely, Mistral-7b
and GPT-3.5 showed lower performance across most metrics, with Mistral-7b demonstrating a
particular weakness in precision (0.667), and GPT-3.5 with even worse recall (0.296).
• GPT-4 and XLMR displayed moderate performance, with XLMR having a slight edge over GPT-4
in accuracy and F1 scores. Interestingly, the fine-tuned XLMR (fine-tuned) achieved a perfect
precision score of 1.000 but at the cost of lower recall (0.315), suggesting a conservative prediction
behavior that limited its false positives, but missed several relevant predictions.
• The variation in performance across the test and dev-test dataset for our best performing model,
RoBERTa, suggests potential overfitting or dataset-specific biases which makes it poor at
generalizing across diferent data. Eforts to that could be beneficial for future experiments would be
ifne-tuning with an expanded parameter grid search, data augmentation such as testing
oversampling or undersampling techniques, or using additional translated English data to make the
training data more diverse which could potentially help the model’s ability to generalize.
5.2. Dutch
The models evaluated for the Dutch language include GPT-4, Mistral 7b, GPT-3.5, XLMR, and XLMR
(fine-tuned). Table 4 provides a detailed overview of the performance metrics for the diferent models.
This condensed analysis aims to highlight which models perform best in handling Dutch language data,
emphasizing their strengths and potential areas for improvement. For the final submission, GPT-4 was
the model used.</p>
        <p>• Overall performance for Dutch, XLMR (fine-tuned) demonstrated the best F1 scores, for the
dev-test data (0.653), with a slight performance decrease for test data (0.611). This model excelled
particularly in recall (0.722) compared to the other models. The high recall coupled with reasonable
precision (0.597) suggests a balanced approach to maximizing both positive identifications and
accuracy of predictions.
• GPT-4, Mistral 7b, and GPT-3.5 (all three using CoT reasoning) showed weaker performance
metrics overall compared to XLMR (fine-tuned). GPT-4, despite GPT-4’s lower accuracy and
precision in the dev-test (0.577 and 0.580, respectively), showed a significant increase in F1 score
on the test data (0.718), which may indicate better generalization under specific conditions. Mistral
7b, on the other hand, displayed lower metrics across the board with particularly low recall (0.310).
• The XLMR model, while not reaching the heights of its fine-tuned counterpart on the dev-test data,
still outperformed the GPT models in most metrics on the dev-test dataset, showing particular
strength in recall (0.611) that closely matches its precision (0.603). This balance resulted in robust
F1 scores in both the dev-test (0.607) and test (0.694) scenarios, underlining its utility as a reliable
model for this task.
• All the models showed significant performance variances across the datasets. Interestingly, only
the XLMR (fine-tuned) exhibited a performance decline from the dev-test to the test dataset, while
all other models performed significantly better on the test dataset. Notably, the GPT-3.5 model,
ifne-tuned on Spanish, Arabic, and Dutch, achieved an F1 score of 0.781 on the test dataset. This
score would have placed it at the top of the CheckThat! leaderboard for Dutch, had we submitted
these results instead of those from GPT-4.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.3. Arabic</title>
        <p>The models evaluated for the Arabic language include GPT-4, Mistral 7b, GPT-3.5, XLMR, and XLMR
(fine-tuned). In addition, a fine-tuned English GPT-3.5 model was evaluated, which classifies
Arabicto-English translated data. Table 5 provides a detailed overview of the performance metrics for the
“check-worthy” class, encompassing accuracy, precision, recall, and F1 for the dev-test dataset, as well
as the F1 score for the submission test dataset. In addition, we annotated 10% of the test data prior
to getting the gold standard, attempting to gain a greater understanding of how the diferent models
might behave for the test data. The most promising model tested on the 10% sample data was the
ifne-tuned GPT-3.5, which attained an F1 score of 0.848. Consequently, we chose this model for our
ifnal submission.</p>
        <p>• GPT-3.5, fine-tuned on English, outperformed the GPT-4 CoT (except for precision), it also
outperformed GPT-4 on the 10% annotated sample data. As a result of this, we chose to submit
test results from the GPT-3.5 model. However, there was a significant drop in performance on
the test data, where the F1 score decreased to 0.569. This indicates a potential issue with the
model’s ability to generalize from the development environment to more diverse or challenging
test scenarios.
• GPT-4 with CoT training, demonstrated robust performance across most metrics, achieving the
highest precision in the dev-test set (0.890) and showcasing strong F1 score (0.871) and recall
(0.854). However, it underperformed compared to GPT-3.5 fine-tuned. We see a similar drop in
performance on test set, indicating that it is significantly diferent from dev-test.
• XLMR showed consistent performance, with particularly great recall (0.870) on the dev-test,
translating into an F1 score of 0.859. It attained the second highest F1 score on the standard test
dataset (0.549)
• The XLMR (fine-tuned) also performed well, improving on precision (0.919) significantly compared
to XLMR counterpart on the dev-test data, which resulted in an F1 score of 0.807. However, like
GPT-4, as for all other models as well, XLMR (fined-tuned) saw a decrease in performance on the
test dataset (F1 0.519), which could suggest an overfitting to the dev-test environment or a need
for further tuning to enhance its ability to generalize across diferent data.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion and Future Work</title>
      <p>This study ofers a detailed examination of the 2024 CheckThat! Lab competition, Task 1, focusing on
check-worthiness estimation for claims in political debates and Twitter data in English, Dutch, and
Arabic. We employ a strategic combination of few-shot chain-of-thought reasoning and
languagespecific fine-tuning methods.</p>
      <p>Our submissions attained the first place for Arabic with an F1 of 0.569, where we translated the
Arabic test data to English, thereafter used a fine-tuned GPT-3.5 for English to classify the translated
data. For Dutch, we secured the third-best submission with a F1 of 0.718, using GPT-4 with few-shot
chain-of-thought reasoning. Lastly, for the English submission earned us the ninth-best submission
with the F1 score of 0.753 using a RoBERTa-Large model, trained on unaltered English training data
provided by the competition organizers.</p>
      <p>Despite having the best submission for Arabic, we observed a significant drop in performance when
comparing the results from the dev-test and the actual submission test dataset. This signals possible
challenges such as overfitting and poor generalization across unseen data. These issues would be an
important area for future investigations, possibly through more robust model training techniques and
exploring additional data augmentation strategies.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Acknowledgments</title>
      <p>This research is funded by SFI MediaFutures partners and the Research Council of Norway (grant
number 309339).
technical report, 2024.
[27] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand,
G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril,
T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023.
[28] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,
K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder,
P. Christiano, J. Leike, R. Lowe, Training language models to follow instructions with human
feedback, 2022.
[29] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,</p>
      <p>L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, 2020.
[30] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,</p>
      <p>Roberta: A robustly optimized bert pretraining approach, 2019.
[31] A. J. Becker, Automated fact-checking of podcasts, 2023.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>L.</given-names>
            <surname>Konstantinovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Price</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Babakar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          , Toward automated factchecking:
          <article-title>Developing an annotation schema and benchmark for consistent automated claim detection</article-title>
          ,
          <source>Digital Threats</source>
          <volume>2</volume>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Arslan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tremayne</surname>
          </string-name>
          ,
          <article-title>Toward automated fact-checking: Detecting checkworthy factual claims by claimbuster</article-title>
          ,
          <source>in: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1803</fpage>
          -
          <lpage>1812</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dalvi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sajjad</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          , G. Da San Martino,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abdelali</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Durrani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Darwish</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Al-Homaid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Caselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Danoe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Stolk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bruntink</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Fighting the covid-19 infodemic: Modeling the perspective of journalists, fact-checkers, social media platforms, policy makers, and the society</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: EMNLP</source>
          <year>2021</year>
          ,
          <year>2021</year>
          , pp.
          <fpage>611</fpage>
          -
          <lpage>649</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Chakraborty</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Przybyła</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Struß</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Haouari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Ruggeri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <surname>The</surname>
            <given-names>CLEF</given-names>
          </string-name>
          -2024 CheckThat! Lab:
          <article-title>Check-worthiness, subjectivity, persuasion, roles, authorities, and adversarial robustness</article-title>
          ,
          <source>in: Advances in Information Retrieval</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>449</fpage>
          -
          <lpage>458</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>I.</given-names>
            <surname>Siles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Boczkowski</surname>
          </string-name>
          ,
          <article-title>Making sense of the newspaper crisis: A critical assessment of existing research and an agenda for future work</article-title>
          ,
          <source>New Media &amp; Society</source>
          <volume>14</volume>
          (
          <year>2012</year>
          )
          <fpage>1375</fpage>
          -
          <lpage>1394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>S.</given-names>
            <surname>Vosoughi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Roy</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. Aral,</surname>
          </string-name>
          <article-title>The spread of true and false news online</article-title>
          ,
          <source>Science</source>
          <volume>359</volume>
          (
          <year>2018</year>
          )
          <fpage>1146</fpage>
          -
          <lpage>1151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>H.</given-names>
            <surname>Allcott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gentzkow</surname>
          </string-name>
          ,
          <article-title>Social media and fake news in the 2016 election</article-title>
          ,
          <source>Journal of Economic Perspectives</source>
          <volume>31</volume>
          (
          <year>2017</year>
          )
          <fpage>211</fpage>
          -
          <lpage>236</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bovet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Makse</surname>
          </string-name>
          ,
          <article-title>Influence of fake news in twitter during the 2016 us presidential election</article-title>
          ,
          <source>Nature Communications</source>
          <volume>10</volume>
          (
          <year>2019</year>
          )
          <article-title>7</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pavlov</surname>
          </string-name>
          , G. Mirceva, Covid
          <article-title>-19 fake news detection by using bert and roberta models</article-title>
          ,
          <source>in: 2022 45th jubilee international convention on information, communication and electronic technology (MIPRO)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>312</fpage>
          -
          <lpage>316</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Schlichtkrull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Vlachos</surname>
          </string-name>
          , A survey on
          <source>automated fact-checking, Transactions of the Association for Computational Linguistics</source>
          <volume>10</volume>
          (
          <year>2022</year>
          )
          <fpage>178</fpage>
          -
          <lpage>206</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Arslan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Hassan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tremayne</surname>
          </string-name>
          ,
          <article-title>A benchmark dataset of check-worthy factual claims</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>T.</given-names>
            <surname>Elsayed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Suwaileh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. D. S.</given-names>
            <surname>Martino</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Atanasova</surname>
          </string-name>
          ,
          <article-title>Overview of the clef-2019 checkthat!: Automatic identification and</article-title>
          verification of claims,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Srikumar</surname>
          </string-name>
          ,
          <article-title>X-fact: A new benchmark dataset for multilingual fact checking</article-title>
          ,
          <source>in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</source>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <year>2021</year>
          , pp.
          <fpage>675</fpage>
          -
          <lpage>682</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Abumansour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zubiaga</surname>
          </string-name>
          ,
          <source>Automated fact-checking: A survey</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          , G. Da San Martino,
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Míguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Caselli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kutlu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shaar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nikolov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. S.</given-names>
            <surname>Kartal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Beltrán</surname>
          </string-name>
          ,
          <article-title>Overview of the CLEF-2022 CheckThat! lab task 1 on identifying relevant claims in tweets</article-title>
          ,
          <source>in: Working Notes of CLEF 2022-Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>F.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barrón-Cedeño</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Cheema</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. K.</given-names>
            <surname>Shahi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hakimov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hasanain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Míguez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zaghouani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <article-title>Overview of the clef-2023 checkthat! lab task 1 on checkworthiness of multimodal and multigenre content (</article-title>
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S. K. N. Mingzhe</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <article-title>Sujatha Das Gollapalli, NUS-IDS at CheckThat! 2022: identifying checkworthiness of tweets using CheckthaT5</article-title>
          , in: N. Faggioli, Guglielmo andd Ferro,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hanbury</surname>
          </string-name>
          , M. Potthast (Eds.), Working Notes of CLEF 2022 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          ,
          <string-name>
            <surname>CLEF</surname>
          </string-name>
          <year>2022</year>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>L.</given-names>
            <surname>Xue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Constant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Al-Rfou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Siddhant</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Barua</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Rafel, mt5: A massively multilingual pre-trained text-to-text transformer</article-title>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>M.</given-names>
            <surname>Sawinski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Węcel</surname>
          </string-name>
          , E. Księżniak,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stróżyna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lewoniewski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Stolarski</surname>
          </string-name>
          , W. Abramowicz,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>