IAI Group at CheckThat! 2024: Transformer Models and
                         Data Augmentation for Checkworthy Claim Detection
                         Notebook for the Checkthat! Lab Task 1 at CLEF 2024

                         Peter Røysland Aarnes1 , Vinay Setty1,* and Petra Galuščáková1
                         1
                             University of Stavanger, Kjell Arholms gate 41, 4021 Stavanger, Norway


                                        Abstract
                                        This paper describes IAI group’s participation for automated check-worthiness estimation for claims, within
                                        the framework of the 2024 CheckThat! Lab “Task 1: Check-Worthiness Estimation”. The task involves the
                                        automated detection of check-worthy claims in English, Dutch, and Arabic political debates and Twitter data. We
                                        utilized various pre-trained generative decoder and encoder transformer models, employing methods such as
                                        few-shot chain-of-thought reasoning, fine-tuning, data augmentation, and transfer learning from one language
                                        to another. Despite variable success in terms of performance, our models achieved notable placements on the
                                        organizer’s leaderboard: ninth-best in English, third-best in Dutch, and the top placement in Arabic, utilizing
                                        multilingual datasets for enhancing the generalizability of check-worthiness detection. Despite a significant drop
                                        in performance on the unlabeled test dataset compared to the development test dataset, our findings contribute
                                        to the ongoing efforts in claim detection research, highlighting the challenges and potential of language-specific
                                        adaptations in claim verification systems.

                                        Keywords
                                        Check-worthiness, Fact-checking, RoBERTa, LLM fine-tuning


                         1. Introduction
                         In an era where information spreads faster than our capacity to verify it, the need for robust mechanisms
                         to assess the veracity of circulating claims has become increasingly critical. In the automated fact-
                         checking research community, a claim is commonly defined as “an assertion about the world that can
                         be checked”, as formalized by Full Fact [1]. However, this definition does not address the worthiness of
                         checking a claim, since not every claim requires scrutiny due to the triviality. To determine whether
                         a claim is verifiable, several factors could be considered: whether the assertion is of public interest;
                         whether it is factually verifiable, such as statements about the present or the past, or involving correlation
                         and causation; if the claim is a rumor or conspiracy; or if it could potentially cause social harm [1, 2, 3].
                         By directing the efforts of fact-checkers and automated systems toward claims with widespread impact,
                         such as those affecting public health or policy decisions, we ensure that critical information remains
                         reliable and verification resources are utilized effectively.
                            In this paper, we detail our approach to training numerous models for the detection of check-worthy
                         claims, specifically within the framework of the 2024 CheckThat! Lab “Task 1: Check-Worthiness
                         Estimation” [4]. This task seeks to determine whether claims found in tweets or political speech
                         transcriptions merit fact-checking, using a binary classification approach.
                            We conducted experiments across all three CheckThat1! languages chosen by the organizers, English,
                         Dutch, and Arabic. Our submissions ranked best for Arabic, third for Dutch, and ninth for English.
                         We employed various exploratory methods tailored to each language, utilizing various pre-trained
                         autoregressive decoder models and encoder-only transformer models.
                            For English and Dutch, our primary focus was to fine-tune our chosen models using the training
                         data provided by the organizers for each specific language. However, we also attempted to fine-tune

                          CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                          $ peter.r.aarnes@uis.no (P. R. Aarnes); vsetty@acm.org (V. Setty); petra.galuscakova@uis.no (P. Galuščáková)
                           0009-0002-3605-4847 (P. R. Aarnes); 0000-0002-9777-6758 (V. Setty); 0000-0001-6328-7131 (P. Galuščáková)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
multilingual models using additional data beyond that of the language in which the model would be
tested. For Arabic, which would reveal itself being the most challenging dataset, we initially fine-tuned
models on Arabic training data. However, the best results were achieved by translating the Arabic test
data into English and then using a GPT-3.5 model, fine-tuned in English, to classify the data.
   We also took part in Task 2 of the CLEF CheckThat! 2024 challenge, which aimed to determine
whether a sentence from a news article expressed the author’s subjective viewpoint or presented
an objective perspective on the topic. As check-worthy claims are inherently objective statements,
we employed the XLM-RoBERTa-Large model, which was trained for claim detection tasks. Given
its multilingual capabilities, we utilized this model for datasets spanning English, German, Italian,
Bulgarian, Arabic, and multilingual sources. The XLM-RoBERTa-Large’s ability to handle diverse
languages made it a suitable choice for this multilingual claim detection task, enabling us to analyze
and classify sentences across various linguistic contexts.


2. Related Work
As traditional news media experiences a decline in popularity, particularly among younger demographics
[5], platforms like X (formerly known as Twitter) and other types of microblogging services have become
primary sources for current events for many individuals. In the influx of X’s popularity, the spread of
misinformation and fake news has been increasing [6], leading to heightened awareness and concern
among researchers, policymakers, and the public. This growing attention has spurred numerous
initiatives aimed at combating false narratives, as exemplified by the pervasive misinformation during
the 2016 U.S. presidential election [7, 8] and the COVID-19 infodemic, both of which significantly
influenced public opinion and health behaviors [9, 3].
   To counteract the spread of misinformation, the research community has intensified efforts to develop
datasets and methodologies for automated fact-checking. Claim detection plays a crucial role within
these systems, serving as a foundational component for effective automated fact-checking [10]. The
most significant progress in this area has been observed in the English language, with the two largest
datasets designed for this purpose being ClaimBuster [11], containing approximately 23,500 manually
annotated sentences, and CT19-T1 [12], a dataset being the result of several years’ worth of data from
the CLEF CheckThat! Lab challenges.
   Additionally, multilingual datasets like those documented by Gupta and Srikumar [13], primarily
used for fact-checking, are also utilized for multilingual claim detection, further enhancing the resources
available for this research. Although there exists smaller datasets, typically with fewer than 10,000
annotated sentences, they are predominantly in English [14].
   Over the past two years, the CheckThat! Labs have consistently used F1 scores as the official
measurement for the check-worthiness estimation subtask. Although the specific task descriptions and
the languages tested have varied across different iterations of the CheckThat! Lab, but the overarching
goal has remained consistent: to predict the check-worthiness of claims in various languages. This
work focuses primarily on text data drawn from sources such as political debates and Twitter [15, 16].
For this year’s CheckThat! Lab, F1 is again the official measure to assess performance, continuing with
a subset of the same languages as previous editions: Arabic, Dutch, and English.
   For the 2022 CheckThat! Lab Task 1, focused on check-worthiness estimation, where the NUS-
IDS group [17] had the winning submission with their CheckthaT5 model, which won in four out
of the six language categories that year [16]. Their model was based on the mT5, a sequence-to-
sequence, massively multilingual model [18], and was trained jointly on multiple languages to promote
language-independent knowledge. Their Arabic submission achieved an F1 score of 0.628, and the
Dutch submission had an F1 score of 0.642. The winning English submission, made by the AI Rational
group, used a fine-tuned RoBERTa model and achieved an F1 score of 0.698.
   For the 2023 CheckThat! Lab Task 1, again focused on check-worthiness estimation (Subtask 1B), the
OpenFact group attained the best submission for English. The group fine-tuned GPT-3 which resulted
in a F1 of 0.898, and in addition they trained BERT-based model which achieved near identical results
Table 1
Data counts across the training, development, and development test dataset splits.
                          Language     Class    Train    Development     Dev-test     Total
                          English       No      17,088        794           210      18,092
                                        Yes      5,413        238           108       5,759
                                       Total    22,501       1,032          318      23,851

                          Dutch         No       590          150           350       1,090
                                        Yes      405          102           316        823
                                       Total     995          252           666       1,913

                          Arabic        No      5,090         682           123       5,895
                                        Yes     2,243         411           377       3,031
                                       Total    7,333        1,093          500       8,926


[16, 19]. For Arabic, the ES-VRAI group submitted the best results, which were derived from a fine-tuned
MARBERT model [20] trained on a downsampled majority class, resulting in a F1 of 0.809 [21].


3. Datasets
As shown in Table 1, there is a significant imbalance in the class label distribution within the training
data. If the model is exposed to one class more frequently during training, it may develop a bias towards
the majority class, leading to overfitting and poor generalization when encountering the minority class
in new data. To address these issues, one can either undersample the majority class or oversample the
minority class to create a more balanced training set. Alternatively, other data augmentation techniques
such as backtranslations, or synthetic data generation could also be used to balance the class distribution
[22]. Additionally, instead of only prioritizing training data class distribution, adjusting the evaluation
strategy during training to prioritize maximizing F1 macro-average scores, ensures that predictions for
different classes are treated with equal importance.


4. Methodology
To conduct our experiments, a series of methods was used in an attempt to optimize the performance of
the different fine-tuned models for a given language. These methods include translating data from one
language to another to increase the training dataset for the particular given language, text normalization,
style transfer, hyperparameter grid searches, and analyzing key performance indicators such as loss
and F1 scores during training, which were logged by the Weights & Biases (W&B) Python library and
online tool [23]. In this section, we will explore in greater detail the methods used to fine-tune the
different models. The code used to train and test our models is available on our GitHub repository1 .

4.1. Data pre-processing and augmentation
Data pre-processing became one of our experimental methods that was used in fine-tuning the different
models. We applied the following methods:

       • Text Normalization: TweetNormalizer 2 [24] script was used post-translation for the Arabic,
         Dutch and Spanish data. During our preliminary testing, the TweetNormalizer did not yield
         promising results, leading us to exclude it from further experiments when training our models


1
    IAI group code repository: https://github.com/iai-group/clef2024-checkthat
2
    TweetNormalizer: https://github.com/VinAIResearch/BERTweet/blob/master/TweetNormalizer.py
          using hyperparameter grid searches. The reasons behind the poor performance of TweetNormal-
          izer are not entirely clear, although it is plausible that the issue may be related to entity linking.
          Unlike other approaches, TweetNormalizer does not preserve the specific “@<username>” tokens
          in tweets. Instead, it replaces any distinct username with a generic “@USER” token, effectively
          removing unique identifiers associated with different classes. This removal of specific usernames
          could potentially disrupt contextual relevance, which might otherwise contribute positively when
          fine-tuning the models.
        • Machine Translation: Due to the large amounts of data to translate, we opted to use a free of
          charge translation systems available in the deep translator library. According to the recent WMT
          report [25], the quality of such freely available commercial systems depends on the particular
          language pair, but is relatively high for all the studied systems and language pairs. We thus
          used Google Translate implementation from deep-translator3 due to its support of all studied
          languages, usual performance quality and no required subscription or API key. Google Translate
          was used to translate datasets from any provided source language (English, Dutch, Arabic, and
          Spanish) to any target language (English, Dutch, Arabic).
        • Style Transfer: As the style of the English collection (political debates) substantially differs
          from the style of Dutch and Arabic collections (Twitter data), we also experimented with machine
          translation with style transfer to prepare in-style training data. Specifically, we style transferred
          the translated English training data to resemble a closer match to the Arabic data. To do this
          style transfer, we employed gpt-3.5-turbo-0125 model via ChatGPT API. We use a single prompt
          for each sentence in which we ask the system to translate the sentence and also to transfer the
          style of debate into a Tweet. We used a few-shot approach with three example Tweets selected
          from the Arabic training collection and the following prompt: Rephrase the following statement
          as if somebody was Tweeting about it in Arabic. Output might use hashtags, emoticons, images
          and links. Statement: ({text to translate}) + Here are a few examples: ({arabic examples}). Though
          the quality of the translated sentences looked reasonable, using these data did not lead to any
          improvement, suggesting that the domain mismatch between the collections is too large to be
          crossed just by a style change. Style transfer might even affect the check-worthiness of the claim.
          GPT model used for style transfer was paid and also relatively slow, what did not allow more
          extensive experimentation.

      Few-shot chain-of-thought reasoning instruction prompt

      Your task is to identify whether a given tweet text in the {lang} language
      is verifiable using a search engine in the context of fact-checking.
      Let’s define a function named checkworthy(input: str).
      he return value should be a strings, where each string selects from "Yes",
      "No".

      "Yes" means the text is a factual checkworthy statement.
      "No" means that the text is not checkworthy, it might be an opinion, a
      question, or others.
      For example, if a user call checkworthy("I think Apple is a good company.")
      You should return a string "No" without any other words,
      checkworthy("Apple’s CEO is Tim Cook.") should return "Yes" since it is
      verifiable.

      Note that your response will be passed to the python                                              interpreter,
      SO NO OTHER WORDS!
      Always return "Yes" or "No" without any other words.
      checkworthy({text})


3
    Google Translate deep translator: https://deep-translator.readthedocs.io/en/latest/usage.html#google-translate
4.2. Model Types and Fine-tuning
For our experiments, we utilized both pre-trained generative autoregressive decoder transformer models
and pre-trained encoder-only transformer models to assess their effectiveness in predicting text in
English, Dutch, and Arabic. Our selection of generative models was based on their popularity and
availability, which includes GPT-4 [26], Mistral-7b [27], GPT-3.5 with few-shot chain-of-thought (CoT)
reasoning, and a fine-tuned GPT-3.5 [28]. For the encoder models, we chose XLM-RoBERTa-Large
[29] and RoBERTa-Large [30], which are prominent in multilingual training classification and English
classification tasks, respectively.
   For fine-tuning the encoder-only models, we utilized the Hugging Face Trainer4 class. Although
most hyperparameters were kept in their default settings, the number of epochs was set to a static 50.
The development dataset was evaluated after each epoch, optimizing for Macro F1 score to monitor
performance. We also employed the hyperparameter grid search using Weights & Biases [23] sweep
functionality to conduct multiple training runs, testing the most critical hyperparameter combinations.
In an attempt to save time during training, training would terminate early if the F1 score for the
development dataset did not improve after 3 consecutive epochs.
   The following list contains an overview of the different models in our experiments used, including
what data was used for fine-tuning, and specifies if a particular model was only used for one particular
language.

       • GPT-4 [26]: Few-shot CoT reasoning. Tested on all three languages.
       • Mistral 7b [27]: Few-shot CoT reasoning. Tested on all three languages.
       • GPT-3.5 [28]: Few-shot chain-of-thought reasoning approach. Tested on all three languages.
       • GPT-3.5 [28] (fine-tuned): Fine-tuned on English training data for the English tests and Arabic
         to English translations, and another model was fine-tuned on Spanish, Arabic and Dutch for the
         Dutch test.
       • XLM-RoBERTa-Large [29] (XLMR): Fine-tuned on English ClaimBuster [11], Norwegian and
         German podcasts data for claim detection [31]. The model was tested on all three languages.
       • XLM-RoBERTa-Large (fine-tuned) [29]: This version of XLM-RoBERTa (which we will refer
         to as “XLMR fine-tuned”) builds upon the initial fine-tuning of the aforementioned XLMR model.
         It underwent additional fine-tuning with the organizer’s training data, specifically tailored for a
         particular language. It was evaluated across all three languages.
       • RoBERTa-Large [30]: Fine-tuned on unaltered English organizer’s training data, and was tested
         only on English data.

4.2.1. English Model fine-tuning and hyperparameter tuning
For the organizer English unlabeled test data submission, we used a fine-tuned RoBERTa-Large model,
since it outperformed the other models on the development test (dev-test) datasets. The list of hyperpa-
rameters employed for the grid search is provided in Table 2. Figure 1 illustrates the outcome of the 24
distinct training runs and their corresponding performance on the development dataset.

Table 2
Hyperparameters grid search values for RoBERTa-Large English fine-tuning. *Early termination if F1 score did
not improve after 3 consecutive epochs.
                             Parameters                       Values
                             Batch Size                       16, 32
                             Epochs                           50*
                             Hidden Dropout Probability       0.1, 0.2, 0.3
                             Learning Rate                    1.25e-05, 2.5e-05, 5e-05, 7.5e-05


4
    https://huggingface.co/docs/transformers/main_classes/trainer
Figure 1: Hyperparameter W&B parallel coordinates plot for RoBERTa hyperparameter grid search. Eval/f1
relates to the best dev-test F1 in a given run.


   Given the consistently high F1 scores across multiple RoBERTa training runs on the development
and dev-test datasets, more in depth analysis was conducted. During the RoBERTa grid search, two
specific model runs, which we will refer to as model 𝐴 and model 𝐵, performed exceptionally well on
the development and dev-test datasets. Since only one model prediction results could be submitted
for the final evaluation, our objective was to make an informed decision on which model would likely
perform the best. For example, model 𝐴 demonstrated slightly better dev-test F1 scores compared to
model 𝐵, although model 𝐵 performed better than model 𝐴 on the development during 𝐵’s training.
   As a final sanity check, we compared the prediction overlap between models 𝐴 and 𝐵 and a fine-tuned
GPT-3.5 model to determine which RoBERTa model deviated most from the GPT-3.5’s predictions. A
significant difference in overlap with the GPT model suggests that one of the RoBERTa models might
have developed a unique pattern of predictions, which in turn could have a significant impact on its
performance on real-world data. We hypothesized that a higher percentage of overlap in predictions
with the GPT model would be advantageous.
   Based on comparisons and analysis of key performance indicators, such as the development dataset
F1, dev-test F1, and the loss rate, we systematically gathered and analyzed training and testing data
using W&B, including test data prediction overlaps. As a result, we decided to go with the RoBERTa
model 𝐴.

4.2.2. Dutch Model fine-tuning
For Dutch, we utilized XLMR, which was fine-tuned on ClaimBuster and podcast data (as detailed in
Section 4.2), as well as XLMR fine-tuned with datasets in the four languages provided by the organizers.
Additionally, we used GPT-3.5 fine-tuned on Dutch, Arabic, and Spanish data, and finally, leveraged
LLMs GPT-4 and Mistral-7b with few-shot CoT reasoning prompts. After extensive analysis, for Dutch,
GPT-4 was the best performing model on the dev-test dataset.

4.2.3. Arabic Model fine-tuning
For Arabic, none of the XLMR models or LLMs with CoT prompt performed well. Since we suspected
that the distribution of dev-test and organizer test are different, we randomly sampled 10% of the test
Table 3
English performance binary averages metrics for the “check-worthy”-class. The columns for accuracy, precision,
and recall are measured from the dev-test dataset.
               Model                  Accuracy    Precision   Recall   F1 (dev-test)   F1 Test
               GPT-4                      0.808       0.813    0.565          0.667     0.658
               Mistral-7b                 0.726       0.667    0.389          0.491     0.503
               GPT-3.5                    0.745       0.865    0.296          0.441     0.397
               GPT-3.5 (fine-tuned)       0.915       0.966    0.778          0.862     0.705
               XLMR                       0.830       0.829    0.630          0.716     0.717
               XLMR (fine-tuned)          0.767       1.000    0.315          0.479     0.662
               RoBERTa                    0.937       0.958    0.852          0.902     0.753


dataset and manually annotated it with labels, thereafter tested each model on that annotated sample.
Since we are three contributors to these experiments, each person annotated labeled 10% of the sample
separately. We calculated Cohen’s kappa to assess inter-annotation agreement (𝑘 = 0.424). In cases of
inter-annotation disagreement, the sentence in question would be annotated according to the majority
rule.


5. Results and Discussion
In this section, we present the results that the different models produced for the dev-test and the
submission test dataset after the gold standard got published. For the dev-test set we access the
performance using metrics that include accuracy, precision, recall, and F1 scores for the positive class
(check-worthy claims), moreover, for the test dataset, only the F1 was measured.

5.1. English
The models evaluated in English includes GPT-4, Mistral-7b, GPT-3.5, XLMR, XLMR (fine-tuned), and
RoBERTa. Table 3 provides a detailed overview of the metrics for the positive class.

    • RoBERTa emerged as the best performing model for accuracy (0.937), precision (0.958), and recall
      (0.852) for the dev-test data, reflecting a strong ability to correctly identify relevant instances
      without a high rate of false positives. This resulted in an impressive F1 score of 0.902 for the
      dev-test, however, the F1 decreased by 0.099 for the test data (F1 0.753). Conversely, Mistral-7b
      and GPT-3.5 showed lower performance across most metrics, with Mistral-7b demonstrating a
      particular weakness in precision (0.667), and GPT-3.5 with even worse recall (0.296).
    • GPT-4 and XLMR displayed moderate performance, with XLMR having a slight edge over GPT-4
      in accuracy and F1 scores. Interestingly, the fine-tuned XLMR (fine-tuned) achieved a perfect
      precision score of 1.000 but at the cost of lower recall (0.315), suggesting a conservative prediction
      behavior that limited its false positives, but missed several relevant predictions.
    • The variation in performance across the test and dev-test dataset for our best performing model,
      RoBERTa, suggests potential overfitting or dataset-specific biases which makes it poor at general-
      izing across different data. Efforts to that could be beneficial for future experiments would be
      fine-tuning with an expanded parameter grid search, data augmentation such as testing over-
      sampling or undersampling techniques, or using additional translated English data to make the
      training data more diverse which could potentially help the model’s ability to generalize.

5.2. Dutch
The models evaluated for the Dutch language include GPT-4, Mistral 7b, GPT-3.5, XLMR, and XLMR
(fine-tuned). Table 4 provides a detailed overview of the performance metrics for the different models.
This condensed analysis aims to highlight which models perform best in handling Dutch language data,
Table 4
Dutch performance binary averages metrics for the “check-worthy”-class. The columns for accuracy, precision,
and recall are measured from the dev-test dataset.
                Model                  Accuracy    Precision   Recall   F1 dev-test   F1 Test
                GPT-4                     0.577       0.580    0.389         0.466     0.718
                Mistral 7b                0.547       0.539    0.310         0.394     0.601
                GPT-3.5                   0.538       0.544    0.155         0.241     0.647
                GPT-3.5 (fine-tuned)      0.677       0.706    0.547         0.617     0.781
                XLMR                      0.625       0.603    0.611         0.607     0.694
                XLMR (fine-tuned)         0.637       0.597    0.722         0.653     0.611


Table 5
Arabic performance binary averages metrics for the “check-worthy”-class. The columns for accuracy, precision,
and recall are measured from the dev-test dataset.
                Model                  Accuracy    Precision   Recall   F1 dev-test   F1 Test
                GPT-4                     0.810       0.890    0.854         0.871     0.526
                Mistral 7b                0.700       0.865    0.714         0.782     0.493
                GPT-3.5                   0.664       0.862    0.661         0.799     0.397
                GPT-3.5 (fine-tuned)      0.824       0.885    0.881         0.883     0.569
                XLMR                      0.784       0.848    0.870         0.859     0.549
                XLMR (fine-tuned)         0.740       0.919    0.719         0.807     0.519


emphasizing their strengths and potential areas for improvement. For the final submission, GPT-4 was
the model used.

    • Overall performance for Dutch, XLMR (fine-tuned) demonstrated the best F1 scores, for the
      dev-test data (0.653), with a slight performance decrease for test data (0.611). This model excelled
      particularly in recall (0.722) compared to the other models. The high recall coupled with reasonable
      precision (0.597) suggests a balanced approach to maximizing both positive identifications and
      accuracy of predictions.
    • GPT-4, Mistral 7b, and GPT-3.5 (all three using CoT reasoning) showed weaker performance
      metrics overall compared to XLMR (fine-tuned). GPT-4, despite GPT-4’s lower accuracy and
      precision in the dev-test (0.577 and 0.580, respectively), showed a significant increase in F1 score
      on the test data (0.718), which may indicate better generalization under specific conditions. Mistral
      7b, on the other hand, displayed lower metrics across the board with particularly low recall (0.310).
    • The XLMR model, while not reaching the heights of its fine-tuned counterpart on the dev-test data,
      still outperformed the GPT models in most metrics on the dev-test dataset, showing particular
      strength in recall (0.611) that closely matches its precision (0.603). This balance resulted in robust
      F1 scores in both the dev-test (0.607) and test (0.694) scenarios, underlining its utility as a reliable
      model for this task.
    • All the models showed significant performance variances across the datasets. Interestingly, only
      the XLMR (fine-tuned) exhibited a performance decline from the dev-test to the test dataset, while
      all other models performed significantly better on the test dataset. Notably, the GPT-3.5 model,
      fine-tuned on Spanish, Arabic, and Dutch, achieved an F1 score of 0.781 on the test dataset. This
      score would have placed it at the top of the CheckThat! leaderboard for Dutch, had we submitted
      these results instead of those from GPT-4.


5.3. Arabic
The models evaluated for the Arabic language include GPT-4, Mistral 7b, GPT-3.5, XLMR, and XLMR
(fine-tuned). In addition, a fine-tuned English GPT-3.5 model was evaluated, which classifies Arabic-
to-English translated data. Table 5 provides a detailed overview of the performance metrics for the
“check-worthy” class, encompassing accuracy, precision, recall, and F1 for the dev-test dataset, as well
as the F1 score for the submission test dataset. In addition, we annotated 10% of the test data prior
to getting the gold standard, attempting to gain a greater understanding of how the different models
might behave for the test data. The most promising model tested on the 10% sample data was the
fine-tuned GPT-3.5, which attained an F1 score of 0.848. Consequently, we chose this model for our
final submission.

    • GPT-3.5, fine-tuned on English, outperformed the GPT-4 CoT (except for precision), it also
      outperformed GPT-4 on the 10% annotated sample data. As a result of this, we chose to submit
      test results from the GPT-3.5 model. However, there was a significant drop in performance on
      the test data, where the F1 score decreased to 0.569. This indicates a potential issue with the
      model’s ability to generalize from the development environment to more diverse or challenging
      test scenarios.
    • GPT-4 with CoT training, demonstrated robust performance across most metrics, achieving the
      highest precision in the dev-test set (0.890) and showcasing strong F1 score (0.871) and recall
      (0.854). However, it underperformed compared to GPT-3.5 fine-tuned. We see a similar drop in
      performance on test set, indicating that it is significantly different from dev-test.
    • XLMR showed consistent performance, with particularly great recall (0.870) on the dev-test,
      translating into an F1 score of 0.859. It attained the second highest F1 score on the standard test
      dataset (0.549)
    • The XLMR (fine-tuned) also performed well, improving on precision (0.919) significantly compared
      to XLMR counterpart on the dev-test data, which resulted in an F1 score of 0.807. However, like
      GPT-4, as for all other models as well, XLMR (fined-tuned) saw a decrease in performance on the
      test dataset (F1 0.519), which could suggest an overfitting to the dev-test environment or a need
      for further tuning to enhance its ability to generalize across different data.


6. Conclusion and Future Work
This study offers a detailed examination of the 2024 CheckThat! Lab competition, Task 1, focusing on
check-worthiness estimation for claims in political debates and Twitter data in English, Dutch, and
Arabic. We employ a strategic combination of few-shot chain-of-thought reasoning and language-
specific fine-tuning methods.
  Our submissions attained the first place for Arabic with an F1 of 0.569, where we translated the
Arabic test data to English, thereafter used a fine-tuned GPT-3.5 for English to classify the translated
data. For Dutch, we secured the third-best submission with a F1 of 0.718, using GPT-4 with few-shot
chain-of-thought reasoning. Lastly, for the English submission earned us the ninth-best submission
with the F1 score of 0.753 using a RoBERTa-Large model, trained on unaltered English training data
provided by the competition organizers.
  Despite having the best submission for Arabic, we observed a significant drop in performance when
comparing the results from the dev-test and the actual submission test dataset. This signals possible
challenges such as overfitting and poor generalization across unseen data. These issues would be an
important area for future investigations, possibly through more robust model training techniques and
exploring additional data augmentation strategies.


7. Acknowledgments
This research is funded by SFI MediaFutures partners and the Research Council of Norway (grant
number 309339).
References
 [1] L. Konstantinovskiy, O. Price, M. Babakar, A. Zubiaga, Toward automated factchecking: Developing
     an annotation schema and benchmark for consistent automated claim detection, Digital Threats 2
     (2021) 1–16.
 [2] N. Hassan, F. Arslan, C. Li, M. Tremayne, Toward automated fact-checking: Detecting check-
     worthy factual claims by claimbuster, in: Proceedings of the 23rd ACM SIGKDD International
     Conference on Knowledge Discovery and Data Mining, 2017, pp. 1803–1812.
 [3] F. Alam, S. Shaar, F. Dalvi, H. Sajjad, A. Nikolov, H. Mubarak, G. Da San Martino, A. Abdelali,
     N. Durrani, K. Darwish, A. Al-Homaid, W. Zaghouani, T. Caselli, G. Danoe, F. Stolk, B. Bruntink,
     P. Nakov, Fighting the covid-19 infodemic: Modeling the perspective of journalists, fact-checkers,
     social media platforms, policy makers, and the society, in: Findings of the Association for
     Computational Linguistics: EMNLP 2021, 2021, pp. 611–649.
 [4] A. Barrón-Cedeño, F. Alam, T. Chakraborty, T. Elsayed, P. Nakov, P. Przybyła, J. M. Struß, F. Haouari,
     M. Hasanain, F. Ruggeri, X. Song, R. Suwaileh, The CLEF-2024 CheckThat! Lab: Check-worthiness,
     subjectivity, persuasion, roles, authorities, and adversarial robustness, in: Advances in Information
     Retrieval, 2024, pp. 449–458.
 [5] I. Siles, P. J. Boczkowski, Making sense of the newspaper crisis: A critical assessment of existing
     research and an agenda for future work, New Media & Society 14 (2012) 1375–1394.
 [6] S. Vosoughi, D. Roy, S. Aral, The spread of true and false news online, Science 359 (2018) 1146–1151.
 [7] H. Allcott, M. Gentzkow, Social media and fake news in the 2016 election, Journal of Economic
     Perspectives 31 (2017) 211–236.
 [8] A. Bovet, H. A. Makse, Influence of fake news in twitter during the 2016 us presidential election,
     Nature Communications 10 (2019) 7.
 [9] T. Pavlov, G. Mirceva, Covid-19 fake news detection by using bert and roberta models, in: 2022
     45th jubilee international convention on information, communication and electronic technology
     (MIPRO), 2022, pp. 312–316.
[10] Z. Guo, M. Schlichtkrull, A. Vlachos, A survey on automated fact-checking, Transactions of the
     Association for Computational Linguistics 10 (2022) 178–206.
[11] F. Arslan, N. Hassan, C. Li, M. Tremayne, A benchmark dataset of check-worthy factual claims,
     2020.
[12] T. Elsayed, P. Nakov, A. Barrón-Cedeño, M. Hasanain, R. Suwaileh, G. D. S. Martino, P. Atanasova,
     Overview of the clef-2019 checkthat!: Automatic identification and verification of claims, 2019.
[13] A. Gupta, V. Srikumar, X-fact: A new benchmark dataset for multilingual fact checking, in:
     Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the
     11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers),
     2021, pp. 675–682.
[14] X. Zeng, A. S. Abumansour, A. Zubiaga, Automated fact-checking: A survey, 2021.
[15] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, R. Míguez, T. Caselli, M. Kutlu, W. Za-
     ghouani, C. Li, S. Shaar, H. Mubarak, A. Nikolov, Y. S. Kartal, J. Beltrán, Overview of the CLEF-2022
     CheckThat! lab task 1 on identifying relevant claims in tweets, in: Working Notes of CLEF
     2022—Conference and Labs of the Evaluation Forum, 2022.
[16] F. Alam, A. Barrón-Cedeño, G. S. Cheema, G. K. Shahi, S. Hakimov, M. Hasanain, C. Li, R. Míguez,
     H. Mubarak, W. Zaghouani, P. Nakov, Overview of the clef-2023 checkthat! lab task 1 on check-
     worthiness of multimodal and multigenre content (2023).
[17] S. K. N. Mingzhe Du, Sujatha Das Gollapalli, NUS-IDS at CheckThat! 2022: identifying check-
     worthiness of tweets using CheckthaT5, in: N. Faggioli, Guglielmo andd Ferro, A. Hanbury,
     M. Potthast (Eds.), Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum,
     CLEF 2022, 2022.
[18] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raffel, mt5: A
     massively multilingual pre-trained text-to-text transformer, 2021.
[19] M. Sawinski, K. Węcel, E. Księżniak, M. Stróżyna, W. Lewoniewski, P. Stolarski, W. Abramowicz,
     Openfact at checkthat! 2023: Head-to-head gpt vs. bert - a comparative study of transformers
     language models for the detection of check-worthy claims, CLEF 2023, 2023.
[20] M. Abdul-Mageed, A. Elmadany, E. M. B. Nagoudi, Arbert & marbert: Deep bidirectional transform-
     ers for arabic, in: Proceedings of the 59th Annual Meeting of the Association for Computational
     Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume
     1: Long Papers), 2021, pp. 7088–7105.
[21] H. T. Sadouk, F. Sebbak, H. E. Zekiri, Es-vrai at checkthat! 2023: Analyzing checkworthiness in
     multimodal and multigenre contents through fusion and sampling approaches (2023).
[22] S. Henning, W. Beluch, A. Fraser, A. Friedrich, A survey of methods for addressing class imbalance
     in deep-learning based natural language processing, in: Proceedings of the 17th Conference of the
     European Chapter of the Association for Computational Linguistics, 2023, pp. 523–540.
[23] L. Biewald, Experiment tracking with weights and biases}, 2020.
[24] D. Q. Nguyen, T. Vu, A. Tuan Nguyen, Bertweet: A pre-trained language model for english tweets,
     in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing:
     System Demonstrations, 2020, pp. 9–14.
[25] T. Kocmi, E. Avramidis, R. Bawden, O. Bojar, A. Dvorkovich, C. Federmann, M. Fishel, M. Freitag,
     T. Gowda, R. Grundkiewicz, B. Haddow, P. Koehn, B. Marie, C. Monz, M. Morishita, K. Murray,
     M. Nagata, T. Nakazawa, M. Popel, M. Popović, M. Shmatova, Findings of the 2023 conference
     on machine translation (WMT23): LLMs are here but not quite there yet, in: Proceedings of the
     Eighth Conference on Machine Translation, WMT’23, 2023, pp. 1–42.
[26] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Al-
     tenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao,
     M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko,
     M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell,
     A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen,
     R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier,
     Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning,
     A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford, L. Gao,
     E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein,
     S. Gray, R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton,
     J. Heidecke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu,
     J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun,
     T. Kaftan, . Kaiser, A. Kamali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim,
     C. Kim, Y. Kim, J. H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, . Kondraciuk, A. Kondrich, A. Kon-
     stantinidis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy,
     C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini,
     S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. M. McKinney,
     C. McLeavey, P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko,
     P. Mishkin, V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair,
     R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino,
     J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng,
     A. Perelman, F. d. A. B. Peres, M. Petrov, H. P. d. O. Pinto, Michael, Pokorny, M. Pokrass, V. H.
     Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. Rae, A. Ramesh, C. Raymond,
     F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar,
     G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh,
     S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky,
     Y. Song, N. Staudacher, F. P. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. B. Thompson,
     P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone,
     A. Vijayvergiya, C. Voss, C. Wainwright, J. J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. J.
     Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter,
     S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan,
     W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, B. Zoph, Gpt-4
     technical report, 2024.
[27] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand,
     G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril,
     T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023.
[28] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,
     K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder,
     P. Christiano, J. Leike, R. Lowe, Training language models to follow instructions with human
     feedback, 2022.
[29] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
     L. Zettlemoyer, V. Stoyanov, Unsupervised cross-lingual representation learning at scale, 2020.
[30] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov,
     Roberta: A robustly optimized bert pretraining approach, 2019.
[31] A. J. Becker, Automated fact-checking of podcasts, 2023.