DataBees at CheckThat! 2024: Check-Worthiness
                         Estimation
                         Notebook for the CheckThat! Lab at CLEF 2024

                         Tanisha Sriram1,* , Yadushree Venkatesh1 , Sowmya Anand1 and Bharathi B1
                         1
                             SSN College Of Engineering


                                        Abstract
                                        In today’s world, it is very essential for us to identify claims in social media posts or transcriptions in order
                                        to combat misinformation. The task 1[1] from CheckThat! lab[2] has enabled us to provide a solution to this
                                        problem. Through this project, we propose a machine learning-based approach to evaluate if the given claims
                                        are worth fact-checking. By employing various pre-trained models like BERT, RoBERTa, and language-specific
                                        models such as AraBERT for Arabic, and traditional classifiers like MultinomialNB and Logistic Regression,
                                        the system is designed to work across three languages i.e. English, Dutch, and Arabic. Through training and
                                        experimentation using the given datasets, the models with the highest F1 scores were identified for each language.
                                        We observed that the best F1 scores were given by DistilBERT and BERT for English, AraBERT for Arabic, and
                                        BERT for Dutch. These models were then used to predict the check-worthiness of claims in unseen test data,
                                        which demonstrated the effectiveness of the proposed solution.

                                        Keywords
                                        fact-checking, pre-trained models, traditional classifiers


                         1. Introduction
                         Where information flows rapidly, in today’s digital age, across various platforms such as social media,
                         email, and messaging apps, distinguishing between real and fake information has become extremely
                         challenging. Misinformation can spread quickly, leading to confusion, misunderstandings, and even
                         harmful consequences. Therefore,through this project, we aim to develop a worthiness checking
                         solution to enhance a user’s ability to check the credibility of information. The issue of spreading
                         of misinformation is greatly magnified due to the sheer volume and speed at which information is
                         being spread. Social media, for instance, allows anyone to post and share content without checking
                         the credibility of the content. This democratization of information has its benefits, but it also has
                         the chances that there may be sharing of misinformation Most people lack the tools or skills to
                         evaluate the information that they come across. They may unknowingly spread false information
                         because it aligns with their beliefs or because they assume that it is accurate without verification.
                         The psychological phenomenon of "confirmation bias" aggravates this issue, as people tend to agree
                         with the information that confirms their preexisting beliefs and disregard information that contradicts
                         them. The consequences of misinformation can be severe. In the realm of health , if there is false
                         information regarding a certain vaccine, people may tend to avoid the process which may lead to
                         spread of the disease which otherwise would have been controlled by the vaccine . Similarly, in politics,
                         misinformation can influence public’s opinion and voting , which may eventually lead to bad results.
                         Misinformation about disasters can create unnecessary panic and inappropriate responses from the
                         public.
                            The aim of this task is to determine whether a claim in a tweet transcription is worth fact-checking.
                         Generally, this decision is made by professional fact-checkers or by human themselves who have

                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ tanisha2310538@ssn.edu.in (T. Sriram); yadushree2310494@ssn.edu.in (Y. Venkatesh); sowmya2310543@ssn.edu.in
                         (S. Anand); bharathib@ssn.edu.in (B. B)
                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
to answer several auxiliary questions such as “does it contain a verifiable factual claim?”, and “is it
harmful?”, before deciding on the credibility of the content. We used the machine learning approach by
training the model on the train data sets and finding the best F1 score over the positive class of the
classification.
   Three languages have been used (English, Dutch and Arabic) and various pre-trained models along
with some traditional methods were used depending on the language. The following were the models
that were used. BERT, RoBERTa, XLM-RoBERTa, DistilBERT, AlBERT, Electra, AraBERT (Arabic) and
BERTje (Ducth). Traditional classifiers such as MultinomialNB, SVC-linear and Logistic Regression were
also used. In the realm of natural language processing (NLP) and information authenticity verification,
we faced challenges while dealing with Dutch and Arabic, which have very different linguistic structures.
These challenges become more and more pronounced as the existing resources and tools are mostly
designed for English. Pre-trained models for languages such as Dutch and Arabic are less available
compared to English. While English has many pre-trained models like BERT, GPT, and their variants,
non-English languages often lack these many resources. Arabic uses a completely different script from
the Latin based languages like English and Dutch. Arabic script is cursive and includes letters that
change its shape based on their position in a word. This increases the complexity to text processing tasks,
such as tokenization and character recognition. The Arabic script is known for its rich morphology and
a single root can be used to generate several other related words. This is quite different from English
and Dutch since these languages have less structural variations. Achieving high F1 scores was quite
challenging in Arabic and Dutch.
   Hence, the challenges of developing natural language processing models for languages like Dutch
and Arabic were significant. By focusing on the above stated areas, we can improve the performance of
models in many languages, enabling more accurate and reliable information verification across diverse
languages.
   The second section contains information about the datasets used for each language. Third section is
the related work section where the reference of relevant documents for similar tasks will be discussed.
Fourth section contains the methodology with the complete process of the task starting from the
pre-processing to various models that were used. The ranks obtained will be presented in the results
section.


2. Dataset
We had been given the dataset which consisted of the training set, the development (dev) set, and the
test set. The training set was the largest part of the dataset and was used to train our ML models. It
had labeled text with their respective class labels . These samples were then used for learning by the
model, and the patterns and relationships in the data were taken in by the algorithms of our model. The
development set, or dev set was the intermediate step in the modeling process. This was the validation
process and helped to fine-tune our model. Finally, the test set, which is the evaluation stage, containing
unlabeled data were passed to the model and the predictions were generated. This was compared
to the ground truth labels. These are then used to measure overall performance and generalization
capabilities of our solution. Each instance is composed of only text, which could come from a tweet,
the transcription of a debate or the transcription of speech. The English dataset has the sentence id,
the text as well as the class label. Similarly for Dutch and Arabic, we were provided with datasets that
contained the tweet id, tweet url, tweet text and the class label. The class labels were provided for the
train and dev sets while the labels had to be predicted for the test set using our project. The model with
the best f1 score over the positive class was evaluated with the dev set.


3. Related Work
Check-worthiness estimation has been implemented in various studies. Copenhagen [3] was a team
who obtained a MAP of 0.155 in CheckThat! 2018 lab. The sentence was represented using word
embedding alone with POS tags. It was used as input to an RNN with GRU memory units, from which
the output from each word was aggregated using attention, and a fully connected layer, from which the
output was predicted using a sigmoid function.
   In ClaimBuster[4], the authors used the transcripts of all of the US presidential debates that were
manually annotated. The authors proposed a SVM-based model with sentence-level features such as
sentiment, length, TF-IDF, POS-tags, and Entity Types.
   Gencheva et al. integrated several context-aware and sentence-level features totrain both SVM and
Feed-forward Neural Networks[5]. This approach outperforms the ClaimBuster system in terms of
MAP and precision.
   Patwari al [6] predicted whether a sentence would be selected by a fact-checking organization using
a boosting-like model. Similarly, Vasileva al.[7] used a multi-task learning neural network that predicts
whether a sentence would be selected for fact-checking by each individual fact-checking organization
(from a set of nine such organizations).
   NUS-IDS team was one of the top teams in the subtask related to detecting check-worthiness of
tweets in 2022[8]. They explored the feasibility of adapting sequence-to-sequence models for detecting
check-worthy social media content in a multilingual environment (Arabic, Bulgarian, Dutch, English,
Spanish and Turkish) provided in the competition. They ranked first in 4 out of 6 languages at Checkthat!
2022 Task 1A.
   AI Rational team[9] in the same subtask employed three different pre-trained transformer models:
BERT, DistilBERT (a distilled version of BERT) and RoBERTa. All models used have been taken from
huggingface. Fine-tuning of parameters was done on DistilBERT model since it was the fastest one to
train.
   PoliMi-FlatEarthers team [10] used a generative pre-trained GPT-3 model in English. They showed
how much larger GPT-3 models, despite being developed primarily for text-generation, outperformed
previous language models on the task of automated claim detection on the 2022 CheckThat! Challenge
dataset. Not only that, they also showed that GPT-3, while designed for handling mainly English tasks,
can maintain competitive performances on other languages as well.
   Glasgow Terrier at CLEF CheckThat! 2019[11] proposed to represent each sentence using their
mentioned entities using a TF-IDF representation. They used a SVM classifier to predict the check-
worthiness of each sentence. Their approach ranked 4th out of 12 submissions. Their experiments
showed that the pronouns and coreference resolution pre-processing procedure they used as part of
their approach does improve the effectiveness of sentence checkworthiness prediction. Furthermore,
their results show that entity analysis features provide valuable evidence for this task.
   Zindex[12] used the oversampling technique to balance the dataset and applied SVM and Random
Forest (RF) with TF-IDF representations. They also used BERT multilingual (BERT-m) and XLM-
RoBERTa-base pre-trained models for the experiments. They used BERT-m for the official submissions
and our systems ranked as 3rd, 5th, and 12th in Spanish, Dutch, and English, respectively. In further
experiments, their evaluation showed that transformer models (BERT-m and XLM-RoBERTa-base)
outperformed the SVM and RF in Dutch and English languages where a different scenario is observed
for Spanish.
   Hence various approaches have been used such as SVM classifier, feed-forward neural networks,
POS-tags, RNN, pre-trained models like BERT, DistilBERT and RoBERTA, GPT-3 model, random forest
and graident boosting and they were proven successful.


4. Methodology
4.1. Pre-Processing
The following are some very important steps that were taken in the pre-processing phase to increase
the quality of the textual data before submitting them to machine learning models. First, all the text
was converted into lowercase for further uniformity and getting rid of word duplication based on case
differences. Punctuation marks were eliminated to reduce the noise and make the meaningful content
    Table 1
    F1 scores of various pre-trained models.
                                       Models         English   Dutch   Arabic
                                       BERT            0.78     0.64     0.64
                                    RoBERTa            0.62     0.47       -
                                  XLM-RoBERTa          0.70     0.48       -
                                    DistilBERT         0.78       -        -
                                     ALBERT            0.72       -        -
                                      Electra          0.74       -        -
                                 AraBERT(Arabic)         -        -      0.67
                                  BERTje(Dutch)          -      0.52       -


    Table 2
    F1 scores of traditional classifiers.
                                       Models         English   Dutch    Arabic
                                 MultinomialNB          0.63     0.38     0.58
                                   SVC-linear           0.63     0.44     0.52
                                Logistic regression     0.57     0.43     0.42


more distinct. Stemming and Lemmatization were applied in order to normalize the words by bringing
them back to the root, thereby reducing the vocabulary size to increase the efficiency of the model.
   Further, common words with small semantic value, like articles and prepositions, which are the
stop-words, were removed to bring into sharp focus the content-bearing words. While emojis add an
expressive element to the text, they have been taken out because semantically they add nothing to
factual content and might introduce noise into the dataset.
   Second, binary encoding was carried out; standardized responses converted yes and no into 1s and
0s, respectively. These binary features allow uniformity in representation across the dataset.
   All these pre-processing steps play an important role in refining textual data, hence ensuring machine
learning models can effectively catch the semantic content and make accurate predictions regarding
the check-worthiness of claims.

4.2. Models
BERT(Bidirectional Encoder Representations from Transformers) developed by Google AI Language,
has greatly improved the Natural Language Processing (NLP). BERT employs an encoder-only
architecture. Its bidirectional Transformer architecture process text sequentially, either from left to
right or right to left improving its understanding of language. After its breakthrough, BERT led to the
creation of successive models, like RoBERTa, ALBERT, and DistilBERT, building on the architecture
and adopting strategies to improve the model performance on various NLP tasks. All the models have
elevated the performance, efficiency, and scalability of the NLP models. It gave us an f1 score of 0.78 in
English, 0.64 in Dutch and 0.64 in Arabic as shown in table 1.

RoBERTa, an evolution of BERT, working towards enhancing its language understanding
tasks. It takes off the next sentence prediction task and uses a dynamic mask during training. RoBERTa
demonstrates better performance on most benchmarks of NLP. The bidirectional context encoding
scheme is aptly suited for subtle understandings of textual data that a task might require, such as
sentiment analysis, text classification, and question answering. RoBERTa’s success in capturing
contextual information from large-scale corpora enables it to generate very accurate representa-
tions of language semantics. It gave us an f1 score of 0.62 in English and 0.47 in Arabic as shown in table 1.

XLM-RoBERTa furthers the capabilities of RoBERTa towards other languages.                        The cross-
lingual pretraining techniques in a transformer-based architecture mean that XLM-RoBERTa could
do well in tasks calling for cross-lingual understanding, such as machine translation, cross-lingual
document classification, and multilingual sentiment analysis. Its competence to deal with diverse
languages makes it an asset for applications with global communication requirements that understand
and process multilingual content. It gave us an f1 score of 0.70 in English and 0.48 in Ducth as shown in
table 1.

DistilBERT, a distilled version of BERT, addresses the computational and memory challenges
associated with large transformer models. By reducing the model size and employing knowledge
distillation techniques, DistilBERT retains the essence of BERT’s contextual understanding while
significantly reducing computational resources. This makes it useful for use in those environments
where resources are constrained. It gave us an f1 score of 0.78 in English as shown in table 1.

ALBERT also known as "A Lite BERT", further enhances the scalability and efficiency of transformer
models. By implementing factorized embedding parameterization and cross-layer parameter sharing,
ALBERT achieves comparable performance to BERT while significantly reducing memory and
computational requirements. This makes it suitable for large-scale NLP tasks such as document
classification, text generation, and language modeling. ALBERT’s efficiency and scalability make it an
attractive option for applications requiring high-performance NLP models deployed at scale. It gave us
an f1 score of 0.72 for English as shown in table 1.

ELECTRA stands for Efficiently Learning an Encoder that Classifies Token Replacements Ac-
curately. It’s a transformer model coming from Google. It differs from traditional masked language
models—MLMs, like BERT, predict the actual masked tokens in the pre-training phase; ELECTRA
uses the replaced token detection task. The input text gets corrupted by replacing some of the
tokens with plausible alternatives generated by a small generator model. Then, the main model—the
discriminator—gets trained to detect these replacements. Due to this approach, ELECTRA is more
sample-efficient and can even reach competitive or superior performance to BERT with much less
computation. It gave us an f1 score of 0.74 in English as shown in table 1.

AraBERT is a language model developed explicitly for Arabic text by the Applied Artificial
Intelligence Institute at the University of Sharjah. It is based on the BERT architecture, but it is modified
and fine-tuned explicitly for the particularities of the Arabic language. AraBERT has been trained on
a large corpus of Arabic text and has shown impressive performance over many NLP tasks, such as
text classification, named entity recognition, and sentiment analysis. That’s what makes it a valuable
resource in Arabic NLP research and applications. It gave us an f1 score of 0.67 for Arabic as shown in
table 1.

BERTje is the Dutch version of BERT developed by the University of Amsterdam and Tilburg
University. Based on the BERT architecture, like AraBERT, it is fine-tuned to the Dutch language. It is
pre-trained on a large corpus of Dutch text and shows good performance for Dutch NLP tasks like text
classification, question answering, and language understanding. BERTje has become for NLP research
in Dutch and other applications because of its ability to capture the intricacies of Dutch language. It
gave us an f1 score of 0.52 for Dutch as shown in table 1.

Multinomial Naive Bayes (MultinomialNB) is a probabilistic classifier based on Bayes’ theo-
rem with the assumption that every feature is independent of all others. It creates class probabilities
according to the frequency of features. This model is, therefore, very apt for tasks in NLP such as
sentiment analysis and document classification, since it deals efficiently with large and very sparse
feature spaces. However, the major benefits of Multinomial NB are the ease in its implementation and
the fact that it works well with very modest computational overhead, which makes this classifier a
great baseline in many text classification research projects. It gave us an f1 score of 0.63 in English, 0.38
Figure 1: Check-Worthiness Estimation


in Dutch and 0.58 in Arabic as shown in table 2.

Support Vector Classification with a linear kernel (SVC-linear) is yet another traditional
machine learning algorithm commonly used for text classification tasks. The work done by SVC-linear
is to find the best hyper-plane that separates the classes in a high-dimensional space. It uses a linear
kernel to compute the dot products between feature vectors, and it has proved very effective in text
classification when dealing with high-dimensional feature spaces. It is useful for not being prone
to overfitting, more so in combination with apt regularization techniques. It can treat nonlinear
relationships through kernel functions, making it flexible with the capacity to model complicated data
distributions. It gave us an f1 score of 0.63 in English, 0.44 in Dutch and 0.52 in Arabic as shown in table 2.

Logistic Regression provides an estimate of the likelihood of a binary outcome as a func-
tion of input features by applying the logistic function to a linear combination of these features. Thus, it
has been immensely employed for all those tasks in NLP where interpretability and efficiency are prime
requirements, such as in sentiment analysis and spam detection. Advantages of Logistic Regression
include the simplicity of returning probabilities that can easily be understood as class predictions,
making the interpretation of results easy. It works decently where the relationship between features
and the target variable is linear or approximately linear, and it’s often a stable choice used as a baseline
model in NLP research. It gave us an f1 score of 0.57 in English, 0.43 in Dutch and 0.42 in Arabic as
shown in table 2.


5. Results and Analysis
5.1. Performance Metrics
An overview of a machine learning model’s performance on a set of test data is provided via a confusion
matrix. It is a way to show how many instances, depending on the model’s predictions, are accurate
and inaccurate. It is frequently used to assess how well classification model, which seek to assign a
categorical label to each instance of input, perform. The number of instances that the model generated
on the test data is shown in the matrix.
True positives (TP): occur when the model accurately predicts a positive data point.
True negatives (TN): occur when the model accurately predicts a negative data point.
False positives (FP): occur when the model predicts a positive data point incorrectly.
False negatives (FN): occur when the model predicts a negative data point incorrectly.
  The precision, recall, and F1-score macro averages are used to evaluate this task. The metrics are
computed for each class separately, and the averages are then used to provide equal significance to each
Table 3
Rank list for Arabic.
                                                Team              F1
                                               visty            0.569
                                          teamopenfact          0.557
                                            DSHacker            0.538
                                             TurQUaz            0.533
                                        SemanticCUETSync        0.532
                                           mjmanas54            0.531
                                          FiredfromNLP          0.530
                                            Madussree           0.530
                                              pandas            0.520
                                            hybrinfox           0.519
                                              Mirela            0.478
                                             DataBees           0.460
                                             Baseline           0.418
                                              JUNLP             0.212


class. Precision in classification relates to the likelihood that the classification was done correctly. It is
the proportion of points that are accurately classified to all points that have been projected to belong to
that class.
                                       𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇 𝑃/(𝑇 𝑃 + 𝐹 𝑃 )                                       (1)
Conversely, recall provides an estimate of the number of correctly performed classifications of a type.
It is the ratio of a class’s correctly classified points to the total of that class’s correctly and wrongly
categorized points.

                                       𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇 𝑃/(𝑇 𝑃 + 𝐹 𝑁 )                                           (2)
The F1-score is a weighted average of recall and precision that is typically employed when there is a
significant class imbalance or when both metrics need to be balanced.

                        𝐹 1𝑠𝑐𝑜𝑟𝑒 = 2((𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛)(𝑅𝑒𝑐𝑎𝑙𝑙))/(𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙))                          (3)

5.2. Results
To objectively assess how well the models that were fitted to the training dataset performed, the test
dataset was utilized. As with this assignment, the macro averages, which comprise precision, recall,
F1-score and confusion matrix, were employed in addition to accuracy as performance indicators for
analysis.
  Taking into consideration the dataset containing tweets in English, it was evident that DistilBERT
and BERT were the best classifiers having an F1 score of 0.78 on the positive class. This was followed
by Electra and Albert having an F1 score of 0.74 and 0.72 on the positive class respectively.
  This run secured the 18th rank in Task 1 English which used the dataset containing tweets in English.
The models performed on the test set with a macro F1 score of 0.78.
  The reason BERT performs so highly in English tweets’ sentiment analysis tasks lies in being
bidirectional. In a large-scale English text pretraining, specific deep context is gained, which enables
the extraction of subtle expressions of the language, slang, and other contextual clues in the tweets,
making it very effective in classifying sentiments.
  While slightly smaller and faster than BERT by a factor of distillation techniques, DistilBERT retains
much of the performance by distilling the essential knowledge from BERT pre-training. Much of the
ability that BERT derives to interpret complex language structures and sentiments in tweets is still
Table 4
Rank list for Dutch.
                                              Team             F1
                                           TurQUaz           0.732
                                          DSHacker           0.730
                                             visty           0.718
                                            Mirela           0.650
                                         Zamoranesis         0.601
                                            FCRUG            0.594
                                        teamopenfact         0.590
                                          hybrinfox          0.589
                                         mjmanas54           0.577
                                           DataBees          0.563
                                            JUNLP            0.550
                                        FiredfromNLP         0.543
                                          Madussree          0.482
                                           Baseline          0.438
                                            pandas           0.308
                                      SemanticCUETSync       0.218


retained with this model, still making this very practical when computational efficiency is of value
without a compromise on performance.
   When the dataset of tweets in the language Dutch was trained on a variety of models, we observed
that BERT had the best performance with an F1 score of 0.64 on the positive class. DistilBERT came
second in terms of accuracy with an F1 score of 0.52 on the positive class.
   This run secured the 10th rank in Task 1 Dutch which used the dataset containing tweets in Dutch as
shown in table. The model performed on the test set with a macro F1 score of 0.64.
   The success of BERT in conducting sentiment analysis for Dutch tweets can be attributed to its
pre-training on enormous Dutch texts that enable the model to get a feel for the specific nuisances of
the use of the Dutch language. On morphology, syntax, and contextual changes typical of Dutch tweets,
BERT does well; hence, it generalizes with high accuracy on tasks like sentiment classification.
   Considering the results of the analysis on the Arabic tweet dataset, it is possible to conclude that
AraBERT was the best classifier with an F1 score of 0.67 on the positive class. This was followed by
MultinomialNB which had an F1 score of 0.58 on the positive class.
   This run secured the 12th rank in Task 1 Arabic which used the dataset containing tweets in Arabic
as shown in table. The model performed on the test set with a macro F1 score of 0.67.
   Specifically, AraBERT is tailored to Arabic text and excels in sentiment analysis on Arabic tweets
due to fine-tuning on a large corpus of Arabic text. The success with this model lies in the intricate
morphology and syntax of the Arabic language, as well as sentiment expressions that are peculiar in
the Arabic language tweets. This special training will ensure that subtlety and context-specific nuances,
very important to be picked up for sentiment classification in Arabic, are so captured.


6. Conclusion
Although this solution promises to automate the checking of claims in several languages, there are a few
areas where it can be improved. Firstly, more research can be done to combine the advantages of different
models may result in greater F1 scores. Secondly, adding bigger and more varied training datasets can
improve the model’s generalization, especially for languages with limited resources. Moreover, cross-
validation method and hyperparameter adjustments can be used to improve the overall performance.
Table 5
Rank list for English.
                                              Team              F1
                                          FactFinders         0.802
                                         teamopenfact         0.796
                                           innavogel          0.780
                                          mjmanas54           0.778
                                        ZHAWStudents          0.771
                                      SemanticCUETSync        0.763
                                               SINAI          0.761
                                           DSHacker           0.760
                                               visty          0.753
                                        FiredfromNLP          0.745
                                            TurQUaz           0.718
                                           hybrinfox          0.711
                                            SSNNLP            0.706
                                             sz06571          0.696
                                           NapierNLP          0.675
                                              Mirela          0.658
                                       KushalChandani         0.658
                                            DataBees          0.619
                                           TrioTitans         0.600
                                           Madussree          0.583
                                             pandas           0.579
                                              JUNLP           0.541
                                            mariuxi           0.517
                                              grig95          0.497
                                              CLaC2           0.494
                                           AquaWave           0.339
                                            Baseline          0.307


Finally, given that both language and misinformation are constantly changing, retraining of model and
adaptation to new linguistic quirks and misinformation strategies can improve the system’s ability to
detect any misinformation


References
 [1] M. Hasanain, R. Suwaileh, S. Weering, C. Li, T. Caselli, W. Zaghouani, A. Barrón-Cedeño, P. Nakov,
     F. Alam, Overview of the CLEF-2024 CheckThat! lab task 1 on check-worthiness estimation of
     multigenre content, in: G. Faggioli, N. Ferro, P. Galuščáková, A. García Seco de Herrera (Eds.),
     Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CLEF 2024, Grenoble,
     France, 2024.
 [2] A. Barrón-Cedeño, F. Alam, J. M. Struß, P. Nakov, T. Chakraborty, T. Elsayed, P. Przybyła, T. Caselli,
     G. Da San Martino, F. Haouari, C. Li, J. Piskorski, F. Ruggeri, X. Song, R. Suwaileh, Overview of
     the CLEF-2024 CheckThat! Lab: Check-worthiness, subjectivity, persuasion, roles, authorities
     and adversarial robustness, in: L. Goeuriot, P. Mulhem, G. Quénot, D. Schwab, L. Soulier, G. M.
     Di Nunzio, P. Galuščáková, A. García Seco de Herrera, G. Faggioli, N. Ferro (Eds.), Experimental IR
     Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fifteenth International
     Conference of the CLEF Association (CLEF 2024), 2024.
 [3] C. Hansen, C. Hansen, J. G. Simonsen, C. Lioma, The copenhagen team participation in the
     check-worthiness task of the competition of automatic identification and verification of claims in
     political debates of the clef-2018 checkthat! lab, in: Conference and Labs of the Evaluation Forum,
     2018. URL: https://api.semanticscholar.org/CorpusID:215822646.
 [4] N. Hassan, A. Nayak, V. Sable, C. Li, M. Tremayne, G. Zhang, F. Arslan, J. Caraballo, D. Jimenez,
     S. Gawsane, S. Hasan, M. Joseph, A. Kulkarni, Claimbuster: the first-ever end-to-end fact-checking
     system, Proceedings of the VLDB Endowment 10 (2017) 1945–1948. doi:10.14778/3137765.
     3137815.
 [5] P. Gencheva, P. Nakov, L. Màrquez, A. Barrón-Cedeño, I. Koychev, A context-aware approach
     for detecting worth-checking claims in political debates, in: R. Mitkov, G. Angelova (Eds.),
     Proceedings of the International Conference Recent Advances in Natural Language Processing,
     RANLP 2017, INCOMA Ltd., Varna, Bulgaria, 2017, pp. 267–276. URL: https://doi.org/10.26615/
     978-954-452-049-6_037. doi:10.26615/978-954-452-049-6_037.
 [6] A. Patwari, D. Goldwasser, S. Bagchi, Tathya: A multi-classifier system for detecting check-worthy
     statements in political debates, 2017, pp. 2259–2262. doi:10.1145/3132847.3133150.
 [7] S. Vasileva, P. Atanasova, L. Màrquez, A. Barrón-Cedeño, P. Nakov, It takes nine to smell a rat:
     Neural multi-task learning for check-worthiness prediction, 2019.
 [8] P. Nakov, A. Barrón-Cedeño, G. Da San Martino, F. Alam, J. M. Struß, T. Mandl, R. Míguez, T. Caselli,
     M. Kutlu, W. Zaghouani, C. Li, S. Shaar, G. K. Shahi, H. Mubarak, A. Nikolov, N. Babulkov, Y. S.
     Kartal, J. Beltrán, The clef-2022 checkthat! lab on fighting the covid-19 infodemic and fake news
     detection, in: M. Hagen, S. Verberne, C. Macdonald, C. Seifert, K. Balog, K. Nørvåg, V. Setty (Eds.),
     Advances in Information Retrieval, Springer International Publishing, Cham, 2022, pp. 416–428.
 [9] A. Savchev, Ai rational at checkthat!-2022: Using transformer models for tweet classification., in:
     CLEF (Working Notes), 2022, pp. 656–659.
[10] S. Agresti, S. A. Hashemian, M. J. Carman, Polimi-flatearthers at checkthat!-2022: Gpt-3 applied
     to claim detection, in: Conference and Labs of the Evaluation Forum, 2022. URL: https://api.
     semanticscholar.org/CorpusID:251471103.
[11] T. Su, C. Macdonald, I. Ounis, Entity detection for check-worthiness prediction: Glasgow terrier
     at clef checkthat! 2019 (2019).
[12] P. Tarannum, F. Alam, M. A. Hasan, S. Noori, Z-index at checkthat! lab 2022: Check-worthiness
     identification on tweet text, 2022. doi:10.48550/arXiv.2207.07308.